Step 2: Preprocessing Text Data

Text preprocessing is a crucial step in any Natural Language Processing (NLP) project. Raw text from resumes contains a lot of information that isn't directly usable by machine learning algorithms. In this step, we'll transform that raw text into a clean, structured format.

What is Text Preprocessing?

Text preprocessing involves cleaning and standardizing text data to make it suitable for analysis. For resumes, this is particularly important because:

Installing and Importing NLTK

We'll use the Natural Language Toolkit (NLTK), a powerful Python library for working with human language data:


# Import necessary libraries
import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
            

Basic Text Cleaning

First, let's define a function to clean our resume text:


def clean_resume(resume_text):
    # Convert to lowercase
    resume_text = resume_text.lower()
    
    # Remove URLs
    resume_text = re.sub('http\S+\s*', ' ', resume_text)
    
    # Remove RT and cc
    resume_text = re.sub('RT|cc', ' ', resume_text)
    
    # Remove hashtags
    resume_text = re.sub('#\S+', '', resume_text)
    
    # Remove mentions
    resume_text = re.sub('@\S+', '  ', resume_text)
    
    # Remove punctuations
    resume_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', resume_text)
    
    # Remove numbers
    resume_text = re.sub(r'[0-9]', ' ', resume_text)
    
    # Remove extra whitespace
    resume_text = re.sub('\s+', ' ', resume_text)
    
    return resume_text
            

Tokenization

Tokenization is the process of breaking text into individual words or tokens:


def tokenize_resume(resume_text):
    # Clean the resume text first
    cleaned_text = clean_resume(resume_text)
    
    # Tokenize the text
    tokens = word_tokenize(cleaned_text)
    
    return tokens
            

Removing Stop Words

Stop words are common words like "the," "and," "is" that don't carry much meaning:


def remove_stopwords(tokens):
    # Get English stop words
    stop_words = set(stopwords.words('english'))
    
    # Filter out stop words
    filtered_tokens = [word for word in tokens if word not in stop_words]
    
    return filtered_tokens
            

Stemming and Lemmatization

These techniques reduce words to their root forms:


def stem_tokens(tokens):
    # Initialize stemmer
    stemmer = PorterStemmer()
    
    # Stem each token
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    
    return stemmed_tokens

def lemmatize_tokens(tokens):
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize each token
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return lemmatized_tokens
            

Putting It All Together

Now, let's create a complete preprocessing pipeline:


def preprocess_resume(resume_text):
    # Clean and tokenize
    tokens = tokenize_resume(resume_text)
    
    # Remove stopwords
    tokens = remove_stopwords(tokens)
    
    # Choose either stemming or lemmatization (lemmatization is usually better for resumes)
    # tokens = stem_tokens(tokens)
    tokens = lemmatize_tokens(tokens)
    
    # Join tokens back into a single string for further processing
    processed_text = ' '.join(tokens)
    
    return processed_text

# Apply preprocessing to our dataset
resume_data['processed_resume'] = resume_data['Resume'].apply(preprocess_resume)

# Display a sample of original vs processed text
for i in range(2):  # Show 2 examples
    print(f"Original Resume {i+1} (excerpt):\n{resume_data['Resume'][i][:300]}...\n")
    print(f"Processed Resume {i+1}:\n{resume_data['processed_resume'][i][:300]}...\n")
    print("-" * 80)
            

Understanding the Differences

Let's understand what each preprocessing step does:

  1. Cleaning: Removes irrelevant characters and standardizes the text
  2. Tokenization: Breaks text into individual words for analysis
  3. Stop Word Removal: Eliminates common words that don't add meaning
  4. Stemming: Reduces words to their stems (e.g., "running" → "run")
  5. Lemmatization: Reduces words to their dictionary form (e.g., "better" → "good")

For resumes, lemmatization is often preferred over stemming because it produces more meaningful words. For example, stemming might convert "experience" to "experi," while lemmatization would keep it as "experience."

Next Steps

Now that we have clean, preprocessed resume text, we're ready to convert it into numerical features that machine learning algorithms can understand. In the next step, we'll explore feature extraction techniques like TF-IDF and word embeddings.