Step 2: Preprocessing Text Data
Text preprocessing is a crucial step in any Natural Language Processing (NLP) project. Raw text from resumes contains a lot of information that isn't directly usable by machine learning algorithms. In this step, we'll transform that raw text into a clean, structured format.
What is Text Preprocessing?
Text preprocessing involves cleaning and standardizing text data to make it suitable for analysis. For resumes, this is particularly important because:
- Resumes contain various formatting, headers, and sections
- They may include special characters, numbers, and dates
- Different candidates use different terminology for similar skills
- There's often irrelevant information that can confuse our model
Installing and Importing NLTK
We'll use the Natural Language Toolkit (NLTK), a powerful Python library for working with human language data:
# Import necessary libraries
import nltk
import re
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
Basic Text Cleaning
First, let's define a function to clean our resume text:
def clean_resume(resume_text):
# Convert to lowercase
resume_text = resume_text.lower()
# Remove URLs
resume_text = re.sub('http\S+\s*', ' ', resume_text)
# Remove RT and cc
resume_text = re.sub('RT|cc', ' ', resume_text)
# Remove hashtags
resume_text = re.sub('#\S+', '', resume_text)
# Remove mentions
resume_text = re.sub('@\S+', ' ', resume_text)
# Remove punctuations
resume_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', resume_text)
# Remove numbers
resume_text = re.sub(r'[0-9]', ' ', resume_text)
# Remove extra whitespace
resume_text = re.sub('\s+', ' ', resume_text)
return resume_text
Tokenization
Tokenization is the process of breaking text into individual words or tokens:
def tokenize_resume(resume_text):
# Clean the resume text first
cleaned_text = clean_resume(resume_text)
# Tokenize the text
tokens = word_tokenize(cleaned_text)
return tokens
Removing Stop Words
Stop words are common words like "the," "and," "is" that don't carry much meaning:
def remove_stopwords(tokens):
# Get English stop words
stop_words = set(stopwords.words('english'))
# Filter out stop words
filtered_tokens = [word for word in tokens if word not in stop_words]
return filtered_tokens
Stemming and Lemmatization
These techniques reduce words to their root forms:
def stem_tokens(tokens):
# Initialize stemmer
stemmer = PorterStemmer()
# Stem each token
stemmed_tokens = [stemmer.stem(token) for token in tokens]
return stemmed_tokens
def lemmatize_tokens(tokens):
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatize each token
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
return lemmatized_tokens
Putting It All Together
Now, let's create a complete preprocessing pipeline:
def preprocess_resume(resume_text):
# Clean and tokenize
tokens = tokenize_resume(resume_text)
# Remove stopwords
tokens = remove_stopwords(tokens)
# Choose either stemming or lemmatization (lemmatization is usually better for resumes)
# tokens = stem_tokens(tokens)
tokens = lemmatize_tokens(tokens)
# Join tokens back into a single string for further processing
processed_text = ' '.join(tokens)
return processed_text
# Apply preprocessing to our dataset
resume_data['processed_resume'] = resume_data['Resume'].apply(preprocess_resume)
# Display a sample of original vs processed text
for i in range(2): # Show 2 examples
print(f"Original Resume {i+1} (excerpt):\n{resume_data['Resume'][i][:300]}...\n")
print(f"Processed Resume {i+1}:\n{resume_data['processed_resume'][i][:300]}...\n")
print("-" * 80)
Understanding the Differences
Let's understand what each preprocessing step does:
- Cleaning: Removes irrelevant characters and standardizes the text
- Tokenization: Breaks text into individual words for analysis
- Stop Word Removal: Eliminates common words that don't add meaning
- Stemming: Reduces words to their stems (e.g., "running" → "run")
- Lemmatization: Reduces words to their dictionary form (e.g., "better" → "good")
For resumes, lemmatization is often preferred over stemming because it produces more meaningful words. For example, stemming might convert "experience" to "experi," while lemmatization would keep it as "experience."
Next Steps
Now that we have clean, preprocessed resume text, we're ready to convert it into numerical features that machine learning algorithms can understand. In the next step, we'll explore feature extraction techniques like TF-IDF and word embeddings.