Step 3: Feature Extraction

After preprocessing our resume text data, we need to convert it into a numerical format that machine learning algorithms can understand. This process is called feature extraction, and it's a crucial step in building our Resume Parser AI.

What is Feature Extraction?

Feature extraction transforms text data into numerical vectors (or features) that represent the important characteristics of the text. For our resume parser, these features will help the model understand the content and context of each resume.

Common Text Feature Extraction Methods

There are several ways to convert text into numerical features. We'll explore two popular methods:

  1. TF-IDF (Term Frequency-Inverse Document Frequency)
  2. Word Embeddings (Word2Vec)

TF-IDF Vectorization

TF-IDF measures how important a word is to a document in a collection of documents. It combines:

Words that appear frequently in a single document but rarely in others receive higher TF-IDF scores.


from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit to top 5000 features

# Fit and transform the preprocessed resume text
tfidf_features = tfidf_vectorizer.fit_transform(resume_data['processed_resume'])

# Convert to a DataFrame for easier viewing
import pandas as pd
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=feature_names)

# Display the first few rows and columns
print(tfidf_df.iloc[:3, :10])  # First 3 resumes, first 10 features

# Save the features for later use
tfidf_df.to_csv('data/tfidf_features.csv')
            

Understanding TF-IDF Output

The output is a matrix where:

Higher scores indicate words that are more important to a particular resume.

Word Embeddings with Word2Vec

While TF-IDF treats words as independent entities, Word2Vec captures semantic relationships between words by representing them as vectors in a multi-dimensional space. Words with similar meanings have similar vector representations.


from gensim.models import Word2Vec
import numpy as np

# For Word2Vec, we need tokenized text (lists of words)
tokenized_resumes = [resume.split() for resume in resume_data['processed_resume']]

# Train Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_resumes, 
                          vector_size=100,  # Dimension of the word vectors
                          window=5,         # Context window size
                          min_count=1,      # Ignore words that appear less than this
                          workers=4)        # Number of threads to run in parallel

# Save the model
word2vec_model.save("models/word2vec_resume.model")

# Function to create document vectors by averaging word vectors
def document_vector(doc, model):
    # Remove out-of-vocabulary words
    doc = [word for word in doc if word in model.wv.key_to_index]
    if len(doc) == 0:
        return np.zeros(model.vector_size)
    return np.mean([model.wv[word] for word in doc], axis=0)

# Create document vectors for each resume
resume_vectors = [document_vector(resume.split(), word2vec_model) for resume in resume_data['processed_resume']]

# Convert to DataFrame
resume_vectors_df = pd.DataFrame(resume_vectors)
resume_vectors_df.columns = [f'feature_{i}' for i in range(resume_vectors_df.shape[1])]

# Display the first few rows
print(resume_vectors_df.head())

# Save the features
resume_vectors_df.to_csv('data/word2vec_features.csv')
            

Understanding Word2Vec Output

The output is a matrix where:

Comparing TF-IDF and Word2Vec

Both methods have their strengths:

TF-IDF:

  • Simpler to understand and implement
  • Works well for keyword matching
  • Captures word importance within documents
  • Doesn't capture word relationships or context

Word2Vec:

  • Captures semantic relationships between words
  • Better for understanding context
  • Can handle synonyms and related terms
  • More complex to implement and interpret

For our resume parser, we might use both: TF-IDF for identifying important keywords and Word2Vec for understanding the context and relationships between skills.

Visualizing the Features

Let's visualize our features to better understand them:


import matplotlib.pyplot as plt
import seaborn as sns

# For TF-IDF, let's look at the most important words for a few job categories
from sklearn.preprocessing import normalize

# Normalize the TF-IDF features
normalized_tfidf = normalize(tfidf_features)

# Group resumes by job category
categories = resume_data['Category'].unique()

plt.figure(figsize=(15, 10))
for i, category in enumerate(categories[:5]):  # Plot first 5 categories
    # Get indices of resumes in this category
    category_indices = resume_data[resume_data['Category'] == category].index
    
    # Calculate average TF-IDF scores for this category
    category_tfidf = normalized_tfidf[category_indices].mean(axis=0)
    
    # Get top 10 words
    top_indices = category_tfidf.argsort()[-10:]
    top_words = [feature_names[j] for j in top_indices]
    top_scores = [category_tfidf[j] for j in top_indices]
    
    # Plot
    plt.subplot(2, 3, i+1)
    plt.barh(top_words, top_scores)
    plt.title(f'Top words for {category}')
    plt.tight_layout()

plt.savefig('results/tfidf_visualization.png')
plt.close()

# For Word2Vec, let's visualize the resume vectors using t-SNE
from sklearn.manifold import TSNE

# Apply t-SNE to reduce dimensions to 2D for visualization
tsne = TSNE(n_components=2, random_state=42)
resume_vectors_2d = tsne.fit_transform(resume_vectors)

# Plot
plt.figure(figsize=(12, 10))
scatter = plt.scatter(resume_vectors_2d[:, 0], resume_vectors_2d[:, 1], 
                     c=resume_data['Category'].astype('category').cat.codes, 
                     cmap='viridis', alpha=0.6)
plt.colorbar(scatter, label='Job Category')
plt.title('t-SNE Visualization of Resume Vectors')
plt.savefig('results/word2vec_visualization.png')
plt.close()

print("Feature visualizations saved to results folder.")
            

Next Steps

Now that we have converted our resume text into numerical features, we can move on to the next step: skill clustering. This will help us identify groups of related skills and extract meaningful patterns from our resume data.