Step 4: Skill Clustering
After extracting features from our resume text, the next step is to identify and group related skills. This process, called skill clustering, helps us understand the relationships between different skills and extract meaningful patterns from resumes.
What is Skill Clustering?
Skill clustering is the process of grouping similar or related skills together. For example, "Python," "Java," and "C++" might be clustered together as "Programming Languages," while "Photoshop," "Illustrator," and "InDesign" might form an "Adobe Creative Suite" cluster.
This clustering helps us:
- Understand the broader skill categories in our dataset
- Identify candidates with complementary skill sets
- Match candidates to job requirements more effectively
- Reduce the dimensionality of our feature space
Implementing K-Means Clustering
K-Means is one of the most popular clustering algorithms. It works by dividing data points into K clusters, where each data point belongs to the cluster with the nearest mean.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# We'll use our TF-IDF features from the previous step
# Assuming tfidf_features is available from Step 3
# Determine the optimal number of clusters using the Elbow Method
inertia = []
k_range = range(1, 15)
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(tfidf_features)
inertia.append(kmeans.inertia_)
# Plot the Elbow Method graph
plt.figure(figsize=(10, 6))
plt.plot(k_range, inertia, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.savefig('results/elbow_method.png')
plt.close()
# Based on the elbow method, let's choose an appropriate number of clusters
# For this example, let's say we choose k=8
optimal_k = 8 # You should adjust this based on your elbow plot
# Apply K-Means clustering with the optimal k
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(tfidf_features)
# Add cluster labels to our dataset
resume_data['cluster'] = cluster_labels
# Examine the clusters
for cluster_id in range(optimal_k):
print(f"\nCluster {cluster_id}:")
cluster_resumes = resume_data[resume_data['cluster'] == cluster_id]
print(f"Number of resumes: {len(cluster_resumes)}")
# Get the most common job categories in this cluster
print("Top job categories:")
print(cluster_resumes['Category'].value_counts().head(3))
# Get the most important terms for this cluster
cluster_center = kmeans.cluster_centers_[cluster_id]
top_indices = cluster_center.argsort()[-10:][::-1] # Top 10 terms
feature_names = tfidf_vectorizer.get_feature_names_out()
top_terms = [feature_names[i] for i in top_indices]
print("Top terms:", ", ".join(top_terms))
Visualizing the Clusters
Let's visualize our clusters to better understand the groupings:
from sklearn.decomposition import PCA
# Use PCA to reduce dimensions for visualization
pca = PCA(n_components=2)
reduced_features = pca.fit_transform(tfidf_features.toarray())
# Plot the clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(reduced_features[:, 0], reduced_features[:, 1],
c=cluster_labels, cmap='viridis', alpha=0.6)
plt.colorbar(scatter, label='Cluster')
# Plot cluster centers
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=100, alpha=0.8, marker='X')
plt.title('K-Means Clustering of Resumes')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.savefig('results/kmeans_clusters.png')
plt.close()
Alternative: DBSCAN Clustering
K-Means requires us to specify the number of clusters in advance. An alternative is DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which can automatically determine the number of clusters based on data density:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
# Standardize the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(tfidf_features.toarray())
# Apply DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5) # You may need to adjust these parameters
dbscan_labels = dbscan.fit_predict(scaled_features)
# Add DBSCAN labels to our dataset
resume_data['dbscan_cluster'] = dbscan_labels
# Count the number of clusters (excluding noise points labeled as -1)
n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
n_noise = list(dbscan_labels).count(-1)
print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")
# Visualize DBSCAN clusters
plt.figure(figsize=(12, 8))
scatter = plt.scatter(reduced_features[:, 0], reduced_features[:, 1],
c=dbscan_labels, cmap='viridis', alpha=0.6)
plt.colorbar(scatter, label='Cluster')
plt.title('DBSCAN Clustering of Resumes')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.savefig('results/dbscan_clusters.png')
plt.close()
Extracting Skills from Clusters
Now that we have our clusters, we can extract the most important skills from each cluster:
# Create a dictionary to store skills by cluster
cluster_skills = {}
for cluster_id in range(optimal_k):
# Get the cluster center
center = kmeans.cluster_centers_[cluster_id]
# Get the top 20 terms for this cluster
top_indices = center.argsort()[-20:][::-1]
top_terms = [feature_names[i] for i in top_indices]
# Store in our dictionary
cluster_skills[f"Cluster {cluster_id}"] = top_terms
# Convert to DataFrame for easier viewing
cluster_skills_df = pd.DataFrame(cluster_skills)
print(cluster_skills_df)
# Save to CSV
cluster_skills_df.to_csv('results/cluster_skills.csv')
Naming the Skill Clusters
To make our clusters more interpretable, we can assign meaningful names based on the dominant skills:
# This would typically be done manually after examining the clusters
# Here's an example of how you might name them
cluster_names = {
0: "Software Development",
1: "Data Science & Analytics",
2: "Marketing & Communications",
3: "Project Management",
4: "Design & Creative",
5: "Sales & Business Development",
6: "Administrative & Support",
7: "Engineering & Technical"
}
# Add cluster names to our dataset
resume_data['cluster_name'] = resume_data['cluster'].map(cluster_names)
# Display the distribution of named clusters
print(resume_data['cluster_name'].value_counts())
The cluster names you choose should reflect the dominant skills and job categories in each cluster. This makes it easier to understand and work with the clusters in subsequent steps.
Next Steps
Now that we've identified skill clusters in our resume data, we can move on to contextual analysis. In the next step, we'll implement techniques like Named Entity Recognition (NER) to understand words in context rather than just matching keywords.