Step 3: Preprocess and Normalize Data

Step 3: Preprocess and Normalize the Data

Now that you've collected financial and market data, the next step is to preprocess and normalize this data to make it suitable for analysis. Raw financial data often contains inconsistencies, missing values, and different scales that need to be addressed before your AI agent can effectively analyze it.

Types of Data Preprocessing

There are two main types of data you'll need to preprocess for your value investing AI agent:

Data Types for Preprocessing

Numerical Data: Financial metrics, ratios, and time series data
Text Data: News articles, financial reports, and analyst comments

Numerical Data Preprocessing

Financial numerical data often requires several preprocessing steps:

Handling Missing Values: Financial data may have gaps due to various reasons (e.g., a company not reporting certain metrics).
Outlier Detection and Treatment: Extreme values can skew your analysis and need to be identified and handled appropriately.
Normalization/Standardization: Different financial metrics have different scales, making direct comparisons difficult without normalization.
Time Series Alignment: Ensuring that time-dependent data points are properly aligned across different data sources.

Text Data Preprocessing

Text data from financial reports and news requires Natural Language Processing (NLP) techniques:

Tokenization: Breaking text into individual words or tokens.
Stop Word Removal: Eliminating common words that don't add analytical value.
Named Entity Recognition: Identifying and extracting entities like company names, products, or key people.
Sentiment Analysis: Determining the sentiment (positive, negative, neutral) of text content.

Preprocessing Numerical Financial Data

Let's look at how to preprocess numerical financial data using Python's pandas library:

Python: Preprocessing Numerical Financial Data

# Install required libraries (run this once)
# pip install pandas numpy matplotlib scikit-learn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

# Sample financial data (in practice, this would be loaded from your data sources)
data = {
    'Ticker': ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META', 'TSLA', 'NVDA', 'JPM', 'V', 'JNJ'],
    'PE_Ratio': [25.6, 30.2, 22.8, 40.5, 18.7, 55.3, 45.8, 12.3, 28.9, 15.6],
    'PB_Ratio': [35.2, 12.8, 5.3, 10.2, 4.8, 15.7, 25.3, 1.5, 12.8, 5.2],
    'ROE': [0.35, 0.42, 0.25, 0.22, 0.18, 0.15, 0.38, 0.12, 0.32, 0.21],
    'Debt_to_Equity': [1.2, 0.5, 0.3, 0.8, 0.4, 0.6, 0.2, 2.5, 0.7, 0.9],
    'FCF_Yield': [0.03, 0.025, 0.02, 0.015, 0.035, 0.01, 0.02, 0.045, 0.03, 0.04],
    'Dividend_Yield': [0.005, 0.008, 0.0, 0.0, 0.0, 0.0, 0.001, 0.03, 0.007, 0.025]
}

# Create DataFrame
df = pd.DataFrame(data)
print("Original Financial Data:")
print(df.head())

# 1. Check for missing values
print("\nMissing Values Count:")
print(df.isnull().sum())

# Let's introduce some missing values for demonstration
df.loc[2, 'PE_Ratio'] = np.nan
df.loc[5, 'PB_Ratio'] = np.nan
df.loc[8, 'ROE'] = np.nan

print("\nData with Introduced Missing Values:")
print(df.head(10))
print("\nMissing Values Count:")
print(df.isnull().sum())

# 2. Handle missing values
# Method 1: Fill with mean
df_mean_filled = df.copy()
df_mean_filled.fillna(df_mean_filled.mean(), inplace=True)

print("\nData after Filling Missing Values with Mean:")
print(df_mean_filled.head(10))

# Method 2: Using SimpleImputer from scikit-learn
imputer = SimpleImputer(strategy='median')
numeric_columns = ['PE_Ratio', 'PB_Ratio', 'ROE', 'Debt_to_Equity', 'FCF_Yield', 'Dividend_Yield']
df_imputed = df.copy()
df_imputed[numeric_columns] = imputer.fit_transform(df_imputed[numeric_columns])

print("\nData after Imputing Missing Values with Median:")
print(df_imputed.head(10))

# 3. Detect and handle outliers
# Method: Z-score
def detect_outliers_zscore(df, column, threshold=3):
    z_scores = np.abs((df[column] - df[column].mean()) / df[column].std())
    return df[z_scores > threshold].index

# Check for outliers in PE_Ratio
outlier_indices = detect_outliers_zscore(df_imputed, 'PE_Ratio')
print(f"\nOutliers in PE_Ratio (Z-score method): {list(outlier_indices)}")

# Handle outliers by capping
def cap_outliers(df, column, lower_percentile=0.05, upper_percentile=0.95):
    lower_limit = df[column].quantile(lower_percentile)
    upper_limit = df[column].quantile(upper_percentile)
    df[column] = df[column].clip(lower=lower_limit, upper=upper_limit)
    return df

df_capped = df_imputed.copy()
for col in numeric_columns:
    df_capped = cap_outliers(df_capped, col)

print("\nData after Capping Outliers:")
print(df_capped.head(10))

# 4. Normalize/Standardize data
# Method 1: Min-Max Scaling (values between 0 and 1)
scaler_minmax = MinMaxScaler()
df_normalized = df_capped.copy()
df_normalized[numeric_columns] = scaler_minmax.fit_transform(df_normalized[numeric_columns])

print("\nData after Min-Max Normalization:")
print(df_normalized.head(10))

# Method 2: Standardization (Z-score normalization)
scaler_standard = StandardScaler()
df_standardized = df_capped.copy()
df_standardized[numeric_columns] = scaler_standard.fit_transform(df_standardized[numeric_columns])

print("\nData after Standardization:")
print(df_standardized.head(10))

# 5. Visualize the effect of normalization
plt.figure(figsize=(15, 10))

# Original data
plt.subplot(3, 1, 1)
df_capped[numeric_columns].boxplot()
plt.title('Original Data (After Handling Missing Values and Outliers)')
plt.xticks(rotation=45)

# Min-Max Normalized
plt.subplot(3, 1, 2)
df_normalized[numeric_columns].boxplot()
plt.title('Min-Max Normalized Data')
plt.xticks(rotation=45)

# Standardized
plt.subplot(3, 1, 3)
df_standardized[numeric_columns].boxplot()
plt.title('Standardized Data')
plt.xticks(rotation=45)

plt.tight_layout()
plt.savefig('normalization_comparison.png')
print("\nNormalization comparison chart saved as 'normalization_comparison.png'")

# 6. Save the preprocessed data
df_standardized.to_csv('preprocessed_financial_data.csv', index=False)
print("\nPreprocessed data saved to 'preprocessed_financial_data.csv'")

Preprocessing Text Data with NLP

Now let's look at how to preprocess text data from financial news and reports using Natural Language Processing techniques:

Python: Preprocessing Financial Text Data with NLP

# Install required libraries (run this once)
# pip install nltk spacy pandas matplotlib textblob

import nltk
import spacy
import pandas as pd
import matplotlib.pyplot as plt
from textblob import TextBlob
from collections import Counter
import re

# Download necessary NLTK data (run this once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Load spaCy model (run this once if not already installed)
# python -m spacy download en_core_web_sm

# Sample financial news headlines
financial_news = [
    "Apple reports record quarterly revenue, beating analyst expectations",
    "Microsoft's cloud business drives strong growth in Q2 earnings",
    "Tesla faces production challenges amid supply chain disruptions",
    "Amazon announces 20-for-1 stock split and $10 billion buyback",
    "Federal Reserve raises interest rates by 0.25% to combat inflation",
    "Oil prices surge as global demand recovers and supply remains tight",
    "Google parent Alphabet misses earnings expectations, shares drop 5%",
    "JPMorgan Chase reports decline in investment banking revenue",
    "Housing market shows signs of cooling as mortgage rates rise",
    "Nvidia stock soars on strong demand for AI chips and data center growth"
]

# Create a DataFrame
news_df = pd.DataFrame({'headline': financial_news})
print("Original Financial News Headlines:")
print(news_df)

# 1. Basic text preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

# Apply preprocessing to headlines
news_df['processed_tokens'] = news_df['headline'].apply(preprocess_text)
print("\nProcessed Tokens:")
print(news_df[['headline', 'processed_tokens']].head())

# 2. Named Entity Recognition using spaCy
nlp = spacy.load('en_core_web_sm')

def extract_entities(text):
    doc = nlp(text)
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return entities

news_df['entities'] = news_df['headline'].apply(extract_entities)
print("\nNamed Entities:")
print(news_df[['headline', 'entities']].head())

# 3. Sentiment Analysis using TextBlob
def analyze_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity  # Returns a value between -1 (negative) and 1 (positive)

news_df['sentiment_score'] = news_df['headline'].apply(analyze_sentiment)
print("\nSentiment Analysis:")
print(news_df[['headline', 'sentiment_score']].head(10))

# Categorize sentiment
def categorize_sentiment(score):
    if score > 0.1:
        return 'Positive'
    elif score < -0.1:
        return 'Negative'
    else:
        return 'Neutral'

news_df['sentiment_category'] = news_df['sentiment_score'].apply(categorize_sentiment)
print("\nSentiment Categories:")
print(news_df[['headline', 'sentiment_category']].head(10))

# 4. Visualize sentiment distribution
sentiment_counts = news_df['sentiment_category'].value_counts()
plt.figure(figsize=(10, 6))
sentiment_counts.plot(kind='bar', color=['green', 'gray', 'red'])
plt.title('Sentiment Distribution in Financial News Headlines')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.tight_layout()
plt.savefig('sentiment_distribution.png')
print("\nSentiment distribution chart saved as 'sentiment_distribution.png'")

# 5. Extract most common words
all_words = [word for tokens in news_df['processed_tokens'] for word in tokens]
word_freq = Counter(all_words)
common_words = word_freq.most_common(10)

print("\nMost Common Words in Headlines:")
for word, count in common_words:
    print(f"{word}: {count}")

# Visualize common words
plt.figure(figsize=(12, 6))
words, counts = zip(*common_words)
plt.barh(words, counts, color='skyblue')
plt.title('Most Common Words in Financial News Headlines')
plt.xlabel('Count')
plt.tight_layout()
plt.savefig('common_words.png')
print("\nCommon words chart saved as 'common_words.png'")

# 6. Save the preprocessed data
news_df.to_csv('preprocessed_financial_news.csv', index=False)
print("\nPreprocessed news data saved to 'preprocessed_financial_news.csv'")

Combining Numerical and Text Data

For a comprehensive value investing analysis, you'll often need to combine insights from both numerical financial data and text-based sentiment analysis:

Integration Strategies

Here are some approaches to combining numerical and text data:

Feature Concatenation

The simplest approach is to create a combined feature set that includes both numerical features (like P/E ratio, ROE) and text-derived features (like sentiment scores).


# Example of feature concatenation
combined_features = pd.concat([
    numerical_features,  # DataFrame with financial ratios
    text_features        # DataFrame with sentiment scores, entity counts, etc.
], axis=1)

Weighted Scoring

Create separate scores for numerical and text analysis, then combine them with appropriate weights:


# Example of weighted scoring
final_score = (0.7 * financial_score) + (0.3 * sentiment_score)

Multi-modal Models

More advanced approaches use models specifically designed to handle multiple types of input data:

Ensemble methods that combine predictions from separate models
Neural networks with multiple input branches for different data types
Transformer-based models that can process both numerical and text inputs

Knowledge Check

Which of the following is NOT a common preprocessing step for numerical financial data?

Handling missing values
Normalizing data to a common scale
Tokenization
Outlier detection and treatment

What is the purpose of sentiment analysis in financial text data?

To count the number of words in a financial report
To determine if news or reports express positive, negative, or neutral opinions about a company
To translate financial reports into different languages
To compress text data for more efficient storage