SVM Lab¶
In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
Load the Data¶
In [16]:
# Step 1: Load the dataset (replace with the path to your dataset)
# The dataset should have two columns: 'label' (spam/ham) and 'message' (text)
url = "https://aet-cs.github.io/white/ML/data/Fake_News.csv"
data = pd.read_csv(url, sep=',')
In [17]:
data = data.dropna()
EDA¶
Do some basic EDA here. How balanced is the data set? Data types? Etc.
In [ ]:
We will ordinal encode the target column
In [8]:
# Encode labels using OrdinalEncoder into 0,1 (dont' change)
encoder = OrdinalEncoder()
data['label'] = encoder.fit_transform(data[['label']])
In [9]:
# Print the mapping from labels to numerical values (dont' change)
label_mapping = {category: idx for idx, category in enumerate(encoder.categories_[0])}
print("Label Encoding:")
for label, value in label_mapping.items():
print(f"'{label}' -> {value}")
Label Encoding: 'Fake' -> 0 'True' -> 1
Define X and y to the be title and label
In [10]:
# define the data
# Your code here
In [11]:
# Step 2: Split the dataset into training and testing sets
# Make a train/test split
# Your code here
In [12]:
# Step 3: Text encoding using TF-IDF
# Don't change
vectorizer = CountVectorizer(stop_words='english', max_features=1000)
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[12], line 4 1 # Step 3: Text encoding using TF-IDF 2 # Don't change 3 vectorizer = CountVectorizer(stop_words='english', max_features=1000) ----> 4 X_train_tfidf = vectorizer.fit_transform(X_train) 5 X_test_tfidf = vectorizer.transform(X_test) NameError: name 'X_train' is not defined
In [13]:
# Step 4: Train the SVM
# your code here
# be sure to use the counts vectors
Print Results¶
In [107]:
# Find misclassified examples
y_pred = np.array(y_pred)
y_test = np.array(y_test)
misclassified_indices = (y_pred != y_test).nonzero()[0] # Indices where predictions differ from true labels
# Print the first few misclassified examples
print("\nMisclassified Examples:")
for idx in misclassified_indices[10:20]: # Limit to 5 examples
print(f"Message: {X_test.iloc[idx]}")
print(f"True Label: {encoder.inverse_transform([[y_test[idx]]])[0][0]}")
print(f"Predicted Label: {encoder.inverse_transform([[y_pred[idx]]])[0][0]}")
print("-" * 50)
Misclassified Examples: Message: Meet The CA Sheriff Who Won’t Be Bullied By Obama And Illegal Immigrant Activists Who Believe The Laws Don’t Apply To Lawbreakers True Label: Fake Predicted Label: True -------------------------------------------------- Message: ALARMING: NSA Refuses to Release Clinton-Lynch Tarmac Transcript with Lame Excuse True Label: Fake Predicted Label: True -------------------------------------------------- Message: Norway appoints its first female foreign minister True Label: True Predicted Label: Fake -------------------------------------------------- Message: Anti-Abortion Laws Collapse In Major Defeat For The Right True Label: Fake Predicted Label: True -------------------------------------------------- Message: Shaquille O’Neal: “The Earth is flat. Yes, it is.” True Label: Fake Predicted Label: True -------------------------------------------------- Message: THE LIST OF WHO’S WHO TAKING ADVANTAGE OF FAILED EU AUSTERITY EXPERIMENT IN GREECE True Label: Fake Predicted Label: True -------------------------------------------------- Message: Obama to visit Hiroshima, will not apologize for World War Two bombing True Label: True Predicted Label: Fake -------------------------------------------------- Message: White House Staff Reportedly Went Behind Trump’s Back On HUGE Issue Because Trump Is Too Reckless True Label: Fake Predicted Label: True -------------------------------------------------- Message: Tensions simmer below surface as Trump, Republicans map strategy True Label: True Predicted Label: Fake -------------------------------------------------- Message: Susan Collins Bucked Party, Voted To Protect Kids, Seniors, Women and Entitlements True Label: Fake Predicted Label: True --------------------------------------------------
Confusion Matrix¶
In [1]:
# print/draw a confusion matrix
TF-IDF¶
Redo the above analysis with a TF-IDF encoding instead of a Counter (Bag of Words)
In [ ]:
Optimize¶
There are lots of parameters to SVM/SVC. Try them out and see how good you can do!
In [ ]: