SVM Lab¶

In [15]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

Load the Data¶

In [16]:
# Step 1: Load the dataset (replace with the path to your dataset)
# The dataset should have two columns: 'label' (spam/ham) and 'message' (text)
url = "https://aet-cs.github.io/white/ML/data/Fake_News.csv"
data = pd.read_csv(url, sep=',')
In [17]:
data = data.dropna()

EDA¶

Do some basic EDA here. How balanced is the data set? Data types? Etc.

In [ ]:
 

We will ordinal encode the target column

In [8]:
# Encode labels using OrdinalEncoder into 0,1 (dont' change)
encoder = OrdinalEncoder()
data['label'] = encoder.fit_transform(data[['label']])
In [9]:
# Print the mapping from labels to numerical values (dont' change)
label_mapping = {category: idx for idx, category in enumerate(encoder.categories_[0])}
print("Label Encoding:")
for label, value in label_mapping.items():
    print(f"'{label}' -> {value}")
Label Encoding:
'Fake' -> 0
'True' -> 1

Define X and y to the be title and label

In [10]:
# define the data
# Your code here
In [11]:
# Step 2: Split the dataset into training and testing sets
# Make a train/test split
# Your code here
In [12]:
# Step 3: Text encoding using TF-IDF
# Don't change
vectorizer = CountVectorizer(stop_words='english', max_features=1000)
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 4
      1 # Step 3: Text encoding using TF-IDF
      2 # Don't change
      3 vectorizer = CountVectorizer(stop_words='english', max_features=1000)
----> 4 X_train_tfidf = vectorizer.fit_transform(X_train)
      5 X_test_tfidf = vectorizer.transform(X_test)

NameError: name 'X_train' is not defined
In [13]:
# Step 4: Train the SVM
# your code here
# be sure to use the counts vectors

Print Results¶

In [107]:
# Find misclassified examples
y_pred = np.array(y_pred)
y_test = np.array(y_test)
misclassified_indices = (y_pred != y_test).nonzero()[0]  # Indices where predictions differ from true labels

# Print the first few misclassified examples
print("\nMisclassified Examples:")
for idx in misclassified_indices[10:20]:  # Limit to 5 examples
    print(f"Message: {X_test.iloc[idx]}")
    print(f"True Label: {encoder.inverse_transform([[y_test[idx]]])[0][0]}")
    print(f"Predicted Label: {encoder.inverse_transform([[y_pred[idx]]])[0][0]}")
    print("-" * 50)
Misclassified Examples:
Message: Meet The CA Sheriff Who Won’t Be Bullied By Obama And Illegal Immigrant Activists Who Believe The Laws Don’t Apply To Lawbreakers
True Label: Fake
Predicted Label: True
--------------------------------------------------
Message: ALARMING: NSA Refuses to Release Clinton-Lynch Tarmac Transcript with Lame Excuse
True Label: Fake
Predicted Label: True
--------------------------------------------------
Message: Norway appoints its first female foreign minister
True Label: True
Predicted Label: Fake
--------------------------------------------------
Message:  Anti-Abortion Laws Collapse In Major Defeat For The Right
True Label: Fake
Predicted Label: True
--------------------------------------------------
Message: Shaquille O’Neal: “The Earth is flat. Yes, it is.”
True Label: Fake
Predicted Label: True
--------------------------------------------------
Message: THE LIST OF WHO’S WHO TAKING ADVANTAGE OF FAILED EU AUSTERITY EXPERIMENT IN GREECE
True Label: Fake
Predicted Label: True
--------------------------------------------------
Message: Obama to visit Hiroshima, will not apologize for World War Two bombing
True Label: True
Predicted Label: Fake
--------------------------------------------------
Message:  White House Staff Reportedly Went Behind Trump’s Back On HUGE Issue Because Trump Is Too Reckless
True Label: Fake
Predicted Label: True
--------------------------------------------------
Message: Tensions simmer below surface as Trump, Republicans map strategy
True Label: True
Predicted Label: Fake
--------------------------------------------------
Message:  Susan Collins Bucked Party, Voted To Protect Kids, Seniors, Women and Entitlements
True Label: Fake
Predicted Label: True
--------------------------------------------------

Confusion Matrix¶

In [1]:
# print/draw a confusion matrix

TF-IDF¶

Redo the above analysis with a TF-IDF encoding instead of a Counter (Bag of Words)

In [ ]:
 

Optimize¶

There are lots of parameters to SVM/SVC. Try them out and see how good you can do!

In [ ]: