k-NN¶

Here's some code to generate a mixture of gaussians. Each point is associated with one 2D gaussian distribution with fixed center and dispersion matrix. The label of the point corresponds to the center of its distribution. A plot is made showing the 3 categories

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.model_selection as sms

# Set the parameters for each Gaussian
means = [
    [2, 2],   # Mean of the first Gaussian
    [-1, -1], # Mean of the second Gaussian
    [3, -3]   # Mean of the third Gaussian
]
covariances = [
    [[1, 0.5], [0.5, 1]],  # Covariance of the first Gaussian
    [[1, -0.3], [-0.3, 1]], # Covariance of the second Gaussian
    [[1, 0.2], [0.2, 1]]   # Covariance of the third Gaussian
]
n_samples = 300  # Samples per Gaussian

# Generate data
data = []
labels = []
for i, (mean, cov) in enumerate(zip(means, covariances)):
    points = np.random.multivariate_normal(mean, cov, n_samples)
    data.append(points)
    labels += [i] * n_samples  # Label each Gaussian with a different number

# Combine all the data
data = np.vstack(data)
labels = np.array(labels)

Make a Train Test Split¶

In [ ]:
# your code here

Plot the train data¶

In [ ]:
# Visualize the data
plt.figure(figsize=(8, 6))
for i in range(len(means)):
    plt.scatter(X_train[y_train == i, 0], X_train[y_train == i, 1], label=f'Class {i}')
#for i in range(len(means)):
#    plt.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], label=f'Class ?')

plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Dataset with Gaussian Clusters')
plt.show()

Implement k-NN with k=1¶

Write a knn_label method. Input

  • X_train
  • y_train
  • X
  • y

It finds and returns the label yhat by finding the 1 nearest neighbor to X in X_train and assigning the label of X_train[best] to yhat.

In [2]:
def knn_label(X_train, y_train, X, y):
    # your code
    pass

Print out rows for each test data point in the format

T/F, X1, X2, y, yhat

where T/F is true if y==yhat

In [3]:
# your code

Come up with a nice way to visualize the location of the test points. Mislabeled points should be clearly visible by some graphic attribute.

In [4]:
# Visualize the data

Extension¶

Expand this technique. You could

  • compute k-nn with k>1 and analyze the error rate as a function of k
  • create d-dimensional datasets and analyze the error rate as a function of d (in this case your number of point should also scale appropriately to acheive a similar density)
  • Vary the centers or dispersions of the distributions and analyze the error
  • Vary the number of distributions AND also k. Is there a relation?
  • Devise an algorithm for quickly solving k-nearest neighbors. Or quicker than what you probably did above. This is not an easy problem (implementing is optional)
In [5]:
# stuff here! and below
In [ ]: