k-NN¶
Here's some code to generate a mixture of gaussians. Each point is associated with one 2D gaussian distribution with fixed center and dispersion matrix. The label of the point corresponds to the center of its distribution. A plot is made showing the 3 categories
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.model_selection as sms
# Set the parameters for each Gaussian
means = [
[2, 2], # Mean of the first Gaussian
[-1, -1], # Mean of the second Gaussian
[3, -3] # Mean of the third Gaussian
]
covariances = [
[[1, 0.5], [0.5, 1]], # Covariance of the first Gaussian
[[1, -0.3], [-0.3, 1]], # Covariance of the second Gaussian
[[1, 0.2], [0.2, 1]] # Covariance of the third Gaussian
]
n_samples = 300 # Samples per Gaussian
# Generate data
data = []
labels = []
for i, (mean, cov) in enumerate(zip(means, covariances)):
points = np.random.multivariate_normal(mean, cov, n_samples)
data.append(points)
labels += [i] * n_samples # Label each Gaussian with a different number
# Combine all the data
data = np.vstack(data)
labels = np.array(labels)
Make a Train Test Split¶
In [ ]:
# your code here
Plot the train data¶
In [ ]:
# Visualize the data
plt.figure(figsize=(8, 6))
for i in range(len(means)):
plt.scatter(X_train[y_train == i, 0], X_train[y_train == i, 1], label=f'Class {i}')
#for i in range(len(means)):
# plt.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], label=f'Class ?')
plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Dataset with Gaussian Clusters')
plt.show()
Implement k-NN with k=1¶
Write a knn_label method. Input
- X_train
- y_train
- X
- y
It finds and returns the label yhat by finding the 1 nearest neighbor to X in X_train and assigning the label of X_train[best] to yhat.
In [2]:
def knn_label(X_train, y_train, X, y):
# your code
pass
Print out rows for each test data point in the format
T/F, X1, X2, y, yhat
where T/F is true if y==yhat
In [3]:
# your code
Come up with a nice way to visualize the location of the test points. Mislabeled points should be clearly visible by some graphic attribute.
In [4]:
# Visualize the data
Extension¶
Expand this technique. You could
- compute k-nn with k>1 and analyze the error rate as a function of k
- create d-dimensional datasets and analyze the error rate as a function of d (in this case your number of point should also scale appropriately to acheive a similar density)
- Vary the centers or dispersions of the distributions and analyze the error
- Vary the number of distributions AND also k. Is there a relation?
- Devise an algorithm for quickly solving k-nearest neighbors. Or quicker than what you probably did above. This is not an easy problem (implementing is optional)
In [5]:
# stuff here! and below
In [ ]: