Here’s a nice writeup of the approximate nearest neighbor problem with an eye towards ML applications like Spotify and Netflix recommendations (may be a paywall I’m not sure)
Note: Oops I forgot sklearn can’t handle categorical features. Ugh. The fix is to one-hot or ordinal encode everything, but that makes for ugly trees.
Some notes on entropy are contained in this chapter. I go through the ABCDEFGH example from class here, starting on p. 63.
The wikipedia on Huffman Trees is pretty good to if you want notes on that algorithm
11/14/2024 (Thursday)
A Decision Tree learning algorithm based on notes from last class. See how it handles the slightly modified restaurant dataset. The original notebook is here but running it requires installing graphiviz, the graphviz development package, and pygraphviz.
Upload your final notebook to grumpy. Please name it Ensemble_Methods.ipynb
Report #2 due end of next week. Find your dataset if you haven’t already. Plan to do CV, GridSearch and Ensemble methods on this report.
Friday and Tuesday will probably be report workdays.
12/11/2024 (Wednesday)
ROC-AUC curves, tuning decision thresholds
Read through the notes here and here. The first link is all review for you, but has great visuals and interactive demos. The second one does a great job demonstrating ROC curves.
Revisit the Mushroom project.
- (You may want to omit neural network from this because it’s slow)
- Make ROC curves and compute AUC scores for each of the classifiers we sampled.
- Make one plot with ROC curves for all the models on the same graph
- For at least one model:
- Plot an ROC curve with cross validation documentation here
- Tune the classifier threshold using sklearn’s tuning ability documentation here
Upload your results to grumpy, named Mushroom-tuned.ipynb
Finalize your ideas for Report-02.ipynb, due next Friday
12/13/2024 (Friday)
Turn in Ensemble and Mushroom notebooks
Work on Report-02.ipynb
Pick a new dataset (see me for rare exceptions!)
Limit yourself to binary classification (you can do multiclass as an extension if you want but start with binarized data)
Focus on new skills: CV, GridSearch, Ensembles and reporting AUC and drawing ROC curves
Even if your first model accuracy is 98.5, you still need to improve it using new techniques!
Due next Thursday
Quick notes on precision/recall curves, randomSearchCV, Naive Bayes and Bayes error rate.
12/17/2024 (Tuesday)
Work on Report due this week
Be sure to consult specification from last class and rubric
PANDOC update. Looks like “sudo apt-get install texlive texlive-xetex pandoc” will get it working