Jupyter Report Rubric
Some guidelines for formatting an ML report as a Jupyter notebook.
This is a draft of a living document so expect updates.
- Title: something like “Predicting Voter Registration Data from
Demographic Data.” Tell us quickly what you’re doing. It’s OK to be
clever if you can pull it off “To Eat or Not to Eat: Identifying
Poisonous Mushrooms” (this is borderline.) “Report #2” (this is
bad).
- Intro: Say what you’re solving, where the data comes from, and what
techniques you’re going to use. It’s good to put the final result here
too. “Ultimately our Random Forest model achieves a test accuracy of 95%
and an MCC score of 0.92.”
- Outlining: Use sections and subsections as appropriate. The title is
“#” in markdown. Sections are “##” and so on.
- Sections. You should usually have (there can be variations, of
course)
- Intro
- The Data (load the data and give us info – size, shape, features,
target. Describe all the columns if possible or maybe point to a website
that does if there’s a lot of data. Deal with basic problems here – is
it balanced? is there missing data?)
- Preprocessing (usually this is where you OneHotEncode, maybe drop
obvious distractors, impute missing data, rebalance, pick a subset of
the rows)
- Data Analysis – more detailed search for co-linearity, predictive
features, relationship to the target field, scaling, normality and skew,
histogram plots, heatmaps, etc.
- Feature Extraction – we haven’t done this much yet, outside of
TF-IDF, but this can be a very big part of your analysis. Especially
with datasets like images, audio, video, time series, EEG, ECG, biology
and genetics, etc.
- Modeling – Now you do some ML. Depending on the purpose of the
report you will give your best model here, or several attempts. Give a
detailed analysis; don’t just print “accuracy = 0.92”. Use relevant
metrics: confusion matrix, ROC curves, precision, recall, F1, MCC, RMS,
etc. You should use cross-validation and discuss the variance of the
results.
- Refinement – Pick your best model (or a few) and make them better.
There are many techniques here. GridSearch is an obvious one but
massaging the data, combining models, better preprocessing to find
salient features, dropping fields, etc. Lots of room here to be
creative. This might be a place for ensemble methods.
- Conclusion – What’s your best result. Make sure to address the pros
and cons of close performing models (variance, precision, recall, in
larger problems there are resource requirements, deployment.) It is also
a good time to reflect on your data more generallly – how well does the
dataset represent the problem here. What would you like to look at next
if you had more time, more resources, more data. What are the
shortcomings in your own analysis. Do you trust your own results? How
much does this generalize? It’s good to give the reader something to
think about beyond your paper, both to make your paper more interesting,
and also to prove that you are capable of interesting thoughts.
- General voice: The entire document should be narrative. You should
rarely have a computation cell without some text before, setting up what
you’re doing “let’s remove the columns for country and state because
they are irrelevant” and tell us what you learned “it looks like pH and
log[H+] are collinear so we will drop one of them” or “this dataset is
highly imbalanced”. Do not expect the reader to read or even
look at your Jupyter code. It’s just there to prove you’re not cheating.
It’s the text in the document that carries the weight.
- Format: Every graph has a title, every axis has a label. If your
heatmap is too small to read, then fix it somehow (maybe do 2 heatmaps,
maybe drop features from the heatmap that won’t be helpful). It’s fine
and expected to draft your visuals before you publish
the final versions. We don’t need to see the drafts.
- Citations: Tell us where to get your data. If you use any
non-standard library (what’s standard? that’s a valid question) just
point it out and say how to install it (probably ‘pip’). If a brilliant
paper or website gave you a brilliant idea, don’t pretend like it’s your
idea. Eventually somebody will realize they have seen it before and
you’ll be cooked. If you actually publish something, we’ll work harder
to make citations, but for classwork just make sure you leave
sufficient. breadcrumbs.
- Final version: I think that a .pdf of your notebook might be best.
That may take some fiddling. let’s experiment but I hope we can get .pdf
to work. Then submit your .ipynb and .pdf files (and any relevant
datasets – either upload them or if they’re huge link to them. Do
not make the mistake of relying on data in the cloud to stay
there. Make your own copy. Put it in your own cloud. Link to that!)