Decision Trees Intro¶
Part 1: The Restaurant Dataset¶
import pandas as pd
# This will get the dataset
# It's a good practice to go ahead and download it (curl/wget)
# and change this cell to read locally
df = pd.read_csv("https://aet-cs.github.io/white/ML/lessons/restaurant.csv")
df
Alt | Bar | Fri | Hun | Pat | Price | Rain | Res | Type | Est | Wait | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Yes | No | No | Yes | Some | $$$ | No | Yes | French | 0-10 | Yes |
1 | Yes | No | No | Yes | Full | $ | No | No | Thai | 30-60 | No |
2 | No | Yes | No | No | Some | $ | No | No | Burger | 0-10 | Yes |
3 | Yes | No | Yes | Yes | Full | $ | No | No | Thai | 10-30 | Yes |
4 | Yes | No | Yes | No | Full | $$$ | No | Yes | French | >60 | No |
5 | No | Yes | No | Yes | Some | $$ | Yes | Yes | Italian | 0-10 | Yes |
6 | No | Yes | No | No | None | $ | Yes | No | Burger | 0-10 | No |
7 | No | No | No | Yes | Some | $$ | Yes | Yes | Thai | 0-10 | Yes |
8 | No | Yes | Yes | No | Full | $ | Yes | No | Burger | >60 | No |
9 | Yes | Yes | Yes | Yes | Full | $$$ | No | Yes | Italian | 10-30 | No |
10 | No | No | No | No | None | $ | No | No | Thai | 0-10 | No |
11 | Yes | Yes | Yes | Yes | Full | $ | No | No | Burger | 30-60 | Yes |
Check out the documentation for Decision Tree Classifiers and implement one for the Restaurant dataset. Print out your decision tree and its accuracy. (It's a small dataset so using all the data for training is OK). Unfortunately the scikit-learn tree classifiers require numberical data so we will label encode our dataset first. (One hot encoding is also a possibility).
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for c in df.columns:
le.fit(df[c])
df[c] = le.transform(df[c])
## Create your training X and y (you can use the whole dataset)
## use scikit-learn to make a decision tree
## calculate its accuracy and metrics
Because of the required encoding, and renaming of features, it's not easy to interpret this tree and compare it to the ones we made in class. Nevertheless, it's a good little example of how to make a decision tree in scikit-learn. In a later notebook we'll look at a tree made using the algorithm from class.
Part 2: The entropy of English¶
Install nltk (natural language toolkit) following the commands below.
!pip install nltk
The next cell will open an interactive window (which is a bit weird). Follow the prompts to download a library called 'brown'
import nltk
## delete the next line after you download "brown" (or comment it)
nltk.download()
brown.words is a list of words
from nltk.corpus import brown
brown.words()
len(brown.words())
Your job is to use these words to compute, using standard python, the entropy of the English language. Only consider 27 characters -- the alphabet plus space.
# your code!