Decision Trees Intro¶

Part 1: The Restaurant Dataset¶

In [2]:
import pandas as pd
In [3]:
# This will get the dataset
# It's a good practice to go ahead and download it (curl/wget)
# and change this cell to read locally

df = pd.read_csv("https://aet-cs.github.io/white/ML/lessons/restaurant.csv")
In [4]:
df
Out[4]:
Alt Bar Fri Hun Pat Price Rain Res Type Est Wait
0 Yes No No Yes Some $$$ No Yes French 0-10 Yes
1 Yes No No Yes Full $ No No Thai 30-60 No
2 No Yes No No Some $ No No Burger 0-10 Yes
3 Yes No Yes Yes Full $ No No Thai 10-30 Yes
4 Yes No Yes No Full $$$ No Yes French >60 No
5 No Yes No Yes Some $$ Yes Yes Italian 0-10 Yes
6 No Yes No No None $ Yes No Burger 0-10 No
7 No No No Yes Some $$ Yes Yes Thai 0-10 Yes
8 No Yes Yes No Full $ Yes No Burger >60 No
9 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No
10 No No No No None $ No No Thai 0-10 No
11 Yes Yes Yes Yes Full $ No No Burger 30-60 Yes

Check out the documentation for Decision Tree Classifiers and implement one for the Restaurant dataset. Print out your decision tree and its accuracy. (It's a small dataset so using all the data for training is OK). Unfortunately the scikit-learn tree classifiers require numberical data so we will label encode our dataset first. (One hot encoding is also a possibility).

In [5]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for c in df.columns:
    le.fit(df[c])
    df[c] = le.transform(df[c])
In [ ]:
## Create your training X and y (you can use the whole dataset)
## use scikit-learn to make a decision tree
## calculate its accuracy and metrics

Because of the required encoding, and renaming of features, it's not easy to interpret this tree and compare it to the ones we made in class. Nevertheless, it's a good little example of how to make a decision tree in scikit-learn. In a later notebook we'll look at a tree made using the algorithm from class.

Part 2: The entropy of English¶

Install nltk (natural language toolkit) following the commands below.

In [ ]:
!pip install nltk

The next cell will open an interactive window (which is a bit weird). Follow the prompts to download a library called 'brown'

In [ ]:
import nltk

## delete the next line after you download "brown" (or comment it)
nltk.download()

brown.words is a list of words

In [ ]:
from nltk.corpus import brown
In [ ]:
brown.words()
In [ ]:
len(brown.words())

Your job is to use these words to compute, using standard python, the entropy of the English language. Only consider 27 characters -- the alphabet plus space.

In [18]:
# your code!