{ "cells": [ { "cell_type": "markdown", "id": "641979b4-84cb-4927-a762-1ea790835388", "metadata": {}, "source": [ "# Decision Trees Intro" ] }, { "cell_type": "markdown", "id": "164f1089-3865-48f4-995b-892d5732c8e0", "metadata": {}, "source": [ "## Part 1: The Restaurant Dataset" ] }, { "cell_type": "code", "execution_count": 2, "id": "a87e1e8a-0d4c-4f3d-ae7f-0bcced53566f", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 3, "id": "d0a618b2-9c66-4d5e-9d73-1a0c496dad29", "metadata": {}, "outputs": [], "source": [ "# This will get the dataset\n", "# It's a good practice to go ahead and download it (curl/wget)\n", "# and change this cell to read locally\n", "\n", "df = pd.read_csv(\"https://aet-cs.github.io/white/ML/lessons/restaurant.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "id": "1b87b405-ee93-415d-b73a-0b055f30fa2b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AltBarFriHunPatPriceRainResTypeEstWait
0YesNoNoYesSome$$$NoYesFrench0-10Yes
1YesNoNoYesFull$NoNoThai30-60No
2NoYesNoNoSome$NoNoBurger0-10Yes
3YesNoYesYesFull$NoNoThai10-30Yes
4YesNoYesNoFull$$$NoYesFrench>60No
5NoYesNoYesSome$$YesYesItalian0-10Yes
6NoYesNoNoNone$YesNoBurger0-10No
7NoNoNoYesSome$$YesYesThai0-10Yes
8NoYesYesNoFull$YesNoBurger>60No
9YesYesYesYesFull$$$NoYesItalian10-30No
10NoNoNoNoNone$NoNoThai0-10No
11YesYesYesYesFull$NoNoBurger30-60Yes
\n", "
" ], "text/plain": [ " Alt Bar Fri Hun Pat Price Rain Res Type \\\n", "0 Yes No No Yes Some $$$ No Yes French \n", "1 Yes No No Yes Full $ No No Thai \n", "2 No Yes No No Some $ No No Burger \n", "3 Yes No Yes Yes Full $ No No Thai \n", "4 Yes No Yes No Full $$$ No Yes French \n", "5 No Yes No Yes Some $$ Yes Yes Italian \n", "6 No Yes No No None $ Yes No Burger \n", "7 No No No Yes Some $$ Yes Yes Thai \n", "8 No Yes Yes No Full $ Yes No Burger \n", "9 Yes Yes Yes Yes Full $$$ No Yes Italian \n", "10 No No No No None $ No No Thai \n", "11 Yes Yes Yes Yes Full $ No No Burger \n", "\n", " Est Wait \n", "0 0-10 Yes \n", "1 30-60 No \n", "2 0-10 Yes \n", "3 10-30 Yes \n", "4 >60 No \n", "5 0-10 Yes \n", "6 0-10 No \n", "7 0-10 Yes \n", "8 >60 No \n", "9 10-30 No \n", "10 0-10 No \n", "11 30-60 Yes " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "56c68b23-ff4f-42c8-b63d-10f482bdbdb7", "metadata": {}, "source": [ "Check out the documentation for [Decision Tree Classifiers](https://scikit-learn.org/1.5/modules/tree.html) and implement one for the Restaurant dataset. Print out your decision tree and its accuracy. (It's a small dataset so using all the data for training is OK). Unfortunately the scikit-learn tree classifiers require numberical data so we will label encode our dataset first. (One hot encoding is also a possibility)." ] }, { "cell_type": "code", "execution_count": 5, "id": "a7d74cd6-2a20-45d1-be11-2c2fdea53d92", "metadata": {}, "outputs": [], "source": [ "from sklearn import preprocessing\n", "le = preprocessing.LabelEncoder()\n", "for c in df.columns:\n", " le.fit(df[c])\n", " df[c] = le.transform(df[c])" ] }, { "cell_type": "code", "execution_count": null, "id": "0d117cc2", "metadata": {}, "outputs": [], "source": [ "## Create your training X and y (you can use the whole dataset)\n", "## use scikit-learn to make a decision tree\n", "## calculate its accuracy and metrics" ] }, { "cell_type": "markdown", "id": "421ebfbd", "metadata": {}, "source": [ "Because of the required encoding, and renaming of features, it's not easy to interpret this tree and compare it to the ones we made in class. Nevertheless, it's a good little example of how to make a decision tree in scikit-learn. In a later notebook we'll look at a tree made using the algorithm from class." ] }, { "cell_type": "markdown", "id": "04cc17f9-83d4-4561-9697-fa6632d4a42d", "metadata": {}, "source": [ "## Part 2: The entropy of English" ] }, { "cell_type": "markdown", "id": "c3cd0a34-0e86-43a0-b670-82f05164ed25", "metadata": {}, "source": [ "Install nltk (natural language toolkit) following the commands below." ] }, { "cell_type": "code", "execution_count": null, "id": "438a674d-d3be-4559-b9de-7c26efe2976c", "metadata": {}, "outputs": [], "source": [ "!pip install nltk" ] }, { "cell_type": "markdown", "id": "8a2bfe6a-1e32-4cfc-b0c8-f00214642d48", "metadata": {}, "source": [ "The next cell will open an interactive window (which is a bit weird). Follow the prompts to download a library called 'brown'" ] }, { "cell_type": "code", "execution_count": null, "id": "b6f622c4-1136-4088-93d9-972831d032ae", "metadata": {}, "outputs": [], "source": [ "import nltk\n", "\n", "## delete the next line after you download \"brown\" (or comment it)\n", "nltk.download()" ] }, { "cell_type": "markdown", "id": "77336ff5-9586-4f56-94ae-60ecce864c8a", "metadata": {}, "source": [ "brown.words is a list of words" ] }, { "cell_type": "code", "execution_count": null, "id": "80c0ed99-d906-4c04-b191-20809bd41deb", "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import brown" ] }, { "cell_type": "code", "execution_count": null, "id": "8dfa49e2-196d-46c0-9296-7340bf4a1679", "metadata": {}, "outputs": [], "source": [ "brown.words()" ] }, { "cell_type": "code", "execution_count": null, "id": "acd2575d-9976-4116-a812-bbc3ff09502d", "metadata": {}, "outputs": [], "source": [ "len(brown.words())" ] }, { "cell_type": "markdown", "id": "85c5a672-0789-4638-89a9-12ed0480cf3a", "metadata": {}, "source": [ "Your job is to use these words to compute, using standard python, the entropy of the English language. Only consider 27 characters -- the alphabet plus space." ] }, { "cell_type": "code", "execution_count": 18, "id": "18b98771-c53e-4be4-92a0-e43ba7730536", "metadata": {}, "outputs": [], "source": [ "# your code!" ] } ], "metadata": { "kernelspec": { "display_name": "ML Python", "language": "python", "name": "myenv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }