{ "cells": [ { "cell_type": "markdown", "id": "7e9349d0", "metadata": {}, "source": [ "# Poisoned Mushroom Dataset" ] }, { "cell_type": "markdown", "id": "382f85ee", "metadata": {}, "source": [ "We are going to take a quick tour of machine learning by working on an example dataset. The mushroom dataset\n", "categorizes mushrooms as 'poisonous' or 'edible' and collects several descriptive properties of each mushroom example." ] }, { "cell_type": "code", "execution_count": 41, "id": "771ea7b9-e7de-43c7-96ae-bc3a30862715", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import os" ] }, { "cell_type": "markdown", "id": "c1121f80", "metadata": {}, "source": [ "## Loading the dataset" ] }, { "cell_type": "code", "execution_count": 118, "id": "95d691a5-1638-4953-b929-761a8095a773", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-typeveil-colorring-numberring-typespore-print-colorpopulationhabitat
0exfnfnfwnb...ywpNaNnopwvNaN
1pNaNygtNaNfcbk...sncpwneNaNyg
2ebyntnfcNaNn...spNaNpwopbyw
3exggtnfwbn...spNaNpwnnNaNNaNd
4eNaNfNaNtnawnn...kNaNwpwNaNlwvd
..................................................................
25981efNaNrfnfNaNnNaN...NaNnppwopkvNaN
25982efsefNaNfcny...ywppwNaNpryd
25983pfgeNaNNaNacbb...ywNaNpwophvm
25984exggtnfwbh...fNaNNaNpwteNaNsNaN
25985ebyytlfcby...kgopwolksg
\n", "

25986 rows × 23 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color ruises odor gill-attachment \\\n", "0 e x f n f n f \n", "1 p NaN y g t NaN f \n", "2 e b y n t n f \n", "3 e x g g t n f \n", "4 e NaN f NaN t n a \n", "... ... ... ... ... ... ... ... \n", "25981 e f NaN r f n f \n", "25982 e f s e f NaN f \n", "25983 p f g e NaN NaN a \n", "25984 e x g g t n f \n", "25985 e b y y t l f \n", "\n", " gill-spacing gill-size gill-color ... stalk-surface-below-ring \\\n", "0 w n b ... y \n", "1 c b k ... s \n", "2 c NaN n ... s \n", "3 w b n ... s \n", "4 w n n ... k \n", "... ... ... ... ... ... \n", "25981 NaN n NaN ... NaN \n", "25982 c n y ... y \n", "25983 c b b ... y \n", "25984 w b h ... f \n", "25985 c b y ... k \n", "\n", " stalk-color-above-ring stalk-color-below-ring veil-type veil-color \\\n", "0 w p NaN n \n", "1 n c p w \n", "2 p NaN p w \n", "3 p NaN p w \n", "4 NaN w p w \n", "... ... ... ... ... \n", "25981 n p p w \n", "25982 w p p w \n", "25983 w NaN p w \n", "25984 NaN NaN p w \n", "25985 g o p w \n", "\n", " ring-number ring-type spore-print-color population habitat \n", "0 o p w v NaN \n", "1 n e NaN y g \n", "2 o p b y w \n", "3 n n NaN NaN d \n", "4 NaN l w v d \n", "... ... ... ... ... ... \n", "25981 o p k v NaN \n", "25982 NaN p r y d \n", "25983 o p h v m \n", "25984 t e NaN s NaN \n", "25985 o l k s g \n", "\n", "[25986 rows x 23 columns]" ] }, "execution_count": 118, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# These lines would load the data locally\n", "# data_root = \"./\"\n", "# filename = \"mushroom.csv\"\n", "# filepath = os.path.join(data_root, filename)\n", "\n", "# We'll fetch it directly from the web\n", "data_url = \"https://aet-cs.github.io/white/ML/lessons/mushroom.csv\"\n", "df = pd.read_csv(data_url)\n", "df" ] }, { "cell_type": "markdown", "id": "a71ed4f4-5106-4425-88cd-ab2acbc24b86", "metadata": {}, "source": [ "`describe` gives a quick overview of each feature" ] }, { "cell_type": "code", "execution_count": 43, "id": "a363da42-73b6-4d09-b1c6-f885b020c633", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-typeveil-colorring-numberring-typespore-print-colorpopulationhabitat
count25986225132250722527225142253622505225872249422418...22563224132255322489224832249722478224932247522502
unique264102922212...4991435967
topexynfnfcbb...swwpwopwvd
freq143547674760248101236169861781116092139973679...10619858084032248915742157138501508584096573
\n", "

4 rows × 23 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color ruises odor gill-attachment \\\n", "count 25986 22513 22507 22527 22514 22536 22505 \n", "unique 2 6 4 10 2 9 2 \n", "top e x y n f n f \n", "freq 14354 7674 7602 4810 12361 6986 17811 \n", "\n", " gill-spacing gill-size gill-color ... stalk-surface-below-ring \\\n", "count 22587 22494 22418 ... 22563 \n", "unique 2 2 12 ... 4 \n", "top c b b ... s \n", "freq 16092 13997 3679 ... 10619 \n", "\n", " stalk-color-above-ring stalk-color-below-ring veil-type veil-color \\\n", "count 22413 22553 22489 22483 \n", "unique 9 9 1 4 \n", "top w w p w \n", "freq 8580 8403 22489 15742 \n", "\n", " ring-number ring-type spore-print-color population habitat \n", "count 22497 22478 22493 22475 22502 \n", "unique 3 5 9 6 7 \n", "top o p w v d \n", "freq 15713 8501 5085 8409 6573 \n", "\n", "[4 rows x 23 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "id": "dc3a4445-274a-4a7f-b397-6f96ddc26d7a", "metadata": {}, "source": [ "All the features in this dataset are categorical, with between 2 and 12 categories. Except the curious `veil-type` which has only one value. Since `veil-type` has only one unique value, we'll drop it." ] }, { "cell_type": "code", "execution_count": 119, "id": "9494a6c7-67e8-419c-9aaa-0f8ff89221bb", "metadata": {}, "outputs": [], "source": [ "df = df.drop('veil-type', axis=1)" ] }, { "cell_type": "markdown", "id": "b7c00c5e", "metadata": {}, "source": [ "## Data Exploration" ] }, { "cell_type": "markdown", "id": "66b733f8-c4db-41d1-8e9e-ac316c0b8ecb", "metadata": {}, "source": [ "Show all the columns. Notice the target is the first column!" ] }, { "cell_type": "code", "execution_count": 120, "id": "91eae28c-7318-4630-b527-3cb6763babeb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'ruises', 'odor',\n", " 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',\n", " 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',\n", " 'stalk-surface-below-ring', 'stalk-color-above-ring',\n", " 'stalk-color-below-ring', 'veil-color', 'ring-number', 'ring-type',\n", " 'spore-print-color', 'population', 'habitat'],\n", " dtype='object')" ] }, "execution_count": 120, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "id": "b5a2457e-739c-41ca-a91e-bbabbd5686f5", "metadata": {}, "source": [ "Get the size of the dataframe. Shape returns (rows, cols)" ] }, { "cell_type": "code", "execution_count": 121, "id": "7b8d5fcc-1043-427a-9264-6f0a5deeb995", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(25986, 22)" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "id": "3e22afb6-b203-4542-9c36-f4910be75a60", "metadata": {}, "source": [ "Let's peek at the target" ] }, { "cell_type": "code", "execution_count": 122, "id": "2c1e74a9-08b9-4e73-8855-8af51df9a2ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 e\n", "1 p\n", "2 e\n", "3 e\n", "4 e\n", " ..\n", "25981 e\n", "25982 e\n", "25983 p\n", "25984 e\n", "25985 e\n", "Name: class, Length: 25986, dtype: object" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['class']" ] }, { "cell_type": "markdown", "id": "a2d5b6d9-e9dc-4bc6-808f-a5a5523f585c", "metadata": {}, "source": [ "This dataset has a LOT of \"N/A\" datapoints. One way to clean the data is to drop all affected rows" ] }, { "cell_type": "code", "execution_count": 123, "id": "635ab8ad-939b-4451-ae75-7414cc9436f1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1327, 22)" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dropna().shape" ] }, { "cell_type": "markdown", "id": "13baf05c-9ed1-4d7f-a794-3f8d6ae332d5", "metadata": {}, "source": [ "But this *significantly* reduces our dataset. Let's instead use a data imputation strategy that fills the N/A with the *mode*" ] }, { "cell_type": "code", "execution_count": 124, "id": "90fc05fd-0f98-4d68-bce8-4bf9c3cd77db", "metadata": {}, "outputs": [], "source": [ "for c in df.columns:\n", " df = df.fillna({c: df[c].mode()[0]})" ] }, { "cell_type": "markdown", "id": "5aebfa3d-f8c3-4da8-a28f-fdc8bbe6f40f", "metadata": {}, "source": [ "Look at df again" ] }, { "cell_type": "code", "execution_count": 125, "id": "63191ab4-1b33-49fe-b9c3-8a9cfce68f7d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-above-ringstalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-colorring-numberring-typespore-print-colorpopulationhabitat
0exfnfnfwnb...sywpnopwvd
1pxygtnfcbk...fsncwnewyg
2ebyntnfcbn...sspwwopbyw
3exggtnfwbn...sspwwnnwvd
4exfntnawnn...skwwwolwvd
..................................................................
25981efyrfnfcnb...ssnpwopkvd
25982efsefnfcny...sywpwopryd
25983pfgefnacbb...sywwwophvm
25984exggtnfwbh...kfwwwtewsd
25985ebyytlfcby...kkgowolksg
\n", "

25986 rows × 22 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color ruises odor gill-attachment \\\n", "0 e x f n f n f \n", "1 p x y g t n f \n", "2 e b y n t n f \n", "3 e x g g t n f \n", "4 e x f n t n a \n", "... ... ... ... ... ... ... ... \n", "25981 e f y r f n f \n", "25982 e f s e f n f \n", "25983 p f g e f n a \n", "25984 e x g g t n f \n", "25985 e b y y t l f \n", "\n", " gill-spacing gill-size gill-color ... stalk-surface-above-ring \\\n", "0 w n b ... s \n", "1 c b k ... f \n", "2 c b n ... s \n", "3 w b n ... s \n", "4 w n n ... s \n", "... ... ... ... ... ... \n", "25981 c n b ... s \n", "25982 c n y ... s \n", "25983 c b b ... s \n", "25984 w b h ... k \n", "25985 c b y ... k \n", "\n", " stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring \\\n", "0 y w p \n", "1 s n c \n", "2 s p w \n", "3 s p w \n", "4 k w w \n", "... ... ... ... \n", "25981 s n p \n", "25982 y w p \n", "25983 y w w \n", "25984 f w w \n", "25985 k g o \n", "\n", " veil-color ring-number ring-type spore-print-color population habitat \n", "0 n o p w v d \n", "1 w n e w y g \n", "2 w o p b y w \n", "3 w n n w v d \n", "4 w o l w v d \n", "... ... ... ... ... ... ... \n", "25981 w o p k v d \n", "25982 w o p r y d \n", "25983 w o p h v m \n", "25984 w t e w s d \n", "25985 w o l k s g \n", "\n", "[25986 rows x 22 columns]" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "id": "df21f38d-9f68-46ca-bf03-682cd3e49315", "metadata": {}, "source": [ "Let's see what the classifications are and how balanced the dataset is. Highly imbalanced datasets require special techniques to ensure valid models." ] }, { "cell_type": "code", "execution_count": 127, "id": "e7f294dd-8f1e-47dd-9a12-9656ba1e893f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "class\n", "e 14354\n", "p 11632\n", "Name: count, dtype: int64" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['class'].value_counts()" ] }, { "cell_type": "markdown", "id": "33d056df-7b9b-41db-8760-f1dec051f951", "metadata": {}, "source": [ "We'll introduce a new plotting library -- \"seaborn\", which has some advantages over matplotlib. Here we show how to quickly make a histogram from a dataframe. Seaborn works nicely with pandas dataframes." ] }, { "cell_type": "code", "execution_count": 128, "id": "fd3c5ce9-dbfb-4bed-8105-8587a23c35c6", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "\n", "# Count plot\n", "sns.countplot(x='class', data=df)\n", "plt.title('Count Plot of Class Frequencies')\n", "plt.xlabel('Class')\n", "plt.ylabel('Frequency')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "cdc198c6-6a81-4ced-bb41-5316396d5051", "metadata": {}, "source": [ "As another example let's plot the \"cap color\" feature." ] }, { "cell_type": "code", "execution_count": 129, "id": "d89439c0", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Count plot\n", "sns.countplot(x='cap-color', data=df, )\n", "plt.title('Count Plot of Cap Color Frequencies')\n", "plt.xlabel('Cap Color')\n", "plt.ylabel('Frequency')\n", "plt.show()\n" ] }, { "cell_type": "markdown", "id": "7a2f82fe-774f-4a74-98f8-8a14a4a5e343", "metadata": {}, "source": [ "I wonder how the color correlates to the outcome -- are some color more poisonous? We'll do some pandas work to make this summary for us. (Here's a nice overview of `groupby`: https://builtin.com/data-science/pandas-groupby)" ] }, { "cell_type": "code", "execution_count": 175, "id": "618f14e5", "metadata": {}, "outputs": [], "source": [ "# Count observations by color and toxicity\n", "counts = df.groupby(['cap-color', 'class']).size().reset_index(name='count')" ] }, { "cell_type": "code", "execution_count": 176, "id": "f0d0b10f-85a4-4f4d-9c5f-7d1dc6a93dd0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cap-colorclasscount
0be660
1bp551
2ce526
3cp413
4ee1834
5ep1627
6ge2226
7gp1765
8ne2673
9np2137
10pe611
11pp483
12re467
13rp380
14ue502
15up413
16we1452
17wp1137
18ye1460
19yp1210
\n", "
" ], "text/plain": [ " cap-color class count\n", "0 b e 660\n", "1 b p 551\n", "2 c e 526\n", "3 c p 413\n", "4 e e 1834\n", "5 e p 1627\n", "6 g e 2226\n", "7 g p 1765\n", "8 n e 2673\n", "9 n p 2137\n", "10 p e 611\n", "11 p p 483\n", "12 r e 467\n", "13 r p 380\n", "14 u e 502\n", "15 u p 413\n", "16 w e 1452\n", "17 w p 1137\n", "18 y e 1460\n", "19 y p 1210" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts" ] }, { "cell_type": "code", "execution_count": 165, "id": "931ce993-ecce-4aee-98b6-df53b08a66f6", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create the bar plot\n", "plt.figure(figsize=(10, 6))\n", "sns.barplot(x='cap-color', y='count', hue='class', data=counts, palette={'p': 'blue', 'e': 'red'})\n", "\n", "# Add plot title and labels\n", "plt.title('Distribution of Mushroom Colors with Poisonous Indication')\n", "plt.xlabel('Color')\n", "plt.ylabel('Count')\n", "\n", "\n", "# Customize the legend\n", "legend = plt.legend(title='Toxicity', labels=['Edible', 'Poisonous'])\n", "legend.get_texts()[0].set_color('red') # Edible in red\n", "legend.get_texts()[1].set_color('blue') # Poisonous in blue\n", "\n", "\n", "# Show the plot\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 143, "id": "405b65e9-f10a-4d98-b812-4fadff77660f", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-above-ringstalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-colorring-numberring-typespore-print-colorpopulationhabitat
count25986259862598625986259862598625986259862598625986...25986259862598625986259862598625986259862598625986
unique264102922212...4499435967
topexynfnfcbb...sswwwopwvd
freq143541114711081826915833104362129219491174897247...1454814042121531183619245192021200985781192010057
\n", "

4 rows × 22 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color ruises odor gill-attachment \\\n", "count 25986 25986 25986 25986 25986 25986 25986 \n", "unique 2 6 4 10 2 9 2 \n", "top e x y n f n f \n", "freq 14354 11147 11081 8269 15833 10436 21292 \n", "\n", " gill-spacing gill-size gill-color ... stalk-surface-above-ring \\\n", "count 25986 25986 25986 ... 25986 \n", "unique 2 2 12 ... 4 \n", "top c b b ... s \n", "freq 19491 17489 7247 ... 14548 \n", "\n", " stalk-surface-below-ring stalk-color-above-ring stalk-color-below-ring \\\n", "count 25986 25986 25986 \n", "unique 4 9 9 \n", "top s w w \n", "freq 14042 12153 11836 \n", "\n", " veil-color ring-number ring-type spore-print-color population habitat \n", "count 25986 25986 25986 25986 25986 25986 \n", "unique 4 3 5 9 6 7 \n", "top w o p w v d \n", "freq 19245 19202 12009 8578 11920 10057 \n", "\n", "[4 rows x 22 columns]" ] }, "execution_count": 143, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 144, "id": "e8187689-c4d0-4181-aff4-666b0a43138b", "metadata": {}, "outputs": [], "source": [ "# Count observations by odor and toxicity\n", "counts = df.groupby(['odor', 'class']).size().reset_index(name='count')" ] }, { "cell_type": "code", "execution_count": 145, "id": "2e3c3a6c-4d7b-4fed-befe-0ef7f74147fe", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Create the bar plot\n", "plt.figure(figsize=(10, 6))\n", "sns.barplot(x='odor', y='count', hue='class', data=counts, palette={'p': 'blue', 'e': 'red'})\n", "\n", "# Add plot title and labels\n", "plt.title('Distribution of Mushroom Odor with Poisonous Indication')\n", "plt.xlabel('Odor')\n", "plt.ylabel('Count')\n", "\n", "# Customize the legend\n", "legend = plt.legend(title='Toxicity', labels=['Edible', 'Poisonous'])\n", "legend.get_texts()[0].set_color('red') # Edible in red\n", "legend.get_texts()[1].set_color('blue') # Poisonous in blue\n", "\n", "# Show the plot\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "0bdfeadc-13f3-4041-a954-d46ed9ead5b2", "metadata": {}, "source": [ "## Correlation matrix heat map" ] }, { "cell_type": "markdown", "id": "7fc020d8-adfa-48de-b9bb-9dabde9499a4", "metadata": {}, "source": [ "Let's get a quick visual representation of the relationshop between features in this dataset. We'll use a version of a Chi-Squared test on all pairs $(n,m)$ of features in the dataset, including the target. (Heat maps for continuous data are easy to plot -- because of the categories we have to do some extra work here. You can treat `cramers_v` as a black box for now.)" ] }, { "cell_type": "code", "execution_count": 62, "id": "1cd6565e-12d2-48a1-bfc1-35c44e39e22b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classcap-shapecap-surfacecap-colorruisesodorgill-attachmentgill-spacinggill-sizegill-color...stalk-surface-above-ringstalk-surface-below-ringstalk-color-above-ringstalk-color-below-ringveil-colorring-numberring-typespore-print-colorpopulationhabitat
class0.9999220.0303970.0430750.0200840.0899740.4651910.0011380.0637790.1229080.088500...0.1297490.1025290.0502020.0443770.0050630.0221750.0920490.1246720.0633700.053764
cap-shape0.0303971.0000000.0115470.0207610.0143050.0207790.0086880.0172700.0179950.020523...0.0111070.0174040.0183790.0174320.0171280.0079260.0121830.0162050.0182080.015867
cap-surface0.0430750.0115471.0000000.0225890.0087360.0248780.0155690.0272910.0204650.025335...0.0129020.0120600.0228970.0216190.0127880.0167060.0142140.0163010.0220580.020441
cap-color0.0200840.0207610.0225891.0000000.0215420.0228040.0171710.0234270.0219740.024858...0.0195450.0169090.0248670.0208310.0174870.0152350.0237210.0196810.0234780.022749
ruises0.0899740.0143050.0087360.0215420.9999190.0434310.0027670.0151490.0111580.041328...0.0257610.0216590.0276720.0315430.0065080.0117590.0373410.0345450.0142830.026531
odor0.4651910.0207790.0248780.0228040.0434311.0000000.0170010.0264580.0469780.027601...0.0277400.0315680.0259460.0230860.0154900.0194580.0308610.0322360.0262930.021654
gill-attachment0.0011380.0086880.0155690.0171710.0027670.0170010.9998700.0047890.0077530.023883...0.0086270.0117420.0202850.0216620.0104710.0044080.0109660.0106410.0129920.012999
gill-spacing0.0637790.0172700.0272910.0234270.0151490.0264580.0047890.9998970.0112270.021896...0.0112530.0145410.0224060.0285430.0084390.0139130.0130690.0189360.0278930.029246
gill-size0.1229080.0179950.0204650.0219740.0111580.0469780.0077530.0112270.9999130.031603...0.0081550.0095700.0276920.0307960.0047860.0080130.0390650.0417010.0256680.023956
gill-color0.0885000.0205230.0253350.0248580.0413280.0276010.0238830.0218960.0316031.000000...0.0300650.0314510.0234050.0273070.0198350.0199870.0333630.0317310.0297690.026966
stalk-shape0.0168550.0178220.0032290.0357450.0008320.0369320.0084900.0014250.0076820.031472...0.0331360.0148590.0289660.0271320.0086270.0171580.0367300.0264280.0203620.029390
stalk-root0.0472940.0198820.0199690.0298920.0306380.0310820.0119250.0161210.0410620.033103...0.0157580.0159290.0312780.0242210.0161080.0158480.0223310.0329830.0257490.029484
stalk-surface-above-ring0.1297490.0111070.0129020.0195450.0257610.0277400.0086270.0112530.0081550.030065...1.0000000.0258750.0293410.0263360.0108730.0086140.0326240.0317090.0123880.018257
stalk-surface-below-ring0.1025290.0174040.0120600.0169090.0216590.0315680.0117420.0145410.0095700.031451...0.0258751.0000000.0246550.0245490.0110190.0155290.0301880.0343590.0214070.018069
stalk-color-above-ring0.0502020.0183790.0228970.0248670.0276720.0259460.0202850.0224060.0276920.023405...0.0293410.0246551.0000000.0251800.0182000.0202520.0249440.0218690.0236040.020929
stalk-color-below-ring0.0443770.0174320.0216190.0208310.0315430.0230860.0216620.0285430.0307960.027307...0.0263360.0245490.0251801.0000000.0181260.0138930.0264380.0264700.0242880.025119
veil-color0.0050630.0171280.0127880.0174870.0065080.0154900.0104710.0084390.0047860.019835...0.0108730.0110190.0182000.0181261.0000000.0097710.0137930.0187530.0151030.017111
ring-number0.0221750.0079260.0167060.0152350.0117590.0194580.0044080.0139130.0080130.019987...0.0086140.0155290.0202520.0138930.0097711.0000000.0085270.0184410.0206090.016895
ring-type0.0920490.0121830.0142140.0237210.0373410.0308610.0109660.0130690.0390650.033363...0.0326240.0301880.0249440.0264380.0137930.0085271.0000000.0360000.0162190.017606
spore-print-color0.1246720.0162050.0163010.0196810.0345450.0322360.0106410.0189360.0417010.031731...0.0317090.0343590.0218690.0264700.0187530.0184410.0360001.0000000.0207550.023303
population0.0633700.0182080.0220580.0234780.0142830.0262930.0129920.0278930.0256680.029769...0.0123880.0214070.0236040.0242880.0151030.0206090.0162190.0207551.0000000.025203
habitat0.0537640.0158670.0204410.0227490.0265310.0216540.0129990.0292460.0239560.026966...0.0182570.0180690.0209290.0251190.0171110.0168950.0176060.0233030.0252031.000000
\n", "

22 rows × 22 columns

\n", "
" ], "text/plain": [ " class cap-shape cap-surface cap-color \\\n", "class 0.999922 0.030397 0.043075 0.020084 \n", "cap-shape 0.030397 1.000000 0.011547 0.020761 \n", "cap-surface 0.043075 0.011547 1.000000 0.022589 \n", "cap-color 0.020084 0.020761 0.022589 1.000000 \n", "ruises 0.089974 0.014305 0.008736 0.021542 \n", "odor 0.465191 0.020779 0.024878 0.022804 \n", "gill-attachment 0.001138 0.008688 0.015569 0.017171 \n", "gill-spacing 0.063779 0.017270 0.027291 0.023427 \n", "gill-size 0.122908 0.017995 0.020465 0.021974 \n", "gill-color 0.088500 0.020523 0.025335 0.024858 \n", "stalk-shape 0.016855 0.017822 0.003229 0.035745 \n", "stalk-root 0.047294 0.019882 0.019969 0.029892 \n", "stalk-surface-above-ring 0.129749 0.011107 0.012902 0.019545 \n", "stalk-surface-below-ring 0.102529 0.017404 0.012060 0.016909 \n", "stalk-color-above-ring 0.050202 0.018379 0.022897 0.024867 \n", "stalk-color-below-ring 0.044377 0.017432 0.021619 0.020831 \n", "veil-color 0.005063 0.017128 0.012788 0.017487 \n", "ring-number 0.022175 0.007926 0.016706 0.015235 \n", "ring-type 0.092049 0.012183 0.014214 0.023721 \n", "spore-print-color 0.124672 0.016205 0.016301 0.019681 \n", "population 0.063370 0.018208 0.022058 0.023478 \n", "habitat 0.053764 0.015867 0.020441 0.022749 \n", "\n", " ruises odor gill-attachment gill-spacing \\\n", "class 0.089974 0.465191 0.001138 0.063779 \n", "cap-shape 0.014305 0.020779 0.008688 0.017270 \n", "cap-surface 0.008736 0.024878 0.015569 0.027291 \n", "cap-color 0.021542 0.022804 0.017171 0.023427 \n", "ruises 0.999919 0.043431 0.002767 0.015149 \n", "odor 0.043431 1.000000 0.017001 0.026458 \n", "gill-attachment 0.002767 0.017001 0.999870 0.004789 \n", "gill-spacing 0.015149 0.026458 0.004789 0.999897 \n", "gill-size 0.011158 0.046978 0.007753 0.011227 \n", "gill-color 0.041328 0.027601 0.023883 0.021896 \n", "stalk-shape 0.000832 0.036932 0.008490 0.001425 \n", "stalk-root 0.030638 0.031082 0.011925 0.016121 \n", "stalk-surface-above-ring 0.025761 0.027740 0.008627 0.011253 \n", "stalk-surface-below-ring 0.021659 0.031568 0.011742 0.014541 \n", "stalk-color-above-ring 0.027672 0.025946 0.020285 0.022406 \n", "stalk-color-below-ring 0.031543 0.023086 0.021662 0.028543 \n", "veil-color 0.006508 0.015490 0.010471 0.008439 \n", "ring-number 0.011759 0.019458 0.004408 0.013913 \n", "ring-type 0.037341 0.030861 0.010966 0.013069 \n", "spore-print-color 0.034545 0.032236 0.010641 0.018936 \n", "population 0.014283 0.026293 0.012992 0.027893 \n", "habitat 0.026531 0.021654 0.012999 0.029246 \n", "\n", " gill-size gill-color ... \\\n", "class 0.122908 0.088500 ... \n", "cap-shape 0.017995 0.020523 ... \n", "cap-surface 0.020465 0.025335 ... \n", "cap-color 0.021974 0.024858 ... \n", "ruises 0.011158 0.041328 ... \n", "odor 0.046978 0.027601 ... \n", "gill-attachment 0.007753 0.023883 ... \n", "gill-spacing 0.011227 0.021896 ... \n", "gill-size 0.999913 0.031603 ... \n", "gill-color 0.031603 1.000000 ... \n", "stalk-shape 0.007682 0.031472 ... \n", "stalk-root 0.041062 0.033103 ... \n", "stalk-surface-above-ring 0.008155 0.030065 ... \n", "stalk-surface-below-ring 0.009570 0.031451 ... \n", "stalk-color-above-ring 0.027692 0.023405 ... \n", "stalk-color-below-ring 0.030796 0.027307 ... \n", "veil-color 0.004786 0.019835 ... \n", "ring-number 0.008013 0.019987 ... \n", "ring-type 0.039065 0.033363 ... \n", "spore-print-color 0.041701 0.031731 ... \n", "population 0.025668 0.029769 ... \n", "habitat 0.023956 0.026966 ... \n", "\n", " stalk-surface-above-ring stalk-surface-below-ring \\\n", "class 0.129749 0.102529 \n", "cap-shape 0.011107 0.017404 \n", "cap-surface 0.012902 0.012060 \n", "cap-color 0.019545 0.016909 \n", "ruises 0.025761 0.021659 \n", "odor 0.027740 0.031568 \n", "gill-attachment 0.008627 0.011742 \n", "gill-spacing 0.011253 0.014541 \n", "gill-size 0.008155 0.009570 \n", "gill-color 0.030065 0.031451 \n", "stalk-shape 0.033136 0.014859 \n", "stalk-root 0.015758 0.015929 \n", "stalk-surface-above-ring 1.000000 0.025875 \n", "stalk-surface-below-ring 0.025875 1.000000 \n", "stalk-color-above-ring 0.029341 0.024655 \n", "stalk-color-below-ring 0.026336 0.024549 \n", "veil-color 0.010873 0.011019 \n", "ring-number 0.008614 0.015529 \n", "ring-type 0.032624 0.030188 \n", "spore-print-color 0.031709 0.034359 \n", "population 0.012388 0.021407 \n", "habitat 0.018257 0.018069 \n", "\n", " stalk-color-above-ring stalk-color-below-ring \\\n", "class 0.050202 0.044377 \n", "cap-shape 0.018379 0.017432 \n", "cap-surface 0.022897 0.021619 \n", "cap-color 0.024867 0.020831 \n", "ruises 0.027672 0.031543 \n", "odor 0.025946 0.023086 \n", "gill-attachment 0.020285 0.021662 \n", "gill-spacing 0.022406 0.028543 \n", "gill-size 0.027692 0.030796 \n", "gill-color 0.023405 0.027307 \n", "stalk-shape 0.028966 0.027132 \n", "stalk-root 0.031278 0.024221 \n", "stalk-surface-above-ring 0.029341 0.026336 \n", "stalk-surface-below-ring 0.024655 0.024549 \n", "stalk-color-above-ring 1.000000 0.025180 \n", "stalk-color-below-ring 0.025180 1.000000 \n", "veil-color 0.018200 0.018126 \n", "ring-number 0.020252 0.013893 \n", "ring-type 0.024944 0.026438 \n", "spore-print-color 0.021869 0.026470 \n", "population 0.023604 0.024288 \n", "habitat 0.020929 0.025119 \n", "\n", " veil-color ring-number ring-type \\\n", "class 0.005063 0.022175 0.092049 \n", "cap-shape 0.017128 0.007926 0.012183 \n", "cap-surface 0.012788 0.016706 0.014214 \n", "cap-color 0.017487 0.015235 0.023721 \n", "ruises 0.006508 0.011759 0.037341 \n", "odor 0.015490 0.019458 0.030861 \n", "gill-attachment 0.010471 0.004408 0.010966 \n", "gill-spacing 0.008439 0.013913 0.013069 \n", "gill-size 0.004786 0.008013 0.039065 \n", "gill-color 0.019835 0.019987 0.033363 \n", "stalk-shape 0.008627 0.017158 0.036730 \n", "stalk-root 0.016108 0.015848 0.022331 \n", "stalk-surface-above-ring 0.010873 0.008614 0.032624 \n", "stalk-surface-below-ring 0.011019 0.015529 0.030188 \n", "stalk-color-above-ring 0.018200 0.020252 0.024944 \n", "stalk-color-below-ring 0.018126 0.013893 0.026438 \n", "veil-color 1.000000 0.009771 0.013793 \n", "ring-number 0.009771 1.000000 0.008527 \n", "ring-type 0.013793 0.008527 1.000000 \n", "spore-print-color 0.018753 0.018441 0.036000 \n", "population 0.015103 0.020609 0.016219 \n", "habitat 0.017111 0.016895 0.017606 \n", "\n", " spore-print-color population habitat \n", "class 0.124672 0.063370 0.053764 \n", "cap-shape 0.016205 0.018208 0.015867 \n", "cap-surface 0.016301 0.022058 0.020441 \n", "cap-color 0.019681 0.023478 0.022749 \n", "ruises 0.034545 0.014283 0.026531 \n", "odor 0.032236 0.026293 0.021654 \n", "gill-attachment 0.010641 0.012992 0.012999 \n", "gill-spacing 0.018936 0.027893 0.029246 \n", "gill-size 0.041701 0.025668 0.023956 \n", "gill-color 0.031731 0.029769 0.026966 \n", "stalk-shape 0.026428 0.020362 0.029390 \n", "stalk-root 0.032983 0.025749 0.029484 \n", "stalk-surface-above-ring 0.031709 0.012388 0.018257 \n", "stalk-surface-below-ring 0.034359 0.021407 0.018069 \n", "stalk-color-above-ring 0.021869 0.023604 0.020929 \n", "stalk-color-below-ring 0.026470 0.024288 0.025119 \n", "veil-color 0.018753 0.015103 0.017111 \n", "ring-number 0.018441 0.020609 0.016895 \n", "ring-type 0.036000 0.016219 0.017606 \n", "spore-print-color 1.000000 0.020755 0.023303 \n", "population 0.020755 1.000000 0.025203 \n", "habitat 0.023303 0.025203 1.000000 \n", "\n", "[22 rows x 22 columns]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from scipy.stats import chi2_contingency\n", "\n", "# Function to calculate Cramér's V\n", "def cramers_v(x, y):\n", " confusion_matrix = pd.crosstab(x, y)\n", " chi2 = chi2_contingency(confusion_matrix)[0]\n", " n = confusion_matrix.sum().sum()\n", " r, k = confusion_matrix.shape\n", " return np.sqrt(chi2 / (n * (min(r, k) - 1)))\n", "\n", "categorical_columns = df.select_dtypes(include=['object', 'category']).columns\n", "corr_matrix = pd.DataFrame(index=categorical_columns, columns=categorical_columns)\n", "\n", "for col1 in categorical_columns:\n", " for col2 in categorical_columns:\n", " corr_matrix.loc[col1, col2] = cramers_v(df[col1], df[col2])\n", "\n", "# Convert to numeric values for plotting\n", "corr_matrix = corr_matrix.astype(float)\n", "corr_matrix" ] }, { "cell_type": "code", "execution_count": 63, "id": "6625d1c5-3aad-4a3a-8aaf-3fdb5042930c", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plotting the correlation matrix\n", "plt.figure(figsize=(10,8))\n", "sns.heatmap(corr_matrix, annot=False, cmap='Blues', square=True, cbar_kws={\"shrink\": .8})\n", "plt.title(\"Cramér's V Correlation Matrix for Categorical Features\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e406b72a-e7e7-4294-853c-66ec7f36f6bc", "metadata": {}, "source": [ "Which features seem to be important?" ] }, { "cell_type": "code", "execution_count": 64, "id": "7ff1e04d-41dd-48b9-97e4-c17ca7351f75", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "class 0.999922\n", "odor 0.465191\n", "gill-size 0.122908\n", "stalk-surface-above-ring 0.129749\n", "stalk-surface-below-ring 0.102529\n", "spore-print-color 0.124672\n", "Name: class, dtype: float64" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corr_matrix['class'][corr_matrix['class']>0.1]" ] }, { "cell_type": "markdown", "id": "961a4b15-dbfb-4585-86f6-5d0bddd1f444", "metadata": {}, "source": [ "## Data Modeling" ] }, { "cell_type": "markdown", "id": "7416c6a5-a15e-4fc9-81a1-529ebdd44b6b", "metadata": {}, "source": [ "We're finally ready to do some data modeling using scikit-learn. In this cell we import some methods we'll use, reload the data frame (just to be safe), re-pre-process-it, and one-hot-encode all the categorical variables." ] }, { "cell_type": "code", "execution_count": 148, "id": "f5cf2242-b167-4111-a509-ede1e11c8c75", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cap-shape_bcap-shape_ccap-shape_fcap-shape_kcap-shape_scap-shape_xcap-surface_fcap-surface_gcap-surface_scap-surface_y...population_spopulation_vpopulation_yhabitat_dhabitat_ghabitat_lhabitat_mhabitat_phabitat_uhabitat_w
count25986259862598625986259862598625986259862598625986...25986259862598625986259862598625986259862598625986
unique2222222222...2222222222
topFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse...FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
freq23726245091926423074245181831219762238251946618384...22365175772167519413210402327524278228402412924425
\n", "

4 rows × 116 columns

\n", "
" ], "text/plain": [ " cap-shape_b cap-shape_c cap-shape_f cap-shape_k cap-shape_s \\\n", "count 25986 25986 25986 25986 25986 \n", "unique 2 2 2 2 2 \n", "top False False False False False \n", "freq 23726 24509 19264 23074 24518 \n", "\n", " cap-shape_x cap-surface_f cap-surface_g cap-surface_s cap-surface_y \\\n", "count 25986 25986 25986 25986 25986 \n", "unique 2 2 2 2 2 \n", "top False False False False False \n", "freq 18312 19762 23825 19466 18384 \n", "\n", " ... population_s population_v population_y habitat_d habitat_g \\\n", "count ... 25986 25986 25986 25986 25986 \n", "unique ... 2 2 2 2 2 \n", "top ... False False False False False \n", "freq ... 22365 17577 21675 19413 21040 \n", "\n", " habitat_l habitat_m habitat_p habitat_u habitat_w \n", "count 25986 25986 25986 25986 25986 \n", "unique 2 2 2 2 2 \n", "top False False False False False \n", "freq 23275 24278 22840 24129 24425 \n", "\n", "[4 rows x 116 columns]" ] }, "execution_count": 148, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.metrics import accuracy_score, classification_report\n", "\n", "df = pd.read_csv(data_url)\n", "\n", "# drop the useless feature\n", "df = df.drop('veil-type', axis=1)\n", "\n", "# drop the target from X -- and store it as y\n", "X = df.drop('class', axis = 1)\n", "y = df['class']\n", "\n", "# one-hot encode all columns at once\n", "X = pd.get_dummies(X)\n", "\n", "# show it to me\n", "X.describe()" ] }, { "cell_type": "markdown", "id": "f2be1f0e", "metadata": {}, "source": [ "### Decision Tree Classifier" ] }, { "cell_type": "markdown", "id": "65e61b9e-d3b2-4aba-ac92-5d333bb5b802", "metadata": {}, "source": [ "Our first model is a decision tree, which is one of the oldest algorithms for classifying observations. Before we create any models, we *always* create a train-test split so there is unseen testing data that wasn't available when the model was training." ] }, { "cell_type": "code", "execution_count": 151, "id": "1d56f782-edf6-4ea1-a615-c89f683ca68c", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...5188518951905191519251935194519551965197
0ppepppeeep...pepeppppep
1peepppepep...pppeeeppep
\n", "

2 rows × 5198 columns

\n", "
" ], "text/plain": [ " 0 1 2 3 4 5 6 7 8 9 ... 5188 5189 5190 5191 \\\n", "0 p p e p p p e e e p ... p e p e \n", "1 p e e p p p e p e p ... p p p e \n", "\n", " 5192 5193 5194 5195 5196 5197 \n", "0 p p p p e p \n", "1 e e p p e p \n", "\n", "[2 rows x 5198 columns]" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Split the data into training and testing sets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", "\n", "# Initialize the DecisionTreeClassifier\n", "clf = DecisionTreeClassifier()\n", "\n", "# Fit the model\n", "clf.fit(X_train, y_train)\n", "\n", "# Make predictions\n", "y_pred = clf.predict(X_test)\n", "results = pd.DataFrame([y_pred, y_test])\n", "results" ] }, { "cell_type": "markdown", "id": "e8a398ef-cfab-4c4a-9da8-c54e8007cb35", "metadata": {}, "source": [ "There are many metrics for evaluating categorical models, and they are sometimes at odds. Accuracy is simplistic and obscures what could be more important -- are there more false positives or more false negatives? And which is more important? In a task to identify poisonous mushrooms, a false negative (labeling a 'p' as an 'e') is deadly. The \"recall\" on \"p\" below captures this value. This measures the percent of poisonous mushrooms you have correctly labeled as poisonous.\n", "\n", "Recall is not everything, though. You can easily get perfect \"poison\" recall by labeling every mushroom as poisonous! Precision measures the fraction of mushrooms you label as poisonous which actually are. \n", "\n", "The $F_1$-score is a type of geometric mean between precision and recall and strikes a bit of a balance between the two.\n", "\n", "In this notebook, look first at \"p-recall\", but keep an eye on the other metrics." ] }, { "cell_type": "code", "execution_count": 153, "id": "4126dbec-ab98-4a33-a19e-4cd5ad088cb6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.6671796844940362\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " e 0.70 0.69 0.70 2873\n", " p 0.62 0.64 0.63 2325\n", "\n", " accuracy 0.67 5198\n", " macro avg 0.66 0.66 0.66 5198\n", "weighted avg 0.67 0.67 0.67 5198\n", "\n" ] } ], "source": [ "# Evaluate the model\n", "accuracy = accuracy_score(y_test, y_pred)\n", "print(f\"Accuracy: {accuracy}\")\n", "\n", "print(\"Classification Report:\")\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "id": "f8466d20-62df-44d3-a115-d54d07e572d7", "metadata": {}, "source": [ "A confusion matrix is a nice way to really show everything that a classifier is doing. The main diagonal are the numbers of correctly classified observations. The off-diagonals are errors. Unfortunately this simple version is unlabeled" ] }, { "cell_type": "code", "execution_count": 157, "id": "c641e9dc-e55a-4b53-8ee7-4b16f61ad6f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1978, 895],\n", " [ 835, 1490]])" ] }, "execution_count": 157, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "\n", "confusion_matrix(y_test, y_pred)" ] }, { "cell_type": "markdown", "id": "d11a4123-27e4-4fe0-a349-11886cc5b7e8", "metadata": {}, "source": [ "With a bit more work we can get a label." ] }, { "cell_type": "code", "execution_count": 158, "id": "3bc8d713-cb12-46c1-9ace-d977e48e5b68", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Calculate the confusion matrix\n", "cm = confusion_matrix(y_test, clf.predict(X_test))\n", "cm_display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)\n", "\n", "# Plot the confusion matrix\n", "fig, ax = plt.subplots(figsize=(8, 6))\n", "sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=clf.classes_, yticklabels=clf.classes_)\n", "plt.xlabel('Predicted')\n", "plt.ylabel('Actual')\n", "plt.title('Confusion Matrix')\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "390a17b1", "metadata": {}, "source": [ "### Testing more methods" ] }, { "cell_type": "markdown", "id": "209a30d2-00b0-4418-b188-fefad96c93cc", "metadata": {}, "source": [ "First I'll define a helper method that takes any dataset and a classifier \"clf\" and\n", "* fits the model to the training data\n", "* applies the model to the test data\n", "* gets an accuracy score and a classification report for the test data" ] }, { "cell_type": "code", "execution_count": 159, "id": "38933f3c", "metadata": {}, "outputs": [], "source": [ "def classifier_tryout(clf, X_train, y_train, X_test, y_test):\n", "\tclf.fit(X_train, y_train)\n", "\n", "\t# Make predictions\n", "\ty_pred = clf.predict(X_test)\n", "\n", "\t# Evaluate the model\n", "\taccuracy = accuracy_score(y_test, y_pred)\n", "\tprint(f\"Accuracy: {accuracy}\")\n", "\n", "\tprint(\"Classification Report:\")\n", "\tprint(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "id": "d5cdbfcd-beb8-4ace-86e2-b72ccb121a08", "metadata": {}, "source": [ "In what follows, we run several very different models and compare their performance. We won't go into much detail about the models. But note how the scikit-learn API makes dealing with each of the models very similar" ] }, { "cell_type": "markdown", "id": "06ef4cb1-23c9-468a-8803-0db9dc26ad0b", "metadata": {}, "source": [ "### Random Forest" ] }, { "cell_type": "code", "execution_count": 160, "id": "876eb165-83e6-404d-9397-9ca9e0ed0985", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.7464409388226241\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " e 0.76 0.80 0.78 2873\n", " p 0.73 0.69 0.71 2325\n", "\n", " accuracy 0.75 5198\n", " macro avg 0.74 0.74 0.74 5198\n", "weighted avg 0.75 0.75 0.75 5198\n", "\n" ] } ], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "clf = RandomForestClassifier(random_state=42)\n", "classifier_tryout(clf, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "id": "1a69f38e-fbec-461b-9a67-6384638eb203", "metadata": {}, "source": [ "### Support Vector Machines" ] }, { "cell_type": "code", "execution_count": 116, "id": "4decd3d2-2549-40f2-8d6f-6dd9a6e9ef6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.7504809542131589\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " e 0.76 0.79 0.78 2873\n", " p 0.73 0.70 0.71 2325\n", "\n", " accuracy 0.75 5198\n", " macro avg 0.75 0.75 0.75 5198\n", "weighted avg 0.75 0.75 0.75 5198\n", "\n" ] } ], "source": [ "from sklearn.svm import SVC\n", "\n", "# Initialize the RandomForestClassifier\n", "clf = SVC(random_state=42, kernel='rbf')\n", "classifier_tryout(clf, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "id": "aed528e4-5585-4ec9-9fe6-399a1cb54256", "metadata": {}, "source": [ "### Logistic Regression" ] }, { "cell_type": "code", "execution_count": 101, "id": "fd9ebd7b-48da-4652-a149-04bcf0bc015e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.746056175452097\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " e 0.76 0.78 0.77 2873\n", " p 0.72 0.70 0.71 2325\n", "\n", " accuracy 0.75 5198\n", " macro avg 0.74 0.74 0.74 5198\n", "weighted avg 0.75 0.75 0.75 5198\n", "\n" ] } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "# Initialize the LogisticRegression\n", "clf = LogisticRegression(random_state=42)\n", "classifier_tryout(clf, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "id": "eb802b79-197a-411d-aec9-c203c956ef30", "metadata": {}, "source": [ "### k-Nearest Neighbors" ] }, { "cell_type": "code", "execution_count": 102, "id": "31601aa2-588b-4997-aee2-dbc36e9ea45e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.6854559445940747\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " e 0.70 0.76 0.73 2873\n", " p 0.67 0.60 0.63 2325\n", "\n", " accuracy 0.69 5198\n", " macro avg 0.68 0.68 0.68 5198\n", "weighted avg 0.68 0.69 0.68 5198\n", "\n" ] } ], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "# Initialize the KNeighborsClassifier\n", "clf = KNeighborsClassifier(weights='uniform')\n", "classifier_tryout(clf, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "id": "f9bf489b-b3f8-45c3-8d4d-979e76eb1f7c", "metadata": {}, "source": [ "### GradientBoost " ] }, { "cell_type": "code", "execution_count": 103, "id": "3f393c0c-534d-44d7-8b12-3f5659a5117b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.7518276260100039\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " e 0.77 0.79 0.78 2873\n", " p 0.73 0.70 0.72 2325\n", "\n", " accuracy 0.75 5198\n", " macro avg 0.75 0.75 0.75 5198\n", "weighted avg 0.75 0.75 0.75 5198\n", "\n" ] } ], "source": [ "from sklearn.ensemble import GradientBoostingClassifier\n", "\n", "# Initialize the GradientBoostingClassifier\n", "clf = GradientBoostingClassifier(random_state=42)\n", "classifier_tryout(clf, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "id": "74f7245a-3478-4c53-afbf-53c2a9b3e0f4", "metadata": {}, "source": [ "### Neural Network" ] }, { "cell_type": "code", "execution_count": 104, "id": "786d1642-2aec-4987-bad9-afe56337cae2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.7052712581762216\n", "Classification Report:\n", " precision recall f1-score support\n", "\n", " e 0.73 0.73 0.73 2873\n", " p 0.67 0.67 0.67 2325\n", "\n", " accuracy 0.71 5198\n", " macro avg 0.70 0.70 0.70 5198\n", "weighted avg 0.71 0.71 0.71 5198\n", "\n" ] } ], "source": [ "from sklearn.neural_network import MLPClassifier\n", "\n", "# Initialize the MLPClassifier\n", "clf = MLPClassifier(random_state=42, hidden_layer_sizes=(1000,10,), learning_rate='adaptive')\n", "classifier_tryout(clf, X_train, y_train, X_test, y_test)" ] }, { "cell_type": "markdown", "id": "366f9ce3-2956-4dfd-a960-92697dd3e669", "metadata": {}, "source": [ "## Conclusion" ] }, { "cell_type": "markdown", "id": "e1c5e40a-a3ca-4a21-b9db-a3f6394d49a6", "metadata": {}, "source": [ "This notebook gave a quick overview of the full data analysis workflow on a real world dataset. Data ingesting and cleaning; exploratory data analysis; modeling. There is much we can do next by trying to optimize models, but that is beyond the scope of this notebook. Also, we only looked at categorical data here -- continuous/quantitative data is a bit different but not much (linear regression is the first stop with continuous data.)\n", "\n", "Which model above do you think is the best? Which one would you start with to try to get even better results?\n", "\n", "As an application, find your own dataset and try to mirror the process we took here. Not all the steps in this notebook will apply directly to whatever dataset you find, but see what you can come up with and how well you can model the target in your data!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }