Running Time Analysis¶

Use numpy and pandas for this lab. Download the csv file "matrix_multiply_times.csv" and put it in this folder

In [ ]:

!wget https://aet-cs.github.io/white/ML/lessons/matrix_multiply_times.csv

Read the csv as a DataFrame and display it (pd.read_csv will help)

In [ ]:

Drop the column "Matrix row size" and save it as a new dataframe (I like to use the names df,df2,df3,etc as I do this kind of work) (read about df.drop)

In [ ]:

Plot the data frame data, x is the 'exp' column, the y values are the other columns (times in seconds). Use pandas e.g.

df2.plot(...)

In [ ]:

Add new columns that represent the log transformations of the existing time columns, because we want a log-log analysis eventually. Here's an example

df2['strass_log'] = np.log(df['Strassen Time']) ## PLEASE use tab-completion for this!

Then drop the old (non-log) columns. Store the result in a new data frame that has only 5 columns (one 'exp' and four log-times)

In [ ]:

Plot the new dataframe line plots: x vs. log-times

Now we want to do a log-log regression. You will need numpy. Luckily pandas and numpy play nicely together. For example you could say

df = #some dataframe
x = df['x']
y = df['y']
coeffs = np.polyfit(x,y,1)

And you can even add in things like

y = np.log(y)
x = 2**x
x = x**2
y = y + np.sqrt(y)

for example

Do the log-log regressions and print out a table in this format

   algorithm name, rate of growth (degree), corrcoeff

Recall you can get corrcoeff from

np.corrcoeff(x,y)[0,1]

In [ ]:

You should notice some problems in your results, and the problems will depend on what you did above. What did you expect and how does it compare to what you got? Now your job is to fix it. Here are some things to consider

Some regression may not even work because it doesn't like the data
Are the units of your "x" column in the regression correct? (They should be log of the row length)
What base are you using with the logs and are you consistent?
Small values of $n$ might be skewing the data. Filter them out (see below)
Rows with NA value are also problematic. Drop them judiciously
Make sure you understand how to interpret the regression equations and numpy results

Work through the Pandas filtering tricks section and then do your analysis afterwards.

Pandas filtering tricks¶

In [ ]:

import pandas as pd
import numpy as np

In [ ]:

df = pd.DataFrame({'x': [1,2,3,10,20,-10,np.nan], 'y':[5,4,3,4,1,0,9]})
df['log_y'] = np.log2(df['y'])

In [ ]:

df

You can index columns by boolean comparisons

In [ ]:

df[df.x>5]

This just returns a list of T/F values

In [ ]:

df.x>5

Here's a different syntax

In [ ]:

df['x']<5

In [ ]:

df['log_y']==-np.inf

Boolean 'not' is ~ in pandas

In [ ]:

df[~(df['log_y']==-np.inf)]

In [ ]:

df[~df['x'].isna()]

Analysis¶

Complete your final analysis here of the 4 running times by using techniques you have learned in this notebook. Print out your final table the the running time comparisons, based on your best justifiable arguments

In [ ]: