Running Time Analysis¶
Use numpy and pandas for this lab. Download the csv file "matrix_multiply_times.csv" and put it in this folder
!wget https://aet-cs.github.io/white/ML/lessons/matrix_multiply_times.csv
Read the csv as a DataFrame and display it (pd.read_csv
will help)
Drop the column "Matrix row size" and save it as a new dataframe (I like to use the names df,df2,df3,etc as I do this kind of work) (read about df.drop
)
Plot the data frame data, x
is the 'exp' column, the y
values are the other columns (times in seconds). Use pandas e.g.
df2.plot(...)
Add new columns that represent the log transformations of the existing time columns, because we want a log-log analysis eventually. Here's an example
df2['strass_log'] = np.log(df['Strassen Time']) ## PLEASE use tab-completion for this!
Then drop the old (non-log) columns. Store the result in a new data frame that has only 5 columns (one 'exp' and four log-times)
Plot the new dataframe line plots: x vs. log-times
Now we want to do a log-log regression. You will need numpy. Luckily pandas and numpy play nicely together. For example you could say
df = #some dataframe
x = df['x']
y = df['y']
coeffs = np.polyfit(x,y,1)
And you can even add in things like
y = np.log(y)
x = 2**x
x = x**2
y = y + np.sqrt(y)
for example
Do the log-log regressions and print out a table in this format
algorithm name, rate of growth (degree), corrcoeff
Recall you can get corrcoeff from
np.corrcoeff(x,y)[0,1]
You should notice some problems in your results, and the problems will depend on what you did above. What did you expect and how does it compare to what you got? Now your job is to fix it. Here are some things to consider
- Some regression may not even work because it doesn't like the data
- Are the units of your "x" column in the regression correct? (They should be log of the row length)
- What base are you using with the logs and are you consistent?
- Small values of $n$ might be skewing the data. Filter them out (see below)
- Rows with NA value are also problematic. Drop them judiciously
- Make sure you understand how to interpret the regression equations and numpy results
Work through the Pandas filtering tricks section and then do your analysis afterwards.
Pandas filtering tricks¶
import pandas as pd
import numpy as np
df = pd.DataFrame({'x': [1,2,3,10,20,-10,np.nan], 'y':[5,4,3,4,1,0,9]})
df['log_y'] = np.log2(df['y'])
df
You can index columns by boolean comparisons
df[df.x>5]
This just returns a list of T/F values
df.x>5
Here's a different syntax
df['x']<5
df['log_y']==-np.inf
Boolean 'not' is ~
in pandas
df[~(df['log_y']==-np.inf)]
df[~df['x'].isna()]
Analysis¶
Complete your final analysis here of the 4 running times by using techniques you have learned in this notebook. Print out your final table the the running time comparisons, based on your best justifiable arguments