# Cloudy Days in Leesburg

This notebook was inspired by the suspicion that, during a certain recent time window, weekends were much cloudier and rainier than weekdays. So I decided to make a proper analysis by downloading weather data and using python/pandas to do some statistical analysis. You should read through this code carefully and figure out what each line of code is doing. When in doubt, use the internet!

In [None]:
import pandas as pd
pd.set_option("display.max_columns",300)

Analyze Weather data

In [None]:
# Before running this cell, ensure you have the file in the same directory as this jupyter notebook.
df = pd.read_csv('weather-daylight.csv')

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
df['skyc1'].value_counts()

In [None]:
df.loc[30:50]

In [None]:
df[df['skyc1']=='CLR'].shape


In [None]:
df[(df['skyc1']=='CLR') & (df['p01i']>0)].shape


In [None]:
df[df['p01i']>0].shape[0] / df[df['p01i']>=0].shape[0]


In [None]:
for c in ['skyc1', 'skyc2', 'skyc3', 'skyc4']:
	print(c,df[df[c].isna()].shape)

In [None]:
df[(df['skyc1'] != df['skyc2']) & (df['skyc1']=='CLR')][['skyc1','skyc2']]

In [None]:
df[(df['skyc1'] != df['skyc2'])][['skyc1','skyc2']].value_counts()

In [None]:
total_p01i_per_day_et_new = df.groupby('Day Of Week ET')['p01i'].sum()

In [None]:
total_p01i_per_day_et_new

We would like to count the number of clear observations per day. We can do this with a `groupby` followed by a `size()` function

In [None]:
df[df['skyc1'] == 'CLR'].groupby('Day Of Week ET').size()

In [None]:
df[df['skyc1'] == 'CLR'].groupby('Day Of Week ET').size().plot(kind="bar");

Does every day have the same number of recorded observations?

In [None]:
df.groupby('Day Of Week ET').size().plot(kind="bar");

In [None]:
percentage_clr_days_et_new = (df[df['skyc1'] == 'CLR'].groupby('Day Of Week ET').size() / df.groupby('Day Of Week ET').size()) * 100

In [None]:
percentage_clr_days_et_new

In [None]:
percentage_clr_days_et_new.plot(kind='bar');

In [None]:
# Recalculate the necessary components based on the 'Day Of Week ET' column from the new dataset with daylight information

# Total precipitation (p01i) for each day of the week
total_p01i_per_day_et_new = df.groupby('Day Of Week ET')['p01i'].sum()

# Percentage of OVC and CLR days
percentage_ovc_days_et_new = (df[(df['skyc1'] == 'OVC') | (df['skyc2'] == 'OVC') | (df['skyc3'] == 'OVC')].groupby('Day Of Week ET').size() / df.groupby('Day Of Week ET').size()) * 100
percentage_clr_days_et_new = (df[df['skyc1'] == 'CLR'].groupby('Day Of Week ET').size() / df.groupby('Day Of Week ET').size()) * 100

# Total number of rows (observations) for each day of the week
total_rows_per_day_et_new = df.groupby('Day Of Week ET').size()

# Combine all the data into one DataFrame
combined_et_df_new = pd.DataFrame({
    'Day Of Week ET': total_rows_per_day_et_new.index,
    'Total Number of Rows ET': total_rows_per_day_et_new.values,
    'Total p01i ET': total_p01i_per_day_et_new.values,
    'Percentage of OVC Days ET': percentage_ovc_days_et_new.values,
    'Percentage of CLR Days ET': percentage_clr_days_et_new.values
})

combined_et_df_new

Conclusions? BTW what is day 5 and 6? How else could you define a good/bad day? What ways could you change the question to make your desired outcome more likely? Is this statistically significant?