Exploring data

The file weather-daylight.csv contains observational weather data for Leesburg, VA. We want to analyze the hypothesis “the fall of 2023 had more cloudy and rainy weekends than normal.” As a class, let’s look at the data and talk about our ideas for processing it. Open this file in Microsoft Excel, or something like it.

What questions do you have? What pre-processing is relevant? What types of calculations would support or refute the claim? Do you know how to make those calculations in Python or another language or tool?

Look at It! LOOK at IT!

(Bonus points if you know the Seinfeld reference). The first thing to do is just look at the data set. What do you see? Here are some questions you should ask?

  • How many rows, how many columns?
  • Is the data rectangular (are all rows the same length?)
  • What types of data? (Numerical, categorical?)
  • What domain of data (numerical: min and max, precision, mean, variance; categorical: number of categories)
  • Any missing data?
  • Is the file clean (read/write errors? paragraphs of text before or after? anything else weird?)
  • Find meaning: What do the columns are rows mean? Are there headers? Are they defined?
  • What is the source? Is this data reliable?
  • Weather columns

Analyse it

  • What question are you asking?
  • What does the data say about the question?
  • Repeat the last 2 steps as needed!

Jupyter Notebook

  • Follow this link to a jupyter notebook
  • I believe you can save and open notebooks from this interface, as long as you are using the same Chrome profile and history (it uses local storage to save state).

Homework

  • Open the jupyter notebook above from class on mybinder.org (or you can run it locally as jupyter-lab in your ml folder on WSL. Make sure to copy the notebook to your ml folder first – download it).
  • Devise a different way to analyse the question about the weather in Leesburg and try to analyse it using pandas. What conclusion can you draw?
  • Make some interesting plots from this dataset. You will need to read up on plotting in dataframes using pandas and possibly some things about pyplot.
  • Consider the London Weather dataset. Investigate the question “Has the weather in London gotten worse in the last 50 years?” Analyse the data and make a claim that you can support. Source for dataset: https://www.kaggle.com/datasets/emmanuelfwerr/london-weather-data, which retrieved the data from https://www.ecad.eu/dailydata/index.php.
  • Be prepared to discuss your findings and present to the class if asked to! (Your research doesn’t need to be profound but I do need to see you’re learning how to use pandas).
  • To be safe, you should download any notebooks you create on mybinder.org because I don’t know how reliable its storage system is.
  • WSL Problems: Fix posted!
  • In general I’m always happy to answer homework questions. Remind is the best way to reach me in the evening so don’t shy away from asking. If I can’t help, I’ll say so. Otherwise I’ll try my best!