Module 07 — Data Analysis¶

Overview¶

This module teaches you how to load, clean, explore, and visualize data using Python's most popular data analysis libraries: pandas and matplotlib. You will work with CSV files containing realistic data sets — student grades, messy sales records, and transaction logs — progressively building up to a complete analysis pipeline.

Every project uses local CSV files so you do not need a database or internet connection. By the end you will be able to take a raw data file, clean it, answer questions about it, produce charts, and export a summary report.

Prerequisites¶

Complete Level 2 before starting this module. You should be comfortable with:

Functions and return values
Reading and writing files
Dictionaries and lists
Basic testing with pytest
Running scripts from the command line

Learning objectives¶

By the end of this module you will be able to:

Load CSV data into a pandas DataFrame and explore its shape, types, and summary statistics.
Filter rows with boolean indexing, group data, and compute aggregations.
Detect and handle missing values, fix data types, remove duplicates, and merge DataFrames.
Create bar charts, line charts, scatter plots, and multi-panel figures with matplotlib.
Build a complete analysis pipeline: load, clean, analyze, visualize, and export a report.

Projects¶

#	Project	What you learn
01	Pandas Basics	DataFrame, read_csv, head(), describe(), info(), shape, dtypes
02	Filtering & Grouping	Boolean indexing, .loc[], groupby(), agg(), value_counts()
03	Data Cleaning	isna(), fillna(), dropna(), dtype conversion, duplicates, merge
04	Visualization	matplotlib bar, line, scatter, subplots, labels, saving figures
05	Analysis Report	Full pipeline — load, clean, analyze, visualize, export summary

Work through them in order. Each project builds on the previous one.

Setup¶

Create a virtual environment and install dependencies before starting:

cd projects/modules/07-data-analysis
python -m venv .venv
source .venv/bin/activate    # macOS/Linux
.venv\Scripts\activate       # Windows
pip install -r requirements.txt

See concepts/virtual-environments.md for a full explanation of virtual environments.

Dependencies¶

This module requires three packages (listed in requirements.txt):

pandas — the workhorse of data analysis in Python. It gives you the DataFrame, a spreadsheet-like structure you can filter, group, and transform with one-liners.
matplotlib — the standard plotting library. You can create bar charts, line charts, scatter plots, histograms, and complex multi-panel figures.
openpyxl — an Excel file engine that pandas uses under the hood when reading or writing .xlsx files. We install it so pandas has full Excel support available if you want to experiment.

A note on data analysis workflow¶

Real data analysis follows a predictable cycle:

Load — read the raw data from a file or database.
Explore — look at shape, types, summary stats, and sample rows.
Clean — fix missing values, wrong types, duplicates, and inconsistencies.
Analyze — filter, group, aggregate, and compute the numbers you need.
Visualize — create charts that make patterns visible.
Report — export findings so others can act on them.

This module walks you through each step across five projects, then combines them all in the final project.