Module 07 / Project 03 — Data Cleaning¶
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | — |
Focus¶
- Detecting missing values with
isna()andisna().sum() - Handling missing values with
fillna()anddropna() - Converting data types with
astype()andpd.to_numeric() - Finding and removing duplicates with
duplicated()anddrop_duplicates() - Merging DataFrames with
pd.merge()
Why this project exists¶
Real-world data is messy. CSV files from production systems have missing values, wrong data types, duplicate rows, and inconsistencies. Before you can analyze data, you must clean it. This project walks you through a realistic cleaning workflow: detect problems, fix them step by step, merge in reference data, and save the cleaned result. These skills are used in every data analysis job.
Run¶
Expected output¶
=== Step 1: Load messy data ===
Loaded 51 rows from data/messy_sales.csv
=== Step 2: Inspect the mess ===
Missing values per column:
order_id 0
product_id 0
quantity 4
price 4
date 2
region 3
dtype: int64
Data types:
order_id int64
product_id object
quantity object <-- should be numeric!
price float64
date object
region object
dtype: object
=== Step 3: Fix data types ===
Converted quantity: 2 values could not be converted (set to NaN)
=== Step 4: Handle missing values ===
Before: 51 rows
After dropping rows with missing quantity or price: 43 rows
Filled 2 missing region values with "Unknown"
=== Step 5: Remove duplicates ===
Found 1 duplicate rows
After removing duplicates: 42 rows
=== Step 6: Merge with product names ===
After merge: 42 rows, 8 columns
New columns: product_name, category
=== Step 7: Save cleaned data ===
Saved cleaned data to data/cleaned_sales.csv
Done. Cleaned 51 messy rows down to 42 clean rows.
Alter it¶
- Instead of dropping rows with missing
quantity, fill them with the median quantity. How does the row count change? - Fill missing
regionvalues with the most common region instead of "Unknown". (Hint:df["region"].mode()[0].) - After merging, group by
categoryand count how many orders each category has. - Add a new calculated column:
total = quantity * price.
Break it¶
- Try
df["quantity"].astype(int)before cleaning the non-numeric values. What error do you get? - Merge on a column name that does not exist in one of the DataFrames. Read the error.
- Call
dropna()with no arguments. How many rows survive? Why is this usually too aggressive?
Fix it¶
- Use
pd.to_numeric(df["quantity"], errors="coerce")instead ofastype(int). Theerrors="coerce"flag turns unparseable values into NaN instead of crashing. - Check that the merge column exists in both DataFrames before merging.
- Use
dropna(subset=["quantity", "price"])to only drop rows where specific columns are missing, not every column.
Explain it¶
- What is the difference between
fillna()anddropna()? When would you use each? - What does
errors="coerce"do inpd.to_numeric()? Why is it useful for messy data? - What is a merge/join? How is
pd.merge()similar to a SQL JOIN? - Why should you check for duplicates before analyzing data?
Mastery check¶
You can move on when you can:
- Detect missing values in any column and decide whether to fill or drop them.
- Convert a column from string to numeric, handling unparseable values gracefully.
- Find and remove duplicate rows.
- Merge two DataFrames on a shared column.
- Explain the cleaning decisions you made and why.
Related Concepts¶
- Collections Explained
- Files and Paths
- How Loops Work
- Types and Conversions
- Quiz: Collections Explained