Module 07 / Project 01 — Pandas Basics¶
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | Walkthrough | — | Flashcards | — | — |
Focus¶
- Loading CSV data with
pd.read_csv() - Exploring a DataFrame:
head(),tail(),shape,dtypes,info(),describe() - Selecting columns by name
- Sorting rows with
sort_values()
Why this project exists¶
Before you can analyze data, you need to know how to load it and look at it. This project teaches you how to get a CSV file into a pandas DataFrame and use built-in methods to understand what the data looks like — how many rows, what columns exist, what types the values are, and what the basic statistics tell you. These exploration steps are the first thing every data analyst does with a new data set.
Run¶
Expected output¶
=== Loading student data ===
Loaded 30 rows and 4 columns from data/students.csv
=== First 5 rows (head) ===
name subject grade age
0 Alice Chen Math 92 17
1 Bob Martinez Science 78 16
2 Carol Johnson English 85 17
3 David Kim Math 67 16
4 Eva Patel Science 91 18
=== Shape ===
Rows: 30, Columns: 4
=== Column types (dtypes) ===
name object
subject object
grade int64
age int64
dtype: object
=== Summary statistics (describe) ===
grade age
count 30.000000 30.000000
mean 80.100000 17.000000
...
=== Selecting just name and grade columns ===
(first 5 rows)
name grade
0 Alice Chen 92
1 Bob Martinez 78
...
=== Sorted by grade (highest first) ===
(first 10 rows)
name subject grade age
18 Sam Turner Math 96 17
...
Done.
The exact numbers will match the CSV data. The ... sections are abbreviated here — your output will show all rows and statistics.
Alter it¶
- Change
head()tohead(10)and see what happens. Trytail(3). - Sort by
ageinstead ofgrade. What happens when two students have the same age? - Select three columns instead of two. What does
df[["name", "subject", "grade"]]return? - Try
df["grade"].mean()anddf["grade"].max()— what do they return?
Break it¶
- Change the filename in
read_csv()to a file that does not exist. What error do you get? - Try selecting a column that does not exist:
df["score"]. Read the error message. - Remove the
import pandas as pdline. What happens?
Fix it¶
- Wrap
read_csv()in a try/except that catchesFileNotFoundErrorand prints a friendly message. - Before selecting a column, check if it exists:
if "score" in df.columns. - Put the import back.
Explain it¶
- What is a DataFrame? How is it different from a list of dictionaries?
- What does
describe()tell you thatinfo()does not? - Why does
dtypesshowobjectfor the name and subject columns instead ofstring? - What is the difference between
df["grade"](one column) anddf[["grade"]](double brackets)?
Mastery check¶
You can move on when you can:
- Load any CSV file into a DataFrame from memory.
- Use
head(),shape,dtypes,info(), anddescribe()to explore a new data set. - Select one or more columns from a DataFrame.
- Sort a DataFrame by any column, ascending or descending.
Related Concepts¶
- Collections Explained
- Files and Paths
- Types and Conversions
- What is a Variable
- Quiz: Collections Explained