Walkthrough: CSV First Reader¶

This guide walks through the thinking process for building this project. It does NOT give you the complete solution. For that, see SOLUTION.md.

Before reading this¶

Try the project yourself first. Spend at least 20 minutes. If you have not tried yet, close this file and open the project README.

Understanding the problem¶

You need to build a program that reads a CSV (comma-separated values) file, displays the data as a nicely formatted text table, and computes statistics for any columns that contain numbers. The program should auto-detect which columns are numeric.

A CSV file is structured tabular data where the first line is the header (column names) and each subsequent line is a data row:

name,department,salary,years
Alice Johnson,Engineering,85000,5
Bob Smith,Marketing,62000,3

Planning before code¶

flowchart TD
    A[Read CSV file] --> B[load_csv: parse into list of dicts]
    B --> C[format_table: build aligned text table]
    B --> D[detect_numeric_columns: find columns with numbers]
    D --> E[column_stats: compute min/max/avg for each numeric column]
    C --> F[Print table]
    E --> F
    F --> G[Save results to JSON]

Four functions to build:

load_csv() -- read the file using csv.DictReader, return a list of dicts
detect_numeric_columns() -- figure out which columns contain numbers
column_stats() -- compute min, max, sum, average for a numeric column
format_table() -- display the data as an aligned text table

Step 1: Loading CSV data¶

Python's csv module has a DictReader class that automatically uses the first row as dictionary keys:

import csv
from pathlib import Path

def load_csv(path: Path) -> list[dict]:
    rows = []
    with open(path, encoding="utf-8", newline="") as f:
        reader = csv.DictReader(f)
        for row in reader:
            rows.append(row)
    return rows

After this, each row is a dictionary like {"name": "Alice Johnson", "department": "Engineering", "salary": "85000", "years": "5"}.

Important: all values come back as strings, even numbers. The string "85000" is not the number 85000 -- you need to convert it later.

Predict before you scroll¶

Why use csv.DictReader instead of just splitting each line on commas? Think about what happens if a value itself contains a comma, like "Smith, Jr.".

Step 2: Detecting numeric columns¶

Not all columns can be averaged. You need to check whether every (non-empty) value in a column can be converted to a number:

def detect_numeric_columns(rows: list[dict]) -> list[str]:
    if not rows:
        return []

    columns = list(rows[0].keys())
    numeric = []

    for col in columns:
        is_numeric = True
        for row in rows:
            value = row[col].strip()
            if not value:
                continue  # skip empty cells
            try:
                float(value)
            except ValueError:
                is_numeric = False
                break
        if is_numeric:
            numeric.append(col)

    return numeric

The strategy is try-and-fail: attempt to convert each value to a float. If any value in the column raises a ValueError, that column is not numeric.

Predict before you scroll¶

If a column has values ["10", "20", "N/A", "30"], would detect_numeric_columns classify it as numeric or text? Why?

Step 3: Computing column statistics¶

Once you know which columns are numeric, calculate stats for each:

def column_stats(rows: list[dict], column: str) -> dict:
    values = []
    for row in rows:
        val = row[column].strip()
        if val:
            values.append(float(val))

    if not values:
        return {"column": column, "count": 0}

    return {
        "column": column,
        "count": len(values),
        "min": min(values),
        "max": max(values),
        "sum": round(sum(values), 2),
        "average": round(sum(values) / len(values), 2),
    }

The guard if not values prevents a division-by-zero crash if a column has no parseable numbers.

Step 4: Formatting the table¶

The table needs aligned columns. The approach: decide on a maximum cell width, pad shorter values with spaces, and truncate longer values.

def format_table(rows: list[dict], max_width: int = 15) -> str:
    if not rows:
        return "(empty table)"

    headers = list(rows[0].keys())

    def truncate(val: str) -> str:
        if len(val) > max_width:
            return val[:max_width - 3] + "..."
        return val

    header_line = "  ".join(truncate(h).ljust(max_width) for h in headers)
    separator = "  ".join("-" * max_width for _ in headers)

    lines = [header_line, separator]
    for row in rows:
        cells = [truncate(row.get(h, "")).ljust(max_width) for h in headers]
        lines.append("  ".join(cells))

    return "\n".join(lines)

Two string methods do the heavy lifting: - .ljust(width) pads a string with spaces on the right to make it width characters - Truncation replaces the end of long strings with ...

Common mistakes¶

Mistake	Why it happens	How to fix
Splitting on commas manually instead of using `csv` module	Seems simpler	Breaks on quoted values like `"Smith, Jr."` -- always use the `csv` module
Treating all columns as numeric	Forgetting that `float("Alice")` crashes	Detect numeric columns first with try/except
`column_stats` crashes on empty data	Dividing by zero when no values exist	Check `if not values` and return early
Table columns misaligned	Not accounting for different value lengths	Use `.ljust()` to pad every cell to the same width

Testing your solution¶

Run the tests from the project directory:

pytest -q

The tests verify: - load_csv() returns the correct number of rows as dicts - detect_numeric_columns() correctly identifies numeric vs text columns - column_stats() computes accurate statistics - format_table() produces aligned output with headers

You can also test manually with the sample data:

python project.py --input data/sample_input.txt

What to explore next¶

Add a --columns flag that lets the user select which columns to display (comma-separated names)
Add row numbering to the formatted table so the user can reference specific rows