This post gets you productive with the core Python stack for data work: pandas, NumPy, and Jupyter. You will create a clean workspace, learn idiomatic patterns, and run a small example on provided sample data. If Part 1 was about “what” and “why,” this part is about “how do I start doing it with the right habits.”
Why Python for Data Science?
- Huge ecosystem: pandas for tabular data, NumPy for arrays, scikit-learn for ML, Matplotlib/Seaborn/Plotly for viz.
- Fast path from prototype to production: the same language can power notebooks, APIs (FastAPI), and pipelines (Airflow/Prefect).
- Community and packages: almost every data source (databases, cloud storage, APIs) has a Python client.
- Ergonomics: readable syntax, rich notebooks, and plenty of learning resources.
Core Tools at a Glance
| Tool | Purpose | Notes |
|---|---|---|
| pandas | Tabular data (DataFrames), joins, groupby, IO | Think “Excel + SQL + Python” |
| NumPy | Fast n-dimensional arrays and math | Underpins pandas; great for vectorized ops |
| JupyterLab | Interactive notebooks for exploration | Mix code, text, and charts |
| Matplotlib/Seaborn | Plotting basics and statistical visuals | Seaborn builds on Matplotlib for nicer defaults |
| VS Code (optional) | IDE for refactoring and debugging | Use with Python + Jupyter extensions |
Learning Goals
- Set up a reproducible Python environment and Jupyter workspace (so your notebook runs the same next week).
- Load and inspect tabular data with pandas; use NumPy for fast array math.
- Apply clean code patterns: tidy data,
assign,pipe,query, and explicit dtypes. - Keep a project folder organized so future you (and teammates) can debug quickly.
- Know what “good enough” exploration looks like before you jump to modeling.
Setup: Environment and Dependencies
- Install Python 3.11+ (or your team’s standard). Consistency avoids “works on my machine.”
- Create a virtual environment and install essentials (keeps dependencies isolated per project):
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install pandas numpy jupyterlab matplotlib seaborn
Shortcut: run make install (creates .venv and installs from requirements.txt), then make notebook to launch JupyterLab.
- Launch JupyterLab in this repo for notebooks:
jupyter lab
- Optional VS Code setup: install the Python and Jupyter extensions; set the interpreter to
.venv/bin/python. - Alternative: if you prefer Conda, create an environment with
conda create -n ds python=3.11 pandas numpy jupyterlab seaborn.
Quick Data Check: pandas + NumPy in Action
Sample files: part2/orders.csv (orders with product, quantity, price) and part2/customers.csv (customer segment and country). This tiny example mirrors a common starter task: join transactions to customer attributes, compute revenue, and summarize by segment.
import pandas as pd
import numpy as np
orders = pd.read_csv("part2/orders.csv", parse_dates=["order_date"])
customers = pd.read_csv("part2/customers.csv")
# Tidy types for memory and consistency
orders["product"] = orders["product"].astype("category")
customers["segment"] = customers["segment"].astype("category")
# Basic quality checks (fast assertions catch bad inputs early)
assert orders["quantity"].ge(0).all()
assert orders["price"].ge(0).all()
# Revenue per order
orders = orders.assign(order_revenue=lambda d: d["quantity"] * d["price"])
# Join customer info (left join preserves all orders)
orders = orders.merge(customers, on="customer_id", how="left")
# Segment-level summary
summary = (
orders.groupby("segment")
.agg(
orders=("order_id", "count"),
revenue=("order_revenue", "sum"),
avg_order_value=("order_revenue", "mean"),
)
.sort_values("revenue", ascending=False)
)
print(summary)
# NumPy example: standardize revenue for quick z-scores
orders["revenue_z"] = (orders["order_revenue"] - orders["order_revenue"].mean()) / orders["order_revenue"].std(ddof=0)
What you just did:
- Loaded CSVs with explicit date parsing and dtypes (prevents surprises later).
- Added a computed column via
assignto keep the transformation readable. - Joined customer attributes to transactions with a left join (common pattern).
- Summarized by segment to answer “which customer type drives revenue?”
- Used NumPy to add a quick z-score—handy for outlier checks or bucketing.
Expected summary output (with the sample data):
orders revenue avg_order_value
segment
Enterprise 4 651.00 162.75
SMB 5 385.50 77.10
Consumer 1 66.00 66.00
Use this as a quick sense check: values are positive, orders count matches the CSV, and Enterprise drives the most revenue.
Companion Notebook
- Path:
part2/notebooks/part2-example.ipynb. - Run with the Makefile:
make install(first time) thenmake notebookand open the notebook. - Without Makefile:
jupyter lab part2/notebooks/part2-example.ipynb(use your activated environment). - Check kaggle here
Idiomatic pandas Patterns
assign: Add columns without breaking method chains.pipe: Encapsulate reusable transformations and keep chains readable.query: Express simple filters with readable expressions.- Explicit dtypes: use
astypeandto_datetimeto avoid silent conversions. - Small helpers: prefer
value_counts(normalize=True)for quick proportions.
Example using pipe and query:
def add_order_revenue(df):
return df.assign(order_revenue=lambda d: d["quantity"] * d["price"])
(orders
.pipe(add_order_revenue)
.query("order_revenue > 200")
.groupby("product")["order_revenue"]
.mean()
)
Notebook Hygiene and Habits
- Start every notebook with imports, configuration, and a short Markdown cell stating the question you are answering.
- Pin a random seed for reproducibility when sampling or modeling.
- Keep side effects contained: write outputs to a
data/orreports/folder, not your repo root. - Restart-and-run-all before sharing; if it fails, fix it before committing.
- When a notebook grows too large, move reusable code into
src/functions and re-import—treat notebooks as experiments, not long-term code storage.
Common Pitfalls to Avoid Early
- Silent type coercion: always inspect
df.dtypesafter loading; parse dates explicitly. - Chained indexing (
df[df["x"] > 0]["y"] = ...) can create copies—use.locandassigninstead. - Skipping data checks: use quick assertions for non-negativity, allowed categories, and unique keys.
- Mixing raw and cleaned data: keep a clear path (raw → interim/clean → features) with filenames that show the stage.
Workspace Structure (simple starter)
project-root/
├── data/ # large raw data (keep out of git; use .gitignore)
├── notebooks/ # exploratory notebooks
├── src/ # reusable functions and pipelines
├── reports/ # exported charts/tables
└── env/ # environment files (requirements.txt, conda.yml)
- Keep sample data small and versioned (like the CSVs here); keep production-scale data in object storage or warehouses.
- Add a
requirements.txtorpoetry.lockto freeze dependencies; pin exact versions when collaborating. - Name notebooks with prefixes like
01-eda.ipynb,02-model.ipynbto show flow; add a short one-line purpose at the top. - Drop a
.gitignoreentry fordata/(unless you are keeping only tiny samples) and for notebook checkpoints. - Consider a
Makefileor simple shell scripts for repeatable tasks (make lint,make test,make notebook).
Practical Checklist
- ✅ Version control your environment (
requirements.txtorpoetry.lock). - ✅ Enforce dtypes and date parsing on read; log shape and null counts immediately.
- ✅ Start with asserts and simple profiling (nulls, ranges); fail fast beats silent corruption.
- ✅ Prefer chains over scattered temporary variables for clarity; factor reusable steps into functions.
- ✅ Cache interim results to disk (
parquet) when they are reused; keep filenames stage-aware (e.g.,orders_clean.parquet). - ✅ Document assumptions in Markdown cells next to the code; future you will thank present you.
- ✅ Before modeling, have a crisp question and a success metric; code follows the question, not the other way around.
What’s Next (Preview of Part 3)
- Handling missing values and outliers systematically.
- Encoding categoricals and first feature engineering patterns.
- Practical pandas pipelines for cleaning messy, real-world data.