2/22/2026 · 3 min read
Getting Started with Pandas
A Minimal, Practical Guide for Python Developers
Pandas is the go-to Python library for working with structured, tabular data — think spreadsheets, CSVs, or SQL tables, but in code. It sits at the heart of most data science and machine learning workflows.
Installation
pip install pandas
Core Data Structures
Pandas has two primary building blocks:
| Structure | Dimensions | Analogy |
|---|---|---|
Series | 1D | A single column |
DataFrame | 2D | A full table |
Series
import pandas as pd
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s['b']) # 20
DataFrame
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'score': [88.5, 92.0, 79.5]
})
Selecting Data
df['name'] # single column → Series
df[['name', 'age']] # multiple columns → DataFrame
df.loc[0] # row by label / index
df.iloc[1] # row by integer position
df[df['age'] > 28] # filter rows by condition
Exploring a DataFrame
df.head(2) # first N rows
df.tail(2) # last N rows
df.shape # (rows, cols)
df.dtypes # column data types
df.describe() # statistical summary
Common Operations
df['age'].mean()
df['score'].max()
# Group and aggregate
df.groupby('name')['score'].mean()
Adding & Modifying Columns
# Derive a new column
df['grade'] = df['score'].apply(lambda x: 'A' if x >= 90 else 'B')
# Modify in place
df['age'] = df['age'] + 1
Handling Missing Data
df.isnull().sum() # count NaNs per column
df.dropna() # drop rows containing any NaN
df.fillna(0) # replace NaN with a default value
Reading & Writing Files
# Reading
df = pd.read_csv('data.csv')
df = pd.read_json('data.json')
df = pd.read_excel('data.xlsx')
# Writing
df.to_csv('output.csv', index=False)
Sorting & Merging
# Sort
df.sort_values('score', ascending=False)
# SQL-style JOIN
merged = pd.merge(df1, df2, on='id', how='inner') # also: left, right, outer
# Stack rows
combined = pd.concat([df1, df2], ignore_index=True)
Mental Model (coming from C / plain Python)
| C / Python | Pandas Equivalent |
|---|---|
array[i] | df.iloc[i] |
struct.field | df['column'] |
for loop filter | df[condition] |
| manual stats | df.describe() |
Key Takeaway
Think column-first and vectorized. Avoid Python
forloops over rows — instead, let pandas operate on entire columns at once. It’s faster, more readable, and more idiomatic.
Next step: pair pandas with NumPy for numerical computing, or feed your cleaned DataFrames into PyTorch / scikit-learn for machine learning.