Python

Python Data Science with Pandas & NumPy

Master DataFrames, array operations, and real-world data analysis

April 8, 2025

11 min read

NumPy and Pandas are the backbone of Python data science. NumPy gives you fast multi-dimensional arrays; Pandas builds on top with DataFrames that feel like spreadsheets but scale to millions of rows. Together they handle 90% of real-world data wrangling.

1. Installing the Libraries

bash
pip install numpy pandas

2. NumPy Arrays

NumPy arrays are faster than Python lists because they store data in contiguous memory and operate with C-speed vectorization:

python
import numpy as np

# Creating arrays
a = np.array([1, 2, 3, 4, 5])
b = np.zeros((3, 3))          # 3x3 matrix of zeros
c = np.arange(0, 10, 2)       # [0, 2, 4, 6, 8]
d = np.linspace(0, 1, 5)      # 5 evenly spaced values from 0 to 1

# Vectorized operations (no loops needed)
print(a * 2)          # [2, 4, 6, 8, 10]
print(a ** 2)         # [1, 4, 9, 16, 25]
print(np.sqrt(a))     # [1.0, 1.41, 1.73, 2.0, 2.24]

# Array stats
print(a.mean())       # 3.0
print(a.std())        # 1.41
print(a.sum())        # 15

NumPy operations run on entire arrays without Python loops — typically 10–100x faster than equivalent list comprehensions. Always prefer array operations over loops when working with data.

3. Reshaping and Indexing

python
import numpy as np

matrix = np.arange(1, 13).reshape(3, 4)
# array([[ 1,  2,  3,  4],
#        [ 5,  6,  7,  8],
#        [ 9, 10, 11, 12]])

# Indexing
print(matrix[0, 1])        # 2  (row 0, col 1)
print(matrix[:, 2])        # [3, 7, 11]  (all rows, col 2)
print(matrix[1:, :2])      # [[5, 6], [9, 10]]

# Boolean masking
data = np.array([10, 25, 3, 47, 8, 60])
print(data[data > 20])     # [25, 47, 60]

4. Introduction to Pandas

A DataFrame is a 2D table with labeled rows and columns. Think of it as a spreadsheet you can manipulate with code:

python
import pandas as pd

# Creating a DataFrame
data = {
    "name":   ["Alice", "Bob", "Charlie", "Diana"],
    "age":    [25, 30, 35, 28],
    "salary": [70000, 85000, 95000, 78000],
    "dept":   ["HR", "Eng", "Eng", "Marketing"],
}
df = pd.DataFrame(data)

print(df.head())
print(df.info())     # data types and non-null counts
print(df.describe()) # count, mean, std, min, max per column

5. Selecting and Filtering Data

python
# Select a column
print(df["name"])

# Select multiple columns
print(df[["name", "salary"]])

# Filter rows
engineers = df[df["dept"] == "Eng"]
high_earners = df[df["salary"] > 80000]

# Multiple conditions
senior_engineers = df[(df["dept"] == "Eng") & (df["age"] > 30)]

# loc (label-based) vs iloc (position-based)
print(df.loc[0, "name"])    # Alice
print(df.iloc[0, 0])        # Alice
print(df.iloc[1:3, :2])     # rows 1-2, cols 0-1

6. Handling Missing Data

python
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name":   ["Alice", "Bob", "Charlie"],
    "age":    [25, None, 35],
    "salary": [70000, 85000, None],
})

# Detect missing values
print(df.isnull())
print(df.isnull().sum())   # count per column

# Drop rows with any missing value
df_clean = df.dropna()

# Fill missing values
df["age"].fillna(df["age"].mean(), inplace=True)   # fill with mean
df["salary"].fillna(0, inplace=True)               # fill with 0

Always inspect missing data before deciding how to handle it. Dropping rows loses information; filling with the mean or median is safer for numerical columns, while filling with the mode works better for categorical ones.

7. GroupBy and Aggregation

python
# Average salary by department
dept_stats = df.groupby("dept")["salary"].mean()
print(dept_stats)

# Multiple aggregations
summary = df.groupby("dept").agg(
    avg_salary=("salary", "mean"),
    max_age=("age", "max"),
    headcount=("name", "count"),
)
print(summary)

# Pivot table
pivot = df.pivot_table(
    values="salary",
    index="dept",
    aggfunc=["mean", "max", "count"]
)

8. Merging DataFrames

python
employees = pd.DataFrame({
    "emp_id": [1, 2, 3],
    "name":   ["Alice", "Bob", "Charlie"],
    "dept_id": [10, 20, 20],
})

departments = pd.DataFrame({
    "dept_id": [10, 20, 30],
    "dept_name": ["HR", "Engineering", "Marketing"],
})

# Inner join (only matching rows)
merged = pd.merge(employees, departments, on="dept_id", how="inner")

# Left join (keep all employees)
merged_left = pd.merge(employees, departments, on="dept_id", how="left")

print(merged)

9. Reading & Writing Files

python
# Read CSV
df = pd.read_csv("data.csv")
df = pd.read_csv("data.csv", index_col=0, parse_dates=["date"])

# Read Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Write to CSV
df.to_csv("output.csv", index=False)

# Read from URL
df = pd.read_csv("https://example.com/data.csv")

# Quick preview
print(df.shape)       # (rows, columns)
print(df.columns)     # column names
print(df.dtypes)      # data types

What's Next?

Data visualization with Matplotlib and Seaborn
Machine learning with scikit-learn
Working with large datasets using Dask or Polars
Time series analysis with Pandas DatetimeIndex

Back to Blog