TechStackTutor Logo
HOMEBLOGKIDSABOUT USCONTACT USBOOK DEMO
Python

Python Data Science with Pandas & NumPy

Master DataFrames, array operations, and real-world data analysis

April 8, 2025

11 min read

NumPy and Pandas are the backbone of Python data science. NumPy gives you fast multi-dimensional arrays; Pandas builds on top with DataFrames that feel like spreadsheets but scale to millions of rows. Together they handle 90% of real-world data wrangling.

1. Installing the Libraries

bash
pip install numpy pandas

2. NumPy Arrays

NumPy arrays are faster than Python lists because they store data in contiguous memory and operate with C-speed vectorization:

python
import numpy as np # Creating arrays a = np.array([1, 2, 3, 4, 5]) b = np.zeros((3, 3)) # 3x3 matrix of zeros c = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] d = np.linspace(0, 1, 5) # 5 evenly spaced values from 0 to 1 # Vectorized operations (no loops needed) print(a * 2) # [2, 4, 6, 8, 10] print(a ** 2) # [1, 4, 9, 16, 25] print(np.sqrt(a)) # [1.0, 1.41, 1.73, 2.0, 2.24] # Array stats print(a.mean()) # 3.0 print(a.std()) # 1.41 print(a.sum()) # 15

NumPy operations run on entire arrays without Python loops — typically 10–100x faster than equivalent list comprehensions. Always prefer array operations over loops when working with data.

3. Reshaping and Indexing

python
import numpy as np matrix = np.arange(1, 13).reshape(3, 4) # array([[ 1, 2, 3, 4], # [ 5, 6, 7, 8], # [ 9, 10, 11, 12]]) # Indexing print(matrix[0, 1]) # 2 (row 0, col 1) print(matrix[:, 2]) # [3, 7, 11] (all rows, col 2) print(matrix[1:, :2]) # [[5, 6], [9, 10]] # Boolean masking data = np.array([10, 25, 3, 47, 8, 60]) print(data[data > 20]) # [25, 47, 60]

4. Introduction to Pandas

A DataFrame is a 2D table with labeled rows and columns. Think of it as a spreadsheet you can manipulate with code:

python
import pandas as pd # Creating a DataFrame data = { "name": ["Alice", "Bob", "Charlie", "Diana"], "age": [25, 30, 35, 28], "salary": [70000, 85000, 95000, 78000], "dept": ["HR", "Eng", "Eng", "Marketing"], } df = pd.DataFrame(data) print(df.head()) print(df.info()) # data types and non-null counts print(df.describe()) # count, mean, std, min, max per column

5. Selecting and Filtering Data

python
# Select a column print(df["name"]) # Select multiple columns print(df[["name", "salary"]]) # Filter rows engineers = df[df["dept"] == "Eng"] high_earners = df[df["salary"] > 80000] # Multiple conditions senior_engineers = df[(df["dept"] == "Eng") & (df["age"] > 30)] # loc (label-based) vs iloc (position-based) print(df.loc[0, "name"]) # Alice print(df.iloc[0, 0]) # Alice print(df.iloc[1:3, :2]) # rows 1-2, cols 0-1

6. Handling Missing Data

python
import pandas as pd import numpy as np df = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, None, 35], "salary": [70000, 85000, None], }) # Detect missing values print(df.isnull()) print(df.isnull().sum()) # count per column # Drop rows with any missing value df_clean = df.dropna() # Fill missing values df["age"].fillna(df["age"].mean(), inplace=True) # fill with mean df["salary"].fillna(0, inplace=True) # fill with 0

Always inspect missing data before deciding how to handle it. Dropping rows loses information; filling with the mean or median is safer for numerical columns, while filling with the mode works better for categorical ones.

7. GroupBy and Aggregation

python
# Average salary by department dept_stats = df.groupby("dept")["salary"].mean() print(dept_stats) # Multiple aggregations summary = df.groupby("dept").agg( avg_salary=("salary", "mean"), max_age=("age", "max"), headcount=("name", "count"), ) print(summary) # Pivot table pivot = df.pivot_table( values="salary", index="dept", aggfunc=["mean", "max", "count"] )

8. Merging DataFrames

python
employees = pd.DataFrame({ "emp_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "dept_id": [10, 20, 20], }) departments = pd.DataFrame({ "dept_id": [10, 20, 30], "dept_name": ["HR", "Engineering", "Marketing"], }) # Inner join (only matching rows) merged = pd.merge(employees, departments, on="dept_id", how="inner") # Left join (keep all employees) merged_left = pd.merge(employees, departments, on="dept_id", how="left") print(merged)

9. Reading & Writing Files

python
# Read CSV df = pd.read_csv("data.csv") df = pd.read_csv("data.csv", index_col=0, parse_dates=["date"]) # Read Excel df = pd.read_excel("data.xlsx", sheet_name="Sheet1") # Write to CSV df.to_csv("output.csv", index=False) # Read from URL df = pd.read_csv("https://example.com/data.csv") # Quick preview print(df.shape) # (rows, columns) print(df.columns) # column names print(df.dtypes) # data types

What's Next?

  • Data visualization with Matplotlib and Seaborn
  • Machine learning with scikit-learn
  • Working with large datasets using Dask or Polars
  • Time series analysis with Pandas DatetimeIndex
Back to Blog