Master DataFrames, array operations, and real-world data analysis
April 8, 2025
11 min read
NumPy and Pandas are the backbone of Python data science. NumPy gives you fast multi-dimensional arrays; Pandas builds on top with DataFrames that feel like spreadsheets but scale to millions of rows. Together they handle 90% of real-world data wrangling.
bashpip install numpy pandas
NumPy arrays are faster than Python lists because they store data in contiguous memory and operate with C-speed vectorization:
pythonimport numpy as np # Creating arrays a = np.array([1, 2, 3, 4, 5]) b = np.zeros((3, 3)) # 3x3 matrix of zeros c = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] d = np.linspace(0, 1, 5) # 5 evenly spaced values from 0 to 1 # Vectorized operations (no loops needed) print(a * 2) # [2, 4, 6, 8, 10] print(a ** 2) # [1, 4, 9, 16, 25] print(np.sqrt(a)) # [1.0, 1.41, 1.73, 2.0, 2.24] # Array stats print(a.mean()) # 3.0 print(a.std()) # 1.41 print(a.sum()) # 15
NumPy operations run on entire arrays without Python loops — typically 10–100x faster than equivalent list comprehensions. Always prefer array operations over loops when working with data.
pythonimport numpy as np matrix = np.arange(1, 13).reshape(3, 4) # array([[ 1, 2, 3, 4], # [ 5, 6, 7, 8], # [ 9, 10, 11, 12]]) # Indexing print(matrix[0, 1]) # 2 (row 0, col 1) print(matrix[:, 2]) # [3, 7, 11] (all rows, col 2) print(matrix[1:, :2]) # [[5, 6], [9, 10]] # Boolean masking data = np.array([10, 25, 3, 47, 8, 60]) print(data[data > 20]) # [25, 47, 60]
A DataFrame is a 2D table with labeled rows and columns. Think of it as a spreadsheet you can manipulate with code:
pythonimport pandas as pd # Creating a DataFrame data = { "name": ["Alice", "Bob", "Charlie", "Diana"], "age": [25, 30, 35, 28], "salary": [70000, 85000, 95000, 78000], "dept": ["HR", "Eng", "Eng", "Marketing"], } df = pd.DataFrame(data) print(df.head()) print(df.info()) # data types and non-null counts print(df.describe()) # count, mean, std, min, max per column
python# Select a column print(df["name"]) # Select multiple columns print(df[["name", "salary"]]) # Filter rows engineers = df[df["dept"] == "Eng"] high_earners = df[df["salary"] > 80000] # Multiple conditions senior_engineers = df[(df["dept"] == "Eng") & (df["age"] > 30)] # loc (label-based) vs iloc (position-based) print(df.loc[0, "name"]) # Alice print(df.iloc[0, 0]) # Alice print(df.iloc[1:3, :2]) # rows 1-2, cols 0-1
pythonimport pandas as pd import numpy as np df = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, None, 35], "salary": [70000, 85000, None], }) # Detect missing values print(df.isnull()) print(df.isnull().sum()) # count per column # Drop rows with any missing value df_clean = df.dropna() # Fill missing values df["age"].fillna(df["age"].mean(), inplace=True) # fill with mean df["salary"].fillna(0, inplace=True) # fill with 0
Always inspect missing data before deciding how to handle it. Dropping rows loses information; filling with the mean or median is safer for numerical columns, while filling with the mode works better for categorical ones.
python# Average salary by department dept_stats = df.groupby("dept")["salary"].mean() print(dept_stats) # Multiple aggregations summary = df.groupby("dept").agg( avg_salary=("salary", "mean"), max_age=("age", "max"), headcount=("name", "count"), ) print(summary) # Pivot table pivot = df.pivot_table( values="salary", index="dept", aggfunc=["mean", "max", "count"] )
pythonemployees = pd.DataFrame({ "emp_id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"], "dept_id": [10, 20, 20], }) departments = pd.DataFrame({ "dept_id": [10, 20, 30], "dept_name": ["HR", "Engineering", "Marketing"], }) # Inner join (only matching rows) merged = pd.merge(employees, departments, on="dept_id", how="inner") # Left join (keep all employees) merged_left = pd.merge(employees, departments, on="dept_id", how="left") print(merged)
python# Read CSV df = pd.read_csv("data.csv") df = pd.read_csv("data.csv", index_col=0, parse_dates=["date"]) # Read Excel df = pd.read_excel("data.xlsx", sheet_name="Sheet1") # Write to CSV df.to_csv("output.csv", index=False) # Read from URL df = pd.read_csv("https://example.com/data.csv") # Quick preview print(df.shape) # (rows, columns) print(df.columns) # column names print(df.dtypes) # data types
What's Next?