1.4 — Loading and Cleaning Market Data

📅 February 8, 2025

1.4 — Loading and Cleaning Market Data#

A backtester needs historical OHLCV (Open, High, Low, Close, Volume) in a clean DataFrame with a DatetimeIndex and no surprises from missing data or timezones. This lesson shows how to load data (e.g. with yfinance), handle missing values, and align to a single timezone so it’s ready for Project 1 — Backtesting Engine.

Installing yfinance#

pip install yfinance pandas

In Python:

import yfinance as yf
import pandas as pd

Downloading OHLCV#

# Single symbol, period (e.g. 1y, 2y, 5y) or start/end dates
ticker = yf.Ticker('AAPL')
df = ticker.history(period='2y')  # or start='2022-01-01', end='2024-01-01'

history() returns a DataFrame with a DatetimeIndex and columns such as Open, High, Low, Close, Volume. Column names are capitalized; you can rename for consistency: df = df.rename(columns=str.lower).

Handling missing values#

Check: df.isna().sum() shows missing counts per column.
Drop rows with any NaN: df = df.dropna() — use if you need complete bars only.
Forward-fill then drop rest: df = df.ffill().dropna() — fills then drops rows that are still NaN (e.g. at the start). Use with care so you don’t fill large gaps in a way that distorts returns.
Drop columns you don’t need: e.g. if you only use close and volume, dropping others can simplify and avoid some NaNs.

For a simple daily backtest, dropna() or a short ffill() for minor gaps is usually enough. Avoid filling long stretches of missing data without a clear rule.

Single timezone#

yfinance often returns timezone-aware indices (e.g. UTC or exchange time). Normalize to one timezone so dates align with your strategy and calendar:

if df.index.tz is None:
    df.index = df.index.tz_localize('UTC')
df.index = df.index.tz_convert('America/New_York')
# Optional: make index date-only (drop time) for daily data
df.index = df.index.normalize()

Use the same timezone everywhere so “one row per day” is unambiguous.

Resulting shape for backtesting#

After loading and cleaning you want:

Index: DatetimeIndex, sorted, one row per trading day, single timezone.
Columns: at least open, high, low, close, volume (lowercase if you prefer).
No NaNs in the columns you use (or an explicit rule for them).

Example helper:

def load_ohlcv(symbol: str, period: str = '2y') -> pd.DataFrame:
    df = yf.Ticker(symbol).history(period=period)
    df = df.rename(columns=str.lower)
    df = df[['open', 'high', 'low', 'close', 'volume']]
    df = df.dropna()
    df.index = df.index.tz_localize('UTC') if df.index.tz is None else df.index
    df.index = df.index.tz_convert('America/New_York')
    df = df.sort_index()
    return df

Quiz time#

Question 1

Write a small script that downloads 1 year of OHLCV for a symbol, renames columns to lowercase, keeps only open, high, low, close, volume, drops rows with any NaN, and sorts by index.

Show Solution

import yfinance as yf
import pandas as pd

symbol = 'AAPL'
df = yf.Ticker(symbol).history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']]
df = df.dropna()
df = df.sort_index()
print(df.head())

Question 2

How do you check for missing values before and after cleaning? Show the code.

Show Solution

import yfinance as yf
import pandas as pd

df = yf.Ticker('AAPL').history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']]

print('Missing per column (before):')
print(df.isna().sum())

df = df.dropna()
print('Missing per column (after):')
print(df.isna().sum())

Question 3

Normalize the DataFrame index to US/Eastern when it is timezone-naive. Assume the source data is in UTC. Give the full code.

Show Solution

import yfinance as yf
import pandas as pd

df = yf.Ticker('AAPL').history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']].dropna()

if df.index.tz is None:
    df.index = df.index.tz_localize('UTC')
df.index = df.index.tz_convert('America/New_York')
df = df.sort_index()
print(df.index.tz)

Next steps#

With clean OHLCV in a DataFrame (from 1.1–1.3), you can add indicator columns, generate signals, and run your backtest. For performance metrics (returns, volatility, Sharpe), see Applied Statistics — Returns and volatility.