1.4 β€” Loading and Cleaning Market Data

1.4 β€” Loading and Cleaning Market Data#

A backtester needs historical OHLCV (Open, High, Low, Close, Volume) in a clean DataFrame with a DatetimeIndex and no surprises from missing data or timezones. This lesson shows how to load data (e.g. with yfinance), handle missing values, and align to a single timezone so it’s ready for Project 1 β€” Backtesting Engine.

Installing yfinance#

pip install yfinance pandas

In Python:

import yfinance as yf
import pandas as pd

Downloading OHLCV#

# Single symbol, period (e.g. 1y, 2y, 5y) or start/end dates
ticker = yf.Ticker('AAPL')
df = ticker.history(period='2y')  # or start='2022-01-01', end='2024-01-01'

history() returns a DataFrame with a DatetimeIndex and columns such as Open, High, Low, Close, Volume. Column names are capitalized; you can rename for consistency: df = df.rename(columns=str.lower).

Handling missing values#

For a simple daily backtest, dropna() or a short ffill() for minor gaps is usually enough. Avoid filling long stretches of missing data without a clear rule.

Single timezone#

yfinance often returns timezone-aware indices (e.g. UTC or exchange time). Normalize to one timezone so dates align with your strategy and calendar:

if df.index.tz is None:
    df.index = df.index.tz_localize('UTC')
df.index = df.index.tz_convert('America/New_York')
# Optional: make index date-only (drop time) for daily data
df.index = df.index.normalize()

Use the same timezone everywhere so β€œone row per day” is unambiguous.

Resulting shape for backtesting#

After loading and cleaning you want:

Example helper:

def load_ohlcv(symbol: str, period: str = '2y') -> pd.DataFrame:
    df = yf.Ticker(symbol).history(period=period)
    df = df.rename(columns=str.lower)
    df = df[['open', 'high', 'low', 'close', 'volume']]
    df = df.dropna()
    df.index = df.index.tz_localize('UTC') if df.index.tz is None else df.index
    df.index = df.index.tz_convert('America/New_York')
    df = df.sort_index()
    return df

Quiz time#

Question 1

Write a small script that downloads 1 year of OHLCV for a symbol, renames columns to lowercase, keeps only open, high, low, close, volume, drops rows with any NaN, and sorts by index.

Show Solution
import yfinance as yf
import pandas as pd

symbol = 'AAPL'
df = yf.Ticker(symbol).history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']]
df = df.dropna()
df = df.sort_index()
print(df.head())

Question 2

How do you check for missing values before and after cleaning? Show the code.

Show Solution
import yfinance as yf
import pandas as pd

df = yf.Ticker('AAPL').history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']]

print('Missing per column (before):')
print(df.isna().sum())

df = df.dropna()
print('Missing per column (after):')
print(df.isna().sum())

Question 3

Normalize the DataFrame index to US/Eastern when it is timezone-naive. Assume the source data is in UTC. Give the full code.

Show Solution
import yfinance as yf
import pandas as pd

df = yf.Ticker('AAPL').history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']].dropna()

if df.index.tz is None:
    df.index = df.index.tz_localize('UTC')
df.index = df.index.tz_convert('America/New_York')
df = df.sort_index()
print(df.index.tz)

Next steps#

With clean OHLCV in a DataFrame (from 1.1–1.3), you can add indicator columns, generate signals, and run your backtest. For performance metrics (returns, volatility, Sharpe), see Applied Statistics β€” Returns and volatility.