1.4 β Loading and Cleaning Market Data
1.4 β Loading and Cleaning Market Data#
A backtester needs historical OHLCV (Open, High, Low, Close, Volume) in a clean DataFrame with a DatetimeIndex and no surprises from missing data or timezones. This lesson shows how to load data (e.g. with yfinance), handle missing values, and align to a single timezone so itβs ready for Project 1 β Backtesting Engine.
Installing yfinance#
pip install yfinance pandas
In Python:
import yfinance as yf
import pandas as pd
Downloading OHLCV#
# Single symbol, period (e.g. 1y, 2y, 5y) or start/end dates
ticker = yf.Ticker('AAPL')
df = ticker.history(period='2y') # or start='2022-01-01', end='2024-01-01'
history() returns a DataFrame with a DatetimeIndex and columns such as Open, High, Low, Close, Volume. Column names are capitalized; you can rename for consistency: df = df.rename(columns=str.lower).
Handling missing values#
- Check:
df.isna().sum()shows missing counts per column. - Drop rows with any NaN:
df = df.dropna()β use if you need complete bars only. - Forward-fill then drop rest:
df = df.ffill().dropna()β fills then drops rows that are still NaN (e.g. at the start). Use with care so you donβt fill large gaps in a way that distorts returns. - Drop columns you donβt need: e.g. if you only use
closeandvolume, dropping others can simplify and avoid some NaNs.
For a simple daily backtest, dropna() or a short ffill() for minor gaps is usually enough. Avoid filling long stretches of missing data without a clear rule.
Single timezone#
yfinance often returns timezone-aware indices (e.g. UTC or exchange time). Normalize to one timezone so dates align with your strategy and calendar:
if df.index.tz is None:
df.index = df.index.tz_localize('UTC')
df.index = df.index.tz_convert('America/New_York')
# Optional: make index date-only (drop time) for daily data
df.index = df.index.normalize()
Use the same timezone everywhere so βone row per dayβ is unambiguous.
Resulting shape for backtesting#
After loading and cleaning you want:
- Index: DatetimeIndex, sorted, one row per trading day, single timezone.
- Columns: at least
open,high,low,close,volume(lowercase if you prefer). - No NaNs in the columns you use (or an explicit rule for them).
Example helper:
def load_ohlcv(symbol: str, period: str = '2y') -> pd.DataFrame:
df = yf.Ticker(symbol).history(period=period)
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']]
df = df.dropna()
df.index = df.index.tz_localize('UTC') if df.index.tz is None else df.index
df.index = df.index.tz_convert('America/New_York')
df = df.sort_index()
return df
Quiz time#
Question 1
Write a small script that downloads 1 year of OHLCV for a symbol, renames columns to lowercase, keeps only open, high, low, close, volume, drops rows with any NaN, and sorts by index.
Show Solution
import yfinance as yf
import pandas as pd
symbol = 'AAPL'
df = yf.Ticker(symbol).history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']]
df = df.dropna()
df = df.sort_index()
print(df.head())
Question 2
How do you check for missing values before and after cleaning? Show the code.
Show Solution
import yfinance as yf
import pandas as pd
df = yf.Ticker('AAPL').history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']]
print('Missing per column (before):')
print(df.isna().sum())
df = df.dropna()
print('Missing per column (after):')
print(df.isna().sum())
Question 3
Normalize the DataFrame index to US/Eastern when it is timezone-naive. Assume the source data is in UTC. Give the full code.
Show Solution
import yfinance as yf
import pandas as pd
df = yf.Ticker('AAPL').history(period='1y')
df = df.rename(columns=str.lower)
df = df[['open', 'high', 'low', 'close', 'volume']].dropna()
if df.index.tz is None:
df.index = df.index.tz_localize('UTC')
df.index = df.index.tz_convert('America/New_York')
df = df.sort_index()
print(df.index.tz)
Next steps#
With clean OHLCV in a DataFrame (from 1.1β1.3), you can add indicator columns, generate signals, and run your backtest. For performance metrics (returns, volatility, Sharpe), see Applied Statistics β Returns and volatility.