0.2 — Project 1 — Backtesting Engine (Foundations) Guide

0.2 — Project 1: Backtesting Engine (Foundations) Guide#

This guide walks you through the “Steps to Complete the Project” from the project overview. Each step is framed as a task with hints to point you in the right direction. Full code and “raw” answers are hidden in Show Solution accordions—try the step yourself first, then expand only when you need to.

The goal is not to give you a giant finished script, but to help you build it step‑by‑step with support when you’re stuck.


1. Set up the environment#

Goal: Create a clean Python environment and install the core libraries so you can run Pandas, NumPy, yfinance, and (optionally) matplotlib.

Why it matters: A dedicated virtual environment keeps this project’s dependencies separate from other work and makes it easy to reproduce later (e.g. with a requirements.txt).

Hints:

Try it: Create and activate a venv, then install the core dependencies. Optionally add a requirements.txt with pinned or minimum versions.

Show Solution

Recommended layout: project folder (e.g. learning-library-projects/) → inside it a package (e.g. backtesting/) and a main script run_backtest.py.

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install core dependencies:

pip install numpy pandas yfinance matplotlib

Optional requirements.txt:

numpy>=1.24,<3.0
pandas>=2.0,<3.0
yfinance>=0.2,<0.3
matplotlib>=3.8,<4.0

Then: pip install -r requirements.txt


2. Load and clean historical data#

Goal: Download daily OHLCV data for at least one symbol and end up with a clean pandas.DataFrame that has a DatetimeIndex, no duplicate dates, and a consistent way of handling missing values.

Why it matters: The rest of the backtest (indicators, signals, P&L) assumes one row per date and aligned columns. Gaps, duplicates, or timezone mix‑ups will cause subtle bugs.

Hints:

Try it: Write a small script that downloads a symbol for a date range and a helper (e.g. clean_ohlcv(df)) that returns a cleaned DataFrame with the properties above.

Show Solution

Minimal download with yfinance:

import yfinance as yf
import pandas as pd

ticker = "SPY"
start = "2015-01-01"
end = "2025-01-01"

df = yf.download(
    ticker,
    start=start,
    end=end,
    interval="1d",
    auto_adjust=False,
    progress=False,
)
print(df.head())

After this, df typically has columns like Open, High, Low, Close, Adj Close, Volume.

Basic cleaning helper:

def clean_ohlcv(df: pd.DataFrame) -> pd.DataFrame:
    if not isinstance(df.index, pd.DatetimeIndex):
        df.index = pd.to_datetime(df.index)
    df = df.sort_index()
    df = df[~df.index.duplicated(keep="first")]
    df = df.ffill().bfill()
    return df

df = clean_ohlcv(df)
print(df.info())

Sanity checks: df.index is a DatetimeIndex; no duplicate index entries; no obvious big gaps except weekends/holidays.


3. Compute indicators (e.g., Simple Moving Average)#

Goal: Add at least one indicator as a new column to your DataFrame. A good first choice is a Simple Moving Average (SMA) so you can later build a moving‑average crossover strategy.

Why it matters: Strategies are driven by indicators. Getting one right (and validated) gives you a template for more.

Hints:

Try it: Implement an SMA and add two columns (e.g. SMA_50 and SMA_200). Check a few values by hand or with a known reference.

Show Solution
import pandas as pd

def add_sma(
    df: pd.DataFrame,
    window: int,
    price_col: str = "Close",
    column_name: str | None = None,
) -> pd.DataFrame:
    if price_col not in df.columns:
        raise KeyError(f"Column {price_col!r} not found in DataFrame")
    if window <= 0:
        raise ValueError("window must be positive")
    col_name = column_name or f"SMA_{window}"
    df[col_name] = (
        df[price_col]
        .rolling(window=window, min_periods=window)
        .mean()
    )
    return df

Usage:

df = add_sma(df, window=50, price_col="Close")
df = add_sma(df, window=200, price_col="Close")
print(df[["Close", "SMA_50", "SMA_200"]].head(10))

Validation: for a given date, the SMA value should equal the mean of the previous window closing prices (e.g. by hand or with a small list in Python).


4. Define and implement signal logic#

Goal: Turn your indicators into a signal series (e.g. +1 = long, 0 = flat, −1 = short) aligned with your DataFrame index. For a moving‑average crossover: long when the short MA is above the long MA; flat (or short) when it’s below.

Why it matters: Signals are the bridge between indicators and trading. They must use only information available at each bar (no look‑ahead). The position you use for P&L will be derived from the signal with a one‑bar lag in the next step.

Hints:

Try it: Implement a function that takes the DataFrame and the short/long SMA column names and writes a signal column (+1 / 0 / −1). Inspect a few rows where the MAs cross to confirm the signal flips as expected.

Show Solution

Concept: long when short MA > long MA; flat (or short) otherwise. Stay flat where either MA is NaN.

import numpy as np
import pandas as pd

def moving_average_crossover_signals(
    df: pd.DataFrame,
    short_col: str = "SMA_50",
    long_col: str = "SMA_200",
    long_only: bool = True,
) -> pd.DataFrame:
    if short_col not in df or long_col not in df:
        raise KeyError("Missing SMA columns in DataFrame")
    short_ma = df[short_col]
    long_ma = df[long_col]
    signal = np.where(short_ma > long_ma, 1, 0 if long_only else -1)
    signal = np.where(short_ma.isna() | long_ma.isna(), 0, signal)
    df["signal"] = signal
    return df

Usage after computing SMAs:

df = moving_average_crossover_signals(df, short_col="SMA_50", long_col="SMA_200")
print(df[["Close", "SMA_50", "SMA_200", "signal"]].head(20))

Look‑ahead note: The signal at day (t) is known at the close of day (t). The position used for P&L on day (t+1) should be the signal from the previous day (shift by 1), which we do in the next step.


5. Simulate trades (positions, P&L, equity curve)#

Goal: Convert signals into positions, then into strategy returns and an equity curve. Avoid look‑ahead: the position on day (t+1) should be based on the signal at the close of day (t).

Why it matters: This is where the strategy actually “trades.” Using the signal from the same bar to compute that bar’s return would be look‑ahead bias. Shifting the signal by one period fixes that.

P&L and slippage: We use close‑to‑close returns and no slippage here. For these assumptions and how to add a simple slippage model later (e.g., fixed bps per trade), see P&L and slippage in the project overview.

Hints:

Try it: Implement helpers that (1) add a position column from signal with a one‑period lag, and (2) add asset_return, strategy_return, and equity_curve columns. Run a quick check with “always long” to compare to buy‑and‑hold.

Show Solution

Positions (shift by 1 to avoid using today’s close to trade today):

def compute_positions_from_signals(df: pd.DataFrame) -> pd.DataFrame:
    if "signal" not in df.columns:
        raise KeyError("DataFrame must contain a 'signal' column")
    df["position"] = df["signal"].shift(1).fillna(0)
    return df

P&L and equity (close‑to‑close returns; strategy return = position × asset return):

def compute_pnl_and_equity(
    df: pd.DataFrame,
    price_col: str = "Close",
    initial_capital: float = 100_000.0,
) -> pd.DataFrame:
    if price_col not in df.columns:
        raise KeyError(f"Column {price_col!r} not found")
    if "position" not in df.columns:
        raise KeyError("DataFrame must contain a 'position' column")
    df["asset_return"] = df[price_col].pct_change().fillna(0.0)
    df["strategy_return"] = df["position"] * df["asset_return"]
    df["equity_curve"] = (1 + df["strategy_return"]).cumprod() * initial_capital
    return df

Usage:

df = compute_positions_from_signals(df)
df = compute_pnl_and_equity(df, price_col="Close", initial_capital=100_000.0)

Sanity check: set df["signal"] = 1 (after first date) and confirm the equity curve matches buy‑and‑hold.


6. Compute performance metrics#

Goal: From your strategy return series (and optionally the equity curve), compute at least: total return, annualized return, annualized volatility, and Sharpe ratio. Optionally add max drawdown, win rate, and number of trades.

Why it matters: These metrics let you compare strategies and interpret backtest results. Use the same definitions as in your statistics lessons (e.g. Returns and volatility for backtesting) so they’re comparable.

Hints:

Try it: Write small functions for total return, annualized return, annualized volatility, and Sharpe ratio. Then add max drawdown from the equity curve. Call them on df["strategy_return"] and df["equity_curve"] and print the results.

Show Solution
import numpy as np
import pandas as pd

def total_return(returns: pd.Series) -> float:
    return float((1.0 + returns).prod() - 1.0)

def annualized_return(returns: pd.Series, periods_per_year: int = 252) -> float:
    if returns.empty:
        return 0.0
    cumulative = 1.0 + total_return(returns)
    n_periods = returns.shape[0]
    years = n_periods / periods_per_year
    if years <= 0:
        return 0.0
    return float(cumulative ** (1.0 / years) - 1.0)

def annualized_volatility(returns: pd.Series, periods_per_year: int = 252) -> float:
    return float(returns.std(ddof=1) * np.sqrt(periods_per_year))

def sharpe_ratio(
    returns: pd.Series,
    risk_free_rate: float = 0.0,
    periods_per_year: int = 252,
) -> float:
    if returns.empty:
        return 0.0
    rf_per_period = (1 + risk_free_rate) ** (1 / periods_per_year) - 1
    excess = returns - rf_per_period
    vol = excess.std(ddof=1)
    if vol == 0:
        return 0.0
    return float(excess.mean() / vol * np.sqrt(periods_per_year))

def max_drawdown(equity_curve: pd.Series) -> float:
    if equity_curve.empty:
        return 0.0
    running_max = equity_curve.cummax()
    drawdowns = equity_curve / running_max - 1.0
    return float(drawdowns.min())

Using them:

metrics = {}
metrics["total_return"] = total_return(df["strategy_return"])
metrics["annualized_return"] = annualized_return(df["strategy_return"])
metrics["annualized_volatility"] = annualized_volatility(df["strategy_return"])
metrics["sharpe_ratio"] = sharpe_ratio(df["strategy_return"], risk_free_rate=0.0)
metrics["max_drawdown"] = max_drawdown(df["equity_curve"])
for name, value in metrics.items():
    print(name, ":", value)

Optional win rate (fraction of periods with positive strategy return when position was non-zero):

def win_rate(returns: pd.Series, position: pd.Series) -> float:
    traded = position != 0
    if traded.sum() == 0:
        return 0.0
    return float((returns[traded] > 0).mean())

Use the formulas from the statistics lessons to interpret each metric.


7. Produce a report#

Goal: Print or display a concise summary of the backtest: configuration (ticker, date range, strategy name, parameters) and the main performance metrics. Optionally plot the equity curve or drawdown.

Why it matters: A single, readable output makes it easy to rerun the backtest with different parameters and compare results.

Hints:

Try it: Write a print_summary(...) (or similar) that prints configuration and metrics. Run it after computing metrics. Optionally add one plot of the equity curve.

Show Solution

Simple text report:

def format_pct(x: float, decimals: int = 2) -> str:
    return f"{x * 100:.{decimals}f}%"

def print_summary(
    ticker: str,
    start_date: str,
    end_date: str,
    strategy_name: str,
    strategy_params: dict,
    metrics: dict,
) -> None:
    print("=== Backtest Summary ===\n")
    print("Configuration:")
    print(f"  Ticker:       {ticker}")
    print(f"  Date range:   {start_date} -> {end_date}")
    print(f"  Strategy:     {strategy_name}")
    print(f"  Parameters:   {strategy_params}")
    print("\nPerformance:")
    print(f"  Total return:          {format_pct(metrics['total_return'])}")
    print(f"  Annualized return:     {format_pct(metrics['annualized_return'])}")
    print(f"  Annualized volatility: {format_pct(metrics['annualized_volatility'])}")
    print(f"  Sharpe ratio:          {metrics['sharpe_ratio']:.2f}")
    print(f"  Max drawdown:          {format_pct(metrics['max_drawdown'])}")

Example call:

print_summary(
    ticker="SPY",
    start_date="2015-01-01",
    end_date="2025-01-01",
    strategy_name="Moving Average Crossover",
    strategy_params={"short_window": 50, "long_window": 200},
    metrics=metrics,
)

Optional equity curve plot (e.g. in a notebook):

import matplotlib.pyplot as plt
df["equity_curve"].plot(figsize=(10, 4), title="Equity Curve")
plt.xlabel("Date")
plt.ylabel("Equity")
plt.show()

8. Document and refactor#

Goal: End up with clean, reusable code and a short README so that you (or someone else) can run the backtest and understand what the strategy does and how to interpret the output.

Why it matters: Good structure (e.g. separate modules for data, indicators, strategy, backtester, metrics, report) makes it easier to add a second strategy (e.g. momentum) or reuse the engine for the next project (e.g. paper‑trading bot).

Hints:

Try it: Organize your code into at least two or three modules and a main script. Write a README that covers setup, run instructions, and a brief strategy + metrics interpretation.

Show Solution

Example module layout:

  • backtesting/data.py — download & clean OHLCV
  • backtesting/indicators.py — SMA and other indicators
  • backtesting/strategy.py — signal logic (e.g. MA crossover)
  • backtesting/backtester.py — positions, returns, equity
  • backtesting/metrics.py — performance metrics
  • backtesting/report.py — text/plot reporting
  • run_backtest.py — wires everything and runs one backtest

Example run_backtest.py skeleton:

from datetime import date
import yfinance as yf
from backtesting.data import clean_ohlcv
from backtesting.indicators import add_sma
from backtesting.strategy import moving_average_crossover_signals
from backtesting.backtester import compute_positions_from_signals, compute_pnl_and_equity
from backtesting.metrics import total_return, annualized_return, annualized_volatility, sharpe_ratio, max_drawdown
from backtesting.report import print_summary

def main() -> None:
    ticker = "SPY"
    start_date = "2015-01-01"
    end_date = date.today().strftime("%Y-%m-%d")
    raw = yf.download(ticker, start=start_date, end=end_date, interval="1d", progress=False)
    df = clean_ohlcv(raw)
    df = add_sma(df, window=50, price_col="Close")
    df = add_sma(df, window=200, price_col="Close")
    df = moving_average_crossover_signals(df, short_col="SMA_50", long_col="SMA_200")
    df = compute_positions_from_signals(df)
    df = compute_pnl_and_equity(df, price_col="Close", initial_capital=100_000.0)
    metrics = {
        "total_return": total_return(df["strategy_return"]),
        "annualized_return": annualized_return(df["strategy_return"]),
        "annualized_volatility": annualized_volatility(df["strategy_return"]),
        "sharpe_ratio": sharpe_ratio(df["strategy_return"], risk_free_rate=0.0),
        "max_drawdown": max_drawdown(df["equity_curve"]),
    }
    print_summary(ticker=ticker, start_date=start_date, end_date=end_date,
                  strategy_name="Moving Average Crossover",
                  strategy_params={"short_window": 50, "long_window": 200},
                  metrics=metrics)

if __name__ == "__main__":
    main()

Using separate modules and small functions makes it easier to add new strategies and reuse the engine later.


9. Final tips for beginners#

If you can run one script (or one notebook) that completes this full pipeline and prints a sensible summary, you’ve achieved the core objective of this project.