1.2 — Indexing and Selection

1.2 — Indexing and Selection#

To implement a backtester you need to select specific rows (e.g. by date or date range) and columns (e.g. close, or your signal column). Pandas provides label-based indexing (.loc), position-based indexing (.iloc), and boolean indexing. Use these to walk through time and read only the data you need.

Used in: Project 1 — Backtesting Engine: iterating over dates, selecting current and past data without look-ahead.

Selecting columns#

Selecting rows: .loc and .iloc#

For a backtest you often loop by date (using .loc and the DatetimeIndex) or by integer position (.iloc) so you can use df.iloc[:i] for “all data up to (but not including) current bar” and avoid look-ahead.

Boolean indexing (filtering rows)#

Select rows where a condition is true:

# Rows where close > 100
df[df['close'] > 100]

# Rows where short_ma > long_ma (e.g. for crossover)
df[df['sma_short'] > df['sma_long']]

You can combine conditions with & (and), | (or), and parentheses. Result is a DataFrame of matching rows.

Slicing with a DatetimeIndex#

Once your index is dates (see 1.3 — Time series and DatetimeIndex):

Avoiding look-ahead in backtests#

When you compute indicators or signals at time , use only data from times . In practice:

Quiz time#

Question 1

What is the difference between .loc and .iloc? When would you use each in a backtest?

Show Solution

.loc = by label (e.g. date). .iloc = by integer position. In a backtest, use .loc for date ranges; use .iloc for “first N rows” so you never use future data. Run this:

import pandas as pd

df = pd.DataFrame(
    {'close': [98, 101, 102, 99, 100]},
    index=pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'])
)

# .loc: by label (date)
row = df.loc['2024-01-03']
slice_by_date = df.loc['2024-01-02':'2024-01-04']
print('By date slice:\n', slice_by_date)

# .iloc: by position (for backtest loop: "past and current only")
i = 2  # current bar
past_and_current = df.iloc[:i+1]
print('Past and current (i=2):\n', past_and_current)

Question 2

How would you select all rows where close is greater than 100 and volume is greater than 1 million?

Show Solution

Use boolean indexing with & and parentheses. Full example:

import pandas as pd

df = pd.DataFrame({
    'close':  [99, 101, 102, 98, 103],
    'volume': [2e6, 1.5e6, 0.8e6, 1.1e6, 1.2e6],
})

# Correct: combine conditions with & and parentheses
subset = df[(df['close'] > 100) & (df['volume'] > 1_000_000)]
print(subset)
#    close    volume
# 1    101  1500000.0
# 4    103  1200000.0   (row 2 has volume 0.8e6, so excluded when volume > 1e6)

# Wrong (operator precedence): df['close'] > 100 & df['volume'] > 1_000_000

Question 3

Why is df.iloc[:i+1] (and not df.iloc[:i] or df.iloc[:i+2]) the right slice for “all data up to and including the current bar” at iteration i? Demonstrate with a small loop.

Show Solution

:i+1 gives rows 0..i (includes current bar). :i excludes current bar; :i+2 includes the next bar (look-ahead). Run this:

import pandas as pd

df = pd.DataFrame({'close': [10, 11, 12, 13, 14]})

for i in range(2, len(df)):
    # Correct: data up to and including bar i
    past_and_current = df.iloc[:i+1]
    # Wrong: df.iloc[:i]     -> excludes row i
    # Wrong: df.iloc[:i+2]   -> includes row i+1 (future)
    print(f"Bar i={i}, rows used (indices 0..{i}): {past_and_current['close'].tolist()}")
# Bar i=2, rows used (indices 0..2): [10, 11, 12]
# Bar i=3, rows used (indices 0..3): [10, 11, 12, 13]
# Bar i=4, rows used (indices 0..4): [10, 11, 12, 13, 14]

Next steps#

1.3 — Time series and DatetimeIndex shows how to set a DatetimeIndex and align data to a single timezone—required for reliable backtests.