1.1 — Series and DataFrame Fundamentals
1.1 — Series and DataFrame Fundamentals#
Pandas builds on two core structures: Series (one-dimensional) and DataFrame (two-dimensional). For backtesting and market data, you will almost always work with DataFrames where each row is a timestamp and each column is a price series, indicator, or signal. This lesson gives you what you need to create and manipulate these structures.
Used in: Project 1 — Backtesting Engine: storing OHLCV data and indicator columns in a DataFrame.
Series#
A Series is a one-dimensional labeled array. Labels form the index; the values can be any type (often numeric).
import pandas as pd
# Series from a list (default integer index)
prices = pd.Series([100.0, 101.5, 99.2, 102.1])
print(prices)
# 0 100.0
# 1 101.5
# 2 99.2
# 3 102.1
# Series with explicit index (e.g. dates or labels)
prices = pd.Series([100.0, 101.5, 99.2], index=['Mon', 'Tue', 'Wed'])
print(prices)
You can do element-wise math and use .mean(), .std(), .sum() etc. Series align by index when you combine them—important when we add indicator columns to a price Series.
DataFrame#
A DataFrame is a table: rows and columns, each column a Series sharing the same index. For market data, think one row per date, one column per series (Open, High, Low, Close, Volume, then your indicators).
# DataFrame from a dict of lists (each key = column name)
df = pd.DataFrame({
'open': [100, 101, 99],
'high': [102, 103, 100],
'low': [98, 100, 97],
'close': [101, 102, 99],
})
# open high low close
# 0 100 102 98 101
# 1 101 103 100 102
# 2 99 100 97 99
df
- Rows: default integer index 0, 1, 2. For backtesting you will replace this with a DatetimeIndex (see 1.3 — Time series and DatetimeIndex).
- Columns: access with
df['close']ordf.close(returns a Series). Add a new column by assignment:df['sma_10'] = ....
OHLC and OHLCV: why we care#
OHLC stands for Open, High, Low, Close—the four standard prices for a single bar (e.g. one day or one hour):
- Open: price at the start of the period
- High: highest price during the period
- Low: lowest price during the period
- Close: price at the end of the period
OHLCV adds Volume (number of shares or contracts traded). In pandas we store these as columns: one row per bar, columns open, high, low, close, and optionally volume.
Why it matters: almost all market data and backtesting assumes this shape. Exchanges and data vendors provide OHLC(V); indicators (e.g. moving averages) usually use close; strategies often compare high/low to open/close. Keeping this layout keeps your code compatible with the rest of the quant ecosystem.
SMA: what it is and why we care#
SMA means Simple Moving Average: the average of the last (n) values. For example, a 10-period SMA of close is the mean of the last 10 closing prices.
- What it does: smooths out noise so you can see trend. Short SMAs (e.g. 10) react quickly; long SMAs (e.g. 50) react slowly.
- Why we care in quant: many strategies use SMAs for signals (e.g. “buy when price crosses above the 50-day SMA”) or as a baseline for volatility and momentum. In pandas you compute it with a rolling window—see below.
In general pandas terms, “SMA” is just one use of a rolling mean; the same rolling machinery is used for rolling standard deviation, min, max, etc.
Adding and overwriting columns#
You add indicator or signal columns by assigning to a new column name. Pandas aligns by index, so if you assign a Series with a matching index, values line up by date.
Three functions show up constantly for time-series and backtesting. It’s worth knowing what each does and when to use it.
.shift(periods) — lag (or lead) values#
- What it does: moves values forward or backward along the index.
s.shift(1)puts the previous row’s value in the current row; the first row gets NaN (no previous value). - Why we care (quant): “previous close,” “yesterday’s signal,” or any lagged feature. Essential to avoid look-ahead: you never use future data at the current bar.
- When to use (general): any time you need “value from N rows ago” (e.g. compare to last week, last month). Positive
periods= lag (past); negative = lead (future—use only for targets, not features).
df['close_lag1'] = df['close'].shift(1) # previous bar's close
.rolling(window) — rolling windows#
- What it does: groups consecutive rows into a window of size
windowand lets you apply an aggregation (.mean(),.std(),.min(), etc.). The firstwindow - 1rows are NaN (not enough history yet). - Why we care (quant): SMAs (
rolling(n).mean()), rolling volatility (rolling(n).std()), rolling high/low, and many indicators are “something over the last N bars.” - When to use (general): any “over the last N rows” calculation—smoothing, local trends, or running statistics.
df['sma_10'] = df['close'].rolling(10).mean()
.pct_change(periods=1) — percent change (returns)#
- What it does: computes (current − previous) / previous, i.e. the proportional change from the previous row. With default
periods=1, the first row is NaN. - Why we care (quant): daily (or bar) returns are the standard for performance, risk, and many signals. Returns are usually in decimal form (0.01 = 1%).
- When to use (general): any “how much did this change compared to last time?” metric—prices, volumes, or other series.
df['returns'] = df['close'].pct_change()
Putting them together:
df['close_lag1'] = df['close'].shift(1) # previous day's close
df['returns'] = df['close'].pct_change() # simple daily returns
df['sma_2'] = df['close'].rolling(2).mean() # 2-bar SMA (for illustration)
This pattern—one row per date, new columns for indicators and signals—is exactly what you use in a backtest: load OHLCV, then add columns for moving averages, signals, etc.
Key takeaways for backtesting#
- Use a DataFrame for your main data: one row per bar (e.g. per day), columns for Open, High, Low, Close, Volume, and any indicators or signals.
- Use a Series for a single series (e.g. one symbol’s close, or one indicator) when you need to pass it to a function or compute on it.
- Add columns with assignment (
df['new_col'] = ...). Keep the same index (dates) so everything stays aligned.
Quiz time#
Practice by writing code that satisfies the given inputs and test cases. Try it yourself before opening the solution or hints.
Question 1
Task: Given a list of closing prices, create a Series with labels 'Mon', 'Tue', 'Wed', 'Thu', then create a DataFrame with columns open, high, low, close for the same four days, using the same index. Add a column that is the mean of the close Series (single number broadcast). Your code should use the exact inputs below.
Input:
close_prices = [101.0, 102.0, 99.0, 100.0]
index = ['Mon', 'Tue', 'Wed', 'Thu']
# open, high, low for each day (you can define these as lists)
open_ = [100, 101, 102, 99]
high = [102, 103, 104, 101]
low = [98, 100, 97, 100]
Test cases:
- The Series has index
['Mon','Tue','Wed','Thu']and values[101, 102, 99, 100]. - The DataFrame has columns
open,high,low,closeand the same index. - A new column
close_meanequals the scalar mean ofclose(e.g. 100.5) in every row.
Expected output (shape): A DataFrame with 4 rows, 5 columns (open, high, low, close, close_mean).
Hint
Use pd.Series(close_prices, index=index). Build the DataFrame from a dict of lists and pass index=.... For close_mean, assign the scalar df['close'].mean().
Show Solution
import pandas as pd
close_prices = [101.0, 102.0, 99.0, 100.0]
index = ['Mon', 'Tue', 'Wed', 'Thu']
open_ = [100, 101, 102, 99]
high = [102, 103, 104, 101]
low = [98, 100, 97, 100]
s = pd.Series(close_prices, index=index)
df = pd.DataFrame({'open': open_, 'high': high, 'low': low, 'close': close_prices}, index=index)
df['close_mean'] = df['close'].mean()
print(df)
Question 2
Task: Given a DataFrame with a close column and a datetime index, add a column close_lag1 that contains the previous bar’s close. First row should have NaN. Use the exact input below.
Input:
import pandas as pd
df = pd.DataFrame(
{'close': [100, 101, 102, 99, 100]},
index=pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'])
)
Test cases:
close_lag1on the first date is NaN.- On
2024-01-02,close_lag1is 100. - On
2024-01-05,close_lag1is 99.
Expected output: Same DataFrame with an extra column close_lag1; values aligned by date.
Hint
Use a method that shifts values down by one row so the “previous” value appears in the current row. No loops needed.
Show Solution
df['close_lag1'] = df['close'].shift(1)
print(df)
# close close_lag1
# 2024-01-01 100 NaN
# 2024-01-02 101 100.0
# 2024-01-03 102 101.0
# 2024-01-04 99 102.0
# 2024-01-05 100 99.0
Question 3
Task: Using the same df as in Question 2 (with close and datetime index), add two columns: returns (percent change of close) and sma_2 (2-period simple moving average of close). Match the test cases below.
Input: The df from Question 2 after adding close_lag1 (or start fresh with the same close and index).
Test cases:
- First row:
returnsis NaN,sma_2is NaN. - Second row:
returns≈ (101−100)/100 = 0.01;sma_2= (100+101)/2 = 100.5. - Third row:
sma_2= (101+102)/2 = 101.5.
Expected output: DataFrame with columns close, close_lag1 (optional), returns, sma_2.
Hint
Percent change from previous row: use the method that computes (current − previous) / previous. For a 2-period average over close, use a rolling window of size 2 and take the mean.
Show Solution
df['returns'] = df['close'].pct_change()
df['sma_2'] = df['close'].rolling(2).mean()
print(df)
Next steps#
In 1.2 — Indexing and selection we cover how to select rows by date, slice ranges, and select columns. In 1.3 — Time series and DatetimeIndex we set up a proper date index for backtesting.