Skip to content
FactorQX
beginnerpythondataresearch

Fetching OHLCV Data with Python

Pull clean historical OHLCV candles in Python, handle pagination and rate limits, and store the result for reproducible research.

1 min read

Reliable research starts with reliable data. This guide covers fetching historical OHLCV (open, high, low, close, volume) candles in Python, the practical issues you'll hit, and how to store data so your work is reproducible.

Educational software content only — not investment advice or a trading signal.

A minimal fetch

Most market-data APIs return candles as arrays of [timestamp, o, h, l, c, v]. We'll normalize them into a pandas DataFrame indexed by time.

fetch.py
import pandas as pd
import requests
 
def fetch_ohlcv(url: str, params: dict) -> pd.DataFrame:
    resp = requests.get(url, params=params, timeout=10)
    resp.raise_for_status()
    rows = resp.json()  # e.g. [[ts, o, h, l, c, v], ...]
 
    df = pd.DataFrame(rows, columns=["ts", "open", "high", "low", "close", "volume"])
    df["ts"] = pd.to_datetime(df["ts"], unit="ms", utc=True)
    return df.set_index("ts").astype(float).sort_index()

Always store timestamps in UTC. Mixing local time zones is a frequent and hard-to-debug source of off-by-one-bar errors.

Pagination and rate limits

APIs cap how many candles they return per request. To build long histories you page backward (or forward) using the last timestamp you received, pausing to respect rate limits:

import time
 
def fetch_range(fetch, start_ms: int, end_ms: int, step_limit=1000):
    out, cursor = [], start_ms
    while cursor < end_ms:
        batch = fetch(since=cursor, limit=step_limit)
        if not batch:
            break
        out.extend(batch)
        cursor = batch[-1][0] + 1  # advance past the last timestamp
        time.sleep(0.25)           # be polite; honor documented rate limits
    return out

Validate before you trust

Before any analysis, check the data:

  • Gaps: reindex to the expected frequency and inspect missing bars.
  • Duplicates: df.index.duplicated() should be all False.
  • Monotonic time: df.index.is_monotonic_increasing should be True.
assert df.index.is_monotonic_increasing
assert not df.index.duplicated().any()
gaps = df.index.to_series().diff().value_counts()
print(gaps.head())

Store it for reproducibility

Cache raw data to disk (Parquet keeps types and is compact) so that re-running your research uses identical inputs:

df.to_parquet("data/BTCUSD_1h.parquet")

Reproducible inputs are the foundation of trustworthy backtests. Once your data pipeline is solid, you can layer indicators and analysis on top with confidence.

Educational content. This article covers software development and research methods only. It is not investment advice, a trading signal, or a recommendation. See our disclaimer.