Architecture
Overview
The Finance Data Pipeline consists of two main stages:
Active Tickers Check — Validates 106K tickers against Yahoo Finance API to determine which are active
Financial Data ETL — Fetches financial statements for active tickers and stores them in PostgreSQL
Pipeline Flow
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 1: Ticker Validation │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ tickers_list.xlsx ──▶ ThreadPoolExecutor ──▶ active_tickers table │
│ (106K tickers) (30 threads) (is_active: bool) │
│ │
│ • Skips already-checked tickers │
│ • Short-circuits on first valid API response │
│ • Batches of 100 with parallel HTTP calls │
│ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 2: Financial Data ETL │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ active_tickers ──▶ Priority Query ──▶ ThreadPoolExecutor ──▶ DB │
│ (is_active=true) (new first, (10 threads) │
│ then oldest) │
│ │
│ 4 Independent Jobs: │
│ • income_stmt (yahoo_api_income_statement) │
│ • cash_flow (yahoo_api_cash_flow) │
│ • balance_sheet (yahoo_api_balance_sheet) │
│ • financials (yahoo_api_financials) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Concurrency Model
The pipeline uses Python’s concurrent.futures for parallelism:
- Single Mode (default):
One process with a
ThreadPoolExecutor. Ideal for CI runners and small machines. Threads are perfect here because the workload is I/O-bound (HTTP requests to Yahoo).- Distributed Mode:
ProcessPoolExecutorspawns multiple processes, each with its own thread pool. Useful for machines with many CPU cores to maximize network throughput.
Single Mode:
┌──────────────────────────────────────┐
│ Process │
│ ├── Thread 1 ──▶ check ticker A │
│ ├── Thread 2 ──▶ check ticker B │
│ ├── Thread 3 ──▶ check ticker C │
│ └── ... (up to N threads) │
└──────────────────────────────────────┘
Distributed Mode:
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Process 1 │ │ Process 2 │ │ Process 3 │
│ ├── Thread 1 │ │ ├── Thread 1 │ │ ├── Thread 1 │
│ ├── Thread 2 │ │ ├── Thread 2 │ │ ├── Thread 2 │
│ └── ... │ │ └── ... │ │ └── ... │
└────────────────┘ └────────────────┘ └────────────────┘
Technology Stack
Python 3.11+ — Core language
stockdex — Yahoo Finance API wrapper (PyPI)
SQLAlchemy — Database ORM and connection pooling
pandas — Data transformation (wide → long format)
PostgreSQL (Neon) — Cloud-hosted database with connection pooling
GitHub Actions — CI/CD and scheduled job execution