ETL Jobs ======== The pipeline has two types of jobs: 1. Active Tickers Check ----------------------- **Purpose**: Determine which of the 106K tickers in the Excel file are active on Yahoo Finance. **Module**: ``finance.src.etl_job.ETLJob`` **Runner**: ``finance.src.run_active_tickers_check`` **Logic**: 1. Read all tickers from ``tickers_list.xlsx`` 2. Query DB for already-checked tickers → skip them 3. For each remaining ticker (in parallel): - Create a ``stockdex.Ticker`` object - Try each of the 4 Yahoo API methods - Short-circuit: return ``True`` on first valid response 4. Upsert batch results into ``active_tickers`` table **CLI Usage**: .. code-block:: bash # Run with defaults (100 batch, 30 threads) python -m finance.src.run_active_tickers_check # Custom settings python -m finance.src.run_active_tickers_check --mode single --threads 20 --max-batches 10 # Distributed mode (multi-process) python -m finance.src.run_active_tickers_check --mode distributed --threads 20 **Arguments**: .. list-table:: :header-rows: 1 * - Argument - Default - Description * - ``--mode`` - single - Execution mode: "single" or "distributed" * - ``--threads`` - 20 - Max concurrent threads per process * - ``--max-batches`` - None (all) - Stop after N batches 2. Financial Data ETL --------------------- **Purpose**: Fetch financial statement data for active tickers and store in long format. **Module**: ``finance.src.financial_data_etl.FinancialDataETL`` **Runner**: ``finance.src.run_financial_etl`` **Logic**: 1. Query ``active_tickers`` for priority tickers: - **First priority**: Active tickers NOT yet in the target table - **Second priority**: Active tickers with oldest ``insert_datetime`` 2. Fetch financial data in parallel using stockdex 3. Melt wide-format data into long format (handle B/M/K/T suffixes) 4. Delete + Insert (upsert) into target table **CLI Usage**: .. code-block:: bash # Run for a specific table python -m finance.src.run_financial_etl --table income_stmt python -m finance.src.run_financial_etl --table cash_flow python -m finance.src.run_financial_etl --table balance_sheet python -m finance.src.run_financial_etl --table financials # With custom settings python -m finance.src.run_financial_etl --table income_stmt --batch-size 100 --threads 15 --max-batches 5 **Arguments**: .. list-table:: :header-rows: 1 * - Argument - Default - Description * - ``--table`` - (required) - Target table: income_stmt, cash_flow, balance_sheet, financials * - ``--batch-size`` - 50 - Number of tickers per batch * - ``--threads`` - 10 - Concurrent threads for Yahoo API calls * - ``--max-batches`` - None (all) - Stop after N batches **Priority Query**: .. code-block:: sql WITH active AS ( SELECT ticker FROM finance.active_tickers WHERE is_active = true ), existing AS ( SELECT ticker, MAX(insert_datetime) as last_insert FROM finance. GROUP BY ticker ) SELECT a.ticker FROM active a LEFT JOIN existing e ON a.ticker = e.ticker ORDER BY CASE WHEN e.ticker IS NULL THEN 0 ELSE 1 END, e.last_insert ASC NULLS FIRST LIMIT :batch_size Data Transformation ------------------- stockdex returns data in wide format: .. code-block:: text Index (dates) | annualTotalRevenue | annualNetIncome | ... -----------------|--------------------|-----------------|---- 2022-09-30 | 394.33B | 99.80B | ... 2023-09-30 | 383.29B | 97.00B | ... The pipeline melts this into long format: .. code-block:: text ticker | frequency | report_date | metric | value -------|-----------|-------------|-----------------------|--------------- AAPL | annual | 2022-09-30 | annualTotalRevenue | 394330000000 AAPL | annual | 2022-09-30 | annualNetIncome | 99800000000 AAPL | annual | 2023-09-30 | annualTotalRevenue | 383290000000 ...