Hybrid Forecasting Service v3.1

Version: 3.1

Date: October 16, 2025

Changelog (v3.0 → v3.1):

Added Data Quality Scoring methodology (FR-01)
Expanded location-level context to include warehouses, plants, and distribution centers (FR-02)
Added on-demand retraining with GraphRAG KB baseline updates (FR-03)
Added Golden Dataset specification for validation and regression testing (Section 5)

1. Introduction & Objective

This document describes the functionality of the ChainAlign Hybrid Forecasting Service. The primary objective of this service is to generate demand and supply forecasts that are both statistically robust and contextually aware of real-world market dynamics. It achieves this by combining traditional time-series forecasting models with a Large Language Model (LLM) that is selectively engaged for complex scenarios.

This specification incorporates advanced methods for handling diverse demand patterns and edge cases, ensuring the standard ChainAlign platform can autonomously solve complex inventory optimization problems like the VN2 challenge.

2. User Roles & Personas

S&OP Planner/Analyst: The primary user who will trigger forecast generation and use the detailed output to build scenarios in the Reconciliation Workbench. They require granular, uncertainty-aware forecasts with clear reasoning.
Executive/Leader: The consumer of the final, summarized forecast and its reasoning in the Executive Decision Dashboard. They require high-level insights, key assumptions, and risk quantification.

3. Core Design Principles

Segmented Hybrid Approach: The engine will not use a one-size-fits-all model. It will first segment SKUs by demand pattern (regular, sparse, irregular, ramp-up, etc.) and apply the most appropriate statistical forecasting technique for each.
Selective Intelligence: The LLM is a powerful but expensive resource. It will not be used for every forecast. Instead, it will be engaged selectively for SKUs flagged with "edge case" characteristics (e.g., structural breaks, event sensitivity, data quality issues) where statistical methods alone are likely to fail.
Data Quality First: A dedicated data validation and preprocessing pipeline will ensure all data is clean, reliable, and properly engineered. This pipeline will also be responsible for identifying and flagging the edge cases that trigger the LLM.
Uncertainty Quantification: Forecasts will always include confidence intervals to provide a realistic understanding of potential outcomes and associated risks.
Continuous Learning & Feedback Loops: The service will track the accuracy of both statistical and hybrid methods against actuals, storing results in a dedicated evaluation table to enable continuous model improvement.
Robustness & Fallback: The system will be designed with comprehensive error handling and graceful degradation paths to manage scenarios like irrelevant RAG results, malformed LLM output, or insufficient historical data.

4. Functional Requirements

The service will generate a single, unified forecast by orchestrating multiple components.

FR-01: Data Validation & Preprocessing Pipeline

Description: The service shall include a robust pipeline to clean, validate, and prepare all historical and real-time data inputs for forecasting. A key function of this pipeline is to identify and flag SKUs that require advanced contextual analysis.
Process Flow:
1. Receive raw historical data (e.g., sales orders, shipment history).
2. Identify and handle outliers, missing values, and data inconsistencies.
3. SKU Segmentation & Edge Case Flagging:
  - Automatically classify each SKU into a demand segment (e.g., sparse, regular, irregular, ramp-up, phase-out).
  - Analyze the time series to detect and flag edge cases such as: has_structural_break, event_sensitive, has_dependency, lifecycle_complexity, and data_quality_issue.
4. Data Quality Scoring:
  - Calculate a composite data quality score (0-1) for each SKU based on:
    - Completeness: (1 - fraction_of_missing_weeks)
    - Freshness: max(0, 1 - days_since_last_update / 30)
    - Consistency: 1 - min(1, coefficient_of_variation / 3)
    - Outlier Score: (1 - fraction_of_weeks_with_outliers) where outliers are values > 3 standard deviations
  - Final score is a weighted average: 0.3 * completeness + 0.2 * freshness + 0.3 * consistency + 0.2 * outlier_score
  - Store in sku_context_metadata.data_quality_score (DECIMAL(3,2))
5. Context Generation: For each SKU, generate and store a rich textual and JSON context in a sku_context_metadata table. This context will be used by the LLM for reasoning.
6. Perform feature engineering (seasonality, promotions, stockouts).
Output: Clean, validated, segmented, and feature-engineered data ready for the forecasting engine.

FR-02: Hybrid Forecasting Engine

Description: The service shall execute a multi-step forecasting process that combines a statistical baseline with selective LLM adjustments for complex SKUs.
Process Flow:
1. Automated SKU Segmentation: The engine first classifies the target SKU based on its historical demand pattern (as determined in FR-01).
2. Statistical Baseline Generation: Based on the segment, an appropriate statistical model is chosen:
  - Sparse: Heuristic model (e.g., mean of last N non-zero sales).
  - Regular: Time-series models like ARIMA.
  - Irregular: Monte Carlo simulation combined with Newsvendor optimization.
  - Ramp-up/Phase-out: Exponential smoothing with trend components. The model generates a baseline numerical forecast and confidence intervals.
3. Selective LLM Engagement:
  - The engine checks if the SKU is flagged for any edge cases from FR-01.
  - If no edge cases are flagged, the statistical forecast is used as the final forecast (method: statistical_only).
  - If edge cases are flagged, the engine proceeds to the LLM step.
4. Contextual LLM Adjustment:
  - The statistical forecast, along with the rich context from the sku_context_metadata table and RAG-retrieved historical decisions, and location-level context (for stores, warehouses, plants, or distribution centers), is fed into a structured prompt for the Gemini LLM (via the M25 AI Compliance Gateway).
  - Location-Level Context includes:
    - Receiving/storage capacity and current utilization
    - Recent sales of dependent/complementary SKUs at this location (last 4 weeks)
    - Location-specific promotional calendar and events (next 2 weeks)
    - Location size category and historical demand multipliers
    - Localized budget constraints and inventory policies
  - The LLM does not forecast from scratch; it provides an adjustment factor and a reasoning, based on the contextual information that the statistical model cannot see.
5. Hybrid Blending: The final forecast is calculated by blending the statistical baseline and the LLM's adjusted forecast, potentially weighted by the LLM's stated confidence.
6. Constraint Application (Constraint Intelligence Engine): The final forecast is passed to a dedicated Constraint Intelligence Engine that applies hard and soft business rules (e.g., min/max inventory, supplier MOQ, shelf life, budget limits) to produce the final, executable order quantity. This engine is also responsible for running Newsvendor optimization and Monte Carlo simulation where applicable, separating this optimization logic from the core statistical engine and the LLM.
Dynamic Cost Parameter Management (C_overage, C_underage)

To accurately perform Newsvendor optimization, the service will dynamically manage C_overage (cost of overstocking) and C_underage (cost of understocking/lost sales). These parameters are critical and can vary by product category, region, and time.
- Derivation & Storage: C_overage and C_underage values will be derived from financial data and business policies. They will be stored in a dedicated, simplified Cost RAG (a small database table) that allows for granular cost assumptions by product category, region, and other relevant dimensions.
- LLM Access: Before running Newsvendor optimization, the LLM will access and query this Cost RAG to retrieve the current, approved cost values relevant to the specific SKU and forecasting context. This ensures that the optimization is based on the most up-to-date and contextually appropriate cost parameters.
1. Capacity Constraint Guardrail: After the final forecast is generated and constraints are applied, a final guardrail check will enforce hard limits based on location receiving/storage capacity (for stores, warehouses, plants, or distribution centers) and inventory policy constraints. The blended forecast must be clipped to ensure feasibility (e.g., Final_Forecast <= Location_Capacity_Limit), preventing recommendations that exceed physical or policy-driven limitations.
Output: A unified forecast including point estimate, confidence intervals, the method used, LLM reasoning, and a final constrained order quantity.

FR-03: Forecast Accuracy & Model Improvement Feedback Loop

Description: The service shall track forecast accuracy over time to enable continuous model improvement.
Process Flow:
1. After an execution cycle, compare the generated forecasts (both statistical and hybrid) against actual performance data.
2. Calculate key accuracy metrics (e.g., MAPE, bias, total cost).
3. Store these results in a dedicated forecast_evaluation table, explicitly logging the method used (statistical_only, hybrid_balanced, etc.), the forecast value, the actual value, and the resulting error.
Continuous Learning & Model Refinement

To ensure continuous improvement, the service will implement a multi-layered learning strategy:
- On-Demand Retraining: Each time a forecast job is executed, the system will:
  1. Look back at recent forecast accuracy metrics (last N periods)
  2. Detect if model performance has degraded (e.g., Bias > 20% or WAPE > 30% for 3 consecutive periods)
  3. If degradation is detected, trigger automatic model retraining with updated historical data
  4. Baseline learnings into the GraphRAG knowledge base across all three layers (structured data, embeddings, graph relationships)
- GraphRAG Knowledge Baseline Updates: After each retraining cycle, insights are stored in the GraphRAG KB:
  1. Structured Layer: Update SKU segmentation, edge case flags, data quality scores
  2. Embedding Layer: Generate and store embeddings of updated demand patterns and forecast reasoning
  3. Graph Layer: Update relationships between SKUs (dependencies), locations, and events
- Scheduled Retraining (Optional): For load balancing in production environments, optional weekly/monthly batch retraining can be scheduled during off-peak hours (e.g., weekends) to process large-scale updates across all SKUs.
LLM Reasoning Backtesting

To validate the contextual insights provided by the LLM, a backtesting mechanism will be implemented:
- After the evaluation cycle, the LLM's stated reasoning (e.g., "Expected competitor promotion") will be correlated with the actual sales outcome and relevant market events.
- A flag (reasoning_accurate: boolean) will be stored in the forecast_evaluation table, indicating whether the LLM's contextual insight proved accurate. This provides a crucial metric for assessing the credibility and value of the hybrid approach over time.
Output: A rich performance dataset for periodic model retraining and comparative analysis of different forecasting methods.

FR-04: Error Handling & Fallback Strategies

Description: The service shall implement robust error handling and define fallback strategies for various failure scenarios.
Process Flow:
1. Handle cases where RAG retrieval returns irrelevant or insufficient documents.
2. Manage malformed or unexpected LLM output.
3. Provide graceful degradation paths when historical data is insufficient for statistical modeling.
4. Log all errors with sufficient detail for debugging and analysis.
Output: Stable service operation even under adverse conditions, with clear error reporting.

5. Sandbox Validation Framework

To safely test, validate, and demonstrate the capabilities of the hybrid engine, a sandbox simulation framework is required. This is a core part of the service's development and maintenance lifecycle.

Synthetic Data Generator: A robust data generator will create a large-scale, realistic dataset (~11,000 SKUs, 150+ weeks) complete with the complex edge cases (structural breaks, event sensitivity, cross-SKU dependencies, lifecycle complexity, data quality issues) needed to stress-test the engine. This synthetic dataset will be generated programmatically as part of the extended test suite.
Golden Dataset: A curated benchmark dataset for validation and regression testing:
- Initial Generation: Synthetically generated with 100-200 representative SKUs across all segments (sparse, regular, irregular, ramp-up, phase-out)
- Manual Curation (Future): Planned manual curation with real-world scenarios and known-good outcomes
- Composition:
  - 20% edge cases (structural breaks, event-driven spikes, dependencies)
  - 30% difficult-to-forecast SKUs (high CV, lumpy demand, sparse patterns)
  - 50% typical SKUs (regular patterns, low CV, stable demand)
- Storage: Dedicated table golden_dataset_forecasts with columns: sku_id, forecast_week, method, forecast_value, actual_value, mape, bias
- Usage:
  - Benchmark comparison (new model vs. golden baseline)
  - Regression testing (ensure new models don't degrade performance)
  - Calibration data for LLM adjustment factor weights
Simulation Harness: A testing script (run_full_evaluation) will orchestrate the end-to-end process on the sandbox data: load data, run forecasts using the hybrid engine, and log results.
Comparative Analysis: The framework will compare the performance (accuracy and cost) of the Hybrid Approach vs. a Statistical-Only baseline across different segments and edge cases.
Visualization Dashboard: An interactive dashboard will be used to analyze the results, providing visual proof of where and why the hybrid method provides lift.

6. API Specification

Endpoint: `POST /api/forecasts/generate`

Description: Initiates the hybrid forecasting process for a given product or business unit, allowing for detailed query parameters.

Request Body:

{
  "product_hierarchy": ["electronics", "smartphones", "iPhone_15"],
  "geographic_scope": ["US-WEST", "US-CENTRAL"],
  "forecast_horizon": "12_months",
  "confidence_levels": [50, 80, 95],
  "scenario_assumptions": {
    "promotion_calendar": true,
    "competitor_actions": true,
    "supply_constraints": true
  }
}

Successful Response (200 OK) - ENHANCED:

{
  "forecast_summary": {
    "point_forecast": 1650000,
    "confidence_intervals": {
      "50%": [1580000, 1720000],
      "80%": [1520000, 1780000],
      "95%": [1450000, 1850000]
    }
  },
  "final_order_recommendation": {
    "order_qty": 1650000,
    "constraint_violations": []
  },
  "methodology": {
    "sku_segment": "irregular",
    "forecasting_method": "hybrid_llm_heavy",
    "statistical_model": "MonteCarlo_Newsvendor",
    "llm_reasoning": "Adjustment based on detected structural break and upcoming promotional event.",
    "key_factors": ["structural_break", "promotion"],
    "data_quality_score": 0.87
  },
  "narrative": "The baseline statistical forecast of 1.5M units has been adjusted to 1.65M units..."
}

7. Non-Functional Requirements

Performance: The end-to-end forecast generation (statistical + AI synthesis) should aim for completion within a reasonable timeframe (e.g., under 60-90 seconds for complex requests) to support interactive scenario planning, with asynchronous processing for longer-running tasks.
Security: All database queries must be scoped to the authenticated user's organization to ensure strict data multi-tenancy. Access to the forecasting service API must be secured with appropriate authentication and authorization mechanisms.
Scalability: The service, running on Google Cloud Run, must be designed to handle multiple concurrent forecast requests efficiently, leveraging serverless scaling capabilities.
Reliability: The service must be highly reliable, providing consistent and accurate forecasts.
Maintainability: The forecasting models and AI prompts should be easily configurable and updateable.

8. Assumptions & Dependencies

The customer's PostgreSQL database contains sufficient (e.g., >24 months) and clean historical data for statistical forecasting.
The RAG knowledge base has been populated with relevant, high-quality documents through the KB Curation Process.
The Google Gemini API and database connections are configured and available.
External data sources (e.g., news feeds, market data APIs) are accessible and provide timely information.