import DocInfo from '@site/src/components/DocInfo';

Functional Specification: M63 - Observability and Event-Driven Architecture

Version: 1.0 Date: November 5, 2025 Status: Draft

1.0 Executive Summary

1.1 Purpose

This document provides the functional and technical specifications for Milestone 63: Observability and Event-Driven Architecture (EDA). This initiative aims to refactor the ChainAlign system to be more robust, scalable, and observable.

1.2 Problem Statement

The current architecture, while functional, has tight coupling between services. Synchronous, direct API calls create potential single points of failure and make the system difficult to scale and debug. Furthermore, a lack of standardized, end-to-end observability makes it challenging to trace requests across services, diagnose performance bottlenecks, and understand the system's behavior under load.

1.3 Solution Overview

The solution is a three-phased approach to progressively introduce OpenTelemetry for foundational observability and an event bus (Google Cloud Pub/Sub) to transition towards a more resilient, event-driven architecture.

Phase 1: Foundational Observability: Implement OpenTelemetry across the Node.js backend and key Python microservices to establish baseline tracing and metrics.
Phase 2: Introduce the Event Bus: Decouple services by introducing Google Cloud Pub/Sub for asynchronous communication, piloted with a non-critical workflow.
Phase 3: Deeper Integration & Resilience: Refactor a core business workflow to be fully event-driven and implement advanced resilience patterns like circuit breakers.

2.0 System Architecture

2.1 Target Architecture

3.0 Phase 1: Foundational Observability with OpenTelemetry

Goal: Instrument the primary backend services to gain baseline tracing and metrics.

3.1 FR-1.1: OpenTelemetry in Node.js Backend

FR-1.1.1: The system SHALL install the following OpenTelemetry npm packages in the backend service: @opentelemetry/api, @opentelemetry/sdk-node, @opentelemetry/auto-instrumentations-node, @opentelemetry/exporter-trace-otlp-http, @opentelemetry/exporter-metrics-otlp-http.
FR-1.1.2: A new configuration file, backend/src/config/telemetry.js, SHALL be created to handle the initialization and configuration of the OpenTelemetry SDK.
FR-1.1.3: The OTLP (OpenTelemetry Protocol) exporter SHALL be configured to send telemetry data to a central collector.
FR-1.1.4: Auto-instrumentation SHALL be enabled for express, pg (via Knex), and http to automatically trace requests, database queries, and outgoing API calls.
FR-1.1.5: The telemetry initialization script SHALL be the first module imported in backend/server.js to ensure all subsequent modules are correctly instrumented.
FR-1.1.6: Custom spans SHALL be added to key business logic functions (e.g., ScenarioOrchestrationService.run, DecisionService.record) to provide application-specific context in traces.

3.2 FR-1.2: OpenTelemetry in Python Services

FR-1.2.1: The requirements.txt for key Python services (starting with forecasting_service) SHALL be updated to include opentelemetry-api, opentelemetry-sdk, opentelemetry-instrumentation-fastapi, and opentelemetry-exporter-otlp.
FR-1.2.2: A telemetry.py module SHALL be created within each instrumented Python service to configure and initialize the OpenTelemetry SDK.
FR-1.2.3: The FastAPI application instance SHALL be instrumented to automatically trace all incoming web requests.
FR-1.2.4: Custom spans SHALL be added to performance-critical sections, such as model inference or complex data transformations, to allow for detailed performance analysis.

3.3 FR-1.3: Telemetry Collector

FR-1.3.1: A telemetry collector (e.g., Grafana Agent or a standard OpenTelemetry Collector) SHALL be added to the project's Docker Compose configuration for local development.
FR-1.3.2: The collector SHALL be configured to receive OTLP data from the Node.js and Python services and export it to appropriate backends (e.g., Prometheus for metrics, Jaeger/Tempo for traces).

4.0 Phase 2: Introduce the Event Bus

Goal: Decouple services and improve resilience by introducing an event bus for non-critical, asynchronous workflows.

4.1 FR-2.1: Shared Event Bus Library

FR-2.1.1: A new shared library SHALL be created to provide a standardized interface for interacting with Google Cloud Pub/Sub.
FR-2.1.2: This library SHALL expose simple publish(topic, payload) and subscribe(topic, handler) functions, abstracting the underlying Google Cloud Pub/Sub client.
FR-2.1.3: The library MUST automatically handle the injection and extraction of OpenTelemetry trace context into and from message attributes to ensure end-to-end trace continuity.

4.2 FR-2.2: Pilot Workflow Refactoring

FR-2.2.1: A non-critical, asynchronous workflow SHALL be selected for the pilot integration (e.g., "New Document Ingested" notification, "User Signup Welcome Email").
FR-2.2.2: The originating service SHALL be refactored to publish an event (e.g., document.ingested) to a Pub/Sub topic instead of making a direct synchronous call.
FR-2.2.3: The consuming service(s) SHALL be refactored to subscribe to the topic and execute their logic upon receiving an event.

5.0 Phase 3: Deeper Integration & Resilience

Goal: Apply the EDA and observability patterns to a core business workflow to enhance modularity and resilience.

5.1 FR-3.1: Core Workflow Event-Driven Refactoring

FR-3.1.1: A core, multi-step business workflow SHALL be selected for refactoring (e.g., "Forecast Generation and Validation").
FR-3.1.2: The workflow SHALL be decomposed into a series of distinct events (e.g., forecast.requested, forecast.generated, forecast.validated, forecast.published).
FR-3.1.3: The monolithic logic SHALL be broken down into smaller, independent services or functions that are triggered by these events.

5.2 FR-3.2: Circuit Breaker Implementation

FR-3.2.1: For critical synchronous API calls that remain in the architecture (e.g., within the GraphQL BFF), a Circuit Breaker pattern SHALL be implemented.
FR-3.2.2: A library such as opossum (for Node.js) SHALL be used to wrap outgoing API calls to downstream services.
FR-3.2.3: The circuit breaker SHALL be configured with appropriate thresholds for failure rate and reset timeout. When the circuit is "open," the service MUST return a cached or fallback response to the client, preventing cascading failures.

6.0 Non-Functional Requirements

Performance: The introduction of OpenTelemetry instrumentation should not add more than 5% overhead to the average API response time.
Reliability: The event bus implementation must guarantee "at-least-once" delivery for critical events.
Security: All communication with Google Cloud Pub/Sub must be authenticated and encrypted.
Maintainability: The shared event bus library must be well-documented and easy to use across different services.

7.0 Success Metrics

Observability: 100% of API requests and database queries in the backend and forecasting_service are captured in traces.
Decoupling: The pilot workflow (FR-2.2) operates fully asynchronously, and the originating service remains functional even if the consuming service is down.
Resilience: The circuit breaker (FR-3.2) successfully prevents cascading failures during simulated downstream service outages.
Traceability: A single user action can be traced end-to-end across both synchronous and asynchronous service boundaries in the tracing backend.

1.0 Executive Summary​

1.1 Purpose​

1.2 Problem Statement​

1.3 Solution Overview​

2.0 System Architecture​

2.1 Target Architecture​

3.0 Phase 1: Foundational Observability with OpenTelemetry​

3.1 FR-1.1: OpenTelemetry in Node.js Backend​

3.2 FR-1.2: OpenTelemetry in Python Services​

3.3 FR-1.3: Telemetry Collector​

4.0 Phase 2: Introduce the Event Bus​

4.1 FR-2.1: Shared Event Bus Library​

4.2 FR-2.2: Pilot Workflow Refactoring​

5.0 Phase 3: Deeper Integration & Resilience​

5.1 FR-3.1: Core Workflow Event-Driven Refactoring​

5.2 FR-3.2: Circuit Breaker Implementation​

6.0 Non-Functional Requirements​

7.0 Success Metrics​