Evals and Guardrails Functional Specification Document (FSD)

1. Introduction

This document outlines the functional specifications for implementing Evals (Evaluation Systems) and Guardrails within the ChainAlign platform. The primary goal is to ensure the reliability, accuracy, safety, and ethical use of AI models, particularly the Large Language Model (LLM) capabilities, and to maintain data integrity and compliance with enterprise standards (SOX, GDPR).

2. Goals

Establish robust mechanisms for evaluating the performance, accuracy, and safety of AI models.
Implement guardrails to prevent undesirable AI outputs (e.g., hallucinations, toxicity, bias).
Ensure financial data integrity and compliance with SOX regulations.
Enforce data privacy and access control in line with GDPR principles.
Provide a structured approach for continuous improvement of AI systems through feedback loops.

3. Scope

This FSD covers the implementation of foundational, automated, and advanced evaluation and guardrail mechanisms across the ChainAlign backend and frontend components. It includes:

Financial accuracy checks.
Immutable audit trails.
Role-Based Access Control (RBAC).
PII data minimization.
Curation feedback loop for AI output.
Automated RAG pipeline evaluations.
LLM output content moderation.
Automated business logic consistency tests.
Business-level evaluations.
Domain-specific S&OP knowledge validation.
Performance and integration testing for AI components.

4. Functional Requirements

Milestone M6 - Foundational Guardrails & Evals (Phase 1)

Goal: Establish core guardrails for financial accuracy and data privacy, and implement foundational AI evaluation mechanisms.

6.1. Financial Accuracy Guardrails (SOX Compliance)

6.1.1. Implement Mathematical Integrity Check:
- FR-6.1.1.1: The system SHALL include a validatePlanIntegrity(plan_id) function in the Node.js backend.
- FR-6.1.1.2: This function SHALL implement logic to verify the core S&OP equation: Previous_Period_Backlog + Current_Period_Bookings - Current_Period_Shipments = Current_Period_Backlog.
- FR-6.1.1.3: The system SHALL integrate this check with the "Consensus Lock-In Protocol" to prevent a plan from being locked if the integrity check fails (variance exceeds a configurable defined tolerance, with a default of 0.01%).
- FR-6.1.1.4: The system SHALL trigger a high-priority alert in the Exception Dashboard with a descriptive message (e.g., "Data Integrity Error: The core plan does not reconcile. Please validate data sources.") if the check fails.
6.1.2. Implement Immutable Audit Trail:
- FR-6.1.2.1: The system SHALL create a dedicated audit_log table in PostgreSQL with fields for timestamp, user_id, action_type, and details (JSONB).
- FR-6.1.2.2: The system SHALL develop a centralized logAuditEvent(user, action, details) function in the Node.js backend.
- FR-6.1.2.3: The system SHALL integrate logAuditEvent for all critical user actions, including but not limited to plan creation, scenario updates, lock decisions, and executive overrides, to ensure a complete and unchangeable record for SOX compliance.

6.2.1. Implement Full Role-Based Access Control (RBAC):
- FR-6.2.1.1: The system SHALL create or ensure the existence of users, roles, and permissions tables in PostgreSQL.
- FR-6.2.1.2: The system SHALL define initial roles: S&OP Participant, S&OP Orchestrator, Executive, and Admin.
- FR-6.2.1.3: The system SHALL implement Node.js/Express middleware to enforce role and permission-based access control for all API endpoints.
- FR-6.2.1.4: The system SHALL restrict access to sensitive API endpoints (e.g., "Consensus Lock-In Protocol") based on user roles, returning a 403 Forbidden error for unauthorized attempts.
6.2.2. Implement PII Data Minimization Feature:
- FR-6.2.2.1: The Intelligent Data Ingestion Engine SHALL be enhanced to recognize and flag fields likely to contain Personally Identifiable Information (PII) during data processing.
- FR-6.2.2.2: The CSV Onboarding Wizard UI SHALL present a warning to the user when PII fields are detected, offering an explicit option to exclude these fields from import.

6.3. Foundational AI Evals (Curation Feedback Loop)

6.3.1. Log Curation Decisions:
- FR-6.3.1.1: The "Curation Queue" functionality SHALL be modified to log all admin decisions (Approve, Reject, Approve with Edits).
- FR-6.3.1.2: Logged decisions SHALL include relevant metadata such as item ID, original AI output, human feedback/edits, timestamp, and the user who performed the action.
- FR-6.3.1.3: Logged decisions SHALL be stored in a suitable data store (e.g., Cloud Firestore or a new PostgreSQL table).
6.3.2. Basic Triage & Tagging Quality Metrics:
- FR-6.3.2.1: The system SHALL provide initial reports or dashboards to visualize triage accuracy and tagging quality based on the logged curation decisions.

Milestone M7 - Automated RAG Evals & Advanced Guardrails (Phase 2)

Goal: Implement automated evaluation for the RAG pipeline and introduce advanced guardrails for LLM output quality and safety.

7.1. Core RAG Metrics (Automated Pipeline Evals)

7.1.1. Create Evaluation Dataset:
- FR-7.1.1.1: The system SHALL utilize a small, high-quality dataset of (question, ground truth answer, relevant context) triples for RAG evaluation.
7.1.2. Integrate Ragas Framework:
- FR-7.1.2.1: The system SHALL integrate the Ragas framework for automated evaluation of the RAG pipeline.
- FR-7.1.2.2: Ragas SHALL be used to implement retrieval evaluation (e.g., Hit Rate, Context Recall, Context Precision) for assessing the retriever component.
- FR-7.1.2.3: Ragas SHALL be used to implement generation evaluation (e.g., Faithfulness, Answer Relevancy) for assessing the LLM generator component, utilizing an LLM-as-a-judge approach.
7.1.3. Automated RAG Evaluation Script:
- FR-7.1.3.1: The system SHALL include a script to run RAG evaluations daily or weekly using the defined dataset and Ragas.
- FR-7.1.3.2: The script SHALL log evaluation results and metrics to a persistent store, including Faithfulness, Answer Relevancy, Context Recall, and Context Precision.

7.2. LLM Output Guardrails (Content Moderation & Safety)

7.2.1. Enhance Text Quality System:
- FR-7.2.1.1: The existing TextStyleValidator (backend/src/services/textQualityService.js) SHALL be extended to include checks for LLM-specific issues (e.g., indicators of toxicity, bias, hallucination).
- FR-7.2.1.2: The system SHALL integrate with Google Gemini's safety features/APIs for real-time content moderation of LLM outputs.
7.2.2. Implement Content Policy Enforcement:
- FR-7.2.2.1: The system SHALL allow for the definition of content policies for LLM-generated output (e.g., no profanity, no sensitive PII, no harmful content).
- FR-7.2.2.2: The system SHALL develop a mechanism to filter or flag LLM outputs that violate these policies before they are displayed to the user.
7.2.3. Implement LLM-as-a-Judge for Nuanced Guardrails:
- FR-7.2.3.1: The system SHALL implement an LLM-as-a-judge mechanism (e.g., using Google Gemini) to detect nuanced guardrail violations such as subtle biases, inappropriate tone, or non-compliance with specific business communication policies, which are difficult to capture with predefined rules.

7.3. Business Logic Consistency Tests (Automated)

7.3.1. Automate Constraint Tests:
- FR-7.3.1.1: The system SHALL develop automated tests for business logic consistency, leveraging the CONSTRAINT_TESTS concept from qwen-zep-automated-evals.md.
- FR-7.3.1.2: These automated tests SHALL be integrated into the CI/CD pipeline or executed as a scheduled backend job.

Milestone M8 - Advanced Evals & Performance (Phase 3)

Goal: Implement business-level evaluations, domain-specific checks, and performance monitoring for the AI system.

8.1. Business-Level Evals (Overall Task Success)

8.1.1. Define Golden Dataset for S&OP Problems:
- FR-8.1.1.1: The system SHALL utilize a "golden dataset" of 10-20 realistic S&OP business problems, each with defined "successful" AI suggestions.
- FR-8.1.1.2: A clear process SHALL be established for the creation, maintenance, and periodic review of the golden dataset to ensure its continued representativeness of real-world S&OP challenges.
8.1.2. Manual/Semi-Automated Task Success Evaluation:
- FR-8.1.2.1: The system SHALL provide a process for S&OP leaders to review AI suggestions against the golden dataset and provide pass/fail grades for "Overall Task Success."
- FR-8.1.2.2: The system SHALL track and visualize the "Overall Task Success Score" over time.

8.2. Domain-Specific Evaluation

8.2.1. S&OP Knowledge Validation:
- FR-8.2.1.1: The system SHALL include tests to validate the LLM's understanding of S&OP terminology and concepts (e.g., using question-answering pairs against a defined S&OP knowledge base).
8.2.2. Forecast Accuracy Baseline Testing:
- FR-8.2.2.1: The system SHALL implement historical data backtesting to evaluate the LLM's impact on forecast accuracy (e.g., measuring MAPE improvement compared to a baseline).

8.3. Performance & Integration Testing

8.3.1. Memory System Validation (Zep):
- FR-8.3.1.1: The system SHALL include tests to ensure Zep memory consistently preserves context across sessions and retrieves relevant information with high accuracy.
8.3.2. Response Time Performance:
- FR-8.3.2.1: The system SHALL implement performance tests to measure LLM response times for various scenario complexities (e.g., simple vs. complex queries).
- FR-8.3.2.2: The system SHALL define and monitor Service Level Agreements (SLAs) for LLM response times.

5. Non-Functional Requirements (NFRs)

Performance:
- NFR-P-1: All guardrail checks SHALL execute with minimal latency to avoid impacting user experience.
- NFR-P-2: Automated RAG evaluations SHALL be configurable to run on a schedule (e.g., daily, weekly) without impacting production performance.
Scalability:
- NFR-S-1: The evaluation and guardrail systems SHALL be scalable to handle increasing volumes of data and AI model interactions.
Security:
- NFR-S-2: All audit logs SHALL be immutable and protected against unauthorized access or modification.
- NFR-S-3: RBAC SHALL be enforced at the API gateway level and within the application logic.
Reliability:
- NFR-R-1: Guardrail mechanisms SHALL be highly reliable and robust, preventing critical failures due to AI misbehavior or data integrity issues.
- NFR-R-2: Evaluation results SHALL be consistently logged and accessible for analysis.
Maintainability:
- NFR-M-1: Guardrail rules and evaluation criteria SHALL be configurable and easily maintainable.
- NFR-M-2: The code for evals and guardrails SHALL adhere to established coding standards and best practices.
Usability:
- NFR-U-1: Alerts and warnings generated by guardrails SHALL be clear, actionable, and provide sufficient context for resolution.
- NFR-U-2: Evaluation reports and dashboards SHALL be intuitive and provide meaningful insights into AI model performance.
- NFR-U-3: The system SHALL provide consolidated views of alerts and metrics through a centralized dashboard, with defined notification protocols for high-priority alerts (e.g., email, PagerDuty, Slack) to ensure timely awareness and action.

6. Enhancements to Consider (Future Work)

Explainable AI (XAI) for Guardrails: For critical guardrail violations, provide clear explanations to the user about why a guardrail was triggered and what specific data points led to the violation.
Adaptive Guardrails: Explore mechanisms where guardrails can adapt or be fine-tuned based on user feedback and the evolving context of S&OP meetings.
Proactive Guardrail Suggestions: The AI could proactively suggest ways to avoid triggering guardrails (e.g., "Consider these data points to ensure your plan maintains mathematical integrity").
Integration with Existing Monitoring Tools: Ensure all eval and guardrail metrics are integrated into a centralized monitoring dashboard (e.g., Google Cloud Monitoring).
Version Control for Evals & Guardrails: Treat eval datasets, guardrail rules, and evaluation scripts as code, and version control them for reproducibility and traceability.
Synthetic Data Generation for Evals: Leverage LLMs to synthetically generate diverse and challenging test cases for evals, especially for edge cases.

7. Open Questions / Dependencies

Detailed definition of content policies for LLM output moderation.
Selection of specific tools/libraries for implementing LLM-as-a-judge if adopted.
Integration points with existing monitoring and alerting infrastructure.
Availability of historical S&OP data for forecast accuracy baseline testing.

1. Introduction​

2. Goals​

3. Scope​

4. Functional Requirements​

Milestone M6 - Foundational Guardrails & Evals (Phase 1)​

6.1. Financial Accuracy Guardrails (SOX Compliance)​

6.2. Data Privacy Guardrails (GDPR Compliance)​

6.3. Foundational AI Evals (Curation Feedback Loop)​

Milestone M7 - Automated RAG Evals & Advanced Guardrails (Phase 2)​

7.1. Core RAG Metrics (Automated Pipeline Evals)​

7.2. LLM Output Guardrails (Content Moderation & Safety)​

7.3. Business Logic Consistency Tests (Automated)​

Milestone M8 - Advanced Evals & Performance (Phase 3)​

8.1. Business-Level Evals (Overall Task Success)​

8.2. Domain-Specific Evaluation​

8.3. Performance & Integration Testing​

5. Non-Functional Requirements (NFRs)​

6. Enhancements to Consider (Future Work)​

7. Open Questions / Dependencies​