Evals and Guardrails Implementation Guide
Overview
This document describes the Evals and Guardrails framework implemented in ChainAlign to ensure financial accuracy, data integrity, and RAG pipeline quality. These systems are foundational to the platform's enterprise readiness and SOX compliance.
Core Principle: Guardrails prevent bad outcomes; Evals measure quality.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Guardrails Layer (Prevention) │
│ │
│ ├─ Financial Accuracy Guardrails │
│ │ └─ P&L Reconciliation, Ratio Sanity, Temporal │
│ │ Consistency validation │
│ │ │
│ ├─ Scenario Financial Guardrails │
│ │ └─ Parameter Validation, Outcome Sanity, Coherence, │
│ │ Feasibility assessment │
│ │ │
│ └─ Data Quality Guardrails (Phase 2) │
│ └─ Input validation, PII detection, Schema checks │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Evals Layer (Measurement) │
│ │
│ ├─ Layer 1: Foundational Evals (Curation Feedback) │
│ │ └─ Human KB review decisions logged and tracked │
│ │ │
│ ├─ Layer 2: Core RAG Metrics (Automated Pipeline) │
│ │ ├─ Retrieval Evals: Hit Rate, Precision@K, Recall@K │
│ │ └─ Generation Evals: Faithfulness, Relevancy │
│ │ │
│ └─ Layer 3: Business-Level Evals (End-to-End Success) │
│ └─ Human judgment of solution quality on real problems│
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Audit Trail (SOX Compliance) │
│ │
│ - Immutable logging of all evaluation and validation events │
│ - Traceability for financial decisions │
│ - Evidence for compliance audits │
└─────────────────────────────────────────────────────────────┘
Guardrails Implementation
1. Financial Accuracy Evaluation Service
File: backend/src/services/FinancialAccuracyEvaluationService.js
Purpose: Ensure financial data is mathematically sound and auditable.
Check 1: P&L Reconciliation (Mathematical Integrity)
Validates core accounting equations:
Revenue - COGS = Gross Profit
Gross Profit - OpEx = EBITDA
EBITDA - D&A = EBIT
EBIT - Interest = EBT
EBT - Tax = Net Income
Tolerance: 0.01% variance (configurable)
const result = FinancialAccuracyEvaluationService.validatePLReconciliation(
financialData,
0.0001 // 0.01% tolerance
);
if (!result.isValid) {
// Prevent plan from being locked
// Create alert in Exception Dashboard
}
Check 2: Ratio Sanity Checks
Detects outliers using industry benchmarks:
| Ratio | Benchmark | Warning Threshold |
|---|---|---|
| Gross Margin | 15-70% | <5% or >90% |
| Operating Margin | 5-30% | <-20% or >50% |
| Net Margin | 2-20% | <-30% |
| Debt-to-Equity | <1.0 | >2.0 |
| Interest Coverage | >2.5 | <1.5 |
| Current Ratio | 1.5-2.5 | <1.0 or >5.0 |
const ratioValidation = FinancialAccuracyEvaluationService.validateRatioSanity(
financialData
);
for (const issue of ratioValidation.issues) {
if (issue.type === 'ERROR') {
// Block data ingestion
} else if (issue.type === 'WARNING') {
// Flag for manual review
}
}
Check 3: Temporal Consistency
Detects anomalies across periods:
- Revenue Growth Rate: Flags >50% growth or <-30% decline
- Margin Shift: Detects >10% period-over-period change
- OpEx Growth Outpacing Revenue: Early warning of operating leverage loss
- FCF Volatility: Alerts to working capital swings
const temporalValidation = FinancialAccuracyEvaluationService.validateTemporalConsistency(
financialDataSeries
);
// Review anomalies
for (const anomaly of temporalValidation.anomalies) {
console.log(`${anomaly.type}: ${anomaly.message}`);
}
Usage Example
// In financial data ingestion pipeline
const validationReport = await FinancialAccuracyEvaluationService.validateFinancialData(
tenantId,
userId,
financialDataArray
);
// Get quality score (0-100)
const qualityScore = FinancialAccuracyEvaluationService.calculateDataQualityScore(
validationReport
);
if (qualityScore < 70) {
// Flag for manual review
await notificationService.alert({
severity: 'WARNING',
message: `Financial data quality score ${qualityScore}% below threshold`,
});
}
// Audit log
await logAuditEvent(tenantId, userId, 'FINANCIAL_VALIDATION', {
status: validationReport.overallStatus,
qualityScore,
issueCount: validationReport.totalIssues,
});
2. Scenario Financial Validation Guardrails
File: backend/src/services/ScenarioFinancialGuardrails.js
Purpose: Ensure financial scenarios are realistic and supportable by decision-makers.
Guardrail 1: Parameter Validation
Validates scenario parameters before projection:
const paramValidation = ScenarioFinancialGuardrails.validateScenarioParameters(
scenario.predicted_impact
);
// Revenue growth: -50% to +100%
// COGS %: 15-95%
// OpEx growth: -50% to +50%
// R&D spend: 0-20% of revenue
// CapEx: 0-30% of revenue
if (paramValidation.requiresApproval) {
// Require C-suite sign-off
scenario.status = 'PENDING_EXECUTIVE_APPROVAL';
}
Guardrail 2: Outcome Sanity Checks
Validates projections are plausible:
const outcomeValidation = ScenarioFinancialGuardrails.validateProjectionOutcomes(
projections
);
// Checks:
// - No negative revenue
// - EBITDA margin between -50% and 80%
// - No sustained negative FCF (3+ years)
// - Debt <3x revenue
// - Effective tax rate <100%
for (const issue of outcomeValidation.issues) {
if (issue.type === 'GUARDRAIL') {
console.log(`⚠️ ${issue.message}`);
}
}
Guardrail 3: Coherence Validation
Ensures scenario logic is internally consistent:
const coherence = ScenarioFinancialGuardrails.validateScenarioCoherence(
scenario.predicted_impact,
projections
);
// Detects:
// - Declining revenue but improving margins (unusual)
// - OpEx growing faster than revenue but margins improving (inconsistent)
// - Other logical contradictions
Guardrail 4: Feasibility & Risk Assessment
High-level risk scoring:
const feasibility = ScenarioFinancialGuardrails.validateScenarioFeasibility(
scenario,
projections
);
// Risk Levels:
// - LOW: No significant risks
// - MEDIUM: 1 medium or some warnings
// - HIGH: Multiple medium or significant risks
// - CRITICAL: High risks that prevent approval
console.log(`Risk Level: ${feasibility.riskLevel}`);
for (const risk of feasibility.risks) {
console.log(` ${risk.level}: ${risk.description}`);
console.log(` Mitigation: ${risk.mitigation}`);
}
Comprehensive Guardrail Check
const guardrailReport = await ScenarioFinancialGuardrails.evaluateScenarioGuardrails(
tenantId,
userId,
scenario,
projections
);
// Status: PASS, REQUIRES_APPROVAL, or FAIL
if (guardrailReport.status === 'REQUIRES_APPROVAL') {
// Route to Executive Approval Workflow
await approvalWorkflow.create({
scenarioId: scenario.id,
requiredApprovers: ['CFO', 'COO'],
guardrailReport,
});
}
// Audit trail for SOX compliance
await logAuditEvent(tenantId, userId, 'SCENARIO_GUARDRAIL_CHECK', guardrailReport);
Evals Implementation
Layer 1: Foundational Evals (Curation Feedback Loop)
File: backend/src/services/RAGEvaluationService.js
Purpose: Log and track human evaluation decisions from KB curation.
// When admin approves, rejects, or edits in Curation Queue
const feedback = await RAGEvaluationService.logCurationFeedback({
tenantId,
userId,
documentId,
curationAction: 'APPROVE', // or 'REJECT', 'EDIT'
originalExtraction: {...},
curatedExtraction: {...},
rationale: 'Corrected supply chain risk categorization',
});
// This feedback becomes training data for future fine-tuning
// Tracks triage accuracy and tagging quality
Metrics Captured:
- Approval rate: % of triaged documents approved as-is
- Rejection rate: % of triaged documents rejected
- Edit rate: % requiring manual correction
- Quality trend: IMPROVING, STABLE, or DECLINING
Layer 2: Core RAG Metrics (Automated Pipeline Evals)
Retrieval Evaluation
Measures how well documents are being retrieved:
const retrievalTest = {
query: 'What are the supply constraints for electronics in Q4?',
relevantDocIds: ['doc-123', 'doc-456', 'doc-789'],
retrievedDocs: [
{ id: 'doc-123', relevanceScore: 0.95 },
{ id: 'doc-999', relevanceScore: 0.88 },
{ id: 'doc-456', relevanceScore: 0.82 },
// ... more results
],
k: 5,
};
const retrievalMetrics = RAGEvaluationService.evaluateRetrieval(retrievalTest);
// Results:
// - hitRate: 1.0 (correct doc in top-5)
// - precision: 0.6 (2 of 3 top results are correct)
// - recall: 0.67 (2 of 3 relevant docs retrieved)
// - mrr: 1.0 (first correct doc at position 1)
// - ndcg: 0.89 (normalized discounted cumulative gain)
Interpreting Metrics:
| Metric | Meaning | Good Score |
|---|---|---|
| Hit Rate | Did correct doc appear? | 0.9+ |
| Precision@K | % of top results are relevant | 0.7+ |
| Recall@K | % of relevant docs retrieved | 0.8+ |
| MRR | How quickly was first relevant doc found | 0.9+ |
| nDCG@K | Ranking quality (penalizes low ranks) | 0.8+ |
Generation Evaluation
Measures quality of AI-generated insights:
const generationTest = {
query: 'What immediate actions should we take to address Q4 supply risks?',
context: [
'Document 1: Supply chain analysis...',
'Document 2: Supplier capacity planning...',
],
generatedAnswer: 'We recommend: 1) Activate alternate suppliers... 2) Increase safety stock...',
groundTruth: 'Expected answer for validation (optional)',
};
const generationMetrics = RAGEvaluationService.evaluateGeneration(generationTest);
// Results:
// - faithfulness: 0.87 (answer terms appear in context)
// - answerRelevancy: 0.92 (answer addresses query)
// - contextRelevancy: 0.85 (context is useful)
// - correctness: 0.78 (matches ground truth, if provided)
// - overallScore: 0.86 (average across metrics)
Interpreting Scores:
| Metric | Meaning | Good Score |
|---|---|---|
| Faithfulness | Is answer grounded in context? | 0.8+ |
| Answer Relevancy | Does answer address query? | 0.8+ |
| Context Relevancy | Is context useful? | 0.8+ |
| Correctness | Is answer factually correct? | 0.8+ |
Layer 3: Business-Level Evals (End-to-End Task Success)
Measures whether AI suggestions actually help solve business problems:
const businessTest = {
problemId: 'OERLIKON-Q4-2025-001',
problemDescription: 'Balzers division facing 15% supply shortfall for HVOF powders. ' +
'Lead time from suppliers is 6 months. Customer order deadline is 4 months away. ' +
'What actions should we take?',
systemSuggestions: [
'Activate emergency supplier for interim supply',
'Negotiate order postponement with key customer',
'Increase safety stock by 25%',
],
humanScore: 8, // 0-10 scale
rationale: 'Suggestions were actionable and addressed the core bottleneck. ' +
'However, missed opportunity to discuss premium pricing option.',
successCriteria: {
addressesCoreIssue: true,
isActionable: true,
considersRiskMitigation: true,
completeness: false, // Missed premium pricing angle
},
};
const businessEval = RAGEvaluationService.evaluateBusinessTaskSuccess(businessTest);
// Results:
// - humanScore: 8
// - scoreCategory: 'EXCELLENT' (8-10), 'GOOD' (6-8), 'FAIR' (4-6), etc.
// - requiresImprovement: false (score >= 6)
Aggregated Evaluation Report
Combine all three layers into a comprehensive quality score:
const evalResults = {
layer1Feedback: [...curationFeedback],
layer2Retrieval: [...retrievalMetrics],
layer2Generation: [...generationMetrics],
layer3Business: [...businessTestResults],
};
const ragReport = RAGEvaluationService.aggregateEvaluationResults(evalResults);
// Report includes:
// - Layer 1: Approval rate (80%), rejection rate (15%), quality trend (IMPROVING)
// - Layer 2: Retrieval score (0.82 nDCG), Generation score (0.85 overall)
// - Layer 3: Business success rate (78% pass), average score (7.8/10)
// - Overall RAG Score: 82/100
// - Recommendations: Focus on improving retrieval quality...
Integration Points
1. Financial Data Ingestion Pipeline
// In CSV upload or ERP sync
const financialData = await parseFinancialData(uploadedFile);
// Validate before ingestion
const validationReport = await FinancialAccuracyEvaluationService.validateFinancialData(
tenantId,
userId,
financialData
);
if (validationReport.overallStatus === 'INVALID') {
throw new Error('Financial data validation failed: ' + JSON.stringify(validationReport.errors));
}
// Proceed with ingestion
await FinancialTimeSeriesRepository.batchInsert(financialData);
2. Scenario Creation & Approval
// When scenario is created with financial parameters
const scenario = await ScenariosService.createScenario(user, decisionProblemId, {
name: '15% Revenue Growth',
predicted_impact: {
revenue_growth_rate: 0.15,
capex_pct_revenue: 0.08,
},
});
// Calculate financial projections
const projections = await ScenarioFinancialService.calculateScenarioFinancialImpact(
user,
scenario.id
);
// Run guardrail checks
const guardrailReport = await ScenarioFinancialGuardrails.evaluateScenarioGuardrails(
user.tenant_id,
user.id,
scenario,
projections
);
if (guardrailReport.status === 'REQUIRES_APPROVAL') {
scenario.status = 'PENDING_EXECUTIVE_APPROVAL';
await approvalWorkflow.createApprovalTask(scenario, guardrailReport);
} else if (guardrailReport.status === 'FAIL') {
throw new Error('Scenario failed financial guardrails');
}
3. KB Curation & Evaluation
// When admin curates a document in the Curation Queue
await RAGEvaluationService.logCurationFeedback({
tenantId,
userId,
documentId,
curationAction: 'APPROVE',
originalExtraction,
curatedExtraction,
rationale: 'Correct risk categorization',
});
// Periodically assess RAG quality
const evalReport = RAGEvaluationService.aggregateEvaluationResults(allEvals);
if (evalReport.overallRAGScore < 70) {
// Notify KB team to focus on quality improvements
await notificationService.sendToTeam('KB_QUALITY_ALERT', {
score: evalReport.overallRAGScore,
recommendations: evalReport.recommendations,
});
}
Audit Trail & Compliance
All evals and guardrail checks are logged for SOX compliance:
// Every validation and evaluation is logged
await logAuditEvent(tenantId, userId, 'FINANCIAL_VALIDATION_RUN', {
dataType: 'P&L_IMPORT',
status: 'PASS',
qualityScore: 92,
timestamp: new Date(),
}, 'Financial data validation completed successfully');
await logAuditEvent(tenantId, userId, 'SCENARIO_GUARDRAIL_CHECK', {
scenarioId: 'scenario-123',
status: 'REQUIRES_APPROVAL',
riskLevel: 'MEDIUM',
guardrailsRun: ['PARAMETER_VALIDATION', 'OUTCOME_SANITY', 'COHERENCE'],
}, 'Scenario approved for executive review');
// Immutable audit log
// SELECT * FROM audit_log WHERE action_type LIKE '%GUARDRAIL%';
Configuration & Thresholds
Financial Validation Thresholds
// In FinancialAccuracyEvaluationService.js
const VALIDATION_THRESHOLDS = {
PL_RECONCILIATION_TOLERANCE: 0.0001, // 0.01%
GROSS_MARGIN_MIN: 0.05, // 5%
GROSS_MARGIN_MAX: 0.95, // 95%
OPERATING_MARGIN_WARNING_MIN: -0.2, // -20%
DEBT_TO_EQUITY_CONCERN: 2.0, // 2.0x
INTEREST_COVERAGE_CONCERN: 1.5, // 1.5x
CURRENT_RATIO_MIN: 1.0, // 1.0x
};
// Adjust based on industry, company size, etc.
Scenario Guardrail Thresholds
// In ScenarioFinancialGuardrails.js
const GUARDRAIL_THRESHOLDS = {
REVENUE_GROWTH_MAX: 1.0, // +100%
REVENUE_GROWTH_MIN: -0.5, // -50%
OPEX_GROWTH_MAX: 0.5, // +50%
RD_SPEND_MAX: 0.2, // 20% of revenue
CAPEX_MAX: 0.3, // 30% of revenue
EBITDA_MARGIN_MIN: -0.5, // -50%
DEBT_TO_REVENUE_MAX: 3.0, // 3.0x
};
RAG Eval Score Thresholds
// In RAGEvaluationService.js
const EVAL_THRESHOLDS = {
RETRIEVAL_HIT_RATE_MIN: 0.9, // 90%
RETRIEVAL_PRECISION_MIN: 0.7, // 70%
GENERATION_FAITHFULNESS_MIN: 0.8, // 80%
BUSINESS_TASK_PASS_THRESHOLD: 6, // 6/10 score
OVERALL_RAG_SCORE_MIN: 70, // 70/100
};
Monitoring & Dashboards
Financial Health Dashboard
Display:
- Data quality scores over time
- Common validation failures
- Temporal consistency anomalies
- Alert frequency and resolution
Scenario Quality Dashboard
Display:
- Scenario approvals vs. rejections
- Risk level distribution
- Guardrail violation trends
- Executive approval workflow metrics
RAG Quality Dashboard
Display:
- Layer 1: Curation approval rate by document type
- Layer 2: Retrieval and generation scores by query category
- Layer 3: Business task success rate
- Overall RAG score trend
- Recommendations implementation status
Next Steps (Phase 2)
Phase 2 Enhancements
- Data Quality Guardrails - Input validation, PII detection, schema enforcement
- Self-Healing Guardrails - Automatic retry/repair of failed validations
- Pre-Processing Compliance Filters - Early-stage regulatory risk detection
- Advanced RAG Evals - Integration with Ragas framework for LLM-as-judge
- Sensitivity Analysis - Scenario parameter sensitivity evaluation
- Custom Guardrails - Allow organizations to define custom validation rules
References
- Guardrails Overview:
/docs/architecture/evals-and-guardrails/guardrails-overview.md - Evals Framework:
/docs/architecture/evals-and-guardrails/evals-overview.md - Audit Service:
backend/src/services/auditService.js - Financial Intelligence:
docs/FINANCIAL_INTELLIGENCE_README.md - Scenario Financial Service:
docs/SCENARIO_FINANCIAL_SERVICE_README.md
Last Updated: 2025-10-23 Status: Phase 1 Implementation Complete Version: 1.0