MILESTONE 79 - ERP Integration System

Status: 📋 PLANNED (Design Complete) Design Date: 2025-11-17 Estimated Duration: 16 weeks (4 phases) Strategic Impact: Transform customer integration from 6-12 months to days-weeks

Executive Summary

M79 establishes ChainAlign as an intelligent iPaaS (Integration Platform as a Service) that normalizes disparate customer ERP systems (SAP, Oracle, Adler, Dynamics, custom) to ChainAlign's canonical S&OP data model. This removes the largest barrier to customer adoption: integration complexity.

Key Innovation:

"Upload CSV today, sync in real-time tomorrow" - gradual sophistication without rebuilding
LLM-powered semantic mapping - understand customer fields automatically
Self-improving ontology - every validation makes future integrations smarter
Agentic automation - minimize human intervention over time

Competitive Differentiation:

Traditional competitors require customers to build data lakes (18-24 months, millions of dollars)
ChainAlign assumes the integration burden, enabling 2-day CSV onboarding → 2-week API integration

What Will Be Built

1. External Layer: Integration Gateway

Component: gateway.chainalign.com (GCP API Gateway)

Purpose: Single, stable, version-agnostic endpoint for all customer integrations

Features:

Multi-protocol support (REST, GraphQL, SOAP, VPN/SQL, CSV)
Environment routing (dev/staging/prod via subdomain or path)
Stripe-style dated versioning (ChainAlign-Version: 2025-11-17 header)
Security (OAuth 2.0, rate limiting, DDoS protection)
Protocol-agnostic (internal refactoring invisible to customers)

Files:

/infrastructure/gateway-config/openapi.yaml - API Gateway configuration
/infrastructure/gateway-config/api-gateway-deploy.sh - Deployment script

2. Ingestion Layer: Protocol Orchestrator

Component: Ingestion Orchestrator (Node.js/Cloud Run)

Purpose: Parse all protocols into canonical queue format

Responsibilities:

Protocol Detection: Read Content-Type header, route to appropriate parser
- application/json → REST parser
- application/graphql → GraphQL parser
- application/soap+xml → SOAP parser
Schema Extraction: Identify field names, types, sample values
Normalization: Convert all formats to standard queue message
Queue Writing: Write to pgmq ingestion_queue

Message Format:

{
  source: 'adler_erp',
  tenant_id: 'uuid',
  protocol: 'REST' | 'GraphQL' | 'SOAP' | 'SQL' | 'CSV',
  raw_data: {...},  // Original payload
  schema_hints: {
    fields: [
      {name: 'proj_stage_bill', type: 'numeric', samples: [1250, 3400]}
    ]
  },
  version: '2025-11-17',
  timestamp: '2025-11-17T10:30:00Z'
}

Files:

backend/services/ingestion-orchestrator/ - New microservice
- server.js - Express server
- parsers/RESTParser.js - REST/JSON parser
- parsers/GraphQLParser.js - GraphQL parser
- parsers/SOAPParser.js - SOAP/XML parser
- parsers/SQLParser.js - Database query result parser
- utils/schemaExtractor.js - Field detection logic
- utils/queueWriter.js - pgmq integration
- Dockerfile - Cloud Run deployment
- package.json

3. Intelligence Layer: Semantic Mapper Agent

Component: Semantic Mapper Agent (Python/LangGraph/Cloud Run)

Purpose: Intelligently map customer fields to ChainAlign canonical schema

Architecture: LangGraph cyclic workflow with 6 nodes:

Node 1: Schema Extraction

Input: Queue message with raw data Output: Structured field analysis Logic:

Parse field names from JSON/XML/CSV
Infer data types (string, numeric, date, boolean)
Collect sample values (first 5-10 records)
Generate fingerprints (e.g., "10-15 digits, prefix 'INV-'" → invoice_number)

Node 2: Exact Match Lookup

Input: Field name Output: 100% confidence mapping OR null Logic:

SELECT concept_id, canonical_field
FROM ontology_concepts
WHERE tenant_id = $1
  AND (concept_name = $2 OR $2 = ANY(synonyms))

Success Path: Auto-map → Node 6 (Execute Mapping)

Node 3: Semantic Search

Input: Unknown field Output: Top 3 similar concepts with scores Logic:

Generate embedding (Gemini embedding-001)
Query pgvector for nearest neighbors
Calculate cosine similarity
If similarity > 0.85: High-confidence match → Node 6
If 0.60-0.85: Medium confidence → Node 4 (LLM)
If < 0.60: Low confidence → Node 4 (LLM)

SELECT concept_id, canonical_field,
       embedding <=> $1 AS distance
FROM ontology_concepts
WHERE tenant_id = $2
ORDER BY embedding <=> $1
LIMIT 3

Node 4: LLM Reasoning

Input: Field + top 3 semantic matches Output: Confidence score (0-100) + reasoning Prompt:

You are a data integration specialist for S&OP systems.

Context:
- Customer field: "proj_stage_bill"
- Data type: numeric
- Sample values: [1250.00, 3400.50, 890.25]
- Customer industry: Manufacturing

Top 3 ontology matches:
1. billing_milestone - Project stage-based billing (similarity: 0.72)
2. revenue - Total sales revenue (similarity: 0.68)
3. stage_payment - Payment by delivery stage (similarity: 0.65)

Question: Is this customer field semantically equivalent to one of these concepts?

Respond in JSON:
{
  "is_match": true/false,
  "matched_concept_id": "uuid or null",
  "confidence": 0-100,
  "reasoning": "Explain your decision"
}

Decision Tree:

Confidence > 95%: Auto-map → Node 6
Confidence 60-95%: Human validation → Node 5
Confidence < 60%: Retry with refined query → Loop back to Node 3 (max 2 cycles)

Refinement Strategies:

Decompose compound names: "proj_stage_bill" → ["project", "stage", "billing"]
Use broader domain terms: "financial" instead of "billing"
Query industry-specific synonyms

Node 5: Human Validation Queue

Input: Low-confidence mapping Output: Pending review record Logic:

INSERT INTO pending_field_mappings (
  tenant_id, source_field, suggested_concept_id,
  llm_confidence, llm_reasoning, sample_data, status
) VALUES ($1, $2, $3, $4, $5, $6, 'pending')

Workflow exits, waits for human approval
Admin reviews in UI, makes decision
On approval, workflow resumes → Node 6

Node 6: Execute Mapping

Input: Validated mapping decision Output: Data written to PostgreSQL Actions:

Case 1: Synonym (most common)

-- Add synonym to ontology
UPDATE ontology_concepts
SET synonyms = array_append(synonyms, $1)
WHERE concept_id = $2;

-- Move data from JSONB to proper column
UPDATE core_entities
SET billing_milestone = (custom_attributes->>'proj_stage_bill')::numeric
WHERE tenant_id = $1;

Case 2: New core concept (rare, requires approval)

// Generate Knex migration
const migration = `
exports.up = function(knex) {
  return knex.schema.table('core_entities', (table) => {
    table.decimal('carbon_footprint_per_unit', 15, 4);
  });
};
`;

// Add to ontology
await ontologyRepository.create({
  concept_name: 'carbon_footprint_per_unit',
  canonical_field: 'carbon_footprint_per_unit',
  domain: 'sustainability'
});

Case 3: Tenant-specific custom field

-- Stays in JSONB permanently
CREATE INDEX idx_custom_tenant_field
ON core_entities USING GIN ((custom_attributes->'tenant_field'))
WHERE tenant_id = $1;

Files:

python-services/semantic-mapper-agent/ - New Python service
- main.py - FastAPI server
- langgraph_workflow.py - LangGraph state machine
- nodes/schema_extractor.py - Node 1
- nodes/exact_matcher.py - Node 2
- nodes/semantic_searcher.py - Node 3
- nodes/llm_reasoner.py - Node 4 (Gemini integration)
- nodes/validation_queue.py - Node 5
- nodes/mapping_executor.py - Node 6
- tools/ontology_client.py - PostgreSQL/pgvector client
- tools/embedding_generator.py - Gemini embedding API
- Dockerfile
- requirements.txt - LangGraph, pgvector, Pydantic, etc.

4. Knowledge Layer: Core Ontology

Component: Ontology database (PostgreSQL + pgvector)

Purpose: Store canonical S&OP concepts and learned synonyms

Schema:

-- Core ontology storage
CREATE TABLE ontology_concepts (
  concept_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID, -- NULL for global concepts

  -- Concept definition
  concept_name VARCHAR(100) NOT NULL,
  canonical_field VARCHAR(100) NOT NULL, -- DB column name
  domain VARCHAR(50), -- 'sales', 'inventory', 'finance', 'production'
  description TEXT,

  -- Mappings
  synonyms TEXT[], -- Known aliases
  embedding VECTOR(768), -- Gemini embedding for semantic search

  -- Metadata
  data_type VARCHAR(20), -- 'string', 'numeric', 'date', 'boolean'
  is_required BOOLEAN DEFAULT false,
  validation_rules JSONB, -- {min, max, regex, enum_values}

  -- Evolution tracking
  usage_count INTEGER DEFAULT 0,
  last_mapped_at TIMESTAMP,
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW(),

  UNIQUE(tenant_id, concept_name)
);

CREATE INDEX idx_ontology_tenant ON ontology_concepts(tenant_id);
CREATE INDEX idx_ontology_domain ON ontology_concepts(domain);
CREATE INDEX idx_ontology_embedding ON ontology_concepts USING ivfflat (embedding vector_cosine_ops);

-- Pending mappings awaiting human review
CREATE TABLE pending_field_mappings (
  mapping_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,

  -- Source field info
  source_field VARCHAR(200) NOT NULL,
  source_type VARCHAR(20),
  sample_data JSONB,

  -- Suggested mapping
  suggested_concept_id UUID,
  llm_confidence INTEGER, -- 0-100
  llm_reasoning TEXT,

  -- Review tracking
  status VARCHAR(20) DEFAULT 'pending', -- 'pending', 'approved', 'rejected'
  reviewed_by UUID,
  reviewed_at TIMESTAMP,
  final_concept_id UUID, -- May differ from suggested

  created_at TIMESTAMP DEFAULT NOW(),

  FOREIGN KEY(tenant_id) REFERENCES tenants(tenant_id),
  FOREIGN KEY(suggested_concept_id) REFERENCES ontology_concepts(concept_id),
  FOREIGN KEY(final_concept_id) REFERENCES ontology_concepts(concept_id)
);

-- ERP version tracking for proactive monitoring
CREATE TABLE erp_version_registry (
  registry_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  erp_system VARCHAR(50) NOT NULL, -- 'SAP_S4HANA', 'ORACLE_EBS', 'ADLER'
  version VARCHAR(50) NOT NULL,
  release_date DATE,

  -- Schema changes in this version
  schema_changes JSONB, -- [{type: 'added', field: 'new_field', description}]

  -- Affected tenants
  affected_tenant_ids UUID[],

  -- Monitoring
  detected_at TIMESTAMP DEFAULT NOW(),
  review_status VARCHAR(20) DEFAULT 'pending',

  UNIQUE(erp_system, version)
);

Initial Seed Data (Standard S&OP Concepts):

-- Products
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
  ('product_sku', 'sku', 'product', ARRAY['SKU', 'Item', 'ItemCode', 'ProductCode', 'Material'], 'string', true),
  ('unit_cost', 'unit_cost', 'product', ARRAY['COGS', 'CostPerUnit', 'StandardCost'], 'numeric', false);

-- Sales
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
  ('quantity_sold', 'quantity_sold', 'sales', ARRAY['SalesQty', 'UnitsSold', 'Volume'], 'numeric', true),
  ('revenue', 'revenue', 'sales', ARRAY['Sales', 'SalesAmount', 'Income'], 'numeric', false);

-- Inventory
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
  ('inventory_level', 'quantity_on_hand', 'inventory', ARRAY['OnHand', 'Stock', 'Available'], 'numeric', false);

-- Locations
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
  ('location_code', 'location_code', 'location', ARRAY['SiteCode', 'PlantCode', 'WarehouseID'], 'string', false);

-- Financial
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
  ('billing_milestone', 'billing_milestone', 'finance', ARRAY['ProjectBilling', 'StageBilling'], 'numeric', false);

Files:

backend/migrations/20251117000001_create_ontology_concepts.cjs - Ontology table
backend/migrations/20251117000002_create_pending_field_mappings.cjs - Review queue
backend/migrations/20251117000003_create_erp_version_registry.cjs - Version tracking
backend/seeds/001_seed_core_ontology.cjs - Initial S&OP concepts
backend/src/dal/OntologyRepository.js - Data access
backend/src/services/OntologyService.js - Business logic

5. Governance Layer: Admin Validation UI

Component: Extended CSV Admin UI (React)

Purpose: Human-in-the-loop review for ambiguous mappings

Features:

Unified View: Show pending mappings from all sources (CSV, REST, GraphQL, SOAP, SQL)
Review Workflow:
- Display: Source field, data type, samples, suggested concept, confidence, LLM reasoning
- Actions: Approve, Correct, Mark as New Concept, Reject
Batch Operations: Select multiple similar fields, apply same action
Learning Feedback: Approvals immediately update ontology, train agent

UI Components:

// frontend/src/pages/Admin/FieldMappingQueue.jsx
<Card>
  <CardHeader>
    <CardTitle>Pending Field Mappings</CardTitle>
    <CardDescription>Review AI-suggested mappings from all integration sources</CardDescription>
  </CardHeader>
  <CardContent>
    <DataTable>
      <Column>Source</Column>
      <Column>Field Name</Column>
      <Column>Type</Column>
      <Column>Samples</Column>
      <Column>Suggested Concept</Column>
      <Column>Confidence</Column>
      <Column>LLM Reasoning</Column>
      <Column>Actions</Column>
    </DataTable>
  </CardContent>
</Card>

API Endpoints:

// backend/src/routes/mappingRoutes.js
router.get('/api/admin/pending-mappings', getMappings);
router.post('/api/admin/approve-mapping', approveMapping);
router.post('/api/admin/reject-mapping', rejectMapping);
router.post('/api/admin/correct-mapping', correctMapping);
router.post('/api/admin/mark-as-new-concept', markAsNewConcept);
router.get('/api/admin/ontology', getOntology);
router.put('/api/admin/ontology/:conceptId', updateConcept);

Files:

frontend/src/pages/Admin/FieldMappingQueue.jsx - Main UI
frontend/src/components/Admin/MappingReviewCard.jsx - Individual mapping card
frontend/src/components/Admin/OntologyManager.jsx - Ontology browser
backend/src/routes/mappingRoutes.js - API routes
backend/src/services/MappingValidationService.js - Approval logic

6. Customer-Facing APIs: GraphQL Integration

Component: GraphQL API for customer writes/subscriptions

Purpose: Modern B2B integration option (mutations + subscriptions)

Schema:

# Customer writes data to ChainAlign
type Mutation {
  syncSalesOrders(orders: [SalesOrderInput!]!): SyncResponse!
  syncInventory(items: [InventoryItemInput!]!): SyncResponse!
  syncProducts(products: [ProductInput!]!): SyncResponse!
  syncForecasts(forecasts: [ForecastInput!]!): SyncResponse!
}

input SalesOrderInput {
  orderNumber: String!
  productSKU: String!
  quantity: Float!
  revenue: Float
  orderDate: DateTime!
  locationCode: String
  # Unknown fields go to custom_attributes for semantic mapping
  customFields: JSON
}

type SyncResponse {
  status: SyncStatus!
  itemsProcessed: Int!
  itemsFailed: Int!
  errors: [SyncError!]
  queueId: ID! # Track async processing
}

# Customer subscribes to real-time requests
type Subscription {
  dataRequested: DataRequest!
  syncStatusUpdate(queueId: ID!): SyncStatusUpdate!
}

type DataRequest {
  requestId: ID!
  dataType: DataType!
  fields: [String!]!
  dateRange: DateRange
}

Example Customer Usage:

// Customer writes via mutation
const result = await client.mutate({
  mutation: gql`
    mutation SyncOrders($orders: [SalesOrderInput!]!) {
      syncSalesOrders(orders: $orders) {
        status
        itemsProcessed
        queueId
      }
    }
  `,
  variables: {
    orders: [
      {
        orderNumber: "SO-12345",
        productSKU: "WIDGET-001",
        quantity: 100,
        revenue: 5000.00,
        orderDate: "2025-11-15T10:30:00Z",
        customFields: {
          proj_stage_bill: 1250.00  // Unknown field → semantic mapper
        }
      }
    ]
  }
});

// Customer subscribes for real-time requests
client.subscribe({
  query: gql`
    subscription OnDataRequested {
      dataRequested {
        requestId
        dataType
        fields
        dateRange { start end }
      }
    }
  `
}).subscribe({
  next(data) {
    console.log('ChainAlign needs fresh data:', data);
    // Customer responds with mutation
  }
});

Files:

backend/src/graphql/integration-schema.graphql - Schema definition
backend/src/graphql/integrationResolvers.js - Mutation/subscription resolvers
backend/src/services/GraphQLIngestionService.js - Business logic
docs/api/graphql-integration-guide.md - Customer documentation

Architecture Summary

Customer ERP → gateway.chainalign.com → Ingestion Orchestrator → pgmq
                                                                    ↓
Admin UI ← pending_field_mappings ← Semantic Mapper Agent (LangGraph)
                                              ↓
                    PostgreSQL ← Execute Mapping (ontology-driven)

Key Architectural Decisions:

Layered Separation:
- Gateway = Infrastructure (auth, routing, DDoS)
- Orchestrator = I/O (protocol parsing, queue writing)
- Mapper = Intelligence (LLM, ontology, mapping logic)
- Governance = Human oversight (UI, approvals)
Protocol Agnosticism:
- All protocols converge to single queue format
- Customer-facing API never changes
- Internal refactoring transparent to customers
Hybrid Mapping Strategy:
- Fast path: Exact match → auto-map (100% confidence)
- Semantic path: pgvector → auto-map if >85% similarity
- LLM path: Gemini reasoning → auto-map if >95% confidence
- Human path: Admin review for <95% confidence
Self-Improving System:
- Every approval adds synonym to ontology
- Future occurrences auto-map immediately
- System accuracy increases over time
Three-Tier Schema Evolution:
- Tier 1: Core ChainAlign schema (rare, deliberate changes)
- Tier 2: Ontology synonyms (frequent, automatic)
- Tier 3: Tenant custom fields (JSONB, per-customer)

Implementation Phases

Phase 1: Foundation (Weeks 1-4)

Goal: Universal ingestion infrastructure operational

Deliverables:

gateway.chainalign.com live with SSL
Ingestion Orchestrator deployed (Cloud Run)
REST/JSON parser working
pgmq ingestion_queue operational
Monitoring/logging configured

Success Criteria:

Customer can POST JSON to gateway
Data lands in queue with metadata
Gateway handles 1000 req/sec
Zero downtime during deployments

Tasks:

GCP API Gateway setup:
- Create SSL certificates
- Configure openapi.yaml
- Deploy to gateway.chainalign.com
- Test with curl/Postman
Build Ingestion Orchestrator:
- Express.js server
- REST parser (JSON)
- Schema extraction logic
- pgmq client integration
- Error handling
Database setup:
- Enable pgmq extension
- Create ingestion_queue
- Configure retention policy
Monitoring:
- Cloud Logging integration
- Alert on queue depth >1000
- Dashboard for ingestion metrics

Files Created:

/infrastructure/gateway-config/
backend/services/ingestion-orchestrator/
Database: ingestion_queue (pgmq)

Phase 2: Semantic Intelligence (Weeks 5-8)

Goal: LangGraph mapper agent processing queue

Deliverables:

Python semantic mapper deployed
LangGraph workflow operational (6 nodes)
Core ontology seeded (50+ S&OP concepts)
pgvector semantic search working
LLM reasoning integrated (Gemini)

Success Criteria:

Known fields auto-map with 100% accuracy
Unknown fields route to pending_field_mappings
LLM provides clear reasoning
Agent processes 100 fields/minute
Cyclic refinement working (retry logic)

Tasks:

Build Python service:
- FastAPI server
- LangGraph state machine (6 nodes)
- PostgreSQL client (psycopg3)
- pgvector queries
Implement nodes:
- Node 1: Schema extraction
- Node 2: Exact match lookup
- Node 3: Semantic search (pgvector)
- Node 4: LLM reasoning (Gemini)
- Node 5: Validation queue writer
- Node 6: Mapping executor
Database migrations:
- ontology_concepts table
- pending_field_mappings table
- erp_version_registry table
- Enable pgvector extension
- Create IVFFlat index
Seed ontology:
- Load 50+ standard S&OP concepts
- Generate embeddings (Gemini)
- Insert with synonyms
LLM integration:
- Gemini API client
- Prompt engineering
- Confidence scoring logic
- Retry/backoff handling

Files Created:

python-services/semantic-mapper-agent/
backend/migrations/2025111700000*.cjs (3 migrations)
backend/seeds/001_seed_core_ontology.cjs

Phase 3: Validation UI & CSV Unification (Weeks 9-10)

Goal: Admin can review/approve mappings, CSV uses same engine

Deliverables:

Admin UI shows pending mappings
Approve/reject/correct workflow
CSV upload refactored to use semantic mapper
Batch operations working

Success Criteria:

Admin reviews 50 fields in 10 minutes
Approvals update ontology immediately
CSV templates stored in ontology
Batch approve works for similar fields

Tasks:

Build Admin UI:
- FieldMappingQueue.jsx page
- MappingReviewCard.jsx component
- OntologyManager.jsx browser
- Integrate with shadcn/ui DataTable
API routes:
- GET /api/admin/pending-mappings
- POST /api/admin/approve-mapping
- POST /api/admin/reject-mapping
- POST /api/admin/correct-mapping
- GET /api/admin/ontology
Services:
- MappingValidationService.js (approval logic)
- OntologyService.js (ontology management)
Repositories:
- OntologyRepository.js
- PendingMappingRepository.js
Refactor CSV:
- Update CSV upload to write to ingestion_queue
- Remove old mapping logic
- Store CSV templates as ontology synonyms

Files Created:

frontend/src/pages/Admin/FieldMappingQueue.jsx
backend/src/routes/mappingRoutes.js
backend/src/services/MappingValidationService.js
backend/src/dal/OntologyRepository.js

Phase 4: GraphQL & Advanced Protocols (Weeks 11-13)

Goal: Multi-protocol ingestion fully operational

Deliverables:

GraphQL mutation API published
GraphQL subscription for real-time events
SOAP parser added to Orchestrator
VPN/SQL adapter working
Customer documentation complete

Success Criteria:

Customers can write GraphQL mutations
Customers can subscribe to events
SOAP integration tested with sample ERP
SQL direct connection proven secure
API docs published with examples

Tasks:

GraphQL schema:
- Define mutation types (SyncSalesOrders, etc.)
- Define subscription types (dataRequested, etc.)
- Input types with customFields support
GraphQL server:
- Extend Orchestrator with Apollo Server
- Schema validation
- Subscription pub/sub (Redis or Pub/Sub)
SOAP adapter:
- WSDL parser
- XML to JSON conversion
- Add to Orchestrator
SQL adapter:
- Read-only connection pool
- Query executor
- Scheduled job runner (Cloud Scheduler)
Documentation:
- GraphQL Playground
- Code examples (JS, Python, Java)
- Auth guide
- Versioning guide

Files Created:

backend/src/graphql/integration-schema.graphql
backend/src/graphql/integrationResolvers.js
backend/services/ingestion-orchestrator/parsers/SOAPParser.js
backend/services/ingestion-orchestrator/adapters/SQLAdapter.js
docs/api/graphql-integration-guide.md

Phase 5: Adler Pilot & Iteration (Weeks 14-16)

Goal: First production ERP integration validated

Deliverables:

Adler ERP connected (whatever protocol they offer)
Data flowing into ChainAlign
Mappings validated with customer
Learnings documented for next integration

Success Criteria:

95%+ fields auto-mapped
Customer satisfied with data quality
Zero manual CSV exports needed
Architecture proven for scale

Tasks:

Adler onboarding:
- Discover their protocol (REST/SOAP/SQL/GraphQL)
- Set up credentials/VPN if needed
- Configure gateway routing for their environment
Initial sync:
- Run semantic mapper on sample data
- Review all pending mappings with admin
- Validate data accuracy against source
Iterative tuning:
- Refine LLM prompts based on Adler-specific fields
- Add industry synonyms to ontology
- Adjust confidence thresholds if needed
Monitoring:
- Set up alerts for Adler integration
- Track success/failure rates
- Monitor data freshness
Documentation:
- Capture Adler-specific patterns
- Update runbook for similar ERPs
- Create case study for sales

Files Created:

docs/integrations/adler-integration-guide.md
docs/integrations/erp-onboarding-runbook.md
backend/seeds/adler_ontology_extensions.cjs (if needed)

Key Technical Decisions

1. LangGraph Over Simple Pipeline

Rationale: Cyclic refinement critical for handling ambiguous mappings. LangGraph enables agent to retry semantic search with refined strategies (decompose field names, use broader terms) before giving up to human review.

2. Hybrid Ontology (SQL + pgvector)

Rationale: Fast exact matching (PostgreSQL) for known fields, semantic similarity (pgvector) for fuzzy matching. No external vector DB needed, leverages existing infrastructure.

3. Protocol-Agnostic Orchestrator

Rationale: Customer-facing API (gateway.chainalign.com) never changes. Internal refactoring (add new parsers, change queue format) invisible to customers. Enables rapid iteration.

4. Three-Tier Schema Evolution

Rationale: Prevents schema bloat while maintaining flexibility:

Core schema: Rarely changes, benefits all customers
Ontology synonyms: Frequent, automatic, no migrations
Tenant JSONB: Per-customer, no pollution

5. Human-in-the-Loop Governance

Rationale: LLM confidence <95% requires human approval to prevent schema pollution. Approvals train the agent, future occurrences auto-map. Balances automation with quality control.

6. Stripe-Style Versioning

Rationale: Dated versions (ChainAlign-Version: 2025-11-17) allow internal refactoring without breaking customers. No v1/v2 URL changes, customers opt into new features via header.

Success Metrics

Technical KPIs

Ingestion Layer:

Uptime: 99.9%
Latency: p99 <500ms (gateway → queue)
Throughput: 1000 req/sec
Queue depth: <1000 messages (steady state)

Semantic Mapper:

Auto-mapping rate: >80% (fields mapped without human review)
LLM accuracy: >90% (when human reviews, LLM was correct)
Processing time: <5 sec/field (including LLM calls)
Retry rate: <10% (cyclic refinement needed)

Ontology Growth:

New synonyms/week: 20-50 (early), 5-10 (mature)
New core concepts/quarter: <5 (deliberate evolution)
Tenant custom fields: <20% of total fields

Business KPIs

Customer Onboarding:

CSV → value: <2 days
API integration → value: <2 weeks
First sync → validated data: <1 week

Customer Satisfaction:

Manual mapping effort: <10% of fields
Data quality issues: <1% of records
Integration reliability: >99.5% uptime

Platform Differentiation:

Supported ERP systems: SAP, Oracle, Adler, Dynamics, +custom
Protocols supported: REST, GraphQL, SOAP, SQL, CSV
Self-service mapping: 80% of fields (target by Q2 2026)

Security & Compliance

Data Protection

Encryption in transit: TLS 1.3 for all external connections
Encryption at rest: GCP default encryption
API credentials: Rotated every 90 days
Tenant isolation: All queries scoped by tenant_id
Audit logging: All mapping approvals logged

Access Control

Admin UI: Requires special role (admin:field_mappings)
API Gateway: OAuth 2.0 token validation
Rate limiting: Per-tenant quotas
PII handling: Sample values truncated/masked in UI, never logged

Compliance

GDPR: Customer owns data, can delete anytime
SOC 2: Change management for ontology updates
Data Processing Agreement: Required for all integrations

Risks & Mitigations

Risk 1: LLM Hallucination

Impact: Incorrect field mappings corrupt data Mitigation:

Confidence threshold (95%) before auto-mapping
Human review for ambiguous cases
Audit log of all LLM reasoning
Rollback capability (revert to JSONB)

Risk 2: Ontology Pollution

Impact: Too many similar concepts, schema bloat Mitigation:

Three-tier schema strategy
Human approval for new core concepts
Quarterly ontology review/cleanup
Usage tracking (unused concepts deprecated)

Risk 3: Customer Schema Drift

Impact: ERP upgrade breaks integration silently Mitigation:

Proactive version monitoring (weekly checks)
Anomaly detection on incoming data
Alerting for new/removed fields
Graceful degradation (unknown fields → JSONB)

Risk 4: Scale (High Volume Tenants)

Impact: Queue depth grows, latency increases Mitigation:

Horizontal scaling (Cloud Run auto-scales)
Batch processing for large syncs
Priority queue (interactive > scheduled)
Circuit breaker (fail fast on overload)

Dependencies

External Services

GCP API Gateway: Gateway infrastructure
Cloud Run: Orchestrator + Mapper hosting
pgmq: Queue management (PostgreSQL extension)
pgvector: Semantic search (PostgreSQL extension)
Gemini API: LLM reasoning + embeddings

Internal Dependencies

Existing CSV UI: Refactor to use semantic mapper
PostgreSQL: Ontology storage, pgvector
Authentication: OAuth tokens for gateway
Notification system: Alert admins on pending mappings

Documentation Deliverables

Technical Docs

✅ docs/architecture/functional-specifications/erp-integration-fsd.md - This FSD
⏳ docs/architecture/integration-gateway-setup.md - Gateway deployment
⏳ docs/architecture/semantic-mapper-design.md - LangGraph workflow
⏳ docs/architecture/ontology-management.md - Schema evolution guide

API Docs

⏳ docs/api/graphql-integration-guide.md - Customer GraphQL API
⏳ docs/api/rest-integration-guide.md - Customer REST API
⏳ docs/api/authentication-guide.md - OAuth setup

Operational Docs

⏳ docs/integrations/erp-onboarding-runbook.md - New customer checklist
⏳ docs/integrations/adler-case-study.md - First pilot learnings
⏳ docs/integrations/troubleshooting.md - Common issues

Admin Docs

⏳ docs/admin/field-mapping-approval-guide.md - UI walkthrough
⏳ docs/admin/ontology-curation.md - When to create new concepts

File Structure Summary

# New directories
/infrastructure/gateway-config/            # GCP API Gateway
backend/services/ingestion-orchestrator/   # Node.js microservice
python-services/semantic-mapper-agent/     # Python/LangGraph

# Backend additions
backend/migrations/
  20251117000001_create_ontology_concepts.cjs
  20251117000002_create_pending_field_mappings.cjs
  20251117000003_create_erp_version_registry.cjs

backend/seeds/
  001_seed_core_ontology.cjs

backend/src/dal/
  OntologyRepository.js
  PendingMappingRepository.js

backend/src/services/
  OntologyService.js
  MappingValidationService.js
  GraphQLIngestionService.js

backend/src/routes/
  mappingRoutes.js

backend/src/graphql/
  integration-schema.graphql
  integrationResolvers.js

# Frontend additions
frontend/src/pages/Admin/
  FieldMappingQueue.jsx
  OntologyManager.jsx

frontend/src/components/Admin/
  MappingReviewCard.jsx
  BatchMappingActions.jsx

# Documentation
docs/architecture/functional-specifications/
  erp-integration-fsd.md (✅ created)

docs/project-management/milestones/
  m79-erp-integration-system.md (✅ this file)

docs/api/
  graphql-integration-guide.md (⏳)

docs/integrations/
  erp-onboarding-runbook.md (⏳)
  adler-case-study.md (⏳)

Total Estimate:

Backend: 8 migrations/seeds, 10 services/repos, 3 route files, 2 GraphQL files
Frontend: 4 new pages/components
Python: 1 new microservice (10+ files)
Infrastructure: 1 API Gateway config
Total New Code: ~5,000 lines (production) + tests

Next Steps

Immediate (Post-Design)

Review FSD with stakeholders
Prioritize Adler pilot timeline
Allocate engineering resources (1 backend, 1 frontend, 1 Python/ML)
Set up Phase 1 infrastructure (GCP API Gateway)

Phase 1 Kickoff (Week 1)

Create infrastructure repos
Set up GCP API Gateway (gateway.chainalign.com)
Scaffold Ingestion Orchestrator (Node.js)
Enable pgmq extension, create queue

Weekly Milestones

Week 4: Phase 1 complete (ingestion working)
Week 8: Phase 2 complete (semantic mapper deployed)
Week 10: Phase 3 complete (admin UI operational)
Week 13: Phase 4 complete (GraphQL + multi-protocol)
Week 16: Phase 5 complete (Adler pilot validated)

Status: 📋 Design Complete, Ready for Phase 1 Strategic Value: Removes #1 barrier to customer adoption (integration complexity) Competitive Advantage: Transforms 6-12 month integration into 2-day onboarding Innovation: LLM-powered semantic mapping + self-improving ontology

Author: Pramod Prasanth Design Date: 2025-11-17 Estimated Completion: Q2 2026 (16 weeks from kickoff)

Executive Summary​

What Will Be Built​

1. External Layer: Integration Gateway​

2. Ingestion Layer: Protocol Orchestrator​

3. Intelligence Layer: Semantic Mapper Agent​

Node 1: Schema Extraction​

Node 2: Exact Match Lookup​

Node 3: Semantic Search​

Node 4: LLM Reasoning​

Node 5: Human Validation Queue​

Node 6: Execute Mapping​

4. Knowledge Layer: Core Ontology​

5. Governance Layer: Admin Validation UI​

6. Customer-Facing APIs: GraphQL Integration​

Architecture Summary​

Implementation Phases​

Phase 1: Foundation (Weeks 1-4)​

Phase 2: Semantic Intelligence (Weeks 5-8)​

Phase 3: Validation UI & CSV Unification (Weeks 9-10)​

Phase 4: GraphQL & Advanced Protocols (Weeks 11-13)​

Phase 5: Adler Pilot & Iteration (Weeks 14-16)​

Key Technical Decisions​

1. LangGraph Over Simple Pipeline​

2. Hybrid Ontology (SQL + pgvector)​

3. Protocol-Agnostic Orchestrator​

4. Three-Tier Schema Evolution​

5. Human-in-the-Loop Governance​

6. Stripe-Style Versioning​

Success Metrics​

Technical KPIs​

Business KPIs​

Security & Compliance​

Data Protection​

Access Control​

Compliance​

Risks & Mitigations​

Risk 1: LLM Hallucination​

Risk 2: Ontology Pollution​

Risk 3: Customer Schema Drift​

Risk 4: Scale (High Volume Tenants)​

Dependencies​

External Services​

Internal Dependencies​

Documentation Deliverables​

Technical Docs​

API Docs​

Operational Docs​

Admin Docs​

File Structure Summary​

Next Steps​

Immediate (Post-Design)​

Phase 1 Kickoff (Week 1)​

Weekly Milestones​

Executive Summary

What Will Be Built

1. External Layer: Integration Gateway

2. Ingestion Layer: Protocol Orchestrator

3. Intelligence Layer: Semantic Mapper Agent

Node 1: Schema Extraction

Node 2: Exact Match Lookup

Node 3: Semantic Search

Node 4: LLM Reasoning

Node 5: Human Validation Queue

Node 6: Execute Mapping

4. Knowledge Layer: Core Ontology

5. Governance Layer: Admin Validation UI

6. Customer-Facing APIs: GraphQL Integration

Architecture Summary

Implementation Phases

Phase 1: Foundation (Weeks 1-4)

Phase 2: Semantic Intelligence (Weeks 5-8)

Phase 3: Validation UI & CSV Unification (Weeks 9-10)

Phase 4: GraphQL & Advanced Protocols (Weeks 11-13)

Phase 5: Adler Pilot & Iteration (Weeks 14-16)

Key Technical Decisions

1. LangGraph Over Simple Pipeline

2. Hybrid Ontology (SQL + pgvector)

3. Protocol-Agnostic Orchestrator

4. Three-Tier Schema Evolution

5. Human-in-the-Loop Governance

6. Stripe-Style Versioning

Success Metrics

Technical KPIs

Business KPIs

Security & Compliance

Data Protection

Access Control

Compliance

Risks & Mitigations

Risk 1: LLM Hallucination

Risk 2: Ontology Pollution

Risk 3: Customer Schema Drift

Risk 4: Scale (High Volume Tenants)

Dependencies

External Services

Internal Dependencies

Documentation Deliverables

Technical Docs

API Docs

Operational Docs

Admin Docs

File Structure Summary

Next Steps

Immediate (Post-Design)

Phase 1 Kickoff (Week 1)

Weekly Milestones