MILESTONE 79 - ERP Integration System
Status: 📋 PLANNED (Design Complete) Design Date: 2025-11-17 Estimated Duration: 16 weeks (4 phases) Strategic Impact: Transform customer integration from 6-12 months to days-weeks
Executive Summary
M79 establishes ChainAlign as an intelligent iPaaS (Integration Platform as a Service) that normalizes disparate customer ERP systems (SAP, Oracle, Adler, Dynamics, custom) to ChainAlign's canonical S&OP data model. This removes the largest barrier to customer adoption: integration complexity.
Key Innovation:
- "Upload CSV today, sync in real-time tomorrow" - gradual sophistication without rebuilding
- LLM-powered semantic mapping - understand customer fields automatically
- Self-improving ontology - every validation makes future integrations smarter
- Agentic automation - minimize human intervention over time
Competitive Differentiation:
- Traditional competitors require customers to build data lakes (18-24 months, millions of dollars)
- ChainAlign assumes the integration burden, enabling 2-day CSV onboarding → 2-week API integration
What Will Be Built
1. External Layer: Integration Gateway
Component: gateway.chainalign.com (GCP API Gateway)
Purpose: Single, stable, version-agnostic endpoint for all customer integrations
Features:
- Multi-protocol support (REST, GraphQL, SOAP, VPN/SQL, CSV)
- Environment routing (dev/staging/prod via subdomain or path)
- Stripe-style dated versioning (
ChainAlign-Version: 2025-11-17header) - Security (OAuth 2.0, rate limiting, DDoS protection)
- Protocol-agnostic (internal refactoring invisible to customers)
Files:
/infrastructure/gateway-config/openapi.yaml- API Gateway configuration/infrastructure/gateway-config/api-gateway-deploy.sh- Deployment script
2. Ingestion Layer: Protocol Orchestrator
Component: Ingestion Orchestrator (Node.js/Cloud Run)
Purpose: Parse all protocols into canonical queue format
Responsibilities:
- Protocol Detection: Read
Content-Typeheader, route to appropriate parserapplication/json→ REST parserapplication/graphql→ GraphQL parserapplication/soap+xml→ SOAP parser
- Schema Extraction: Identify field names, types, sample values
- Normalization: Convert all formats to standard queue message
- Queue Writing: Write to pgmq
ingestion_queue
Message Format:
{
source: 'adler_erp',
tenant_id: 'uuid',
protocol: 'REST' | 'GraphQL' | 'SOAP' | 'SQL' | 'CSV',
raw_data: {...}, // Original payload
schema_hints: {
fields: [
{name: 'proj_stage_bill', type: 'numeric', samples: [1250, 3400]}
]
},
version: '2025-11-17',
timestamp: '2025-11-17T10:30:00Z'
}
Files:
backend/services/ingestion-orchestrator/- New microserviceserver.js- Express serverparsers/RESTParser.js- REST/JSON parserparsers/GraphQLParser.js- GraphQL parserparsers/SOAPParser.js- SOAP/XML parserparsers/SQLParser.js- Database query result parserutils/schemaExtractor.js- Field detection logicutils/queueWriter.js- pgmq integrationDockerfile- Cloud Run deploymentpackage.json
3. Intelligence Layer: Semantic Mapper Agent
Component: Semantic Mapper Agent (Python/LangGraph/Cloud Run)
Purpose: Intelligently map customer fields to ChainAlign canonical schema
Architecture: LangGraph cyclic workflow with 6 nodes:
Node 1: Schema Extraction
Input: Queue message with raw data Output: Structured field analysis Logic:
- Parse field names from JSON/XML/CSV
- Infer data types (string, numeric, date, boolean)
- Collect sample values (first 5-10 records)
- Generate fingerprints (e.g., "10-15 digits, prefix 'INV-'" → invoice_number)
Node 2: Exact Match Lookup
Input: Field name Output: 100% confidence mapping OR null Logic:
SELECT concept_id, canonical_field
FROM ontology_concepts
WHERE tenant_id = $1
AND (concept_name = $2 OR $2 = ANY(synonyms))
Success Path: Auto-map → Node 6 (Execute Mapping)
Node 3: Semantic Search
Input: Unknown field Output: Top 3 similar concepts with scores Logic:
- Generate embedding (Gemini embedding-001)
- Query pgvector for nearest neighbors
- Calculate cosine similarity
- If similarity > 0.85: High-confidence match → Node 6
- If 0.60-0.85: Medium confidence → Node 4 (LLM)
- If < 0.60: Low confidence → Node 4 (LLM)
SELECT concept_id, canonical_field,
embedding <=> $1 AS distance
FROM ontology_concepts
WHERE tenant_id = $2
ORDER BY embedding <=> $1
LIMIT 3
Node 4: LLM Reasoning
Input: Field + top 3 semantic matches Output: Confidence score (0-100) + reasoning Prompt:
You are a data integration specialist for S&OP systems.
Context:
- Customer field: "proj_stage_bill"
- Data type: numeric
- Sample values: [1250.00, 3400.50, 890.25]
- Customer industry: Manufacturing
Top 3 ontology matches:
1. billing_milestone - Project stage-based billing (similarity: 0.72)
2. revenue - Total sales revenue (similarity: 0.68)
3. stage_payment - Payment by delivery stage (similarity: 0.65)
Question: Is this customer field semantically equivalent to one of these concepts?
Respond in JSON:
{
"is_match": true/false,
"matched_concept_id": "uuid or null",
"confidence": 0-100,
"reasoning": "Explain your decision"
}
Decision Tree:
- Confidence > 95%: Auto-map → Node 6
- Confidence 60-95%: Human validation → Node 5
- Confidence < 60%: Retry with refined query → Loop back to Node 3 (max 2 cycles)
Refinement Strategies:
- Decompose compound names: "proj_stage_bill" → ["project", "stage", "billing"]
- Use broader domain terms: "financial" instead of "billing"
- Query industry-specific synonyms
Node 5: Human Validation Queue
Input: Low-confidence mapping Output: Pending review record Logic:
INSERT INTO pending_field_mappings (
tenant_id, source_field, suggested_concept_id,
llm_confidence, llm_reasoning, sample_data, status
) VALUES ($1, $2, $3, $4, $5, $6, 'pending')
- Workflow exits, waits for human approval
- Admin reviews in UI, makes decision
- On approval, workflow resumes → Node 6
Node 6: Execute Mapping
Input: Validated mapping decision Output: Data written to PostgreSQL Actions:
Case 1: Synonym (most common)
-- Add synonym to ontology
UPDATE ontology_concepts
SET synonyms = array_append(synonyms, $1)
WHERE concept_id = $2;
-- Move data from JSONB to proper column
UPDATE core_entities
SET billing_milestone = (custom_attributes->>'proj_stage_bill')::numeric
WHERE tenant_id = $1;
Case 2: New core concept (rare, requires approval)
// Generate Knex migration
const migration = `
exports.up = function(knex) {
return knex.schema.table('core_entities', (table) => {
table.decimal('carbon_footprint_per_unit', 15, 4);
});
};
`;
// Add to ontology
await ontologyRepository.create({
concept_name: 'carbon_footprint_per_unit',
canonical_field: 'carbon_footprint_per_unit',
domain: 'sustainability'
});
Case 3: Tenant-specific custom field
-- Stays in JSONB permanently
CREATE INDEX idx_custom_tenant_field
ON core_entities USING GIN ((custom_attributes->'tenant_field'))
WHERE tenant_id = $1;
Files:
python-services/semantic-mapper-agent/- New Python servicemain.py- FastAPI serverlanggraph_workflow.py- LangGraph state machinenodes/schema_extractor.py- Node 1nodes/exact_matcher.py- Node 2nodes/semantic_searcher.py- Node 3nodes/llm_reasoner.py- Node 4 (Gemini integration)nodes/validation_queue.py- Node 5nodes/mapping_executor.py- Node 6tools/ontology_client.py- PostgreSQL/pgvector clienttools/embedding_generator.py- Gemini embedding APIDockerfilerequirements.txt- LangGraph, pgvector, Pydantic, etc.
4. Knowledge Layer: Core Ontology
Component: Ontology database (PostgreSQL + pgvector)
Purpose: Store canonical S&OP concepts and learned synonyms
Schema:
-- Core ontology storage
CREATE TABLE ontology_concepts (
concept_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID, -- NULL for global concepts
-- Concept definition
concept_name VARCHAR(100) NOT NULL,
canonical_field VARCHAR(100) NOT NULL, -- DB column name
domain VARCHAR(50), -- 'sales', 'inventory', 'finance', 'production'
description TEXT,
-- Mappings
synonyms TEXT[], -- Known aliases
embedding VECTOR(768), -- Gemini embedding for semantic search
-- Metadata
data_type VARCHAR(20), -- 'string', 'numeric', 'date', 'boolean'
is_required BOOLEAN DEFAULT false,
validation_rules JSONB, -- {min, max, regex, enum_values}
-- Evolution tracking
usage_count INTEGER DEFAULT 0,
last_mapped_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
UNIQUE(tenant_id, concept_name)
);
CREATE INDEX idx_ontology_tenant ON ontology_concepts(tenant_id);
CREATE INDEX idx_ontology_domain ON ontology_concepts(domain);
CREATE INDEX idx_ontology_embedding ON ontology_concepts USING ivfflat (embedding vector_cosine_ops);
-- Pending mappings awaiting human review
CREATE TABLE pending_field_mappings (
mapping_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
-- Source field info
source_field VARCHAR(200) NOT NULL,
source_type VARCHAR(20),
sample_data JSONB,
-- Suggested mapping
suggested_concept_id UUID,
llm_confidence INTEGER, -- 0-100
llm_reasoning TEXT,
-- Review tracking
status VARCHAR(20) DEFAULT 'pending', -- 'pending', 'approved', 'rejected'
reviewed_by UUID,
reviewed_at TIMESTAMP,
final_concept_id UUID, -- May differ from suggested
created_at TIMESTAMP DEFAULT NOW(),
FOREIGN KEY(tenant_id) REFERENCES tenants(tenant_id),
FOREIGN KEY(suggested_concept_id) REFERENCES ontology_concepts(concept_id),
FOREIGN KEY(final_concept_id) REFERENCES ontology_concepts(concept_id)
);
-- ERP version tracking for proactive monitoring
CREATE TABLE erp_version_registry (
registry_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
erp_system VARCHAR(50) NOT NULL, -- 'SAP_S4HANA', 'ORACLE_EBS', 'ADLER'
version VARCHAR(50) NOT NULL,
release_date DATE,
-- Schema changes in this version
schema_changes JSONB, -- [{type: 'added', field: 'new_field', description}]
-- Affected tenants
affected_tenant_ids UUID[],
-- Monitoring
detected_at TIMESTAMP DEFAULT NOW(),
review_status VARCHAR(20) DEFAULT 'pending',
UNIQUE(erp_system, version)
);
Initial Seed Data (Standard S&OP Concepts):
-- Products
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
('product_sku', 'sku', 'product', ARRAY['SKU', 'Item', 'ItemCode', 'ProductCode', 'Material'], 'string', true),
('unit_cost', 'unit_cost', 'product', ARRAY['COGS', 'CostPerUnit', 'StandardCost'], 'numeric', false);
-- Sales
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
('quantity_sold', 'quantity_sold', 'sales', ARRAY['SalesQty', 'UnitsSold', 'Volume'], 'numeric', true),
('revenue', 'revenue', 'sales', ARRAY['Sales', 'SalesAmount', 'Income'], 'numeric', false);
-- Inventory
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
('inventory_level', 'quantity_on_hand', 'inventory', ARRAY['OnHand', 'Stock', 'Available'], 'numeric', false);
-- Locations
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
('location_code', 'location_code', 'location', ARRAY['SiteCode', 'PlantCode', 'WarehouseID'], 'string', false);
-- Financial
INSERT INTO ontology_concepts (concept_name, canonical_field, domain, synonyms, data_type, is_required) VALUES
('billing_milestone', 'billing_milestone', 'finance', ARRAY['ProjectBilling', 'StageBilling'], 'numeric', false);
Files:
backend/migrations/20251117000001_create_ontology_concepts.cjs- Ontology tablebackend/migrations/20251117000002_create_pending_field_mappings.cjs- Review queuebackend/migrations/20251117000003_create_erp_version_registry.cjs- Version trackingbackend/seeds/001_seed_core_ontology.cjs- Initial S&OP conceptsbackend/src/dal/OntologyRepository.js- Data accessbackend/src/services/OntologyService.js- Business logic
5. Governance Layer: Admin Validation UI
Component: Extended CSV Admin UI (React)
Purpose: Human-in-the-loop review for ambiguous mappings
Features:
- Unified View: Show pending mappings from all sources (CSV, REST, GraphQL, SOAP, SQL)
- Review Workflow:
- Display: Source field, data type, samples, suggested concept, confidence, LLM reasoning
- Actions: Approve, Correct, Mark as New Concept, Reject
- Batch Operations: Select multiple similar fields, apply same action
- Learning Feedback: Approvals immediately update ontology, train agent
UI Components:
// frontend/src/pages/Admin/FieldMappingQueue.jsx
<Card>
<CardHeader>
<CardTitle>Pending Field Mappings</CardTitle>
<CardDescription>Review AI-suggested mappings from all integration sources</CardDescription>
</CardHeader>
<CardContent>
<DataTable>
<Column>Source</Column>
<Column>Field Name</Column>
<Column>Type</Column>
<Column>Samples</Column>
<Column>Suggested Concept</Column>
<Column>Confidence</Column>
<Column>LLM Reasoning</Column>
<Column>Actions</Column>
</DataTable>
</CardContent>
</Card>
API Endpoints:
// backend/src/routes/mappingRoutes.js
router.get('/api/admin/pending-mappings', getMappings);
router.post('/api/admin/approve-mapping', approveMapping);
router.post('/api/admin/reject-mapping', rejectMapping);
router.post('/api/admin/correct-mapping', correctMapping);
router.post('/api/admin/mark-as-new-concept', markAsNewConcept);
router.get('/api/admin/ontology', getOntology);
router.put('/api/admin/ontology/:conceptId', updateConcept);
Files:
frontend/src/pages/Admin/FieldMappingQueue.jsx- Main UIfrontend/src/components/Admin/MappingReviewCard.jsx- Individual mapping cardfrontend/src/components/Admin/OntologyManager.jsx- Ontology browserbackend/src/routes/mappingRoutes.js- API routesbackend/src/services/MappingValidationService.js- Approval logic
6. Customer-Facing APIs: GraphQL Integration
Component: GraphQL API for customer writes/subscriptions
Purpose: Modern B2B integration option (mutations + subscriptions)
Schema:
# Customer writes data to ChainAlign
type Mutation {
syncSalesOrders(orders: [SalesOrderInput!]!): SyncResponse!
syncInventory(items: [InventoryItemInput!]!): SyncResponse!
syncProducts(products: [ProductInput!]!): SyncResponse!
syncForecasts(forecasts: [ForecastInput!]!): SyncResponse!
}
input SalesOrderInput {
orderNumber: String!
productSKU: String!
quantity: Float!
revenue: Float
orderDate: DateTime!
locationCode: String
# Unknown fields go to custom_attributes for semantic mapping
customFields: JSON
}
type SyncResponse {
status: SyncStatus!
itemsProcessed: Int!
itemsFailed: Int!
errors: [SyncError!]
queueId: ID! # Track async processing
}
# Customer subscribes to real-time requests
type Subscription {
dataRequested: DataRequest!
syncStatusUpdate(queueId: ID!): SyncStatusUpdate!
}
type DataRequest {
requestId: ID!
dataType: DataType!
fields: [String!]!
dateRange: DateRange
}
Example Customer Usage:
// Customer writes via mutation
const result = await client.mutate({
mutation: gql`
mutation SyncOrders($orders: [SalesOrderInput!]!) {
syncSalesOrders(orders: $orders) {
status
itemsProcessed
queueId
}
}
`,
variables: {
orders: [
{
orderNumber: "SO-12345",
productSKU: "WIDGET-001",
quantity: 100,
revenue: 5000.00,
orderDate: "2025-11-15T10:30:00Z",
customFields: {
proj_stage_bill: 1250.00 // Unknown field → semantic mapper
}
}
]
}
});
// Customer subscribes for real-time requests
client.subscribe({
query: gql`
subscription OnDataRequested {
dataRequested {
requestId
dataType
fields
dateRange { start end }
}
}
`
}).subscribe({
next(data) {
console.log('ChainAlign needs fresh data:', data);
// Customer responds with mutation
}
});
Files:
backend/src/graphql/integration-schema.graphql- Schema definitionbackend/src/graphql/integrationResolvers.js- Mutation/subscription resolversbackend/src/services/GraphQLIngestionService.js- Business logicdocs/api/graphql-integration-guide.md- Customer documentation
Architecture Summary
Customer ERP → gateway.chainalign.com → Ingestion Orchestrator → pgmq
↓
Admin UI ← pending_field_mappings ← Semantic Mapper Agent (LangGraph)
↓
PostgreSQL ← Execute Mapping (ontology-driven)
Key Architectural Decisions:
-
Layered Separation:
- Gateway = Infrastructure (auth, routing, DDoS)
- Orchestrator = I/O (protocol parsing, queue writing)
- Mapper = Intelligence (LLM, ontology, mapping logic)
- Governance = Human oversight (UI, approvals)
-
Protocol Agnosticism:
- All protocols converge to single queue format
- Customer-facing API never changes
- Internal refactoring transparent to customers
-
Hybrid Mapping Strategy:
- Fast path: Exact match → auto-map (100% confidence)
- Semantic path: pgvector → auto-map if >85% similarity
- LLM path: Gemini reasoning → auto-map if >95% confidence
- Human path: Admin review for <95% confidence
-
Self-Improving System:
- Every approval adds synonym to ontology
- Future occurrences auto-map immediately
- System accuracy increases over time
-
Three-Tier Schema Evolution:
- Tier 1: Core ChainAlign schema (rare, deliberate changes)
- Tier 2: Ontology synonyms (frequent, automatic)
- Tier 3: Tenant custom fields (JSONB, per-customer)
Implementation Phases
Phase 1: Foundation (Weeks 1-4)
Goal: Universal ingestion infrastructure operational
Deliverables:
gateway.chainalign.comlive with SSL- Ingestion Orchestrator deployed (Cloud Run)
- REST/JSON parser working
- pgmq
ingestion_queueoperational - Monitoring/logging configured
Success Criteria:
- Customer can POST JSON to gateway
- Data lands in queue with metadata
- Gateway handles 1000 req/sec
- Zero downtime during deployments
Tasks:
-
GCP API Gateway setup:
- Create SSL certificates
- Configure openapi.yaml
- Deploy to gateway.chainalign.com
- Test with curl/Postman
-
Build Ingestion Orchestrator:
- Express.js server
- REST parser (JSON)
- Schema extraction logic
- pgmq client integration
- Error handling
-
Database setup:
- Enable pgmq extension
- Create
ingestion_queue - Configure retention policy
-
Monitoring:
- Cloud Logging integration
- Alert on queue depth >1000
- Dashboard for ingestion metrics
Files Created:
/infrastructure/gateway-config/backend/services/ingestion-orchestrator/- Database:
ingestion_queue(pgmq)
Phase 2: Semantic Intelligence (Weeks 5-8)
Goal: LangGraph mapper agent processing queue
Deliverables:
- Python semantic mapper deployed
- LangGraph workflow operational (6 nodes)
- Core ontology seeded (50+ S&OP concepts)
- pgvector semantic search working
- LLM reasoning integrated (Gemini)
Success Criteria:
- Known fields auto-map with 100% accuracy
- Unknown fields route to pending_field_mappings
- LLM provides clear reasoning
- Agent processes 100 fields/minute
- Cyclic refinement working (retry logic)
Tasks:
-
Build Python service:
- FastAPI server
- LangGraph state machine (6 nodes)
- PostgreSQL client (psycopg3)
- pgvector queries
-
Implement nodes:
- Node 1: Schema extraction
- Node 2: Exact match lookup
- Node 3: Semantic search (pgvector)
- Node 4: LLM reasoning (Gemini)
- Node 5: Validation queue writer
- Node 6: Mapping executor
-
Database migrations:
ontology_conceptstablepending_field_mappingstableerp_version_registrytable- Enable pgvector extension
- Create IVFFlat index
-
Seed ontology:
- Load 50+ standard S&OP concepts
- Generate embeddings (Gemini)
- Insert with synonyms
-
LLM integration:
- Gemini API client
- Prompt engineering
- Confidence scoring logic
- Retry/backoff handling
Files Created:
python-services/semantic-mapper-agent/backend/migrations/2025111700000*.cjs(3 migrations)backend/seeds/001_seed_core_ontology.cjs
Phase 3: Validation UI & CSV Unification (Weeks 9-10)
Goal: Admin can review/approve mappings, CSV uses same engine
Deliverables:
- Admin UI shows pending mappings
- Approve/reject/correct workflow
- CSV upload refactored to use semantic mapper
- Batch operations working
Success Criteria:
- Admin reviews 50 fields in 10 minutes
- Approvals update ontology immediately
- CSV templates stored in ontology
- Batch approve works for similar fields
Tasks:
-
Build Admin UI:
FieldMappingQueue.jsxpageMappingReviewCard.jsxcomponentOntologyManager.jsxbrowser- Integrate with shadcn/ui DataTable
-
API routes:
GET /api/admin/pending-mappingsPOST /api/admin/approve-mappingPOST /api/admin/reject-mappingPOST /api/admin/correct-mappingGET /api/admin/ontology
-
Services:
MappingValidationService.js(approval logic)OntologyService.js(ontology management)
-
Repositories:
OntologyRepository.jsPendingMappingRepository.js
-
Refactor CSV:
- Update CSV upload to write to
ingestion_queue - Remove old mapping logic
- Store CSV templates as ontology synonyms
- Update CSV upload to write to
Files Created:
frontend/src/pages/Admin/FieldMappingQueue.jsxbackend/src/routes/mappingRoutes.jsbackend/src/services/MappingValidationService.jsbackend/src/dal/OntologyRepository.js
Phase 4: GraphQL & Advanced Protocols (Weeks 11-13)
Goal: Multi-protocol ingestion fully operational
Deliverables:
- GraphQL mutation API published
- GraphQL subscription for real-time events
- SOAP parser added to Orchestrator
- VPN/SQL adapter working
- Customer documentation complete
Success Criteria:
- Customers can write GraphQL mutations
- Customers can subscribe to events
- SOAP integration tested with sample ERP
- SQL direct connection proven secure
- API docs published with examples
Tasks:
-
GraphQL schema:
- Define mutation types (SyncSalesOrders, etc.)
- Define subscription types (dataRequested, etc.)
- Input types with customFields support
-
GraphQL server:
- Extend Orchestrator with Apollo Server
- Schema validation
- Subscription pub/sub (Redis or Pub/Sub)
-
SOAP adapter:
- WSDL parser
- XML to JSON conversion
- Add to Orchestrator
-
SQL adapter:
- Read-only connection pool
- Query executor
- Scheduled job runner (Cloud Scheduler)
-
Documentation:
- GraphQL Playground
- Code examples (JS, Python, Java)
- Auth guide
- Versioning guide
Files Created:
backend/src/graphql/integration-schema.graphqlbackend/src/graphql/integrationResolvers.jsbackend/services/ingestion-orchestrator/parsers/SOAPParser.jsbackend/services/ingestion-orchestrator/adapters/SQLAdapter.jsdocs/api/graphql-integration-guide.md
Phase 5: Adler Pilot & Iteration (Weeks 14-16)
Goal: First production ERP integration validated
Deliverables:
- Adler ERP connected (whatever protocol they offer)
- Data flowing into ChainAlign
- Mappings validated with customer
- Learnings documented for next integration
Success Criteria:
- 95%+ fields auto-mapped
- Customer satisfied with data quality
- Zero manual CSV exports needed
- Architecture proven for scale
Tasks:
-
Adler onboarding:
- Discover their protocol (REST/SOAP/SQL/GraphQL)
- Set up credentials/VPN if needed
- Configure gateway routing for their environment
-
Initial sync:
- Run semantic mapper on sample data
- Review all pending mappings with admin
- Validate data accuracy against source
-
Iterative tuning:
- Refine LLM prompts based on Adler-specific fields
- Add industry synonyms to ontology
- Adjust confidence thresholds if needed
-
Monitoring:
- Set up alerts for Adler integration
- Track success/failure rates
- Monitor data freshness
-
Documentation:
- Capture Adler-specific patterns
- Update runbook for similar ERPs
- Create case study for sales
Files Created:
docs/integrations/adler-integration-guide.mddocs/integrations/erp-onboarding-runbook.mdbackend/seeds/adler_ontology_extensions.cjs(if needed)
Key Technical Decisions
1. LangGraph Over Simple Pipeline
Rationale: Cyclic refinement critical for handling ambiguous mappings. LangGraph enables agent to retry semantic search with refined strategies (decompose field names, use broader terms) before giving up to human review.
2. Hybrid Ontology (SQL + pgvector)
Rationale: Fast exact matching (PostgreSQL) for known fields, semantic similarity (pgvector) for fuzzy matching. No external vector DB needed, leverages existing infrastructure.
3. Protocol-Agnostic Orchestrator
Rationale: Customer-facing API (gateway.chainalign.com) never changes. Internal refactoring (add new parsers, change queue format) invisible to customers. Enables rapid iteration.
4. Three-Tier Schema Evolution
Rationale: Prevents schema bloat while maintaining flexibility:
- Core schema: Rarely changes, benefits all customers
- Ontology synonyms: Frequent, automatic, no migrations
- Tenant JSONB: Per-customer, no pollution
5. Human-in-the-Loop Governance
Rationale: LLM confidence <95% requires human approval to prevent schema pollution. Approvals train the agent, future occurrences auto-map. Balances automation with quality control.
6. Stripe-Style Versioning
Rationale: Dated versions (ChainAlign-Version: 2025-11-17) allow internal refactoring without breaking customers. No v1/v2 URL changes, customers opt into new features via header.
Success Metrics
Technical KPIs
Ingestion Layer:
- Uptime: 99.9%
- Latency: p99 <500ms (gateway → queue)
- Throughput: 1000 req/sec
- Queue depth: <1000 messages (steady state)
Semantic Mapper:
- Auto-mapping rate: >80% (fields mapped without human review)
- LLM accuracy: >90% (when human reviews, LLM was correct)
- Processing time: <5 sec/field (including LLM calls)
- Retry rate: <10% (cyclic refinement needed)
Ontology Growth:
- New synonyms/week: 20-50 (early), 5-10 (mature)
- New core concepts/quarter: <5 (deliberate evolution)
- Tenant custom fields: <20% of total fields
Business KPIs
Customer Onboarding:
- CSV → value: <2 days
- API integration → value: <2 weeks
- First sync → validated data: <1 week
Customer Satisfaction:
- Manual mapping effort: <10% of fields
- Data quality issues: <1% of records
- Integration reliability: >99.5% uptime
Platform Differentiation:
- Supported ERP systems: SAP, Oracle, Adler, Dynamics, +custom
- Protocols supported: REST, GraphQL, SOAP, SQL, CSV
- Self-service mapping: 80% of fields (target by Q2 2026)
Security & Compliance
Data Protection
- Encryption in transit: TLS 1.3 for all external connections
- Encryption at rest: GCP default encryption
- API credentials: Rotated every 90 days
- Tenant isolation: All queries scoped by tenant_id
- Audit logging: All mapping approvals logged
Access Control
- Admin UI: Requires special role (
admin:field_mappings) - API Gateway: OAuth 2.0 token validation
- Rate limiting: Per-tenant quotas
- PII handling: Sample values truncated/masked in UI, never logged
Compliance
- GDPR: Customer owns data, can delete anytime
- SOC 2: Change management for ontology updates
- Data Processing Agreement: Required for all integrations
Risks & Mitigations
Risk 1: LLM Hallucination
Impact: Incorrect field mappings corrupt data Mitigation:
- Confidence threshold (95%) before auto-mapping
- Human review for ambiguous cases
- Audit log of all LLM reasoning
- Rollback capability (revert to JSONB)
Risk 2: Ontology Pollution
Impact: Too many similar concepts, schema bloat Mitigation:
- Three-tier schema strategy
- Human approval for new core concepts
- Quarterly ontology review/cleanup
- Usage tracking (unused concepts deprecated)
Risk 3: Customer Schema Drift
Impact: ERP upgrade breaks integration silently Mitigation:
- Proactive version monitoring (weekly checks)
- Anomaly detection on incoming data
- Alerting for new/removed fields
- Graceful degradation (unknown fields → JSONB)
Risk 4: Scale (High Volume Tenants)
Impact: Queue depth grows, latency increases Mitigation:
- Horizontal scaling (Cloud Run auto-scales)
- Batch processing for large syncs
- Priority queue (interactive > scheduled)
- Circuit breaker (fail fast on overload)
Dependencies
External Services
- GCP API Gateway: Gateway infrastructure
- Cloud Run: Orchestrator + Mapper hosting
- pgmq: Queue management (PostgreSQL extension)
- pgvector: Semantic search (PostgreSQL extension)
- Gemini API: LLM reasoning + embeddings
Internal Dependencies
- Existing CSV UI: Refactor to use semantic mapper
- PostgreSQL: Ontology storage, pgvector
- Authentication: OAuth tokens for gateway
- Notification system: Alert admins on pending mappings
Documentation Deliverables
Technical Docs
- ✅
docs/architecture/functional-specifications/erp-integration-fsd.md- This FSD - ⏳
docs/architecture/integration-gateway-setup.md- Gateway deployment - ⏳
docs/architecture/semantic-mapper-design.md- LangGraph workflow - ⏳
docs/architecture/ontology-management.md- Schema evolution guide
API Docs
- ⏳
docs/api/graphql-integration-guide.md- Customer GraphQL API - ⏳
docs/api/rest-integration-guide.md- Customer REST API - ⏳
docs/api/authentication-guide.md- OAuth setup
Operational Docs
- ⏳
docs/integrations/erp-onboarding-runbook.md- New customer checklist - ⏳
docs/integrations/adler-case-study.md- First pilot learnings - ⏳
docs/integrations/troubleshooting.md- Common issues
Admin Docs
- ⏳
docs/admin/field-mapping-approval-guide.md- UI walkthrough - ⏳
docs/admin/ontology-curation.md- When to create new concepts
File Structure Summary
# New directories
/infrastructure/gateway-config/ # GCP API Gateway
backend/services/ingestion-orchestrator/ # Node.js microservice
python-services/semantic-mapper-agent/ # Python/LangGraph
# Backend additions
backend/migrations/
20251117000001_create_ontology_concepts.cjs
20251117000002_create_pending_field_mappings.cjs
20251117000003_create_erp_version_registry.cjs
backend/seeds/
001_seed_core_ontology.cjs
backend/src/dal/
OntologyRepository.js
PendingMappingRepository.js
backend/src/services/
OntologyService.js
MappingValidationService.js
GraphQLIngestionService.js
backend/src/routes/
mappingRoutes.js
backend/src/graphql/
integration-schema.graphql
integrationResolvers.js
# Frontend additions
frontend/src/pages/Admin/
FieldMappingQueue.jsx
OntologyManager.jsx
frontend/src/components/Admin/
MappingReviewCard.jsx
BatchMappingActions.jsx
# Documentation
docs/architecture/functional-specifications/
erp-integration-fsd.md (✅ created)
docs/project-management/milestones/
m79-erp-integration-system.md (✅ this file)
docs/api/
graphql-integration-guide.md (⏳)
docs/integrations/
erp-onboarding-runbook.md (⏳)
adler-case-study.md (⏳)
Total Estimate:
- Backend: 8 migrations/seeds, 10 services/repos, 3 route files, 2 GraphQL files
- Frontend: 4 new pages/components
- Python: 1 new microservice (10+ files)
- Infrastructure: 1 API Gateway config
- Total New Code: ~5,000 lines (production) + tests
Next Steps
Immediate (Post-Design)
- Review FSD with stakeholders
- Prioritize Adler pilot timeline
- Allocate engineering resources (1 backend, 1 frontend, 1 Python/ML)
- Set up Phase 1 infrastructure (GCP API Gateway)
Phase 1 Kickoff (Week 1)
- Create infrastructure repos
- Set up GCP API Gateway (
gateway.chainalign.com) - Scaffold Ingestion Orchestrator (Node.js)
- Enable pgmq extension, create queue
Weekly Milestones
- Week 4: Phase 1 complete (ingestion working)
- Week 8: Phase 2 complete (semantic mapper deployed)
- Week 10: Phase 3 complete (admin UI operational)
- Week 13: Phase 4 complete (GraphQL + multi-protocol)
- Week 16: Phase 5 complete (Adler pilot validated)
Status: 📋 Design Complete, Ready for Phase 1 Strategic Value: Removes #1 barrier to customer adoption (integration complexity) Competitive Advantage: Transforms 6-12 month integration into 2-day onboarding Innovation: LLM-powered semantic mapping + self-improving ontology
Author: Pramod Prasanth Design Date: 2025-11-17 Estimated Completion: Q2 2026 (16 weeks from kickoff)