VN2 Dataset Integration Guide
Overview
This guide explains how to map the VN2-competition-datasets into ChainAlign's optimized sales table.
What Exists
Infrastructure
- ✅ Retail Tenant: MegaRetail Inc. (seed:
002_add_megaretail_tenant.cjs) - ✅ Locations Table: Stores dimension (
locations) - ✅ Products Table: Core entities (
core_entities, type='PRODUCT') - ✅ Sales Table: Optimized fact table (
sales) - NEW - ✅ Import Script: VN2 data importer (
backend/scripts/import_vn2_dataset.js)
Optimized Sales Table Schema
The sales table is designed for time-series forecasting, financial auditing, and Judgment Engine retrospection:
Core Dimensions
| Field | Type | Purpose |
|---|---|---|
sale_id | UUID | Primary key |
tenant_id | UUID | Multi-tenancy isolation |
sale_date | TIMESTAMPTZ | Critical for forecasting - indexed for time-series queries |
location_id | UUID | Store dimension (FK to locations) |
entity_id | UUID | Product dimension (FK to core_entities) |
Financial Facts
| Field | Type | Purpose |
|---|---|---|
quantity_sold | NUMERIC(15,4) | Core forecasting metric - the demand signal |
revenue | NUMERIC(19,4) | Total selling price (financial impact) |
cost_of_goods_sold | NUMERIC(19,4) | Direct cost (gross margin calculation) |
unit_price | NUMERIC(19,4) | Price at time of sale (scenario backtesting) |
Feature Engineering (ML Ready)
| Field | Type | Purpose |
|---|---|---|
promo_flag | BOOLEAN | External covariate for forecasting models |
stockout_flag | BOOLEAN | Demand censoring indicator (data cleaning) |
event_id | UUID | External event context (FK, nullable) |
s_and_op_cycle_id | UUID | Links to S&OP cycles for Judgment Engine retrospection |
Metadata
| Field | Type | Purpose |
|---|---|---|
custom_attributes | JSONB | Flexible domain extensions |
notes | TEXT | Audit trail |
created_at | TIMESTAMP | Audit timestamps |
updated_at | TIMESTAMP | Audit timestamps |
Indexes
idx_sales_tenant_loc_ent_date - Core time-series queries
idx_sales_tenant_date - Tenant-level trend analysis
idx_sales_loc_ent_date - Location-product analysis
idx_sales_date - Time-based partitioning
idx_sales_promo - Promotional impact analysis
idx_sales_stockout - Inventory constraint analysis
idx_sales_cycle_entity - Judgment Engine retrospection
UNIQUE: (tenant_id, location_id, entity_id, sale_date)
VN2 Dataset Structure
Master.csv (Store & Product Metadata)
- Store IDs: 0-599 (600 locations)
- Product IDs: 100+ (~599 unique products)
- Hierarchy: ProductGroup → Division → Department → DepartmentGroup
Weekly Sales CSVs
- Format:
Week N - YYYY-MM-DD - Sales.csv - Time Range: 157 weeks (2021-04-12 through 2024-04-08)
- Data: Units sold per store-product per week
- Sparsity: ~16% of combinations have sales
Field Mappings: VN2 → ChainAlign Sales Table
Phase 1: Create Locations
| VN2 Source | ChainAlign Field | Notes |
|---|---|---|
| Store ID (0-599) | location_code | Format: STORE-000 to STORE-599 |
| Store ID | location_name | Generated: "Store 0", "Store 1", etc. |
| Store ID | custom_attributes.vn2_store_id | For traceability |
| StoreFormat, Format | custom_attributes | Preserved as JSONB |
Result: 600 retail locations created
Phase 2: Create Products
| VN2 Source | ChainAlign Field | Notes |
|---|---|---|
| Product ID | sku | Format: SKU-00126 (padded) |
| Product ID | entity_name | Generated: "Product 126" |
| ProductGroup | domain_attributes.product_group | Hierarchy preserved |
| Division | domain_attributes.division | Hierarchy preserved |
| Department | domain_attributes.department | Hierarchy preserved |
| DepartmentGroup | domain_attributes.department_group | Hierarchy preserved |
| (Estimated) | unit_cost | $25.00 (default) |
| (Estimated) | unit_price | $45.00 (default, 80% markup) |
Result: ~599 product entities created
Phase 3: Import Sales
| VN2 Source | ChainAlign Field | Notes |
|---|---|---|
| Store ID | location_id (lookup) | Joins to locations table |
| Product ID | entity_id (lookup) | Joins to core_entities table |
| Week-end Date | sale_date | Converted to TIMESTAMP |
| Sales Quantity | quantity_sold | Direct mapping |
| (Calculated) | revenue | quantity × unit_price |
| (Calculated) | cost_of_goods_sold | quantity × unit_cost |
| unit_price | unit_price | Snapshot from entity |
| (Default) | promo_flag | false (VN2 doesn't provide) |
| (Default) | stockout_flag | false (VN2 doesn't provide) |
| (Default) | s_and_op_cycle_id | NULL (can be linked later) |
| (Default) | event_id | NULL (can be linked later) |
| Week metadata | custom_attributes | {vn2_week_number, vn2_store_id, vn2_product_id} |
Result: ~57,000 sales records (sparse data - only non-zero sales)
How to Import
Prerequisites
# Ensure migrations are run
cd backend
npx knex migrate:latest --env development
# Check migration includes sales table creation
# Migration: backend/migrations/20251028000002_create_sales_table.cjs
Test Import (Recommended First)
# See what would be imported without touching DB
node scripts/import_vn2_dataset.js --dry-run
Full Import
# Create locations, products, and sales
node scripts/import_vn2_dataset.js
# Expected output:
# ✅ Created 600 retail locations
# ✅ Created ~599 product entities
# ✅ Imported ~57,000 sales records
Limited Import (for testing)
# Test with only first 10 locations and products
node scripts/import_vn2_dataset.js --limit 10
Validation Queries
After Import
-- Count locations
SELECT COUNT(*) as location_count
FROM locations
WHERE custom_attributes->>'vn2_store_id' IS NOT NULL;
-- Expected: 600
-- Count products
SELECT COUNT(*) as product_count
FROM core_entities
WHERE entity_type='PRODUCT'
AND domain_attributes->>'vn2_product_id' IS NOT NULL;
-- Expected: ~599
-- Count sales
SELECT COUNT(*) as sales_count,
MIN(sale_date) as earliest,
MAX(sale_date) as latest
FROM sales;
-- Expected: ~57,000 records, 2021-04-12 to 2024-04-08
Sales Analysis
-- Top 10 best-selling products
SELECT
e.entity_name,
e.sku,
SUM(s.quantity_sold) as total_qty,
SUM(s.revenue) as total_revenue,
COUNT(*) as weeks_active
FROM sales s
JOIN core_entities e ON s.entity_id = e.entity_id
GROUP BY e.entity_id, e.entity_name, e.sku
ORDER BY total_qty DESC
LIMIT 10;
-- Monthly sales trend
SELECT
DATE_TRUNC('month', s.sale_date) as month,
SUM(s.quantity_sold) as units,
SUM(s.revenue) as revenue,
AVG(s.revenue / NULLIF(s.quantity_sold, 0)) as avg_price
FROM sales s
GROUP BY DATE_TRUNC('month', s.sale_date)
ORDER BY month;
-- Store performance ranking
SELECT
l.location_name,
COUNT(DISTINCT s.entity_id) as products_sold,
SUM(s.quantity_sold) as total_units,
SUM(s.revenue) as total_revenue,
AVG(s.revenue / NULLIF(s.quantity_sold, 0)) as avg_price
FROM sales s
JOIN locations l ON s.location_id = l.location_id
GROUP BY l.location_id, l.location_name
ORDER BY total_revenue DESC
LIMIT 20;
Design Decisions & Tradeoffs
Why One Sales Table (Not sales_history)?
- Time-stamped facts: Each sale is a time-defined event (when did it happen?)
- Multi-granularity support: Can support daily, weekly, monthly, or any time granularity
- Forecasting: ML models consume all historical sales data from a single table
- Simplicity: No archival/rotation complexity - append-only natural growth
Why Sparse Data Import (Non-Zero Sales Only)?
- Storage: 57K records vs. potential 359M combinations (99.98% sparse)
- Performance: Query faster on sparse data with good indexes
- Realistic: Real retail data is also sparse (not every product sells in every store every week)
Why Estimated Pricing ($25 cost, $45 retail)?
- VN2 doesn't provide: Original dataset has no price data
- 80% markup: Reasonable for retail (similar to your product group logic)
- Can be updated: Update
unit_cost,unit_pricein core_entities after import
Why Feature Flags Default to False?
- VN2 doesn't provide: No promo_flag or stockout_flag in original data
- Can be enriched later: Update sales records with actual event flags after import
- Forward compatible: Schema ready for promo and stockout analysis
Why s_and_op_cycle_id is Nullable?
- Async linkage: Cycles created after sales import
- Retrospective analysis: Link sales to which planning cycle they were executed under
- Judgment Engine: Supports "what actually happened vs. what we planned?"
Next Steps
- Run Migration:
npx knex migrate:latest --env development - Test Import:
node backend/scripts/import_vn2_dataset.js --dry-run - Import Data:
node backend/scripts/import_vn2_dataset.js - Validate: Run SQL queries above to confirm data integrity
- Enrich: Update
promo_flag,stockout_flag,s_and_op_cycle_idwith actual data - Analyze: Use for forecasting, scenario analysis, Judgment Engine retrospection
Script Details
Script: backend/scripts/import_vn2_dataset.js
- Autonomous 3-phase import (locations → products → sales)
- Idempotent (safe to re-run, skips existing records)
- Dry-run mode for validation
- Comprehensive error handling and statistics
Created: Oct 28, 2025
Migration: backend/migrations/20251028000002_create_sales_table.cjs
Dataset: /backend/VN2-competition-datasets/ (157 weeks, 599 stores, ~599 products)