Skip to main content

VN2 Dataset Integration Guide

Overview

This guide explains how to map the VN2-competition-datasets into ChainAlign's optimized sales table.

What Exists

Infrastructure

  • Retail Tenant: MegaRetail Inc. (seed: 002_add_megaretail_tenant.cjs)
  • Locations Table: Stores dimension (locations)
  • Products Table: Core entities (core_entities, type='PRODUCT')
  • Sales Table: Optimized fact table (sales) - NEW
  • Import Script: VN2 data importer (backend/scripts/import_vn2_dataset.js)

Optimized Sales Table Schema

The sales table is designed for time-series forecasting, financial auditing, and Judgment Engine retrospection:

Core Dimensions

FieldTypePurpose
sale_idUUIDPrimary key
tenant_idUUIDMulti-tenancy isolation
sale_dateTIMESTAMPTZCritical for forecasting - indexed for time-series queries
location_idUUIDStore dimension (FK to locations)
entity_idUUIDProduct dimension (FK to core_entities)

Financial Facts

FieldTypePurpose
quantity_soldNUMERIC(15,4)Core forecasting metric - the demand signal
revenueNUMERIC(19,4)Total selling price (financial impact)
cost_of_goods_soldNUMERIC(19,4)Direct cost (gross margin calculation)
unit_priceNUMERIC(19,4)Price at time of sale (scenario backtesting)

Feature Engineering (ML Ready)

FieldTypePurpose
promo_flagBOOLEANExternal covariate for forecasting models
stockout_flagBOOLEANDemand censoring indicator (data cleaning)
event_idUUIDExternal event context (FK, nullable)
s_and_op_cycle_idUUIDLinks to S&OP cycles for Judgment Engine retrospection

Metadata

FieldTypePurpose
custom_attributesJSONBFlexible domain extensions
notesTEXTAudit trail
created_atTIMESTAMPAudit timestamps
updated_atTIMESTAMPAudit timestamps

Indexes

idx_sales_tenant_loc_ent_date      - Core time-series queries
idx_sales_tenant_date - Tenant-level trend analysis
idx_sales_loc_ent_date - Location-product analysis
idx_sales_date - Time-based partitioning
idx_sales_promo - Promotional impact analysis
idx_sales_stockout - Inventory constraint analysis
idx_sales_cycle_entity - Judgment Engine retrospection
UNIQUE: (tenant_id, location_id, entity_id, sale_date)

VN2 Dataset Structure

Master.csv (Store & Product Metadata)

  • Store IDs: 0-599 (600 locations)
  • Product IDs: 100+ (~599 unique products)
  • Hierarchy: ProductGroup → Division → Department → DepartmentGroup

Weekly Sales CSVs

  • Format: Week N - YYYY-MM-DD - Sales.csv
  • Time Range: 157 weeks (2021-04-12 through 2024-04-08)
  • Data: Units sold per store-product per week
  • Sparsity: ~16% of combinations have sales

Field Mappings: VN2 → ChainAlign Sales Table

Phase 1: Create Locations

VN2 SourceChainAlign FieldNotes
Store ID (0-599)location_codeFormat: STORE-000 to STORE-599
Store IDlocation_nameGenerated: "Store 0", "Store 1", etc.
Store IDcustom_attributes.vn2_store_idFor traceability
StoreFormat, Formatcustom_attributesPreserved as JSONB

Result: 600 retail locations created

Phase 2: Create Products

VN2 SourceChainAlign FieldNotes
Product IDskuFormat: SKU-00126 (padded)
Product IDentity_nameGenerated: "Product 126"
ProductGroupdomain_attributes.product_groupHierarchy preserved
Divisiondomain_attributes.divisionHierarchy preserved
Departmentdomain_attributes.departmentHierarchy preserved
DepartmentGroupdomain_attributes.department_groupHierarchy preserved
(Estimated)unit_cost$25.00 (default)
(Estimated)unit_price$45.00 (default, 80% markup)

Result: ~599 product entities created

Phase 3: Import Sales

VN2 SourceChainAlign FieldNotes
Store IDlocation_id (lookup)Joins to locations table
Product IDentity_id (lookup)Joins to core_entities table
Week-end Datesale_dateConverted to TIMESTAMP
Sales Quantityquantity_soldDirect mapping
(Calculated)revenuequantity × unit_price
(Calculated)cost_of_goods_soldquantity × unit_cost
unit_priceunit_priceSnapshot from entity
(Default)promo_flagfalse (VN2 doesn't provide)
(Default)stockout_flagfalse (VN2 doesn't provide)
(Default)s_and_op_cycle_idNULL (can be linked later)
(Default)event_idNULL (can be linked later)
Week metadatacustom_attributes{vn2_week_number, vn2_store_id, vn2_product_id}

Result: ~57,000 sales records (sparse data - only non-zero sales)


How to Import

Prerequisites

# Ensure migrations are run
cd backend
npx knex migrate:latest --env development

# Check migration includes sales table creation
# Migration: backend/migrations/20251028000002_create_sales_table.cjs
# See what would be imported without touching DB
node scripts/import_vn2_dataset.js --dry-run

Full Import

# Create locations, products, and sales
node scripts/import_vn2_dataset.js

# Expected output:
# ✅ Created 600 retail locations
# ✅ Created ~599 product entities
# ✅ Imported ~57,000 sales records

Limited Import (for testing)

# Test with only first 10 locations and products
node scripts/import_vn2_dataset.js --limit 10

Validation Queries

After Import

-- Count locations
SELECT COUNT(*) as location_count
FROM locations
WHERE custom_attributes->>'vn2_store_id' IS NOT NULL;
-- Expected: 600

-- Count products
SELECT COUNT(*) as product_count
FROM core_entities
WHERE entity_type='PRODUCT'
AND domain_attributes->>'vn2_product_id' IS NOT NULL;
-- Expected: ~599

-- Count sales
SELECT COUNT(*) as sales_count,
MIN(sale_date) as earliest,
MAX(sale_date) as latest
FROM sales;
-- Expected: ~57,000 records, 2021-04-12 to 2024-04-08

Sales Analysis

-- Top 10 best-selling products
SELECT
e.entity_name,
e.sku,
SUM(s.quantity_sold) as total_qty,
SUM(s.revenue) as total_revenue,
COUNT(*) as weeks_active
FROM sales s
JOIN core_entities e ON s.entity_id = e.entity_id
GROUP BY e.entity_id, e.entity_name, e.sku
ORDER BY total_qty DESC
LIMIT 10;

-- Monthly sales trend
SELECT
DATE_TRUNC('month', s.sale_date) as month,
SUM(s.quantity_sold) as units,
SUM(s.revenue) as revenue,
AVG(s.revenue / NULLIF(s.quantity_sold, 0)) as avg_price
FROM sales s
GROUP BY DATE_TRUNC('month', s.sale_date)
ORDER BY month;

-- Store performance ranking
SELECT
l.location_name,
COUNT(DISTINCT s.entity_id) as products_sold,
SUM(s.quantity_sold) as total_units,
SUM(s.revenue) as total_revenue,
AVG(s.revenue / NULLIF(s.quantity_sold, 0)) as avg_price
FROM sales s
JOIN locations l ON s.location_id = l.location_id
GROUP BY l.location_id, l.location_name
ORDER BY total_revenue DESC
LIMIT 20;

Design Decisions & Tradeoffs

Why One Sales Table (Not sales_history)?

  • Time-stamped facts: Each sale is a time-defined event (when did it happen?)
  • Multi-granularity support: Can support daily, weekly, monthly, or any time granularity
  • Forecasting: ML models consume all historical sales data from a single table
  • Simplicity: No archival/rotation complexity - append-only natural growth

Why Sparse Data Import (Non-Zero Sales Only)?

  • Storage: 57K records vs. potential 359M combinations (99.98% sparse)
  • Performance: Query faster on sparse data with good indexes
  • Realistic: Real retail data is also sparse (not every product sells in every store every week)

Why Estimated Pricing ($25 cost, $45 retail)?

  • VN2 doesn't provide: Original dataset has no price data
  • 80% markup: Reasonable for retail (similar to your product group logic)
  • Can be updated: Update unit_cost, unit_price in core_entities after import

Why Feature Flags Default to False?

  • VN2 doesn't provide: No promo_flag or stockout_flag in original data
  • Can be enriched later: Update sales records with actual event flags after import
  • Forward compatible: Schema ready for promo and stockout analysis

Why s_and_op_cycle_id is Nullable?

  • Async linkage: Cycles created after sales import
  • Retrospective analysis: Link sales to which planning cycle they were executed under
  • Judgment Engine: Supports "what actually happened vs. what we planned?"

Next Steps

  1. Run Migration: npx knex migrate:latest --env development
  2. Test Import: node backend/scripts/import_vn2_dataset.js --dry-run
  3. Import Data: node backend/scripts/import_vn2_dataset.js
  4. Validate: Run SQL queries above to confirm data integrity
  5. Enrich: Update promo_flag, stockout_flag, s_and_op_cycle_id with actual data
  6. Analyze: Use for forecasting, scenario analysis, Judgment Engine retrospection

Script Details

Script: backend/scripts/import_vn2_dataset.js

  • Autonomous 3-phase import (locations → products → sales)
  • Idempotent (safe to re-run, skips existing records)
  • Dry-run mode for validation
  • Comprehensive error handling and statistics

Created: Oct 28, 2025 Migration: backend/migrations/20251028000002_create_sales_table.cjs Dataset: /backend/VN2-competition-datasets/ (157 weeks, 599 stores, ~599 products)