Demo Data Auto-generation Script

OVERVIEW:
We are developing a fictitious relational dataset for use in demonstrating our business intelligence and financial modeling solution. To ensure the dataset supports our demo scenario, we have established targets for several key measures, including sales for each product, sales for each customer, # invoices for each customer and invoices per each date. We now need automate the generation of granular data to fill the database – and the granular data must sum to the targets we’ve established.

OBJECTIVE:
The goal is to create a series of invoice line items such that the sum of the NET revenue of all the line items equals the Invoice Total in the seed data. The Invoice Total is a target – the actual can vary by a couple percent.

INVOICE TOTAL NET REVENUE: The Invoice Total is NET OF DISCOUNTS. The Customer Discount Percentage must be applied to the product list price to arrive at the net revenue. The definition:

[List Price] x [Quantity] x [Discount] = [Net Revenue]
Sum of [Net Revenue] for the line items = Invoice Total

CONTRAINT:
The products data set includes the target revenue – net of discounts – for each product. We want to distribute the target NET revenue for each product across the invoices. The Invoice target revenues EQUAL the Products target revenue in the seed data, and they will be equal by definition in the result set. We need to ensure that the revenue per customer, revenue per product and total invoices remain proportional.

APPROACH:

We presume that the script will iterate through the invoices, starting with the largest, and add products, also starting with the largest, until the invoice is full (e.g. reached the targeted revenue). Once all the Target Product Revenue is a consumed (within a couple percent) the product should not be included on any additional invoices. There are many products with low list prices, so those products can be used to ‘top-off’ invoices and get them to the target.

Note that this approach is not a requirement. If there’s a better approach – have at it.

DATA DEFINITION:

Table #1 (Seed): Invoices (161K Rows)
Customer ID (Unique ID for Customer)
Site # (Unique identifier for Customer Site)
Customer Site # (Each Customer has 1 or More Sites)
Invoice ID (Unique ID for Invoice)
Invoice Date
Invoice Net Revenue – Target (This is throwaway data. Try to get the line items to sum as closely to this number as possible)
Customer Discount % (Applied to ALL the line items on the invoice)

Table #2 (Seed): Products (10K Rows)
Product Category (Qualitative; Not Important)
Product Line (Qualitative; Not Important)
Product ID (Unique Product ID)
Product Description (Qualitative; Not Important)
List Price (Qualitative; Not Important)
Target Net Revenue (Target

Table #3 (Result Set – To Be Generated): Invoice Line Items
Line Item ID (Auto-generated ID)
Invoice ID (Parent Object)
Product ID (Each Line Item has one Product)
Quantity (The quantity of product sold in the invoice)
Calculated: Each Line Item has a Net Revenue value: [Products.List Price] x [Quantity] x [Invoice.Discount] = [Net Revenue]

VARIABLES:
While the total revenues by product and customer must remain largely intact, some of the data can be manipulated to get the generator script to work:

1- The total net revenue can be off by +/- 1%;
2- The distribution of revenue across customer or products can be +/- 3% of the target;
3- The distribution of revenue across customer sites is not very important; +/- 10% is acceptable;
4- The total number of invoices can be +/- 5%, although the distribution across customers and dates should be proportional;
5- List prices can be adjusted to remove fractional quantities.

DELIVERABLE:
1- Complete Data Set of ‘Invoice Line Items’ (Likely 1 Million Rows)
2- PHP Script (or MySQL Procedure/Query) to Populate the Data, should we need to run it again in the future.

Leave a Reply

Your email address will not be published. Required fields are marked *