Building a 5-Phase CSV Upload Pipeline That Handles 30K Rows
A single-pass CSV loader failed in three ways at 30K rows: duplicates, partial commits on crashes, and 120K extra queries from inline inventory sync. A 5-phase pipeline inside one transaction fixed all three and now processes 30K rows in 8 seconds.
A CSV upload sounds simple. Read the file, insert the rows, done. When the file has 30K rows, involves deduplication logic, syncs to 4 related inventory tables, and needs to handle partial failure safely, it is a different problem.
The upload pipeline went through a few designs before landing on a 5-phase structure that handles all of these requirements reliably.
Problem
The first version was a single pass: read the CSV, loop through rows, insert each one. Fast to write, obvious to reason about. It broke in production in three ways.
Duplicates. The CSV files sometimes had duplicate rows for the same product within the same upload. A single-pass insert without deduplication would create duplicate records. Reports would show inflated totals.
Partial failure. A crash at row 15000 left 14999 committed records and 15001 unprocessed. There was no clean way to retry without double-writing the first half.
Inventory sync. After inserting delivery records, the system needed to update inv_location, sales_inventory, and parent aggregate totals. Doing this inside the row loop meant 4 additional queries per row. At 30K rows, that is 120K extra queries in the insert loop.

Solution
Five sequential phases inside a single database transaction.
Phase 1: Validate incoming data before any writes. Check required fields, data types, reference integrity. If anything fails, reject the entire upload with clear error messages before touching the database.
Phase 2: Deduplicate using a JavaScript Map keyed by the natural identifier. When a duplicate appears, merge the quantities instead of creating two entries. O(1) lookup per record.
Phase 3: Batch-fetch existing records from the database in groups of 3000. Build a lookup Map from the results. For each incoming record, check the Map to decide insert or update. One database read phase instead of per-row lookups.
Phase 4: Execute bulk operations. Collect all inserts into a multi-row VALUES statement. Collect all updates into a CASE WHEN statement. Two queries for the entire batch instead of one per row.
Phase 5: Sync inventory after all delivery records are written. Recalculate inv_location, sales_inventory, and parent totals in targeted SQL queries. Run these after the main batch, not inline with each row.
The entire pipeline runs inside one transaction. Any failure at any phase triggers a full rollback. The database is unchanged. The upload can be retried safely.

Result
30K row uploads complete in about 8 seconds end to end.
Duplicate records are eliminated before any writes happen.
Inventory sync runs as targeted SQL after the batch, not as row-by-row queries. This keeps the main batch fast and the sync accurate.
The 5-phase structure also made debugging straightforward. When a failure occurs, the phase number in the log tells you exactly where to look.
Found this useful?
Share it with someone who'd appreciate it.
https://wardvisual.com/blogs/csv-upload-pipeline-nodejs

@wardvisual · 🇵🇭 Dasmarinas City, Cavite PH
Full-stack engineer. Business systems, database optimization, and operations software.