Client · Delivered
Archive automation, LPG distributorship
Fourteen hours of manual triage to under one minute. 3,905 records reconciled, 116 fuzzy duplicates merged, zero rework after handoff.
Python 3.12 · pandas · openpyxl · rapidfuzz · Google Drive API · OAuth 2.0
Client · Delivered · 2026-03-15
The brief
An LPG distributorship in Odisha kept its consumer archive as 4,000 scanned KYC images in a flat folder, named only by internal consumer numbers. Staff cross referenced an inconsistent Excel database to find a file. A decade of manual data entry had created over a hundred fuzzy duplicate locality names: transliteration variants, suffix inconsistencies, plain typos. Finding a customer record meant fourteen hours of triage a month.
The brief was a one paragraph note: rebuild this into something staff can navigate, without paying for a SaaS, and make it rerunnable when the source data changes.
What we shipped
A Python pipeline in three independent stages, each idempotent and resumable.
Stage one: reconcile and sort. Read the Excel source, build a consumer number lookup, copy matching images into area then locality subfolders. Pre create destination paths. Log every move with the original key, the destination, and a timestamp.
Stage two: normalise localities. Detect fuzzy duplicate locality names across delivery areas using rapidfuzz. 116 distinct merges consolidated 796 folders to about 680 actual localities. Every merge logged with the original keys, the canonical reference, the score, and the policy that produced the decision.
Stage three: publish and link. Upload the normalised archive to Google Drive over OAuth, then rewrite the Excel consumer numbers as clickable hyperlinks pointing at the KYC images. Staff now click a number and the KYC photo opens.
What we measure
| Records reconciled | 3,905 |
| Folders generated | 796 (collapsed to ~680 after stage two) |
| Fuzzy duplicates merged | 116 |
| Manual work replaced | ~14 hours per cycle |
| Full rerun time | under one minute |
Staff now access records through a two level hierarchy (delivery area, then locality) with one click hyperlinks. The pipeline reruns in under a minute when the source data changes.
Notes for engineers
This is a worked example of the AI powered workflows engagement shape. We use it on the home page because the numbers are the kind a buyer can keep in their head. The approach is documented in more depth in fuzzy record linkage in Python.