Archive automation, LPG distributorship

The brief

An LPG distributorship in Odisha kept its consumer archive as 4,000 scanned KYC images in a flat folder, named only by internal consumer numbers. Staff cross referenced an inconsistent Excel database to find a file. A decade of manual data entry had created over a hundred fuzzy duplicate locality names: transliteration variants, suffix inconsistencies, plain typos. Finding a customer record meant fourteen hours of triage a month.

The brief was a one paragraph note: rebuild this into something staff can navigate, without paying for a SaaS, and make it rerunnable when the source data changes.

What we shipped

A Python pipeline in three independent stages, each idempotent and resumable.

Stage one: reconcile and sort. Read the Excel source, build a consumer number lookup, copy matching images into area then locality subfolders. Pre create destination paths. Log every move with the original key, the destination, and a timestamp.

Stage two: normalise localities. Detect fuzzy duplicate locality names across delivery areas using rapidfuzz. 116 distinct merges consolidated 796 folders to about 680 actual localities. Every merge logged with the original keys, the canonical reference, the score, and the policy that produced the decision.

Stage three: publish and link. Upload the normalised archive to Google Drive over OAuth, then rewrite the Excel consumer numbers as clickable hyperlinks pointing at the KYC images. Staff now click a number and the KYC photo opens.

What we measure


Records reconciled	3,905
Folders generated	796 (collapsed to ~680 after stage two)
Fuzzy duplicates merged	116
Manual work replaced	~14 hours per cycle
Full rerun time	under one minute

Staff now access records through a two level hierarchy (delivery area, then locality) with one click hyperlinks. The pipeline reruns in under a minute when the source data changes.

Notes for engineers

This is a worked example of the AI powered workflows engagement shape. We use it on the home page because the numbers are the kind a buyer can keep in their head. The approach is documented in more depth in fuzzy record linkage in Python.

The brief

What we shipped

What we measure

Notes for engineers

Want a case study like this one?