Skip to main content

Archive automation, LPG distributorship

Fourteen hours of manual triage to under one minute. 3,905 records reconciled, 116 fuzzy duplicates merged, zero rework after handoff.

// stack

Python 3.12 · pandas · openpyxl · rapidfuzz · Google Drive API · OAuth 2.0

Client · Delivered · 2026-03-15

triage time
14h → <1min
records
3,905

Reconciled

rework
0

After handoff

Verified figures from this engagementcase study · 2026-03-15 · Delivered
The shape of the engagement, stage by stagecase map · 2026-03-15 · Delivered

The brief

An LPG distributorship in Odisha kept its consumer archive as 4,000 scanned KYC images in a flat folder, named only by internal consumer numbers. Staff cross referenced an inconsistent Excel database to find a file. A decade of manual data entry had created over a hundred fuzzy duplicate locality names: transliteration variants, suffix inconsistencies, plain typos. Finding a customer record meant fourteen hours of triage a month.

The brief was a one paragraph note: rebuild this into something staff can navigate, without paying for a SaaS, and make it rerunnable when the source data changes.

What we shipped

A Python pipeline in three independent stages, each idempotent and resumable.

Stage one: reconcile and sort. Read the Excel source, build a consumer number lookup, copy matching images into area then locality subfolders. Pre create destination paths. Log every move with the original key, the destination, and a timestamp.

Stage two: normalise localities. Detect fuzzy duplicate locality names across delivery areas using rapidfuzz. 116 distinct merges consolidated 796 folders to about 680 actual localities. Every merge logged with the original keys, the canonical reference, the score, and the policy that produced the decision.

Stage three: publish and link. Upload the normalised archive to Google Drive over OAuth, then rewrite the Excel consumer numbers as clickable hyperlinks pointing at the KYC images. Staff now click a number and the KYC photo opens.

What we measure

Records reconciled3,905
Folders generated796 (collapsed to ~680 after stage two)
Fuzzy duplicates merged116
Manual work replaced~14 hours per cycle
Full rerun timeunder one minute

Staff now access records through a two level hierarchy (delivery area, then locality) with one click hyperlinks. The pipeline reruns in under a minute when the source data changes.

Notes for engineers

This is a worked example of the AI powered workflows engagement shape. We use it on the home page because the numbers are the kind a buyer can keep in their head. The approach is documented in more depth in fuzzy record linkage in Python.

Ankit Sahu, founder of Kinyoubi Atelier & Co., attests to the authenticity of this engagement. Where the client is redacted, identity is held under NDA and can be confirmed under mutual non disclosure. Technical artefacts, unredacted, are available to qualified buyers under NDA on request.

Want a case study like this one?

Send a note. We respond within one business day.