Skip to main content

Fuzzy record linkage in Python: when rapidfuzz is enough

Most reconciliations need a normalisation pass and a string distance, not a machine learning system. Until they do.

By the studio3 min readPython · Fuzzy matching · Record linkage

"Fuzzy matching" gets used to describe two different problems. They look similar and need different tools.

Problem one: messy variations of the same entity. A locality called Ramanagar in one row, Raamnagar in another, Ramnagar Phase II in a third. Same place. Different transliteration choices. This is a normalisation problem with a string distance fallback.

Problem two: two different entities mistaken for one. Two customers in the same locality named A.K. Singh, neither the same person. This is an entity resolution problem and demands a human in the loop.

Most engineering teams reach for a heavy tool (dedupe.io, recordlinkage, Splink) before they have characterised which problem they have. We do the opposite: we pretend it is problem one until evidence forces us up the ladder.

The four step pipeline

For a problem one workload, this is the pattern.

Normalise first. Standardise case. Strip extra whitespace. Replace common diacritics. Normalise unit suffixes (Ph II, Phase 2, Phase II all collapse to "phase 2"). Strip honorifics if you are matching names. This step removes 80% of the noise and the cost is a hundred lines of Python.

Block second. Partition by a high cardinality field (delivery area, district, year of birth) so you compare within blocks instead of across the cartesian product. Without this, you scale O(n²); with it, you scale to about O(n).

Match third. Apply a string distance within each block. We use rapidfuzz, a fast BSD licensed Python implementation of Levenshtein and its variants. Pick a threshold; everything above it is a candidate merge.

Audit fourth. Every merge writes a log row with the original keys, the chosen canonical reference, the score, the rule that produced the decision, and a timestamp. This makes the work reversible. Two years later when someone asks why two records collapsed, you have the answer.

When rapidfuzz is enough

On the archive automation engagement the pipeline reduced 796 folders to about 680 real localities through 116 distinct merges. No machine learning. The normalisation step did most of the work; rapidfuzz handled the residue.

When to upgrade

Three signals say "you are not on problem one any more":

  • The blocking key is unstable. If a 5% change in source data shifts which records land in which blocks, the matcher's recall is shaky, and you need a more sophisticated blocking strategy (canopy clustering, sorted neighbourhood, learned blocking).
  • Ground truth is contested. If two informed reviewers disagree on more than ~15% of candidate merges, the system needs more structure than a similarity score.
  • Schema churn. If your source data adds or renames fields each quarter, you need declarative, versioned matching logic. That is when dedupe.io or Splink earn their cost.

The audit step is non negotiable

Whatever tool you reach for, log the merge before you execute it. We have undone a wrong merge from three months ago by reading the log, restoring the source row, and rerunning the pipeline with an adjusted threshold. Without the log, that work would have been irrecoverable. With it, it was a one hour fix.

Working on something like this?

Send a note. We respond within one business day.