PULLFIRST[THE RECORD]
Guides

Matching pipeline

How PullFirst joins public contractor records across licenses, permits, enforcement, and OSHA, and what the confidence fields on each link mean.

Context: Joining public contractor records is the product — the thesis behind the approach described here.

Every contractor record in PullFirst is a join across separate public systems that were never designed to line up. This page is the reference for how the join runs, what the resulting confidence fields mean, and how to consume them from the API.

Normalization

Every name string — license, permit, enforcement order, OSHA establishment — passes through the same normalizer before any comparison runs. The goal of normalization is to collapse cosmetic formatting differences without destroying identity.

The normalizer applies the following transformations in order:

  • Lowercases the full string.
  • Strips corporate suffixes: LLC, INC, CORP, and common variants.
  • Collapses repeated whitespace to a single space.
  • Removes punctuation other than intra-word hyphens.
  • Replaces ampersands with the word and.
  • Folds singular and plural forms of the same word (so services and service resolve to the same canonical token).

After normalization, an UPPERCASE license record and a title-case permit record for the same business reduce to one canonical string even when one uses and and the other uses &.

What normalization cannot do

Normalization does not pull apart a string that contains more than the business itself. Permit systems in several jurisdictions prepend a contact person's name to the business-name field; others append an email address or a trailing tag like CONTRACTOR LICENSE. These shapes survive normalization unchanged.

PullFirst treats those patterns as first-class and strips them before scoring, not after. Stripping-after looks tempting on any single row — it becomes dangerous the moment a contact name overlaps with a word in the business name. A person named Carpenter on a permit for a business starting with Carpenter is the canonical trap: strip the contact by position and the business name loses a real word. Every normalization rule is a policy decision about what is identity and what is noise, and it has to hold across millions of permits without a human in the loop.

Match stages

Each candidate link runs through the stages in order and stops at the first one that produces a decision.

Exact

If the normalized permit string equals a normalized license string, the match is verified and the pipeline does not search further.

Fuzzy (Jaro-Winkler)

When exact fails, the matcher compares candidates using Jaro-Winkler distance. Jaro-Winkler weights prefix agreement and handles character transpositions cleanly, which fits company names better than generic edit distance.

  • Higher scores become likely matches.
  • Lower scores become possible matches.
  • When several candidates score close to one another on the same permit, the match is marked ambiguous.

Geographic tiebreaker

Geographic agreement promotes a match to the verified tier:

  • A permit and a license that share a city corroborate each other.
  • When the city strings diverge but the permit property and the contractor's registered address sit in the same county, the match still promotes.

County-level agreement exists because the common case is a contractor based in one metro suburb working on a permit in the next suburb over. City-only matching would drop that case; county-level catches it.

Confidence tiers

The tier travels with every contractor link in the API response.

TierMeaning
VERIFIEDExact normalized match, or fuzzy match plus geographic agreement. Safe to treat as a conclusion.
LIKELYStrong fuzzy match without geographic corroboration. Manually auditable; safe for aggregate reporting.
POSSIBLEWeaker fuzzy match. Use with judgment; not recommended for aggregate reporting.
AMBIGUOUSMultiple candidates scored close to one another. Treat the match as a pointer, not a conclusion.

Filter to VERIFIED and LIKELY for aggregate reporting unless you have a specific reason to include the rest.

Ambiguity

Most of the corpus lives in the ambiguous tier. The weight of ambiguity is structural, not incidental: tens of thousands of groups of licensed contractors normalize to the same string after suffixes and formatting come off. The largest such group contains sixteen distinct licenses that reduce to the same first-name, middle-initial, last-name pattern. Attaching one of those sixteen to a permit without additional signal is a guess dressed up as certainty.

PullFirst surfaces ambiguity rather than resolving it silently. A permit in the ambiguous tier exposes every candidate license with its score and method. Downstream consumers choose what to do:

  • Filter to verified matches only.
  • Accept likely matches.
  • Pull the full candidate list and rank it on your own criteria.
  • Surface the raw ambiguity to your own user.

The API never pretends a guess is a fact.

Import-boundary discipline

Every identity claim the pipeline makes is stamped at import time with four fields: the normalization pass applied, the fuzzy score, the confidence tier, and the method label that produced it. Rows that cannot be matched with confidence stay unlinked and visible, with a reason.

The alternative — pull everything in, fix it later — breaks at corpus scale. Once a permit row is written with a particular reading of its contractor field, every downstream cache, search index, and aggregation inherits that reading. Unwinding it later is far more expensive than holding the line at import.

Reading match fields in the API

Contractor links returned by the API carry:

  • confidence — one of VERIFIED, LIKELY, POSSIBLE, AMBIGUOUS.
  • method — the stage that produced the match (exact, fuzzy, fuzzy+geo, etc.).
  • score — the raw fuzzy distance when applicable.

Treat confidence as the primary filter. Use method and score for audit and for custom reranking when your domain has evidence beyond what PullFirst uses.