Collecting public contractor records from Minnesota is straightforward. Reconciling them into one contractor across every source is not.
Four public sources can all reference the same licensed contractor and spell the name four different ways. The state license file uses strict UPPERCASE with a formal INC or LLC suffix. In some cities, the permit's business-name field holds the contact person's name followed by the business name. An enforcement order on the same contractor can use a different corporate suffix, like LLC instead of INC, and an ampersand where the license file had the word and. A consumer-complaint filing drops the corporate suffix altogether.
Four strings, one contractor, no shared key.
The PullFirst matching stack runs in stages. Each stage attaches a confidence score to the match, and the scores travel with the record into the output.
- VERIFIED MATCHES436,000
- LIKELY MATCHES489,000
- POSSIBLE MATCHES223,000
- AMBIGUOUS CANDIDATES2,540,000
- NAMED, UNMATCHED347,000
Normalization has to decide what is noise and what is identity.
The normalizer lowercases every string, strips business suffixes like LLC and INC and CORP, collapses whitespace, removes punctuation, and converts ampersands to the word and. It also folds singular and plural forms of the same word together, so a name ending in services and one ending in service don't read as two different businesses. After all of that runs, the UPPERCASE license version of a business name and a title-case version of the same business both reduce to one matching canonical string, even when one uses the word and while the other uses an ampersand.
Normalization cannot pull apart a name that contains more than the business itself. A permit string with a contact person's name prepended to the business name still has the contact welded onto the front after the normalizer runs. Stripping the prepended words looks obvious on any single row and becomes dangerous when the contact's name overlaps with a word in the actual business name. A person with the last name Carpenter on a permit for a business whose name starts with Carpenter is the trap: strip the contact and the business name loses a real word. Every normalization rule is a decision about which parts of a string are name and which parts are noise, and that decision has to hold up across a few million permits without a human in the loop.
The match runs in stages, each with a confidence score.
Exact match goes first. If the normalized permit string equals a normalized license string, the match is verified and the pipeline stops searching. If exact fails, the matcher drops into fuzzy mode using Jaro-Winkler distance, which weights prefix agreement and handles transpositions cleanly for company names. Higher scores become likely matches. Lower scores become possible matches. When several candidates score close to one another on the same permit, the match is marked ambiguous.
Geographic agreement is the tiebreaker. A permit and a license that share a city corroborate each other and clear the match into the verified tier. When the city strings diverge but the permit property and the contractor's registered address sit in the same county, that still counts. A contractor based in one metro suburb working on a permit in the next suburb over is the common case, and county-level agreement is the right strength of signal for it. That rule exists because common business names collide across the state. A short name built from common trade words can cover dozens of distinct licensed businesses, and the tiebreaker picks between them using location.
Ambiguity is part of the output.
Most of the corpus lives in the ambiguous tier. The 2.5 million ambiguous candidate rows cover roughly 870,000 distinct permits, which means most permits in that tier have more than one plausible license attached. Ambiguity is heavy because common names collide hard. Roughly 39,000 groups of licensed contractors normalize to the same string after suffixes and formatting come off. The largest of those groups contains sixteen distinct licenses that reduce to the same first-name, middle-initial, last-name pattern. Picking one of those sixteen and attaching it to a permit with no other signal is a guess dressed up as certainty.
PullFirst publishes match confidence on every link. A permit shows every license it plausibly belongs to, with a score and a method. A consumer of the API can filter to verified matches only, accept likely matches, pull the full candidate list and rank it themselves, or surface the raw ambiguity to their own user. The system does not pretend a guess is a fact.
Contractor strings arrive padded with contact information.
Some permit systems mix extra content into the business-name field. A company name can arrive with a parenthesized contact name glued to the end, an email address welded in alongside it, or a string like CONTRACTOR LICENSE dangling off the back. These shapes repeat often enough that the normalizer has to treat them as first-class patterns rather than edge cases.
Left in the string, those shapes break the match. A business-name field still carrying a contact person or an embedded email scores against license records that carry neither, and the fuzzy distance drops what is really the same contractor into the possible or ambiguous tier. The normalizer strips these patterns before scoring, not after. The score is the decision. Match rate on affected feeds only climbs once the known patterns come off.
Quality checks belong at the import boundary.
The temptation with messy source data is to treat normalization as a post-hoc clean-up job. Pull everything in and fix it later. That breaks once the corpus gets large enough to matter. Once a permit row is written with a particular reading of its contractor field, every downstream consumer of that record inherits the same reading. Unwinding it later means reaching into caches and search indexes that already depend on the wrong answer.
PullFirst's rule is that every identity claim the pipeline makes carries a normalization pass, a fuzzy score, a confidence tier, and a method label at import time. A row that cannot be matched with confidence stays unlinked and visible, with a reason. That is cheaper than trying to unwind a bad join later.
The join is the product.
The public record is free. Anyone can pull the licensing file or request the enforcement documents. A PullFirst record carries something the raw sources do not: a finished read on which license a permit belongs to, with a tier on every link. When the evidence is thin, the record stays in the ambiguous tier with every plausible license still attached, and the collision stays legible to the caller. The output tells you what is known, what is likely, and what is genuinely in contention. It never pretends a guess is an answer.
That discipline is the product. Ambiguity ships with the data.
SOURCE Contractor license records from the Minnesota Department of Labor and Industry, building permit records from Minnesota city and county systems, state disciplinary and enforcement orders against licensed contractors, and federal OSHA inspection exports. Match counts are rounded to the nearest thousand and reflect a point-in-time snapshot of the PullFirst matching corpus; they shift as new permits import and matching reruns.