PULLFIRST[THE RECORD]
LOG / 003APRIL 19, 2026

Musings of an ETL operator.

The pipeline churns every day. Data models and diagrams litter the desk. The scheduler chirps in the background. Ingest, geocode, matching algorithms, normalize, import, quality guard, the chain grows and grows.

Operator work on contractor data is unglamorous. The posts before describe the sources and the join. These are notes from the in-between. None of it is novel. All of it is what makes the output the record.

  • MN PERMIT ROWS MANAGED4,095,417
  • MN CITIES + COUNTIES102
  • DLI LICENSES274,532
  • OSHA INSPECTIONS122,688
  • CONFIDENCE TIERS4 published

Normalization keeps biting.

The matching post described the hard cases at a high level. In practice they arrive one at a time. A contact person's name suffixed to a business name. An ampersand where the license file used the word and. A trailing email address in the contractor field. A trade acronym appended in parentheses. None of these look like edge cases on their own. They start to look important when a contractor showing up in two confidence tiers instead of one is a contractor the operator recognizes from the day before.

Geocoding is a pipeline of its own.

A permit rarely arrives with coordinates. Some jurisdictions publish a parcel identifier that carries coordinates by join. Some publish an address string and nothing else. Some publish a lot and block that only a local geocoder can make sense of. Turning any of those into a latitude and longitude is a second pipeline. Address strings are parsed, normalized, and submitted to the Census geocoder in batches. The result carries its own match tier. A rooftop match, an interpolated match along a street segment, a ZIP-centroid fallback, or nothing. A rooftop match and a ZIP centroid are not the same fact, and every coordinate in the PullFirst record carries the tier it was resolved at.

Parcels are a second join, not a guaranteed one.

Not every jurisdiction publishes a parcel identifier with the permit. The ones that do make the join trivial. The record attaches to a known parcel, pulls the parcel's address, acreage, and ownership lineage, and downstream consumers can filter by property directly. The ones that do not leave every permit dependent on address-string inference. Inference works often enough to be useful and fails often enough to stay honest. A permit without a parcel is not half a record, but it is a record with a different shape. A lot fewer downstream queries can be answered about it, and the record has to be legible about that.

Freshness is a field, not a footnote.

Every record in the corpus carries a last-pulled timestamp. The operator discipline is to keep those timestamps moving. When a source has not been refreshed in days, that is a fact the caller can read directly off the record, not a caveat buried in documentation. Freshness is one of the few contractor-data claims that can be verified without a lawyer: either the record was pulled today or it was not, and the timestamp settles it.

This is why the pipeline keeps humming.

Scheduling is where jurisdictions collide.

The pipeline runs hundreds of source pulls per day, plus geocode backfills, match reruns, and import chains that depend on all of the above. Courtesy to public endpoints requires spacing them out. So does the database. Any one pull can starve the others if it runs unbounded. Any one import can deadlock with the next if it holds a lock too long. Two concurrent normalizes against the same jurisdiction become an argument about which one's view of the truth wins. The scheduler's job is to keep all of that from happening, and to notice quickly when it does.

Silent failure is the failure that matters.

Every serious incident in this pipeline has been a silent one. Loud failures are easy. A job throws, a retry fires, an alert lands on a dashboard. A job that succeeds with the wrong data is the one that costs a week of unwinding downstream. The operator discipline goes into making silent failures loud. Row-count deltas per source, shape assertions at import, per-day diffs that refuse to merge when something looks wrong, and alerts that fire on a jurisdiction dropping to zero.

The work behind each field is why it is the record.

None of this shows up in the API. The caller sees a permit, a confidence tier, a timestamp, a coordinate with its match tier, a parcel join where the jurisdiction publishes one. What the caller does not see is the work behind each of those fields. That work is why the output is called the record. You may want to pull it first.

REFERENCE The per-source freshness, refresh cadence, and field inventory this post alludes to are documented in the data sources guide.

SOURCE Operational snapshot of the PullFirst pipeline as of April 19, 2026. Record counts are point-in-time and shift continuously as new filings import.

[END OF FILE]← ALL FIELD NOTES