Mentatcurated
Artificial Intelligence high · independent

Low-background steel

A public catalogue points to text and images made before late 2022 — the moment machine-written content began seeping into everything scraped from the open web.

Battleships that sank before 1945 are worth salvaging for their steel. Every furnace built since the first atomic tests draws in air laced with fallout, so all new steel carries a faint radioactive signature — useless for the most sensitive Geiger counters and medical sensors. The clean metal has to come from wrecks that went down before the bomb. It is called low-background steel, and there is a fixed amount of it left on the sea floor.

Generative AI has polluted the data — Robyn Speer, retiring wordfreq

A small catalogue at lowbackgroundsteel.ai borrows the analogy for text. Since late 2022, when ChatGPT arrived, the open web has filled with machine-generated writing — and anything scraped from it now is a blend of human and synthetic prose that can't be cleanly separated. So the site collects pointers to corpora frozen before the contamination: the Wikipedia snapshot from August 2022, Project Gutenberg, the Library of Congress photo archive, the code GitHub sealed into an Arctic vault in 2020. Uncontaminated human text, in this framing, is a finite resource that stopped being made.

The site itself is modest — a community-submission blog, not the audited, fingerprinted registry the pitch sometimes implies; treat it as an index, not a vault. The sharper evidence sits one link away. In 2024 the researcher behind wordfreq, a tool that tracked how often words are used across forty languages, stopped updating it. Part of the reason was pollution — the post-2022 web no longer samples human language cleanly. Part was money: Reddit and others stopped giving their archives away and began selling them at prices only the largest labs can pay.

That is the quiet stakes of the analogy. The supply of clean human data is being fenced off from two directions at once — spoiled by machine output, and locked behind paywalls — while the models that caused the spoiling are the ones that can still afford what's left.

The lenses

Novelty 2
Impact · breadth 2
Impact · depth 2
Actionable 2
Substance 3
Hype 3

The facts

What it isA public, community-curated catalogue of text/image sources created before the late-2022 generative-AI era
CostFree to browse; open submissions
The catchA blog of pointers, not a verified or audited dataset — the sources are frozen snapshots, not a cleaned corpus
Open lowbackgroundsteel.ai →

How this connects

Tap a node to open it