Uncanny Atlas

Run book

The hands-on, terminal-facing companion to How it works - the same pipeline, but as commands you can run and counts you can watch. Each step links back to the idea it implements.

Running the pipeline

Steps run in order; each is resumable and safe to re-run. Steps 3–7 need a local Ollama server with gemma3:4b and nomic-embed-text pulled. Commands assume uv sync --all-extras has been run.

1 · Collect raw Reddit data

912,187 comments collected

uv run isthisai-collect submissions
uv run isthisai-collect comments
# other subreddit:
uv run isthisai-collect comments --subreddit RealOrAI

shortcut: make collect

Fetches every submission and comment from the subreddits via the PullPush API into the submissions and comments tables. No labels yet.

→ How it works: the problem

2 · Fill gaps from Arctic Shift (optional)

30,996 submissions

uv run isthisai-import api submissions
uv run isthisai-import api comments
# or from downloaded dumps:
uv run isthisai-import file comments data/RC_*.zst

shortcut: make import-arctic-api

PullPush can miss windows of history. Arctic Shift backfills missing submissions/comments into the same tables (deduplicated on id).

3 · Extract indicators with the LLM (sample)

18,474 LLM-extracted rows

uv run isthisai-extract sample

Filters to opinion comments (keyword + length, minus bots/deleted), randomly samples a few thousand (default 2,500), and asks gemma3:4b (Ollama) for the indicators cited in each comment - one comment per call, so every indicator is tied to the right comment. Each indicator becomes a row in comment_indicators with category = NULL.

→ How it works: reading the comments

4 · Build the taxonomy (the seeds)

189 taxonomy indicators (seeds)

uv run isthisai-extract taxonomy

Takes the ~200 most frequent indicator phrases and asks the LLM to sort each into one of the categories. These become the seeds that semantic expansion hunts from. Writes indicator_taxonomy, then backfills the category on every matching comment_indicators row.

→ How it works: seeds

5 · Embed taxonomy + comments

912,187 embedded comments

uv run isthisai-embed indicators
uv run isthisai-embed comments --all

Generates 768-dim nomic-embed-text vectors for taxonomy indicators (indicator_embeddings) and comment bodies (comment_embeddings). Add --all to embed the whole corpus, not just indicator-bearing comments - the single biggest lever on semantic coverage. Expensive but resumable.

→ How it works: the map of meaning

6 · Ground (drop hallucinated indicators)

18,474 LLM-extracted rows

uv run isthisai-embed ground

The text-only model sometimes invents an indicator for a comment that just reacts ("it's obviously AI"). This compares each LLM indicator's embedding to its comment's embedding and deletes the ones below the grounding threshold (default 0.45). Semantic and keyword rows are left alone.

→ How it works: why it hallucinates

7 · Semantic expansion

24,042 semantic matches

uv run isthisai-embed semantic

Compares every comment embedding to every seed (taxonomy-indicator) embedding; above the similarity threshold (default 0.73) it inserts a new comment_indicators row (batch_id semantic_*). Only comments ≥20 chars (non-bot, non-[deleted]) are matched - the same length gate as the LLM sample, so one-word/emoji reactions are excluded. Seeds a curator marked Noise are skipped. This is how coverage grows beyond the LLM sample - re-run it after embedding more comments.

→ How it works: finding the neighbours

8 · Inspect anytime

671 indicator aliases (merges)

uv run isthisai-stats

shortcut: make stats

Prints corpus counts and date ranges to the terminal. Or just use the Explore tabs in this app - they read the same database live.

The two upstream filters

Before the model reads anything, two hand-maintained lists in extract.py shape the input - and both materially affect the rankings, so they're worth understanding before trusting the numbers.

Filter	What it does & why	Current value
Candidate keywords `OPINION_KEYWORDS`	Selects which comments are eligible for the LLM sample - a comment must contain at least one (and be ≥20 chars, non-bot). Deliberately topical, not visual-indicator words: filtering for “finger”/“shadow” would pre-decide the findings (you'd only “discover” the indicators you searched for). Broad on purpose - semantic expansion handles recall.	AI real fake generated obvious look
Stop-list `STOP_INDICATORS`	Drops a returned “indicator” if it's never a property of the image. Exact, case-insensitive match. Two kinds get dropped: where an image was posted, and pure verdicts ("definitely AI", "not AI") that are judgements, not evidence, and otherwise swamp the rankings. Generation tools, bare subjects and watermarks are deliberately kept in: a SynthID watermark is strong evidence of AI, not noise. Exact, case-insensitive match; the full list lives in extract.py.	facebook tiktok reddit definitely ai not ai obviously ai 100% ai ai generated ai slop

Improving accuracy

Automated extraction is imperfect - the LLM mislabels, the keyword filter lets noise through, and semantic expansion has no notion of “correct”. These are the levers, cheapest first (the ideas behind them are in How it works → cleaning up).

Lever	What it does	Where
Mark Noise	Tag phrases that aren't real indicators (vague judgments, meta-commentary). Cascades to every comment using the phrase and writes through to the taxonomy, so it's durable - a later semantic re-expansion won't reintroduce it (Noise phrases stop being expanded).	Curate → Indicators
Re-categorise phrases	Move an indicator to the right category. One change backfills all rows sharing that phrase - highest leverage per click.	Curate → Indicators
Fix the taxonomy	Edit the source-of-truth indicator/category. Backfills existing rows and steers all future semantic expansion - fixes the root cause, not just symptoms.	Curate → Indicators
Merge near-duplicates	Consolidate scattered phrasings (“wrong hands”, “funny hands”, “hands look messed up”) into one canonical indicator (a merged group), so frequency counts reflect the real concept rather than splitting across synonyms.	Curate → Merge
Tune the similarity threshold	Lower the 0.73 default for more coverage (more false positives); raise it for higher precision (fewer matches). Re-run `isthisai-embed semantic` after changing `ISTHISAI_EMBED_THRESHOLD`.	pipeline, step 7
Extract a larger sample	Run LLM extraction over more comments for broader, higher-confidence coverage before relying on semantic expansion to fill the rest.	pipeline, step 3
Widen embedding coverage	By default only indicator-bearing comments are embedded, so semantic expansion searches a tiny pool. Embed the whole corpus, then re-expand, to surface indicator mentions in comments the sample never touched - the biggest lever on how representative the counts are.	`isthisai-embed comments --all`
Rename an indicator / drop a bad comment	Pick any indicator, see every comment that cites it, then rename the canonical or remove an obviously mis-attributed comment from it.	Explore: Inspect indicator
Remove bad indicators	Deleting a taxonomy indicator stops it being re-created on the next semantic run - the only way to permanently suppress a bad expansion source.	Curate → Indicators

Beyond the pipeline

The refined comment_indicators data (verified indicators + categories) can be exported as JSONL to fine-tune a replacement for gemma3:4b. Point the pipeline at the new model via ISTHISAI_OLLAMA_MODEL; see the project README for the export and training steps.