Run book

The hands-on, terminal-facing companion to How it works - the same pipeline, but as commands you run and counts you watch. Each step links back to the idea it implements.

Your data right now

912,187
Comments
61,493
Indicator rows
195
Taxonomy indicators
912,187
Embedded comments
43,018
Semantic matches
232
Active merges

Running the pipeline

Steps run in order; each is resumable and safe to re-run. Steps 3–7 need a local Ollama server with gemma3:4b and nomic-embed-text pulled. Commands assume uv sync --all-extras has been run.

1 · Collect raw Reddit data

912,187 comments collected
uv run isthisai-collect submissions
uv run isthisai-collect comments
# other subreddit:
uv run isthisai-collect comments --subreddit RealOrAI
shortcut: make collect

Fetches every submission and comment from the subreddits via the PullPush API into the submissions and comments tables. No labels yet.

→ How it works: the problem

2 · Fill gaps from Arctic Shift (optional)

30,996 submissions
uv run isthisai-import api submissions
uv run isthisai-import api comments
# or from downloaded dumps:
uv run isthisai-import file comments data/RC_*.zst
shortcut: make import-arctic-api

PullPush can miss windows of history. Arctic Shift backfills missing submissions/comments into the same tables (deduplicated on id).

3 · Extract indicators with the LLM (sample)

18,475 LLM-extracted rows
uv run isthisai-extract sample

Filters to opinion comments (keyword + length, minus bots/deleted), randomly samples a few thousand (default 2,500), and asks gemma3:4b (Ollama) for the indicators cited in each comment - one comment per call, so every indicator is tied to the right comment. Each indicator becomes a row in comment_indicators with category = NULL.

→ How it works: reading the comments

4 · Build the taxonomy (the seeds)

195 taxonomy indicators (seeds)
uv run isthisai-extract taxonomy

Takes the ~200 most frequent indicator phrases and asks the LLM to sort each into one of the categories. These become the seeds that semantic expansion hunts from. Writes indicator_taxonomy, then backfills the category on every matching comment_indicators row.

→ How it works: seeds

5 · Embed taxonomy + comments

912,187 embedded comments
uv run isthisai-embed indicators
uv run isthisai-embed comments --all

Generates 768-dim nomic-embed-text vectors for taxonomy indicators (indicator_embeddings) and comment bodies (comment_embeddings). Add --all to embed the whole corpus, not just indicator-bearing comments - the single biggest lever on semantic coverage. Expensive but resumable.

→ How it works: the map of meaning

6 · Ground (drop hallucinated indicators)

18,475 LLM-extracted rows
uv run isthisai-embed ground

The text-only model sometimes invents an indicator for a comment that just reacts ("it's obviously AI"). This compares each LLM indicator's embedding to its comment's embedding and deletes the ones below the grounding threshold (default 0.45). Semantic and keyword rows are left alone.

→ How it works: why it hallucinates

7 · Semantic expansion

43,018 semantic matches
uv run isthisai-embed semantic

Compares every comment embedding to every seed (taxonomy-indicator) embedding; above the similarity threshold (default 0.73) it inserts a new comment_indicators row (batch_id semantic_*). Only comments ≥20 chars (non-bot, non-[deleted]) are matched - the same length gate as the LLM sample, so one-word/emoji reactions are excluded. Seeds a curator marked Noise are skipped. This is how coverage grows beyond the LLM sample - re-run it after embedding more comments.

→ How it works: finding the neighbours

8 · Inspect anytime

232 indicator aliases (merges)
uv run isthisai-stats
shortcut: make stats

Prints corpus counts and date ranges to the terminal. Or just use the Explore tabs in this app - they read the same database live.

The two upstream filters

Before the model reads anything, two hand-maintained lists in extract.py shape the input - and both materially affect the rankings, so they're worth understanding before trusting the numbers.

FilterWhat it does & whyCurrent value
Candidate keywords
OPINION_KEYWORDS
Selects which comments are eligible for the LLM sample - a comment must contain at least one (and be ≥20 chars, non-bot). Deliberately topical, not visual-indicator words: filtering for “finger”/“shadow” would pre-decide the findings (you'd only “discover” the indicators you searched for). Broad on purpose - semantic expansion handles recall.AI real fake generated obvious look
Stop-list
STOP_INDICATORS
Drops a returned “indicator” if it's never a property of the image. Exact, case-insensitive match. Two kinds get dropped: where an image was posted, and pure verdicts ("definitely AI", "not AI") that are judgements, not evidence, and otherwise swamp the rankings. Generation tools, bare subjects and watermarks are deliberately kept in: a SynthID watermark is strong evidence of AI, not noise. Exact, case-insensitive match; the full list lives in extract.py.facebook tiktok reddit definitely ai not ai obviously ai 100% ai ai generated ai slop

Improving accuracy

Automated extraction is imperfect - the LLM mislabels, the keyword filter lets noise through, and semantic expansion has no notion of “correct”. These are the levers, cheapest first (the ideas behind them are in How it works → cleaning up).

LeverWhat it doesWhere
Mark NoiseTag phrases that aren't real indicators (vague judgments, meta-commentary). Cascades to every comment using the phrase and writes through to the taxonomy, so it's durable - a later semantic re-expansion won't reintroduce it (Noise phrases stop being expanded).Curate → Indicators
Re-categorise phrasesMove an indicator to the right category. One change backfills all rows sharing that phrase - highest leverage per click.Curate → Indicators
Fix the taxonomyEdit the source-of-truth indicator/category. Backfills existing rows and steers all future semantic expansion - fixes the root cause, not just symptoms.Curate → Indicators
Merge near-duplicatesConsolidate scattered phrasings (“wrong hands”, “funny hands”, “hands look messed up”) into one canonical indicator (a merged group), so frequency counts reflect the real concept rather than splitting across synonyms.Curate → Merge
Tune the similarity thresholdLower the 0.73 default for more coverage (more false positives); raise it for higher precision (fewer matches). Re-run isthisai-embed semantic after changing ISTHISAI_EMBED_THRESHOLD.pipeline, step 7
Extract a larger sampleRun LLM extraction over more comments for broader, higher-confidence coverage before relying on semantic expansion to fill the rest.pipeline, step 3
Widen embedding coverageBy default only indicator-bearing comments are embedded, so semantic expansion searches a tiny pool. Embed the whole corpus, then re-expand, to surface indicator mentions in comments the sample never touched - the biggest lever on how representative the counts are.isthisai-embed comments --all
Rename an indicator / drop a bad commentPick any indicator, see every comment that cites it, then rename the canonical or remove an obviously mis-attributed comment from it.Explore: Inspect indicator
Remove bad indicatorsDeleting a taxonomy indicator stops it being re-created on the next semantic run - the only way to permanently suppress a bad expansion source.Curate → Indicators

Beyond the pipeline

The refined comment_indicators data (verified indicators + categories) can be exported as JSONL to fine-tune a replacement for gemma3:4b. Point the pipeline at the new model via ISTHISAI_OLLAMA_MODEL; see the project README for the export and training steps.