Run book
The hands-on, terminal-facing companion to How it works - the same pipeline, but as commands you run and counts you watch. Each step links back to the idea it implements.
Your data right now
Running the pipeline
Steps run in order; each is resumable and safe to re-run. Steps 3–7 need a local Ollama server with gemma3:4b and nomic-embed-text pulled. Commands assume uv sync --all-extras has been run.
1 · Collect raw Reddit data
uv run isthisai-collect submissions uv run isthisai-collect comments # other subreddit: uv run isthisai-collect comments --subreddit RealOrAI
make collectFetches every submission and comment from the subreddits via the PullPush API into the submissions and comments tables. No labels yet.
→ How it works: the problem2 · Fill gaps from Arctic Shift (optional)
uv run isthisai-import api submissions uv run isthisai-import api comments # or from downloaded dumps: uv run isthisai-import file comments data/RC_*.zst
make import-arctic-apiPullPush can miss windows of history. Arctic Shift backfills missing submissions/comments into the same tables (deduplicated on id).
3 · Extract indicators with the LLM (sample)
uv run isthisai-extract sample
Filters to opinion comments (keyword + length, minus bots/deleted), randomly samples a few thousand (default 2,500), and asks gemma3:4b (Ollama) for the indicators cited in each comment - one comment per call, so every indicator is tied to the right comment. Each indicator becomes a row in comment_indicators with category = NULL.
→ How it works: reading the comments4 · Build the taxonomy (the seeds)
uv run isthisai-extract taxonomy
Takes the ~200 most frequent indicator phrases and asks the LLM to sort each into one of the categories. These become the seeds that semantic expansion hunts from. Writes indicator_taxonomy, then backfills the category on every matching comment_indicators row.
→ How it works: seeds5 · Embed taxonomy + comments
uv run isthisai-embed indicators uv run isthisai-embed comments --all
Generates 768-dim nomic-embed-text vectors for taxonomy indicators (indicator_embeddings) and comment bodies (comment_embeddings). Add --all to embed the whole corpus, not just indicator-bearing comments - the single biggest lever on semantic coverage. Expensive but resumable.
→ How it works: the map of meaning6 · Ground (drop hallucinated indicators)
uv run isthisai-embed ground
The text-only model sometimes invents an indicator for a comment that just reacts ("it's obviously AI"). This compares each LLM indicator's embedding to its comment's embedding and deletes the ones below the grounding threshold (default 0.45). Semantic and keyword rows are left alone.
→ How it works: why it hallucinates7 · Semantic expansion
uv run isthisai-embed semantic
Compares every comment embedding to every seed (taxonomy-indicator) embedding; above the similarity threshold (default 0.73) it inserts a new comment_indicators row (batch_id semantic_*). Only comments ≥20 chars (non-bot, non-[deleted]) are matched - the same length gate as the LLM sample, so one-word/emoji reactions are excluded. Seeds a curator marked Noise are skipped. This is how coverage grows beyond the LLM sample - re-run it after embedding more comments.
→ How it works: finding the neighbours8 · Inspect anytime
uv run isthisai-stats
make statsPrints corpus counts and date ranges to the terminal. Or just use the Explore tabs in this app - they read the same database live.
The two upstream filters
Before the model reads anything, two hand-maintained lists in extract.py shape the
input - and both materially affect the rankings, so they're worth understanding before trusting the
numbers.
| Filter | What it does & why | Current value |
|---|---|---|
Candidate keywordsOPINION_KEYWORDS | Selects which comments are eligible for the LLM sample - a comment must contain at least one (and be ≥20 chars, non-bot). Deliberately topical, not visual-indicator words: filtering for “finger”/“shadow” would pre-decide the findings (you'd only “discover” the indicators you searched for). Broad on purpose - semantic expansion handles recall. | AI real fake generated obvious look |
Stop-listSTOP_INDICATORS | Drops a returned “indicator” if it's never a property of the image. Exact, case-insensitive match. Two kinds get dropped: where an image was posted, and pure verdicts ("definitely AI", "not AI") that are judgements, not evidence, and otherwise swamp the rankings. Generation tools, bare subjects and watermarks are deliberately kept in: a SynthID watermark is strong evidence of AI, not noise. Exact, case-insensitive match; the full list lives in extract.py. | facebook tiktok reddit definitely ai not ai obviously ai 100% ai ai generated ai slop |
Improving accuracy
Automated extraction is imperfect - the LLM mislabels, the keyword filter lets noise through, and semantic expansion has no notion of “correct”. These are the levers, cheapest first (the ideas behind them are in How it works → cleaning up).
| Lever | What it does | Where |
|---|---|---|
| Mark Noise | Tag phrases that aren't real indicators (vague judgments, meta-commentary). Cascades to every comment using the phrase and writes through to the taxonomy, so it's durable - a later semantic re-expansion won't reintroduce it (Noise phrases stop being expanded). | Curate → Indicators |
| Re-categorise phrases | Move an indicator to the right category. One change backfills all rows sharing that phrase - highest leverage per click. | Curate → Indicators |
| Fix the taxonomy | Edit the source-of-truth indicator/category. Backfills existing rows and steers all future semantic expansion - fixes the root cause, not just symptoms. | Curate → Indicators |
| Merge near-duplicates | Consolidate scattered phrasings (“wrong hands”, “funny hands”, “hands look messed up”) into one canonical indicator (a merged group), so frequency counts reflect the real concept rather than splitting across synonyms. | Curate → Merge |
| Tune the similarity threshold | Lower the 0.73 default for more coverage (more false positives); raise it for higher precision (fewer matches). Re-run isthisai-embed semantic after changing ISTHISAI_EMBED_THRESHOLD. | pipeline, step 7 |
| Extract a larger sample | Run LLM extraction over more comments for broader, higher-confidence coverage before relying on semantic expansion to fill the rest. | pipeline, step 3 |
| Widen embedding coverage | By default only indicator-bearing comments are embedded, so semantic expansion searches a tiny pool. Embed the whole corpus, then re-expand, to surface indicator mentions in comments the sample never touched - the biggest lever on how representative the counts are. | isthisai-embed comments --all |
| Rename an indicator / drop a bad comment | Pick any indicator, see every comment that cites it, then rename the canonical or remove an obviously mis-attributed comment from it. | Explore: Inspect indicator |
| Remove bad indicators | Deleting a taxonomy indicator stops it being re-created on the next semantic run - the only way to permanently suppress a bad expansion source. | Curate → Indicators |
Beyond the pipeline
The refined comment_indicators data (verified indicators + categories) can be
exported as JSONL to fine-tune a replacement for gemma3:4b. Point the pipeline at
the new model via ISTHISAI_OLLAMA_MODEL; see the project README for the export and
training steps.