Pipeline status

How the corpus narrows from every collected comment down to the ones that name a specific visual indicator — and how those indicators were found. A read-only snapshot of extraction, embedding, and expansion.

Coverage funnel

Collected 912,187 · 100.0% the whole corpus
Embedded 912,187 · 100.0% of collected
Candidate comments 473,708 · 51.9% of collected
Read by the model 15,990 · 3.4% of candidates
Comments citing an indicator 41,982 · 8.9% of candidates, 4.6% of all

Candidate keyword filter

The keyword pre-filter that defines the candidate comments stage above (the pool the LLM samples from). A comment must mention at least one of these (plus the ≥20-char and non-bot checks). Semantic expansion uses a broader gate — the same ≥20-char / non-bot checks but no keyword requirement, so it can reach comments that describe a tell without these words: 777,779 eligible comments. The ≥20-char floor is what stops it matching one-word & emoji reactions (a generic seed like "AI voice" would otherwise vacuum up thousands).

AIrealfakegeneratedobviouslook

How the indicators were found

18,475
LLM-extracted
43,018
Semantic matches
0
Keyword expansion

Taxonomy & curation

195
Taxonomy indicators
191
Embedded indicators
232
Indicator aliases
0
Pending re-expansion

Recent extraction runs

BatchModelStartedCompletedSampleProcessed
d2cb1034gemma3:4b2026-06-04T09:15:282026-06-04T09:49:4280007995
215c54bagemma3:4b2026-06-03T11:53:422026-06-03T12:28:3580007995