Pipeline status
How the corpus narrows from every collected comment down to the ones that name a specific visual indicator — and how those indicators were found. A read-only snapshot of extraction, embedding, and expansion.
Coverage funnel
Collected 912,187 · 100.0% the whole corpus
Embedded 912,187 · 100.0% of collected
Candidate comments 473,708 · 51.9% of collected
Read by the model 15,990 · 3.4% of candidates
Comments citing an indicator 41,982 · 8.9% of candidates, 4.6% of all
Candidate keyword filter
The keyword pre-filter that defines the candidate comments stage above (the pool the LLM samples from). A comment must mention at least one of these (plus the ≥20-char and non-bot checks). Semantic expansion uses a broader gate — the same ≥20-char / non-bot checks but no keyword requirement, so it can reach comments that describe a tell without these words: 777,779 eligible comments. The ≥20-char floor is what stops it matching one-word & emoji reactions (a generic seed like "AI voice" would otherwise vacuum up thousands).
AIrealfakegeneratedobviouslook
How the indicators were found
18,475
LLM-extracted
43,018
Semantic matches
0
Keyword expansion
Taxonomy & curation
195
Taxonomy indicators
191
Embedded indicators
232
Indicator aliases
0
Pending re-expansion
Recent extraction runs
| Batch | Model | Started | Completed | Sample | Processed |
|---|---|---|---|---|---|
| d2cb1034 | gemma3:4b | 2026-06-04T09:15:28 | 2026-06-04T09:49:42 | 8000 | 7995 |
| 215c54ba | gemma3:4b | 2026-06-03T11:53:42 | 2026-06-03T12:28:35 | 8000 | 7995 |