Uncanny Atlas
This project is a personal study. I wanted to explore when people started worrying about photos and video content being AI-generated, as well as what signs people were using to try spot AI-generated images.
This website is an overview of my results so far. It is also the dashboard for the tool I built to conduct this exploration. You can run this tool locally yourself.
When run, the tool downloads the comments from r/isthisAI and r/RealOrAI, pulls out the “indicators” people cite - wrong hands, impossible shadows, garbled text - and maps them with language-model embeddings into a tally of the indicators people actually rely on. I then curate these to group them and turn them into an understandable dataset.
It is currently based on 912,187 comments. Of the 777,779 that potentially contain an indicator, 29,406 (3.8%) have been found to contain at least one so far.
In the current dataset the most-cited indicator is Hands with 2,628 comments.
What you can do here
How it works is a from-first-principles, interactive explainer of the
whole method — keyword matching's failure, the language-model reader, the “map of meaning”, and
semantic expansion.
The Explore views show the live results from my run and curation of the data:
the most-cited indicators, how they trend over time, and generally how the subreddits have grown.
The Run book documents the pipeline end to end so you can rebuild it yourself.
This site vs. running it yourself
You're viewing the public, read-only edition of my run at exploring this data. To keep within Reddit's content terms and data-protection law, it serves only aggregate results — counts, trends and breakdowns — and deliberately leaves a few things out:
| Capability | This site | Self-hosted |
|---|---|---|
| Example comments (Indicators, Inspect, Semantic matches) | Hidden | Full comment text |
| Curate workflow (categorise indicators, merge) | Off | Available |
| Underlying data | Frozen aggregate snapshot | Live database + pipeline |
| Pipeline (collect → extract → embed) | Not runnable | Runnable (needs Ollama) |
Why: the verbatim comments belong to their Reddit authors, not to this project, and a frozen public copy couldn't honour later deletions — so this public, personal data exploration can only contain only the derived statistics, never the raw text or usernames. Running it yourself, against the comments and data you have collected using the tool, removes that constraint. See the Run book to get started.
Credits & attribution
- Data. Public comments from Reddit (r/isthisAI, r/RealOrAI), retrieved via the PullPush and Arctic Shift public archives. Reddit and the comment authors retain all rights to the original content.
- Models. For this run, I used with
gemma3:4band embeddings withnomic-embed-text, both run locally via Ollama. If running locally, these can be configured. - Built with. SvelteKit, Observable Plot, and better-sqlite3.
- Type. Display face Captain Edward by SimpleBits (live site only); body text in Inter.
- Code. Open source under the MIT license — github.com/ryanbateman/uncanny_atlas.
Uncanny Atlas is a non-commercial research showcase and intentionally does not include any particular user data. If you are a Reddit user and somehow find a contribution of yours included, contact us and it will be removed from the next rebuild. The public snapshot is periodically rebuilt from the upstream archives, which propagate deletions.
Methodology & limitations
The headline method is on the How it works page. Read the numbers with these caveats in mind:
- Individual perspective. Creating this dataset involves curating data into categories, picking similarity values, and making personal choices about what constituted a datapoint. Other people will pick other categories, make other choices. This data is not intended to be regarded as objectively correct, and the project is intentionally open-source so that others can run it themselves with their own approach/choices.
- Recency. The corpus skews to recent activity (~2025 onward), so trends say more about the present than the early history of AI imagery.
- Selection. Only two subreddits feed it (r/isthisAI, r/RealOrAI); their audiences and norms shape which indicators surface.
- Pre-and-post filters. I made personal judgements regarding which comments were worth sampling. There is a keyword filter and a ~20-character length floor, which drops very short or off-topic reactions.
- Sample + expansion. Only a few thousand comments are read by the language model; the rest are reached by embedding similarity. The expansion threshold (0.73) and the grounding threshold (0.45) trade coverage against precision.
- Model bias. Both the extractor and the embedder carry their own biases. Different models will draw the map differently.
- Human-in-the-loop. Indicator merging and noise removal are curated by (my) hand, which adds judgement (and the possibility of error) to the canonical labels.