Uncanny Atlas

This project is a personal study. I wanted to explore when people started worrying about photos and video content being AI-generated, as well as what signs people were using to try spot AI-generated images.

This website is an overview of my results so far. It is also the dashboard for the tool I built to conduct this exploration. You can run this tool locally yourself.

When run, the tool downloads the comments from r/isthisAI and r/RealOrAI, pulls out the “indicators” people cite - wrong hands, impossible shadows, garbled text - and maps them with language-model embeddings into a tally of the indicators people actually rely on. I then curate these to group them and turn them into an understandable dataset.

It is currently based on 912,187 comments. Of the 777,779 substantive comments, the pipeline has flagged 29,479 (3.8%) as citing a possible indicator; after curation, 18,658 (2.4%) cite a genuine visual tell.

In the current dataset the most-cited indicator is Hands with 2,637 comments.

Explore the results → How it works → Run book →

What you can do here

How it works is a from-first-principles, interactive explainer of the whole method — keyword matching's failure, the language-model reader, the “map of meaning”, and semantic expansion.

The Explore views show the live results from my run and curation of the data: the most-cited indicators, how they trend over time, and generally how the subreddits have grown.

The Run book documents the pipeline end to end so you can rebuild it yourself.

This site vs. running it yourself

You're viewing the public, read-only edition of my run at exploring this data. To keep within Reddit's content terms and data-protection law, it serves only aggregate results — counts, trends and breakdowns — and deliberately leaves a few things out:

Capability	This site	Self-hosted
Example comments (Indicators, Inspect, Semantic matches)	Hidden	Full comment text
Curate workflow (categorise indicators, merge)	Off	Available
Underlying data	Frozen aggregate snapshot	Live database + pipeline
Pipeline (collect → extract → embed)	Not runnable	Runnable (needs Ollama)

Why: the verbatim comments belong to their Reddit authors, not to this project, and a frozen public copy couldn't honour later deletions — so this public, personal data exploration can only contain only the derived statistics, never the raw text or usernames. Running it yourself, against the comments and data you have collected using the tool, removes that constraint. See the Run book to get started.

Credits & attribution

Data. Public comments from Reddit (r/isthisAI, r/RealOrAI), retrieved via the PullPush and Arctic Shift public archives. Reddit and the comment authors retain all rights to the original content.
Models. For this run, I used with gemma3:4b and embeddings with nomic-embed-text, both run locally via Ollama. If running locally, these can be configured.
Built with. SvelteKit, Observable Plot, and better-sqlite3.
Type. Display face Captain Edward by SimpleBits (live site only); body text in Inter.
Code. Open source under the MIT license — github.com/ryanbateman/uncanny_atlas.

Uncanny Atlas is a non-commercial research showcase and intentionally does not include any particular user data. If you are a Reddit user and somehow find a contribution of yours included, contact me and it will be removed from the dataset and from the next published rebuild.

Methodology & limitations

The headline method is on the How it works page. Read the numbers with these caveats in mind:

Individual perspective. Creating this dataset involves curating data into categories, picking similarity values, and making personal choices about what constituted a datapoint. Other people will pick other categories, make other choices. This data is not intended to be regarded as objectively correct, and the project is intentionally open-source so that others can run it themselves with their own approach/choices.
Recency. The corpus skews to recent activity (~2025 onward), so trends say more about the present than the early history of AI imagery.
Selection. Only two subreddits feed it (r/isthisAI, r/RealOrAI); their audiences and norms shape which indicators surface.
Pre-and-post filters. I made personal judgements regarding which comments were worth sampling. There is a keyword filter and a ~20-character length floor, which drops very short or off-topic reactions.
Sample + expansion. Only a few thousand comments are read by the language model; the rest are reached by embedding similarity. The expansion threshold (0.73) and the grounding threshold (0.45) trade coverage against precision.
Model bias. Both the extractor and the embedder carry their own biases. Different models will draw the map differently.
Human-in-the-loop. Indicator merging and noise removal are curated by (my) hand, which adds judgement (and the possibility of error) to the canonical labels.