Uncanny Atlas

This project is a personal study. I wanted to explore when people started worrying about photos and video content being AI-generated, as well as what signs people were using to try spot AI-generated images.

This website is an overview of my results so far. It is also the dashboard for the tool I built to conduct this exploration. You can run this tool locally yourself.

When run, the tool downloads the comments from r/isthisAI and r/RealOrAI, pulls out the “indicators” people cite - wrong hands, impossible shadows, garbled text - and maps them with language-model embeddings into a tally of the indicators people actually rely on. I then curate these to group them and turn them into an understandable dataset.

It is currently based on 912,187 comments. Of the 777,779 that potentially contain an indicator, 29,217 (3.8%) have been found to contain at least one so far.

In the current dataset the most-cited indicator is Hands with 2,628 comments.

What you can do here

How it works is a from-first-principles, interactive explainer of the whole method — keyword matching's failure, the language-model reader, the “map of meaning”, and semantic expansion.

The Explore views show the live results from my run and curation of the data: the most-cited indicators, how they trend over time, and generally how the subreddits have grown.

The Run book documents the pipeline end to end so you can rebuild it yourself.

This site vs. running it yourself

You're viewing the public, read-only edition of my run at exploring this data. To keep within Reddit's content terms and data-protection law, it serves only aggregate results — counts, trends and breakdowns — and deliberately leaves a few things out:

CapabilityThis siteSelf-hosted
Example comments (Indicators, Inspect, Semantic matches)HiddenFull comment text
Curate workflow (categorise indicators, merge)OffAvailable
Underlying dataFrozen aggregate snapshotLive database + pipeline
Pipeline (collect → extract → embed)Not runnableRunnable (needs Ollama)

Why: the verbatim comments belong to their Reddit authors, not to this project, and a frozen public copy couldn't honour later deletions — so this public, personal data exploration can only contain only the derived statistics, never the raw text or usernames. Running it yourself, against the comments and data you have collected using the tool, removes that constraint. See the Run book to get started.

Credits & attribution

Uncanny Atlas is a non-commercial research showcase and intentionally does not include any particular user data. If you are a Reddit user and somehow find a contribution of yours included, contact us and it will be removed from the next rebuild. The public snapshot is periodically rebuilt from the upstream archives, which propagate deletions.

Methodology & limitations

The headline method is on the How it works page. Read the numbers with these caveats in mind:

  • Individual perspective. Creating this dataset involves curating data into categories, picking similarity values, and making personal choices about what constituted a datapoint. Other people will pick other categories, make other choices. This data is not intended to be regarded as objectively correct, and the project is intentionally open-source so that others can run it themselves with their own approach/choices.
  • Recency. The corpus skews to recent activity (~2025 onward), so trends say more about the present than the early history of AI imagery.
  • Selection. Only two subreddits feed it (r/isthisAI, r/RealOrAI); their audiences and norms shape which indicators surface.
  • Pre-and-post filters. I made personal judgements regarding which comments were worth sampling. There is a keyword filter and a ~20-character length floor, which drops very short or off-topic reactions.
  • Sample + expansion. Only a few thousand comments are read by the language model; the rest are reached by embedding similarity. The expansion threshold (0.73) and the grounding threshold (0.45) trade coverage against precision.
  • Model bias. Both the extractor and the embedder carry their own biases. Different models will draw the map differently.
  • Human-in-the-loop. Indicator merging and noise removal are curated by (my) hand, which adds judgement (and the possibility of error) to the canonical labels.