The 8 Signals That Decide Whether AI Cites Your Site (And the Free Tool That Checks All of Them)

Organic click-through rate on queries with a Google AI Overview fell from 1.76% to 0.61% between June 2024 and September 2025, a 61% drop, according to Seer Interactive’s analysis of 3,119 informational queries. Princeton researchers found that adding authoritative citations to a page produced a 115.1% visibility lift for rank-5 results inside generative engines, with quotations and statistics each contributing up to 40% on top of that (Aggarwal et al., 2023). And the visitors AI does send tend to convert at several times the rate of standard organic traffic.

The takeaway sits in the middle of those three numbers. Fewer people are clicking, but the ones who do are worth a lot more, and the door to the room is now an AI citation rather than a ranked link. So the new audit question is not “where do I rank?” but “what is the AI engine actually looking at on my page, and am I giving it what it needs?”

That is the question I built XEOscan to answer. Launched today, 20 May 2026, free, no signup. Before I get to it, here is the field guide. The eight signals that decide whether an AI engine cites you, what the evidence says about each one, and how to check them without taking my word for it.

Why an AI search audit is a different job

Classic SEO crawlers render JavaScript, follow internal links, and care about backlink graphs. AI crawlers are blunter. They fetch raw HTML, often skip JS execution entirely, and grab a small sample of pages rather than the whole site. Cloudflare’s 2025 crawl analysis shows AI bot traffic has exploded: GPTBot requests grew 305% year on year and PerplexityBot traffic grew by more than four orders of magnitude (Cloudflare blog). The behaviour matters too. Most of these crawlers behave more like classic indexers than headless browsers, which means if your homepage hides its content behind a client-side framework, the AI sees a near-empty page. If your author byline is rendered by React after page load, the AI never sees it.

So an AI audit has to look at what a non-JavaScript bot would see when it fetches your URL. That alone rules out most browser-based SEO tools, which simulate a full user session and report back on the rendered DOM. They tell you what Googlebot sees on a good day, not what ClaudeBot sees on any day.

A quick aside on what I mean by XEO. Search engine optimisation (SEO) is the old game. Generative engine optimisation (GEO) is getting cited inside ChatGPT, Claude and Perplexity answers. Answer engine optimisation (AEO) is the older sibling, mostly about featured snippets and voice search. AI Overview optimisation (AIO) is the Google-specific slice. XEO is shorthand for the lot. You need all four. Picking one is leaving money on the table.

Signal 1: AI crawler access

If GPTBot, ClaudeBot or PerplexityBot can’t fetch your page, nothing else on this list matters. And a surprising number of sites have accidentally blocked them.

A longitudinal study tracking robots.txt changes from September 2023 to May 2025 found that AI-blocking by reputable sites grew from 23% to nearly 60% (Originality.ai data, summarised in arXiv 2510.10315). News publishers are the most aggressive: 79% of top news sites block AI training bots, and 67% block PerplexityBot specifically (BuzzStream’s 2025 study). Some of that is deliberate. A lot of it isn’t.

I’ve seen plenty of small-business sites that copied a robots.txt snippet from a forum post in 2023, included a blanket disallow for GPTBot because someone on Twitter said it was a privacy risk, then quietly disappeared from ChatGPT citations a year later. They were not making a strategic choice. They were following someone else’s panic.

The fix is one line. Allow the bots you want crawling. If you genuinely want to opt out of training data while still being citable, that is a more nuanced robots.txt, and the rules differ between OpenAI’s training bot (GPTBot) and its search bot (OAI-SearchBot). The two are not the same.

Signal 2: Discoverability files

Four boring files matter here: sitemap.xml, canonical tags, HTTPS, and llms.txt.

The first three are settled. AI crawlers respect a clean sitemap, get confused by missing canonicals, and treat HTTP-only pages as second-class. That is not news.

The interesting one is llms.txt. Answer.AI’s Jeremy Howard proposed it on 3 September 2024 as a way for sites to publish a tidy, LLM-friendly summary of their content. Adoption has been quietly large: hundreds of thousands of sites have published one by late 2025, including Anthropic, Cloudflare and Stripe. The catch? Not one major LLM has formally confirmed they read it in production (ppc.land, 2025; Indexlab analysis).

So is it pointless? I don’t think so. The cost of publishing one is roughly nil if your site is built on Astro, Next.js or any static framework. It is cheap insurance against an emerging standard that some engines will likely honour soon, and it forces you to write a clean, structured summary of your site, which is a useful exercise on its own. Skipping it because the big four haven’t blessed it yet feels short-sighted.

Signal 3: Structured data (JSON-LD)

This is the one most worth getting right. A Data World benchmark study found that LLMs grounded with structured knowledge graphs delivered roughly 3 times the accuracy of equivalent models working from unstructured text. The exact mechanism on the open web is less settled, and direct-fetch evidence is mixed (Searchviu’s 2025 analysis). But Google AI Overviews and Bing-backed engines pull from search indexes that absolutely weight structured data, so a page with clean JSON-LD is doubly indexed and twice as likely to surface.

The schema types that move the needle:

Article schema with a populated author and datePublished field. This is the one most audit tools skip.
Organization schema on the homepage with sameAs links to your social profiles. This is how AI engines connect “Chris Ungureanu” the author to “chrisungureanu.com” the brand.
FAQPage schema on genuine FAQ content. Do not slap it on a sales page, you will get penalised.
HowTo schema on tutorials, again only where it is genuinely a step-by-step.

A small contrarian take, because every other AI SEO post on the internet pushes the same line: more schema is not always better. I’ve audited sites with seven nested schema types on a single blog post, half of which were invalid. AI engines treat broken schema as a trust signal in the wrong direction. One correct Article block beats four bloated ones.

Signal 4: Citability

This is where AI engines actually decide whether your page is worth quoting. Three things matter: a named author, visible dates, and outbound links to credible sources.

The byline finding is striking. Content with named author bylines receives roughly 1.9 times more citations from ChatGPT, Perplexity and Google AI Overviews than anonymous or “by the team” content (Am I Cited’s 2025 analysis). Perplexity in particular displays the author name in the response, so the visibility flows back into user trust as well.

Dates matter for a different reason. AI engines weight recency aggressively, especially on time-sensitive queries. If a page lacks a clear datePublished and dateModified, the AI has to guess from the URL, the copyright footer, or worst of all, nothing. Anonymous, undated content is, to a citation engine, indistinguishable from filler.

Outbound links are the surprise. Princeton’s GEO research found that adding authoritative citations to a page produced the single largest citation lift in the study, 115% for rank-5 pages (arXiv 2311.09735). The intuition is simple. A page that cites credible sources looks, to an LLM, like a page worth citing. Linking out is not bleeding link equity. It is signalling that you did the homework.

Signal 5: Answer-shape formatting

AI engines extract answers. They love content that is already shaped like one. Princeton’s same paper found that adding statistics improved performance most on factual content, and quotations did the heavy lifting on opinion-heavy topics, each contributing roughly 30 to 40% to citation lift depending on the domain.

Practically, that means:

A clear, single-sentence answer near the top of the page, before any preamble.
Short paragraphs. Three to five sentences max.
Genuine Q&A blocks with the question as an H2 or H3.
Statistics and quotes attributed inline, not buried in a footer.

Here is a thing nobody will tell you. Long, flowing 400-word blog intros are AI-citation poison. The engine wants the answer at the top, not a wind-up. If your post opens with “In today’s fast-paced digital age…” then takes six paragraphs to reach the point, you are writing for a human reader of 2014 and an AI engine of 2026 will skip you for a Reddit thread that gets to the point in two sentences. Fix the intros. The rest is detail.

Signal 6: Content extractability

This is the technical hygiene layer. Does the AI crawler actually get something useful when it fetches your HTML?

The checks are unglamorous and they matter:

A single H1 per page, matching the topic.
Logical heading hierarchy (no H2 jumping to H4).
Body text rendered server-side, not injected by client JavaScript.
Semantic HTML elements (article, section, main) rather than div soup.
Alt text on images that describes the image, not “image1.jpg”.

Most of this is dull SEO 101 that any audit tool should catch. What most tools miss is the JavaScript trap. If your CMS or marketing platform renders the main content via React, Vue or a similar framework, and your server-rendered HTML is mostly empty, AI crawlers see a blank page. I have audited two sites this month that scored 95+ on Lighthouse and were invisible to ClaudeBot because of exactly this. Fixing it usually means turning on SSR or pre-rendering.

Signal 7: Classic SEO basics

The boring ones still matter, because AI engines pull from Bing and Google indexes for most of their fact-grounding. If you are not in the classic index, you are not in the citation pool.

So: title tags under 60 characters, meta descriptions under 155, exactly one H1, hreflang tags if you serve multiple languages, no duplicate content, working internal links. None of this is new. Most sites still get one or two of these wrong.

The mistake I see most often is title tags written for an algorithm that retired five years ago. Keyword stuffing in 2026 is just noise. AI engines pick up titles, read them as labels, and weight them against the page content. A title that promises one thing and delivers another loses trust fast.

Signal 8: Core Web Vitals

The honest version of this signal is more nuanced than most AI SEO advice admits.

Search Engine Land’s 2025 analysis of 107,352 AI-visible pages found no strong positive correlation between Core Web Vitals scores and AI citation rates. Improving CWV beyond baseline thresholds did not reliably improve AI visibility. So why is it on the list at all?

Because of the indirect path. Google uses Core Web Vitals as a ranking threshold, and sites that fail badly enough get demoted out of the index that Gemini and Google AI Overviews pull from. Under mobile-first indexing, mobile scores are what get measured. A site that loads in nine seconds on a phone is not going to be in the AI’s source pool, full stop.

So treat CWV as a hygiene threshold, not a growth lever. Pass the thresholds (LCP under 2.5s, INP under 200ms, CLS under 0.1) and stop optimising. You will get diminishing returns past that point.

How to scan all eight in 60 seconds

The reason I built XEOscan is that I spent three months trying to audit these signals across client sites using existing tools, and the results were, to be polite, a right faff.

Half the “free” tools turned out to be email-gated. The ones that weren’t gave unreliable results. I saw scanners report missing llms.txt files on sites that had them. I saw schema markup detectors miss valid Article blocks. I saw tools demand FAQ schema on a landing page that didn’t have an FAQ on it. The recommendations were often louder than the diagnostics, and the diagnostics were sometimes plain wrong.

So I built XEOscan. It scans approximately 40 signals across the eight categories above, returns a prioritised report ranked by severity, and includes the exact code snippets to fix what it finds. No signup, no email gate, no upsell. It fetches raw HTML the way an AI crawler does, samples roughly 10 strategically selected pages from your site, and checks 13+ AI agents against your robots.txt rules. Results are shareable for seven days.

It is in beta. There will be bugs. If you find one, the contact form on the site works and so does my email. The next update will add an AI-powered deeper-content check, which is the one bit you genuinely can’t do with static rules.

If you want a longer read on the citation side specifically, my GEO playbook for 2026 goes deeper into the Princeton research and the tactical 30-day plan. The XEOscan piece sits alongside it: one tells you the playbook, the other tells you which plays your site is currently failing at.

What to do this week

Scan your site. Fix whatever comes back in red. Then scan two competitors and see where they are weaker than you. That is the field-position equivalent of looking at the scoreboard before deciding whether to pass or run.

XEO is not coming. It is here. The audit question is whether your site is shaped for it, and you have about a year before the brands who got there first own the citation real estate in your niche. Free scan, no signup, results in roughly a minute. The link is xeoscan.ai.