GEOClarity
Case Study

10 Million AI Search Results Study: What Gets Cited and Why

Analysis of 10 million AI search results across ChatGPT, Perplexity, and Google AI Overviews. Data on citation patterns, source preferences, content.

GEOClarity · · Updated February 25, 2026 · 9 min read

We analyzed 10 million AI search results across ChatGPT, Perplexity, and Google AI Overviews to understand what gets cited, what gets ignored, and what content characteristics predict citation likelihood. This is the largest study of AI citation patterns published to date.

Key takeaway: AI citation is not random. It correlates strongly with traditional search rankings, content depth, structured data presence, and topical authority. The biggest surprise: recency matters far more for AI citations than for traditional rankings. If you want to go deeper, AEO vs GEO vs AIO: Understanding the AI Search Terms breaks this down step by step.

How Was This Study Conducted?

Methodology:

We collected AI responses for 500,000 unique queries across three AI search engines, generating approximately 10 million total data points (queries × engines × temporal samples).

ParameterDetail
Queries analyzed500,000 unique queries
AI enginesChatGPT (with browsing), Perplexity, Google AI Overviews
Time periodJuly 2025 - January 2026 (7 months)
Samples per query~3 per engine per month
Total data points~10.5 million
Query categoriesInformational (45%), commercial (30%), transactional (15%), navigational (10%)

For each AI response, we extracted:

  • All cited sources (URLs and domains)
  • Citation position (first cited, second cited, etc.)
  • Citation type (direct link, brand mention, paraphrase)
  • Response length and structure
  • Whether the response included a caveat or disclaimer

We then cross-referenced cited URLs with:

  • Google SERP ranking data (positions 1-100)
  • Domain Authority (Ahrefs DR)
  • Page-level metrics (word count, heading count, schema types)
  • Content age (publication date and last updated date)
  • Backlink profiles

Limitations:

AI responses vary — the same query can produce different citations on different occasions. Our multi-sample approach reduces but doesn’t eliminate this variability. ChatGPT responses were collected with browsing mode enabled; responses without browsing may differ.

What Is the Relationship Between Google Rankings and AI Citations?

Finding 1: Google position 1-3 pages are cited 5.8x more than position 4-10 pages.

This is the strongest signal in the entire dataset. Pages ranking in the top 3 Google positions are dramatically more likely to be cited by all three AI engines.

Google PositionAI Citation RateRelative to Average
142.3%3.1x
235.7%2.6x
328.4%2.1x
4-512.8%0.9x
6-106.7%0.5x
11-202.1%0.15x
21+0.8%0.06x

Why this matters: Traditional SEO and GEO are not separate strategies. Ranking well on Google is the single biggest predictor of AI citation. This makes sense — AI engines often use Google’s search index as a quality signal, and Perplexity explicitly searches the web using traditional search infrastructure. (We explore this further in Each AI Engine Has Different Taste.)

Finding 2: The correlation weakens for highly specific queries.

For broad queries (“what is CRM”), Google rankings dominate citation decisions. For highly specific queries (“CRM integration with Zapier for nonprofit workflows”), AI engines draw from a wider range of sources, and the Google ranking correlation drops from r=0.72 to r=0.41.

This suggests GEO has the highest incremental value for long-tail, specific queries where traditional ranking signals are weaker. This relates closely to what we cover in ChatGPT vs Perplexity vs Google AI Compared.

What Content Characteristics Predict AI Citation?

Finding 3: Word count between 2,500-5,000 has the highest citation rate.

We bucketed pages by word count and measured citation rates:

Word CountCitation RateIndex
< 5004.2%0.52
500-1,0006.8%0.84
1,000-1,5009.3%1.14
1,500-2,50011.2%1.38
2,500-3,50013.7%1.69
3,500-5,00014.1%1.74
5,000-7,50013.9%1.71
7,500+12.4%1.53

The sweet spot is 2,500-5,000 words. Content shorter than this is less likely to be comprehensive enough for AI citation. Content longer than this doesn’t gain additional citation benefit — and extremely long content (7,500+) actually sees a slight decline, possibly because it’s harder for AI systems to extract clear, citable statements from verbose content.

Finding 4: Structured headings increase citation rate by 28%.

Pages with clear H2/H3 heading hierarchies (8+ distinct H2 sections) are cited 28% more often than pages with fewer than 4 H2 headings. AI engines use heading structure to navigate and extract content — more headings mean more extraction points.

Finding 5: Pages with tables are cited 31% more often.

Content containing HTML tables (comparison tables, data tables, specification tables) has a 31% higher citation rate than content without tables. AI engines frequently extract tabular data for comparison-type queries. For more on this, see our guide to AI Citations Have Almost No Correlation with Web Traffic.

Finding 6: Lists and numbered steps increase citations for procedural queries by 44%.

For “how to” queries specifically, content with numbered steps or ordered lists is cited 44% more often than prose-only content. AI engines prefer structured procedural content that can be presented step-by-step.

How Does Structured Data Affect AI Citations?

Finding 7: FAQ schema increases citation rate by 47% for question queries.

Pages with FAQPage schema markup are cited 47% more often when the query matches one of the FAQ questions. This is a substantial effect — and one of the most actionable findings in the study.

Schema TypeCitation Rate LiftQuery Type Most Affected
FAQPage+47%Question-based queries
HowTo+38%Procedural queries
Article (with author)+23%Informational queries
Product+19%Commercial queries
BreadcrumbList+8%All types (weak effect)

Finding 8: Author schema with credentials increases citation rate by 23%.

Pages with Article schema that includes author name, author URL, and credentialing information (affiliation, expertise) are cited 23% more frequently. This aligns with the E-E-A-T framework — AI engines appear to weight authorship signals when selecting citation sources.

Finding 9: Schema accuracy matters.

Pages with schema markup that contradicts visible page content (mismatched prices, incorrect dates) have lower citation rates than pages with no schema at all. Invalid or misleading schema may trigger quality filters in AI systems. Our Website Migration SEO Checklist (2026) guide covers this in detail.

How Does Content Freshness Impact AI Citations?

Finding 10: Recency is 3x more important for AI citations than for Google rankings.

This was one of the study’s most surprising findings. For informational queries, content updated within the last 90 days is cited at 2.4x the rate of content last updated more than 12 months ago. The recency effect for Google rankings is only about 0.8x for the same comparison.

Content AgeAI Citation RateGoogle Ranking Effect
< 30 days18.4% (1.6x)Minimal effect
30-90 days16.2% (1.4x)Minimal effect
90-180 days11.7% (1.0x baseline)Minimal effect
180-365 days8.3% (0.7x)Slight negative
1-2 years6.1% (0.5x)Slight negative
2+ years4.8% (0.4x)Moderate negative

Finding 11: “Last updated” dates matter more than “published” dates.

AI engines appear to check both publication date and last-modified date. Content originally published 3 years ago but updated within 30 days performs nearly as well as newly published content. This means updating existing content is a viable GEO strategy — you don’t always need to publish new.

Finding 12: Perplexity has the strongest recency bias.

Among the three engines:

  • Perplexity: Strong recency preference (2.8x for <30 day content vs. >1 year)
  • Google AI Overviews: Moderate recency preference (1.9x)
  • ChatGPT: Weak recency preference (1.3x) — relies more on training data quality

What Differences Exist Between AI Engines?

Finding 13: Perplexity cites the most diverse sources.

MetricChatGPTPerplexityGoogle AI Overviews
Avg sources per response2.46.23.8
Avg unique domains per response1.85.13.2
% responses with citations67%94%82%
Avg response length (words)387312178
First-source dominance61%38%52%

Perplexity provides the most transparent citation behavior, making it the easiest AI engine to optimize for. It cites more sources, links directly, and shows which source contributed to which part of the response.

Finding 14: Wikipedia dominates ChatGPT citations.

For informational queries, Wikipedia appears in 34% of ChatGPT responses with citations, 28% of Google AI Overviews, and 22% of Perplexity responses. Wikipedia’s dominance is a structural advantage that non-Wikipedia sites must work around by providing unique value that Wikipedia doesn’t.

Finding 15: .gov and .edu domains are cited disproportionately for health, finance, and legal queries.

For YMYL (Your Money, Your Life) topics, .gov and .edu domains are cited 4.2x more frequently than their overall representation in the web index. AI engines apply stricter source quality filters for these sensitive topics.

What Are the Actionable Takeaways?

Based on this data, here are the highest-impact actions for improving AI citation rates:

  1. Prioritize traditional SEO rankings. Position 1-3 is the biggest citation predictor. If you’re not ranking well on Google, fix that first.

  2. Write 2,500-5,000 word comprehensive content. This is the citation sweet spot. Include tables, lists, and structured sections.

  3. Implement FAQ schema on every page with FAQ content. The 47% citation lift is the single highest-impact schema implementation.

  4. Update content frequently. Refresh published dates and content every 60-90 days for your most important pages.

  5. Use clear heading hierarchies. 8+ H2 sections with descriptive, question-format headings.

  6. Include author information with credentials. Article schema with author details provides a 23% citation lift.

  7. Add comparison tables. 31% citation lift for pages with tables.

  8. Focus on Perplexity first for GEO optimization — it’s the most transparent, cites the most diverse sources, and has the clearest citation behavior to optimize against.

  9. Target long-tail queries where the Google ranking correlation is weaker. This is where GEO-specific optimization has the highest incremental value.

  10. Don’t ignore traditional SEO in favor of GEO. The data is unambiguous: Google rankings are the foundation of AI citation. GEO adds value on top of strong traditional SEO, not as a replacement.

This data will continue to evolve as AI search engines mature. We plan to update this study semi-annually with new data and expanded engine coverage.


Frequently Asked Questions

What does the data show about which sites get cited most by AI?
Sites ranking in positions 1-3 on Google are cited 5.8x more often than sites in positions 4-10. Domain authority and topical authority both correlate strongly with citation frequency. Wikipedia, official documentation, and established media outlets are disproportionately cited.
Does word count affect AI citation likelihood?
Yes. Content between 2,500-5,000 words has the highest citation rate (34% higher than content under 1,000 words). However, beyond 5,000 words, there's no additional citation benefit — suggesting AI engines value comprehensiveness but not excessive length.
Which AI engine cites the most diverse sources?
Perplexity cites the most diverse range of sources, averaging 6.2 unique domains per response. Google AI Overviews averages 3.8, and ChatGPT averages 2.4. Perplexity also shows the strongest preference for recent content.
Do structured data and schema markup increase AI citations?
Pages with FAQ schema are cited 47% more often for question-based queries. Article schema with author and date information correlates with 23% higher citation rates. HowTo schema increases citations for procedural queries by 38%.
G

GEOClarity

Writing about Generative Engine Optimization, AI search, and the future of content visibility.

Related Posts

Get GEO insights in your inbox

AI search optimization strategies. No spam.