https://t.co/WMwidLX6H6

Tw93(@HiTw93)

Tw93(@HiTw93)2026年5月3日

https://t.co/WMwidLX6H6

9.2Score

TL;DR · AI 摘要

文章系统阐述面向AI搜索的可见性策略（GEO），指出AI检索逻辑与传统SEO根本不同，强调结构化内容、精准robots.txt控制和新兴llms.txt标准对提升AI可见性的关键作用。

核心要点

AI搜索结果83%来自传统搜索TOP10之外的页面，核心依赖内容清晰度与可信源而非PageRank
应差异化允许search/retrieval类爬虫（如OAI-SearchBot），但阻止训练类爬虫（如GPTBot）以平衡曝光与数据控制
llms.txt是面向AI的元数据标准，已获Anthropic/Cloudflare等采用，低普及率带来早期实践红利

结构提纲

按章节快速跳转。

§引言：意外走红引发的AI可见性思考
作者因开源项目被AI频繁引用而系统梳理AI搜索可见性原理，强调非‘黑帽’而是内容优化策略。
·AI搜索 vs 传统SEO的本质差异
AI检索不依赖排名权重，83%引用来自非TOP10页面，更看重结构化、高信噪比的文档内容。
·robots.txt的精细化治理策略
区分训练/检索/用户触发三类AI爬虫，针对性Allow/Disallow，避免一刀切阻断或放行。
·llms.txt：面向AI的元数据新标准
在根目录部署Markdown格式llms.txt，声明项目定位、关键链接与背景，已被头部技术公司采纳。
›实践技巧：跨项目mesh式llms.txt互链
作者将多个个人项目llms.txt相互引用，构建可发现性网络，提升AI爬虫深度抓取效率。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

AI可见性策略（GEO）
- 核心差异
  - AI不依赖PageRank
  - 83%引用来自非TOP10页面
  - 偏好结构化、高信噪比内容
- 技术实施
  - robots.txt分层控制
  - llms.txt元数据标准
  - 跨项目mesh互链
- 价值定位
  - 品牌可见性策略
  - 1小时可落地，非流量工程
  - 产品力才是终极护城河

金句 / Highlights

值得收藏与分享的关键句。

83% of AI Overview citations come from pages outside the top 10 — AI rewards clear structure and reliable sourcing, not PageRank.
— 第2段
⬇︎ 下载 PNG 𝕏 分享到 X
Block training crawlers (GPTBot) to keep content out of model training; allow search crawlers (OAI-SearchBot) to stay visible in AI answers.
— 第3段
⬇︎ 下载 PNG 𝕏 分享到 X
llms.txt is a new standard, similar to robots.txt but designed for AI consumption — it’s prioritized by AI systems during crawling.
— 第4段
⬇︎ 下载 PNG 𝕏 分享到 X
BuiltWith tracks over 840,000 websites with llms.txt, yet SE Ranking finds only 10% adoption among 300k domains — being early is an advantage.
— 第4段
⬇︎ 下载 PNG 𝕏 分享到 X
I made each site’s llms.txt reference the others… forming a mesh. An AI crawler entering through any one site can follow the links and discover everything else.
— 第5段
⬇︎ 下载 PNG 𝕏 分享到 X

#AI Search#SEO#robots.txt#llms.txt#Developer Experience

打开原文

A few friends pinged me recently saying my open source tools were showing up when they asked AI questions. I hadn’t done anything deliberate, so I figured: why not spend an hour structuring things properly? After doing it, I fired off a quick tweet, but the notes were messy. People seemed genuinely interested, so I decided to write it up as a proper article for reference.

I hate gaming rankings or generating junk content. This article won’t teach you shortcuts. It’s about helping AI better understand the content you already have.

I looked into why it happened and found that AI search runs on entirely different logic. Traditional SEO is about fighting into the top 10, but 83% of AI Overview citations come from pages outside the top 10. AI rewards clear structure and reliable sourcing, not PageRank. My projects aren’t big, but the READMEs and docs are well-written enough that AI picks them up where bigger sites have thin content. That’s probably why friends were seeing Pake and MiaoYan in their AI results.

AI search is growing fast: 527% year-over-year in the first half of 2025, ChatGPT hit 900 million weekly active users by February 2026, and referral traffic converts at roughly 5x the rate of traditional search. But it still accounts for less than 1% of total referral traffic. This is a brand visibility strategy, not a traffic strategy. Worth an hour of setup, not a week, because your product is your real competitive advantage, not this.

Most people treat robots.txt as a switch: either block AI crawlers or allow them all. But AI crawlers come in several types, and they do different things.

Training crawlers (GPTBot, ClaudeBot, Meta-ExternalAgent, CCBot) take your content to train models. Blocking them keeps your content out of training data but doesn’t affect current AI search results.

Search and retrieval crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) fetch content in real time to answer user queries. Block these and you vanish from AI search.

User-triggered fetchers (ChatGPT-User, Claude-User, Perplexity-User, Google-Agent) only fire when someone pastes your URL into a chat window. Block them and users asking “summarize this page” get nothing.

Opt-out tokens (Google-Extended, Applebot-Extended) aren’t real crawlers. They’re signals you declare in robots.txt to opt out of AI training.

Undeclared crawlers (Bytespider, xAI’s Grok bot) don’t identify themselves and don’t necessarily follow the rules.

My approach: allow search/retrieval and user-triggered crawlers, block training and undeclared ones.

code

# Search & retrieval: allow
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# User-triggered: allow
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

# Training: block
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

# Opt-out tokens
User-agent: Google-Extended
Disallow: /

# Undeclared: block
User-agent: Bytespider
Disallow: /

llms.txt is a new standard, similar to robots.txt but designed for AI consumption. You place a Markdown file at your site root describing what your site does, its key pages, and who’s behind it. AI systems prioritize this file when they crawl your content.

BuiltWith tracks over 840,000 websites that have deployed llms.txt, including Anthropic, Cloudflare, Stripe, and Vercel. But in SE Ranking’s survey of 300,000 domains, adoption is only 10%. It’s still early, and being early is an advantage.

The format is simple:

markdown

code

# Your Project Name

> One-line description of what this is.

## Links

- [Documentation](https://yoursite.com/docs)
- [GitHub](https://github.com/you/project)
- [Blog](https://yoursite.com/blog)

## About

Short paragraph explaining the project, its purpose, 
key features, and what makes it different.

After creating yours, submit it to

,

, and the llms-txt-hub repository on GitHub via PR.

I also did something interesting: I made each site’s llms.txt reference the others. I maintain

,

, and

. Each site’s llms.txt links to the rest, forming a mesh. An AI crawler entering through any one site can follow the links and discover everything else.

These changes take effect after crawlers revisit your site, usually within a few days. After that, try searching for your project name in ChatGPT. The citation sources and description accuracy should improve.

llms.txt is the summary. llms-full.txt is the complete version, typically 30-60KB, containing project descriptions, FAQs, usage scenarios, competitive comparisons, and README excerpts. Mintlify’s CDN analysis shows llms-full.txt gets 3-4x more traffic than llms.txt. AI systems that find the summary want the full version.

Markdown routes go further. Evil Martians recommend providing a .md version of every page on your site. A 15,000-token HTML page becomes a 3,000-token Markdown document, 80% less noise.

The simplest way to tell AI you have a Markdown version is adding this to your page’s <head>:

htmlbars

code

<link rel="alternate" type="text/markdown" href="/page.md" />

Claude Code and Cursor already send Accept: text/markdown headers when fetching docs. This is standard HTTP/1.1 content negotiation, around since 1997.

The robots.txt and llms.txt work from the previous sections makes your content readable to AI, but AI has to find you first. ChatGPT’s search runs on Bing, Google AI Overview uses Google’s own index, and Perplexity also relies on search APIs. If your pages aren’t indexed by search engines, none of the structuring work above matters. So the first step is making sure Google and Bing have indexed your site.

The setup is straightforward: go to

, verify your domain via DNS or HTML file upload, then submit your sitemap URL (usually

). Check the “Pages” indexing report to see which pages are indexed and which have issues. If an important page isn’t indexed, use the URL Inspection tool to manually request indexing.

You might think Bing doesn’t matter, but Copilot, DuckDuckGo, and Yahoo all run on Bing’s index underneath. Register with Bing Webmaster Tools, submit your sitemap, and check the AI Performance panel to see how often AI cites your content. While you’re there, set up IndexNow so Bing gets notified immediately when you publish new content instead of waiting for a crawler to find it.

Setting up IndexNow means placing an API key file at your site root, then sending a POST request to

with the list of changed URLs whenever you publish. Bing picks them up within minutes. Many static site generators and CMS platforms have IndexNow plugins available.

Google Search Console doesn’t have an AI-specific panel yet, but submitting your sitemap and monitoring indexing status is still worth doing. Google’s AI Overviews pull from a wider range than traditional results, so even pages that don’t rank in the top 10 can appear in AI-generated answers.

Perplexity has more users than you’d expect. They run a publisher program at

. Once approved, you get an 80/20 revenue share and access to citation analytics data.

Instead of waiting for AI to scrape information from scattered sources, give it a single entry point with everything organized.

A knowledge site should provide three layers: an overview (llms.txt), a full version (llms-full.txt, 30-60KB), and standalone knowledge pages for each core project. Add structured JSON APIs so AI tools can fetch data programmatically. Pull data live from upstream sources like the GitHub API with caching that refreshes periodically, keeping maintenance cost near zero.

One thing that’s easy to miss: give AI a narrative structure, not just a list of projects. If you have multiple projects, write a description that connects them, how they relate, your technical direction, the overall picture. When AI answers “who is this person” or “what does this team do,” a coherent narrative works much better than a flat list.

My implementation is called Yobi (from the Japanese 呼び / よび, meaning “to call” or “to summon”). It serves an llms.txt overview, a 50KB llms-full.txt, per-project pages, and four JSON endpoints (/api/profile, /api/projects, /api/blog, /api/weekly) that pull live data from the GitHub API with ISR caching that refreshes hourly. Stack: Next.js + TypeScript on Vercel.

The JSON API returns structured project data with live GitHub stats:

Each project needs its own standalone page, not a row in a list, but a self-contained Markdown document with a citable summary, core features, competitive comparisons, use cases, and install commands. Ahrefs found that cited pages have titles with higher semantic similarity to user queries, and natural-language URL slugs (like /projects/pake) get cited more than opaque IDs (like /page?id=47).

URL structure matters. /projects/pake tells the model what the page is about before it reads a single line. /page?id=47 tells it nothing.

Subdomains carry less authority than root domains. AI crawlers that discover

won’t automatically find

or

. If your llms.txt, project pages, and API data are spread across subdomains, AI may only see part of the picture.

The fix is to mirror key structured data onto your main domain so that

,

, and

all live under one roof. AI crawlers discover your main site through search indexing and find everything without leaving. Implementation options include scheduled CI sync, build-time fetching, or reverse proxying. I use a GitHub Action that syncs subdomain data to the blog repo nightly.

When launching a new site, use a checklist to avoid gaps. Core items: robots.txt (categorized crawler permissions), llms.txt (site summary with cross-references), sitemap (submitted to search engines), Bing Webmaster Tools (enable IndexNow), Google Search Console (monitor indexing). Each site’s llms.txt should reference the others, forming a discovery mesh.

The easiest trap when doing this work is getting carried away with every GEO technique you come across and trying to add them all, creating a mess and losing sight of what matters.

/.well-known/ai.txt: multiple competing proposals, no real adoption. Wait for a winner.

HTML comments with AI hints: parsers strip comments before AI sees the content.

User-Agent sniffing to serve Markdown: returning different content to bots versus humans is cloaking. Google will penalize you.

Unofficial AI meta tags: unless a major AI provider explicitly documents support, it’s just noise.

I initially thought JSON-LD would be powerful for AI visibility. Deeper research showed a more complicated picture. SearchVIU ran an experiment where they put data only in JSON-LD without showing it on the page. All five AI systems they tested failed to find it. Mark Williams-Cook’s follow-up experiment showed that LLMs treat <script type="application/ld+json"> as plain text, reading whatever words are inside without understanding the structured semantics.

The one confirmed exception is Bing/Copilot, which uses JSON-LD to enrich its search index. Keep existing JSON-LD (it helps Bing/Copilot and traditional rich results), but don’t add it expecting ChatGPT or Claude to cite you more.

The GEO paper from Princeton and IIT Delhi, published at KDD 2024, found that adding authoritative citations improves AI visibility by 115%, relevant statistics by 33%, and direct quotations from credible sources by 43%.

My friend

has been doing serious research on GEO. His

ran 602 prompts across three platforms and scraped tens of thousands of pages for feature analysis. His

is worth reading. Here are the patterns most useful for content creators.

Specificity. Pages with real data, clear definitions, and side-by-side comparisons see over 50% higher impact than vague, general pages. Step-by-step structure also helps noticeably. Pure FAQ format actually hurts. Those GEO tools that tell you “add FAQ to boost your score” are giving advice the data contradicts, which also validates my decision to drop FAQ sections from my own pages.

Content depth. AI doesn’t favor short summaries. It favors long content it can slice into reusable segments. High-impact pages average nearly 2,000 words with 10+ headings. Low-impact pages average just 170 words, over a 10x gap. The sweet spot is 1,000 to 3,000 words.

Relevance. All mechanical SEO metrics (heading hierarchy, meta descriptions, keyword density) predict less than one single variable: whether your page content actually answers the question the user asked.

Platform differences. ChatGPT cites fewer sources but uses each deeply; its per-citation impact is over 5x that of Google. Perplexity casts a wider net, citing more than twice as many sources. To get cited by ChatGPT, go deep on individual pages. To get cited by Perplexity, go wide.

Content type. Official websites, news, and industry verticals account for roughly 80% of citation sources. But encyclopedia-style and explainer pages have 3x the impact of news pages. English content accounts for over 83% of global citation samples, so projects targeting an international audience need English versions.

Of all the pages ChatGPT retrieves during a session, only 15% end up in the final answer. The other 85% are never cited. Getting into the retrieval pool is just the first hurdle. The model still has to decide which pages are worth citing.

Ahrefs found that cited pages have titles with noticeably higher semantic similarity to user queries, and pages with descriptive natural-language URL slugs get cited more than those with opaque IDs. This is why llms.txt and Markdown routes help: they give the model a clean, unambiguous signal about what your page covers.

Brands get cited 6.5x more often through third-party sources than through their own domains. Someone praising your project on Reddit or Hacker News carries more weight than your own marketing copy. That’s exactly why having a well-structured llms.txt matters: it gives the model a citable anchor to point to, even when the conversation that triggered the query happened somewhere else.

There are AI SEO audit tools that score your site and tell you to add FAQ sections, trust pages, or more text. Don’t let scores drive your decisions. The test is simple: does every paragraph you add contain information that isn’t already on the page? If not, don’t add it. I once added a FAQ to Yobi that just restated what the About section already said, purely to push the score up. That’s padding. I removed it.

Everything here is about helping AI understand what you have accurately, giving it a clean working environment. That lasts longer than any shortcut.

The basic configuration takes about an hour. The knowledge endpoint and per-project pages take longer, but once the data structure is in place, maintenance is easy. The daily sync runs on its own.

Give it a few days for crawlers to pick up the changes, then try searching for your name or project in ChatGPT, Perplexity, or Claude. The citations should be more accurate.

AI citation attribution is still unreliable. CJR and Tow Center tested 200 AI-generated citations and found 153 with partial or complete errors. Do the structural work because it makes your content easier to access accurately, but don’t treat an AI citation as proof that users saw your exact words. The mechanism is still improving

If you have your own products, blog, or website, give it a try. You can also hand this article to Claude Code and let it handle most of the setup.