CCBot — Common Crawl's Web Crawler
CCBot powers Common Crawl's open web dataset, used as training data by many AI models. Learn how blocking it works and its limitations.
QUICK FACTS
CCBot What is CCBot?
CCBot is the crawler for Common Crawl, a non-profit that maintains a massive open dataset of web pages. Many AI companies — including those building smaller or open-source models — use Common Crawl snapshots as training data. Blocking CCBot stops future snapshots from including your site, but older snapshots are already publicly available and used by downstream models.
How to Block CCBot
Add the following to your robots.txt file (located at the root of your website):
User-agent: CCBot Disallow: /
What Happens When You Block CCBot
Future Common Crawl snapshots will not include your content. Past snapshots are already public and cannot be retracted.
Should You Block CCBot?
CCBot builds an open dataset that multiple downstream AI companies use. Blocking it prevents future dataset snapshots from including your content, but past snapshots are already public. This is a broad opt-out that affects many downstream models at once.
CCBot vs Other Common Crawl Crawlers
Common Crawl currently operates CCBot as a standalone crawler. Unlike companies like OpenAI and Anthropic that split functionality across multiple user-agents, Common Crawl uses a single identifier for its AI crawling operations.
GENERATE YOUR ROBOTS.TXT
Use our visual generator to create a robots.txt file that blocks CCBot and any other crawlers you want to opt out of.