The internet’s reliance on trust is being challenged by the rise of AI-powered answer engines employing stealthy crawling techniques. Cloudflare recently observed such behavior from Perplexity, an AI answer engine that, when blocked, masked its identity and ignored robots.txt directives to access website content. This article details Cloudflare’s response, highlighting the importance of transparent crawling practices and outlining steps website owners can take to protect their content.
Observed Stealth Crawling Behavior
Cloudflare detected Perplexity employing stealth crawling techniques to circumvent website preferences:
- User-Agent Spoofing: Perplexity used a generic macOS Chrome user-agent to disguise its identity when its declared crawlers (PerplexityBot, Perplexity-User) were blocked.
- IP Rotation and ASN Masking: Utilized multiple undisclosed IPs and rotating ASNs to evade detection and website blocks.
- Robots.txt Disregard: Ignored or failed to fetch robots.txt files, accessing content explicitly disallowed by website owners.

Cloudflare’s Response
Based on this behavior, which violates established web crawling norms (RFC 9309) and violates Cloudflare’s Verified Bots Policy, Cloudflare took the following actions:
- De-listing as a Verified Bot: Perplexity was removed from Cloudflare’s list of verified bots.
- Heuristics-Based Blocking: Cloudflare added heuristics to its managed rules to detect and block Perplexity’s stealth crawling activity.
- Signature Matching: Added signature matches for Perplexity’s stealth crawler to its managed rule blocking AI crawling activity. This rule is available to all Cloudflare customers.

Testing Methodology
Cloudflare’s findings were based on:
- Customer Complaints: Customers reported continued access by Perplexity despite blocking its declared crawlers via robots.txt and WAF rules.
- Controlled Experiments: Newly created domains with explicit robots.txt disallow directives and WAF rules blocking Perplexity’s crawlers were tested. Perplexity still provided detailed information from these restricted domains.

Best Practices for Responsible Web Crawling
Cloudflare advocates for responsible web crawling practices:
- Transparency: Clearly identify crawlers using unique user-agents, declare IP ranges, use Web Bot Auth, and provide contact information.
- Respect for Website Preferences: Adhere to robots.txt directives, rate limits, and security measures.
- Clear Purpose: Clearly define the crawler’s purpose and make it publicly accessible.
- Dedicated Crawlers for Specific Tasks: Use unique crawlers for different tasks.
Comparison with OpenAI’s Crawling Practices
Cloudflare highlights OpenAI as an example of responsible AI crawling. OpenAI’s crawlers are clearly identified, respect robots.txt directives, and use Web Bot Auth for enhanced security.
Protecting Your Website from Stealth Crawling
Cloudflare’s bot management system already blocks Perplexity’s stealth crawling attempts. Customers can further enhance protection by:
- Implementing Robots.txt Rules: Use robots.txt to explicitly disallow access to specific content or sections of your website. Cloudflare’s managed robots.txt feature simplifies this process.
- Utilizing Bot Management Challenges: Set up challenge rules to differentiate between bots and humans, allowing legitimate users to access your site.
- Leveraging AI Crawling Block Rules: Enable Cloudflare’s managed rule to block AI crawling activity.
Conclusion:
Perplexity’s stealth crawling tactics demonstrate the challenges posed by the increasing sophistication of AI-powered web crawlers. Cloudflare’s response emphasizes the need for transparency and adherence to established web crawling best practices. Website owners can take proactive steps to protect their content using robots.txt directives, bot management tools, and Cloudflare’s managed rules to prevent unauthorized access and maintain control over their data. The ongoing evolution of both crawling techniques and detection methods highlights the importance of continuous vigilance and collaboration in the ever-changing digital landscape.
And if you'd like to go a step further in supporting us, you can treat us to a virtual coffee ☕️. Thank you for your support ❤️!
We do not support or promote any form of piracy, copyright infringement, or illegal use of software, video content, or digital resources.
Any mention of third-party sites, tools, or platforms is purely for informational purposes. It is the responsibility of each reader to comply with the laws in their country, as well as the terms of use of the services mentioned.
We strongly encourage the use of legal, open-source, or official solutions in a responsible manner.


Comments