Perplexity AI Accused of Scaping Websites
Cloudflare Accuses Perplexity AI of Scraping Websites Against Explicit Instructions
We haven’t talked about this newer AI yet but I know a number of tech’s that prefer this over mainstream AI models. It’s said to provide much more data to users without the many guardrails currently in place with the more well-known AI’s.
AI startup Perplexity has been accused of scraping content from websites which is the same way other AI tools have built their LLM (Large Language Models) data bases.
According to internet infrastructure provider Cloudflare, the problem is that Perplexity is apparently scraping data from websites that have explicitly opted out of such activity,
In a blog post published yesterday, Cloudflare revealed research indicating that Perplexity has been bypassing restrictions and concealing its scraping behavior. The company claims Perplexity masked its identity while accessing web pages, allegedly to circumvent site owners’ preferences.
AI models like those developed by Perplexity require vast amounts of data—text, images, and videos—often sourced from the internet. While scraping has long been a common practice among AI startups, many websites have pushed back by implementing the robots.txt protocol, a web standard that signals which pages should or shouldn’t be indexed. However, enforcement of these rules has yielded mixed results.
Cloudflare alleges that Perplexity deliberately circumvented these blocks by altering its bots’ user-agent strings—which identify the type of device and browser accessing a site—and switching autonomous system numbers (ASNs), which identify large networks on the internet.
“This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals,” Cloudflare stated.
In response, Perplexity spokesperson Jesse Dwyer dismissed Cloudflare’s claims, calling the blog post a “sales pitch.” Dwyer also asserted that the screenshots shared by Cloudflare “show that no content was accessed,” and further claimed that the bot identified in the post “isn’t even ours.”
Cloudflare said it began investigating after receiving complaints from customers who noticed Perplexity scraping their sites despite having implemented robots.txt rules and blocks targeting Perplexity’s known bots. Cloudflare conducted tests and confirmed that the startup was bypassing these restrictions.
“We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked,” Cloudflare added.
As a result, Cloudflare has removed Perplexity’s bots from its verified list and introduced new techniques to block them.
This isn’t the first time Perplexity has faced allegations of unauthorized scraping. In 2024, media outlets including Wired accused the company of plagiarizing content. During an interview at TechCrunch Disrupt 2024, CEO Aravind Srinivas struggled to define plagiarism when questioned about the controversy.
Perplexity vs ChatGPT: Which AI tool is better?
AI chatbots pretty much all feel the same. Sure, they use different models under the hood, but whether you’re using ChatGPT, Meta AI, or Google Gemini, the experience is pretty similar. You enter your question and a generated AI response comes out—which is why Perplexity AI is so interesting.
Instead of just being another chatbot, Perplexity is billed as an alternative to traditional search engines. Yes, it works kind of like a typical conversational AI chatbot, but it’s designed to be more accurate and up to date. So how does this compare to ChatGPT, which has also been held up as a possible replacement for search engines?
You decide… Give Perplexity a try and maybe compare the results with ChatGPT, Meta AI or Googles Gemini?
Thanks to TechCrunch for compiling and reporting on this.
Side Note:
In case you don't know, according to Google, Cloudflare is a global network and security company that provides a range of services to improve the security, performance, and reliability of websites and applications. It acts as a middleman between website visitors and the server, offering features like DDoS protection, content delivery, and web application firewall. Essentially, Cloudflare helps make the internet faster, safer, and more reliable for businesses and individuals alike.
David Snell, Rob Hakala and Beth Foster at 95.9 WATD Studio
David Snell joins Rob Hakala and Beth Foster of the South Shore’s Morning News on 95.9 WATD fm every Tuesday at 8:11
You can listen to this broadcast here: https://actsmartit.com/perplexity-accused-scraping-websites/