‍Web Scrapers Bots and Gen AI‍ threats

Web Scrapers are now serious Bot threat in the age of Gen AI.

Recommended articles

Social share

Web scraper bots also known as spider or crawler bots covertly crawl web pages and extract  rich media and other data. Web scraping bots are often seen as more of a nuisance than a real problem, and only tend to get noticed if they cause a server overload. However, the rise of Generative AI has made web scraping into a real threat with serious financial implications which cannot be ignored.

Why were Web Scrapers Largely Ignored?

Web scrapers mostly steal website content - often marketing content, which has been approved and crafted for public release. If hackers are determined to steal web content they will, even if they have to copy it page by page. Also scraping services are commercially available to make scraping entire websites very easy, and these evade most bot detection software. If you can’t do anything about it, best to just ignore the whole thing. Taking a ‘so what’ fatalistic attitude to web scrapers is the one thing you definitely should not be doing as they have evolved to now represent a multi-vector risk as outlined below. 

The major risks of web scrapers are summarized below:.

  1. The advent of generative AI services has changed web scraping into much more of a real threat to businesses, and few have understood the financial impact of this threat.

  1. Scrapers are used for platform and vulnerability reconnaissance, probing your tech stack for potential weaknesses that will be exploited by yet more bots in future attacks

  1. Scrapers combined with social media and bot based distribution have evolved sophisticated real-time offers and promotions to take advantage of discrepancies in stock, pricing and delivery, particularly for travel, ecom, white goods and other commodity products.

Let’s examine these threat vectors in detail.

Gen AI Scraping Bot Threat - What Copyright?

What are the major threats from Gen AI Scraping?  

Gen AI scraping bots, copyright and SEO

ChatGPT and its derivatives have changed the world of SEO forever. While we can argue all day long about the quality of ChatGPT derived content, whether ChatGPT content can be distinguished from real human writing or not, what isn’t up for debate is that the ChatGPT content can be optimized for SEO. The search engine algorithms don’t care if it’s human or not. They just care if it’s ‘good’ in algorithmic terms.

Bots and SEO Skyscraper Strategy

How is this being exploited? It’s a new twist on the older SEO Skyscraper strategy. You can think of this as the McDonalds / Burger King strategy. If people are already going to a hamburger destination, it pays dividends to have your competing brand right across the street. Your competitor has done all the hard work to establish footfall. You can then hijack some of the traffic.

Find a competitor with strong SEO rankings for keywords that you covet. Analyse the backlinks and copy, and then build a better taller skyscraper to attract more footfall. This is a very common technique, but needed skilled copywriters, and a knowledgeable SEO team or agency to apply it. It also takes time. A lot of time. Writing hundreds of 2,000 SEO optimized articles is a major work effort. In the skyscraper SEO model, the original copyrighted content is not stolen. It's effectively been harvested,  re-written, improved and optimized with more current information, video’s, infographics and improved layout and presentation. 

Today, it’s possible to do the exact same thing with ChatGPT prompts and a small army of dedicated bots - and build an entire library of optimized content sucked from competitive sites with no human intervention. That’s a game changer. Think of all that time and effort it took to create a serious content marketing strategy, and the value of your page 1 keywords.

Gen AI Scraping Data Exploitation

Gen AI is able to take the scraped data, and turn it into other content optimisations or other forms. For example, it can take pricing or statistical data, and turn it into a new lookup service. Essentially the threat is the re-purposing of existing content, with automation tools to take advantage of the core IP created in the first place. With subtle variations, tweaks, data consolidations, the copyright has gone, and something new has been created.

Gen AI Language Model Creation

Many of the language models use libraries such as those of https://commoncrawl.org/ which is a truly vast open repository of all the data crawled from years of web content. Most sites haven’t excluded  data aggregating crawlers such as common crawl.  Common crawl is a fantastic resource for data scientists, researchers, and ML model creators who need vast datasets to train their models, and enrich our understanding.  Most don’t know what these crawlers are doing, and what the potential negative consequences are. Now you know, then it should be an informed choice to opt-in. You can simply do this in robots.txt - but of course now you’ve signposted to hackers that you’ve got content you want to protect, and its exact location.

Scrapers use Mobile and ISP Proxies and opt-in botnets to Avoid Detection

Scraping tools and scraping packages have evolved to make life easier for data scraping. There are a wide-range of available scraping tools available depending on your language preference and skill level, all the way from point and click SaaS type packages such as Brightdata, to python libraries, puppeteer for node JS, and OpenBullet for the .net crowd.

It’s important to note that some of these platforms have been built to avoid detection, and if the scraper is using a custom script they can easily take out any easily identifiable signature data from the platform. For example, Brightdata uses a very large database of millions of domestic IP and actively seeks to avoid detection. VerifiedVisitors recommends blocking any known scraping tools and services, as well as using our automated detectors to prevent scrapers, and those services trying to hide their platform origins.

Scrapers are used for platform and vulnerability reconnaissance

Many bots target IT infrastructure to understand the full tech stack and all components used. 

In many cases this can be harmless data. 

Which web server, Content Distribution Network (CDN), or e-commerce platform aren’t exactly state secrets. So what right? 

Legitimate commercial services such as Built-With then package the data up, allowing sales and marketing teams to precisely target domains with the exact spec and build they have solutions for. 

However, on the illegitimate side, you can easily see the opportunity for hackers who can target known infrastructure vulnerabilities. They can launch illegal bots to quickly and easily find compromised versions and weak tech stacks across the web. Of course, this is just another reason to ensure we’re always updating software, and we have robust version controls in place, but we all know that’s not always the reality. 

Bots can extract very detailed information right down to specific releases and versions. Often these generic crawlers will hijack an existing common user agent string, pretending to be a legitimate search or media crawler. 

Scrapers combined with social media and bot based distribution

This is not a new attack, but again the rise of Gen AI bot creation, makes it super easy for hackers and fraudsters to spot an arbitrage gap in the market. Ticket reselling, and high value branded goods in scarce supply and be purchased with custom bots combined with social media campaigns to create instant revenue streams as they seek to take advantage of the market gaps in resell price. See the Ticketmaster Taylor Swift Bots article here. Another classic is the market arbitrage on sports betting. For example, fans will always support and bet on their national team, and bots can take advantage of the variations in the odds between nations. See the detailed guide on price scraping bots here.

Frequently Asked Questions

No items found.