AI For Bot Detection
December 12, 2023

‍How effective is Bot Management Software?

Bot Management platforms will tell you their bot mitigation is 100% successful, or sometimes, to be more ‘realistic’ the magic 99.99% success rate is quoted. “Trust us, trust us”, they say, “we’ve got this one.” 

Some vendors work the math the other way, and show that their False Positive rate (the amount of humans that get mistaken for bots) magically comes out to less than 0.01%, but somehow completely fail to mention their False Negative rate (the amount of bots that are mistaken for humans), which is the one figure you really should care about. For a discussion about False Positive and False Negative rates, please see Bot Detection Accuracy here.

Bots as a Service (BaaS) boasts 99.99% success rate

Meanwhile the latest Bots as a Service (BaaS) boast that they have a 99.99% success rate at avoiding this bot detection in the first place.

They both can’t be right. So what on earth is going on? 

Our response is quite simple. Please don’t trust us.

We have a zero trust model for a reason.

Our “playback” features allows our customers to see exactly which visitors were blocked and why, so they can validate and independently check at their SOC, or use a SIEM or other analysis tools they may have.

Headline figures such as 99.99% effective rates, or 0.01% false positives rates, really don't mean anything.Just because we are 99.99% effective for all bot detection across all customers is meaningless.That 0.01% could be the really malicious  bot that’s currently exfiltrating your data.

Adopting a zero trust model, means we offer our customers a systematic way of measuring and validating bot traffic. 

One of the ways we do this is to constantly measure our performance against the latest threats.

Bots as a Service (Baas) Provider threat.

We reviewed 10 of the Bots as a Service (BaaS) providers and chose some of the best ones at avoiding bot detection. While we can’t go into each and everyone, we chose to use Brightdata, as their platform seems to be robust and claims to be the most effective at avoiding bot detection. Brightdata claims a healthy 99.99% success rate against web sites, and not only that, they specifically claim they have the highest success rate in the ‘industry’.

We set-up a real live test to see if we could bypass our own bot defenses using Brightdata in a Red Team Versus Blue Team live bot video. (for the video please see here).

99.99 Effective at Avoiding Bot Protection?

You can see in the screenshot the 99.99% success rate claimed by Brightdata. They have a set of templates for extremely well known websites, such as Amazon, Linkedin, Zara, Hermes, Ikea, Google, Yelp, TrustPilot, AirB&B,  - hundreds of extremely well-known sites organized by category.

Problems with JavaScript Fingerprinting for Bots

Old school bot detection works primarily with a JavaScript fingerprint, that ID’s each incoming request. Just like a club bouncer checking drinkers for ID, the fingerprint launches on each incoming client, and takes a snapshot of the platform, and sets an ID for that visitor. 

There are four immediate problems with this Fingerprinting approach. 

  1. The JavaScript is publicly available and can be reverse engineered. Although the JS is obfuscated, given enough patience it can be decoded to reveal the range of values it requires to pass the fingerprint tests. 
  2. The visitor has to fingerprint checked at least once, before the signature rules can be applied. If you simply rotate each visitor after the first visit, you never show up as a repeat visitor in the first place, and just bypass the fingerprint. This means a very high degree of rotation, and necessitates a large pool of proxies.
  3. Using real devices that actually have valid fingerprints again can neatly bypass the fingerprint detection. All the associated canvas, mouse and platform checks will pass.
  4. Instead of hitting the actual site with the Fingerprint, hitting a cached CDN server will bypass the Fingerprint. 

So how does Brightdata Avoid Bot Detection

Brightdata has a wide range of proxies and a large pool of IPs addresses, so that it can be set to rotate agents very quickly, so each visit is a first time visit. As we have seen, this is an effective bypass for the fingerprint JS agents. It’s going to bypass them each time. This is much easier than JavaScript deobfuscation and means you don’t have to reverse engineer the fingerprint expected values. Traditionally, this meant having your own botnet, or acquiring access to one, so that you launch millions of attacks from a new IP each time. This was expensive, time consuming, and not entirely effective, as the botnet IP are quickly picked up by IP reputation services, if they are being used in large scale attacks over time.

Proxy Bot Infrastructure as a Service

Brighdata uses many proxy types, allowing you to choose the right combination for your selected target. For example, you can buy a package of mobile or residential ISP proxies, and set the pool large enough so that you can rotate the bots each time. As you can see in the screenshots, just select the package that you need, and you’re all set to go. With millions of dometic IP, or mobile gateways using the large ASN mobile gateways, it's next to impossible to use IP reputation services to stop these bot attacks. 

Mobile Proxies for Bot Attacks

Mobile proxies are amongst the most effective, and also the most expensive as you can see on the price list. These use real mobiles, organized into click farms, and usually Zip-tied to a moving bar to trigger the accelerometer to fool the fingerprint into believing they are being actively used by human users. E-com sites often find their most valuable customers shop on mobile devices, and prioritize mobile visitors accordingly. The proxies use real devices, so again, the fingerprinting in all likelihood will fail. Worse, a bot has now been classified as a human visitor.

Residential IP Proxies for Bot Attacks

Cheaper options but still very effective are residential Proxies using real devices. The real device passes the fingerprint checks and residential IP can’t be blocked with old school IP reputation without causing many false positives. 

Included in the list are data center IPs, which at first seems counter-intuitive? Humans don’t live in data centers, so why have this option?  Data center IPs can be used, for example, for an API data mining attack. The API is expecting bots from data centers, and may block residential IPs. 

Once the proxies are set according to the target victim vulnerabilities, the next stage is to deploy the bot scripts. 

BOT Scripts

Brightdata has a series of templates to make targeting websites much easier. These are organized by category as you can see below, and include some of the largest ecom and general data sets in the world. These scripts have been customized for each site, for example for Single Page Application (SPA) sites, or other more complex applications, where a simple crawl of each URL isn’t possible. Brightdata also claims to bypass CAPTCHA

For our testing, we deployed a simple scraping script and just edited the fields to start the scraping process. 

Bot Attack

Armed with our bot script, and the proxy infrastructure, we can now launch our bot attack. Although we have picked scraping, the bots can be targeted to perform any custom scripts, to target whatever you like on the target’s infrastructure. Just to recap, the bot attack now is going to bypass the following old school bot detection techniques:

❌ IP Reputation fails with millions of residential and mobile proxies

❌ JS signature fails as the bots rotate each end every time

❌ WAF rate limiting is bypassed by slowing the bots to mimic human visits with a custom script. The bots don’t care, they can go slow and low.

❌ Throwing CAPTCHA on all visits - bots bypass the CAPTCHA.  

❌ Issuing a challenge page for every request to further fingerprint the client is going to make the site unusable and the proxy clients using real devices may well pass the Fingerprinting test

How Does VerifiedVisitors work?

VerifiedVisitors learns from your traffic with our AI platform so that we can not only help you manage bot threats, but ensure your customers are prioritized and treated like VIPs, rather than something less than human with the current CAPTCHA methods.

VerifiedVisitors AI platform identifies the visitors and places them into cohorts as set out in the screenshot. You can clearly see, by risk type, each cohort broken down by actual threats which are dynamically verified over time. This allows us to trust but verify, for example for repeat visitors and known good bots, we are able to use the ML to track the behavior over time to ensure they are legitimate verified visitors that we actually want. 

To stop the Brightdata attacks, we then need two dynamic rules to be in place:

  1. Rule 1 is to select the first-time visitor cohort, of visitors that we have never seen before. Inevitably this will include human visitors as well as the bots.
  2. Rule 2 serves a challenge page to just this new visitor cohort, and performs a footprint check of the client to determine if it’s human or bot. At the first visitor stage, these checks allow us to look for the tell tale signs of the bot platform itself used to launch the bot attacks. We use hundreds of signals to look for these signs. The Bots as a Service platforms have to get every signal value correct - we just have to detect one or two errors and inconsistencies in the platform footprint.

This attack type is quite extreme. It’s rare for bot attacks to just send one request and rotate each and every time. However, even in this extreme case, we are able to identify the threat cohort, and then successfully mitigate the attack without affecting the legitimate repeat visitors and regular users of the service.

Benefits of AI Cohort protection?

✅ You treat your customers like VIPs and ensure they are not affected by any rules

✅ Bot traffic can be blocked before it hits the website so you don't suffer any spikes, additional CPU or bandwidth, and the bot simply fails.

✅ The holding page is quick, usually takes 1-2 seconds and doesn’t require a CAPTCHA or other challenge. It can also be custom designed with messaging, product pages, service updates or other valuable information that you want to give to the client

✅ Filtering out the bots makes it much easier to see the real visitors and understand your analytics to help you convert. For example, every time has a proportion of quick abandonment rates, usually under 30 seconds. How much of this traffic is bot traffic, or a tell tale sign that customers really don’t like your website design? Understanding the real verified visitors hitting your site, also allows the AI to spot anomalies. For example, a large spike in first time visitors that never convert, but simply are distributed across the site, crawling pages sequentially over-time is a sure sign this kind of bot scraping attack is taking place. 

How does the Bots as a Service (BaaS) 99.99% success rate stack up?

Now that we’ve shown you the detailed walk through of Brightdata, how does its 99.99% rate stack up? Without definitely measuring across a benchmarked set of target endpoints, it's hard to say for sure. As we have seen it's definitely detectable, but we can certainly say it would defeat the old school bot detection methods pretty handily as we’ve shown above. Serving CAPTCHA or a challenge page for every visitor would make the site unusable. 

Many companies simply regard their website content as marketing content that has been approved for release to the public. If it’s scraped, then so be it. However, what this fails to take into consideration is the systematic data mining of the entire datasets involved. For example, AirB&B may not worry about web marketing or a few listings being scraped, but certainly the systematic data mining of every single listing in a specific country or region represents a serious threat to its business model and IP.

Bot Detection Accuracy Metrics

The vast majority of bot detection vendors do not have a robust confusion matrix model, as they aren’t using Machine Learning at the heart of their detection model. Their models aren’t looking at the whole picture. 

When VerifiedVisitors develops our models we want to prioritise detection of false negatives over false positives. 

Why? The reason is simple, if we challenge a small additional percentage of humans we don’t breach our zero trust. When we label a bot as a human and create the false negative, we need to avoid that at all costs. The false negatives create the real problem, as we’ve breached our zero trust principles, and allowed a bot access to our protected space. 

Trust the Playback Mode

Playback Mode

VerifiedVisitors has a playback mode which allows you to set up the rules and cohorts you want, and then verify the results of the AI platform detectors. This allows you to measure the quantifiable effect of the bot mitigation and ensure the quality of the detectors. The only true measure that counts is how effective the bot prevention is on your endpoints. Focusing on a structured and analytical framework for measuring that is what is important. All the rest is marketing fluff.

Check more blogs

Get updates on the content