AI For Bot Detection
November 30, 2023

Human or Bot and the role of AI in Bot Detection.

Hearing the newly minted voice of John Lennon on the ‘latest’ Beatles single singing “Now and Then” as if it were yesterday, has proved to not only be a massive hit with the fans, but achieved something quite different, almost transcendent in nature. 

Blending in the human, with the AI extracted vocal track, plucked off an old cassette tossed into a drawer, has given us something new, an undeniably powerful and emotional release of human emotion.

In less than a month it had garnered 34 million views of the official video channel on YouTube alone. Oh the irony, that it has taken a machine to bring back home the incredible warmth and humanity of one of the true heroes of the people back to us. 

Human or Bot?

AI is blurring the lines everywhere. From deep fakes, to AI generated faces, to Chat GPT generated homework, it keeps on getting harder and harder to tell the difference between human and bot. 

At VerifiedVisitors we use AI to detect bots from humans, and we specialise in detecting the mos customised human like bots, that avoid traditional detection methods.

In the digram below we can see the matrix of simple detection based on IP reputation and signature detection, moving towards fingerprinting, then detecting real botnet devices, and finally customised bots, which pass CAPTCHA and have mouse trails and other human-like attributes.

Human or Bot Detection Matrix

Legacy Bot detection Software

Signature Based Detection and IP Reputation Fails

Most of the bot detection software uses signature based and IP reputation methods to discover known bot attacks. This works very well for the basic bots with static signatures that are hitting servers constantly, but doesn’t work for sophisticated bots capable of hiding their origins and mimicking human-like behaviour, that change their signature dynamically. 

Fingerprint Detection & Heuristics only

Other bot detection make extensive use of fingerprints and heuristics, which again work very well to detect basic bots, but don’t work so well with botnets, sophisticated ISP and mobile proxies who are using real devices that will pass the fingerprint test. 

Fingerprinting also has a fundamental flaw. Client side scripts can be reverse engineered and de-obfuscated to reveal exactly what the scripts are doing. Although the de-obfuscation is painful and time consuming, the exact parameters of the signature based request will be revealed eventually with enough patience. De-obfuscation allows the hacker to reverse engineer the expected parameters, and then stuff the fingerprint with the appropriate responses to pass the fingerprint checks every time.

CAPTCHA designed to distinguish Bot from Human fails.

Machines Learn while Humans Slave.

Finally, CAPTCHA, designed as the ultimate test of human or bot, also has a fundamental flaw. Most people generally don’t know that the CAPTCHA results are used as massive training data to power machine learning image detection models. The countless data provided actually creates very accurate labeled training data that the ML uses to constantly learn and improve its algorithms, and it's one the main reasons the ML image recognition algorithms have improved exponentially over the last few years. 

While humans have been wasting billions of hours completing frustrating CAPTCHAs, the machines have been busy learning how to build extremely accurate image recognition models that are now being used by the generative AI companies to power some of the latest sophisticated imagery. However, they can also be used by the hackers to feed back into solving the CAPTCHA themselves, creating a massive own-goal. 

Text based CAPTCHAs

As early as 2017, text based CAPTCHAs were being routinely solved by bots using OCR techniques or in rarer cases Convolutional Neural Network (CNNs) and Recurrent Neural Networks (RNNs). That’s why CAPTCHAs started to introduce lots of noise, such as image distortion to make it more difficult for the bots, but also significantly more challenging for users, in particular the elderly. Many people genuinely have difficulty solving the ones with extreme noise and complex backgrounds.

reCAPTCHA Bypass

Bots have long been able to bypass Google reCAPTCHA by using speech recognition API to pass the audio accessibility options, once the speech recognition started to improve in quality. 

Bots also use a combination of ML based techniques combined with Google’s own reverse image search to bypass Google’s reCAPTCHA. Using reverse image search allowed them to obtain higher quality images for more effective tagging. This very clever reverse engineering results in a still higher level of accuracy, and various methods for improving the image tagging are constantly evolving.

Google had to evolve more challenging algorithms such as reCAPTCHA. Each new method is initially successful, as the bots have to be programmed around each challenge. Initially reCAPTCHA is much easier for the end-user and its reliance on fingerprint telemetry and mouse movements, made it much harder to bypass for the bots, so it really was a double win. 

reCAPTCHA v3

However, this too became subject to ML techniques, in particular the use of Reinforcement Learning (RL) through the generating of humanlike mouse movements learnt from analysis of the weblogs from the trillions of training attempts. The RL didn’t have to do too much - it didn’t need to realistically emulate the mouse movements of a real user, but had to do just enough to pass the parameters of reCAPTCHA v3, which have to be set reasonably broadly to avoid too many false positive results. This gap is wide enough to reverse engineer the ML models to be effective.

If this weren’t enough, human CAPTCHA farms quickly evolved to pass even the most human of challenges such as puzzles and visual images that need to be rotated or moved into position. These CAPTCHA farms combined with stealth browser and residential and mobile proxy servers, became packaged up as Bots as a Service providers (BaaS.) 

We’ve seen how Bots certainly make widespread use of AI to defeat CAPTCHA. There is an obvious massive benefit of breaking CAPTCHA services, which is clearly worth even the considerable amount of time and effort needed.

But do bots use AI to avoid human behavioural detection as well? 

Bots can of course be trained to evade detection on the page. Rate limiting WAF have already forced bots to go low and slow to avoid detection. Bots can be programmed with intermittent rest points to delay progress and mimic reading times, the sequence of pages etc. they should follow to appear more ‘natural’ and mimic the timings of a human. 

Typically, evading capture is just a question of trial and error - the hacker will just amend the bot script until success is achieved. What they are not doing is using web log data to reverse engineer the overall patterns of human traffic, and then calculating how they can avoid detection by hiding in the middle of the traffic distribution. However, they really don’t need to - simple trial and error should be enough to ensure the bot can pass through eventually.

At VerifiedVisitors we use a combination of the user behavior from the web logs and also the mouse or pointer trajectories as the user navigates through the website. These are then analyzed by the AI engine to look for the telltale signs of bot or human behavior. 

Of course, all the other telemetry, IP and user agent are analyzed as well, but the critical detection for advanced bots v. human is often found in the behavioral combined with the forensic analysis of mouse movements. Mouse movements are the most human indication that we have, as they are directly driven by the pointer, and vary according to the user and device. If the bot is so perfectly programmed to avoid the behavioral detection, this also means it's likely not doing anything malicious or harmful - unless it has truly hidden and obfuscated its origins perfectly.

Generative Adversarial Networks (GANs) to detect Human or Bot

In the future Generative Adversarial Networks (GANs) use two ‘competing’ neural networks, the Generator and the Discriminator, as effectively virtual Red Team v. Blue Team training sessions based on learning from the movements of its adversary. 

During the training, the Generator attempts to create bots that display more  and more human-like behaviors, trying to trick the Discriminator into labeling them as Human rather than Bot. The constant adversarial training loop constantly self reinforces. The Discriminator gets better at finding Bots, the Generator are re-creating human type behaviors. 

Check more blogs

Get updates on the content