How To Catch A Bot

At spider.io we’re in the business of catching automated web traffic. This is a short post introducing some of the clues we analyse and why.

In its simplest form a bot has two components: a priority queue of web pages to crawl, and a loop that takes the first item off the end of the queue and downloads it. A trivial example can be seen below that downloads the top 1m webpages, according to Alexa, as images using Paul Hammond’s webkit2png project.

#!/bin/bash
cat top-1m.csv | while read f;
do
./webkit2png -s 1 -C  -D ./out http://`echo $f | cut -d, -f2`;
done;

More complex bots add some processing either in or after the loop and periodically update their priority queue.

By analysing several clues it is possible to identify these two bot components.

Catching the priority queue

Bots don’t navigate sites like people do. Typically they make more requests, at a higher speed, and rather than navigating a site with purpose they select pages systematically. This makes the click trace (the list of pages a user has visited) of a bot substantially different from a typical user.

Click-trace analysis is the traditional approach to detecting bots. The advantage of catching a bot based on its click trace is that this is independent of the technology used by the assailant. The disadvantage of this approach is that you need to have seen enough clicks from the same bot before you can classify and you need to be able to identify these clicks as all being from the same bot.

Catching the downloader

Real users download web pages with web browsers, onto a computer or mobile device. There are several distinctive activity streams that accompany such a download at the different levels of the OSI model, and by identifying these activity streams we can check that a normal download is taking place.

We catch the majority of bots based on how they download individual pages. This has a number of advantages: we only need to see a single page request from a bot to be able to catch it; we can catch bots that distribute themselves across multiple IPs; and we can recognise bot requests hidden amongst legitimate user traffic from the same IP.

Find the motive, find the perpetrator

As quoted at least once in any decent cop show, the same applies to bots. Once you know what the villain is likely to do, it’s much easier to spot. Unfortunately this is an arms race—once the bot creators know where you’re looking, the bots become harder to find.