Bad Robot!** Sneaky Bots in mPulse Data – For Retailers

**With apologies to JJ Abrams

mPulse collects massive amounts of RUM data every day: Navigations, SPA transitions, XHR calls, and page Resources to name some of the content that gets included. And it is all collected from within real browsers.

However, real browser can be light years removed from real visitor. mPulse tries to eliminate as many automated visits as possible, but blacklists can’t move as fast as automators, a term to describe those entities that develop and release scripted browsers into the world for purposes that range from innocuous to malevolent.

The hardest thing to detect, unless you have a system such as Akamai Bot Manager, are the patterns that might indicate that traffic is unusual, especially if it is not enough to trigger a massive spike in the number of beacons being collected by mPulse. Those patterns to exist, and they can be easily spotted, if you focus on a few key areas.

Watch your product pages

A favorite target for one species on bot is the product page (or PDP). This species, the price scraper, is designed to collect volumes of data on pricing to provide to competitors looking for some advantage in a hyper-competitive world. While relatively innocuous, the volume of these could slow the experience of real visitors to the site, especially if they don’t work as expected.

An example of this happened a few years ago when a retailer saw a massive spike in requests (to the tune of thousands of requests per day!) for a single product from a single location and browser version. This negatively affected the overall performance of the site as a whole.

Watch for Linux

Linux is a popular operating system…for servers. However, except in some very specific cases, it should not appear as one of the top 5 operating systems on your site. When this happens, treat it as a red flag event. 

In the instance used here, filtering the data to Linux only quickly showed that this was a scraper bot, targeting this customer’s PDP content. And while this bot accounted for only 4% of the PDP beacons for that day, the performance of these requests slightly skewed the median of the page group, increasing it from 2.03s (no Linux) to 2.09s (with Linux). This was due to a 20.66s median for the Linux bots.

These bots were using contemporary versions of popular browsers — Chrome/70 and Firefox/66, the release versions of these browsers at the time data was collected — making them indistinguishable from real traffic. The only dimension that flagged these as bots was Linux OS.

In this example from another retailer, the Linux presence on the PDP is even greater, comprising 27% of the overall traffic, a volume orders of magnitude greater than in the overall real user population.

Watch for Old Browser Versions

Another factor is that bots don’t always upgrade to the latest version of the browser at the same rate as real users, so finding a population of older browser versions among the data can provide a clear indicator of:

  1. A population of real-users that use old browsers due to corporate restrictions on software
  2. A population of bots that have not used the auto-upgrade feature to move to the latest version.

In the example at the right, the retailer sees its visitors upgrading to Chrome on a regular cycle, with the older versions aging out as expected based on the auto-upgrade feature in Chrome.

In the next example, the pattern is very different: Chrome 71 did not age out as would be expected using the auto-upgrade feature. Where Retailer 1 saw Chrome/71 age out by February 19, Retailer 2 saw the population of Chrome/71 stabilize at a level that was much higher than can be explained by residual, non-upgraded visitors.

But it’s not just Chrome that is affected; Firefox can also be used to create bot traffic. In the example below, the largest populations of real-user Firefox versions that should appear are 65 and 68. In the data, however, Firefox/60 and Firefox/38 are present, in numbers far exceeding those of the real user visitors, a clear indicator of bot traffic

These bots will also negatively affect the  recorded performance of Firefox during this period, as the median performance for both Firefox/38 and Firefox/60 when visiting the site was above 80 seconds.

Bots Matter in RUM

As shown by the data from two customers above, bots matter in a number of ways:

  • They can skew your mPulse performance metrics in a way that could lead to incorrect conclusions being made about the performance of key performance groups
  • They can inflate the metrics from certain OS and Browser families in a way that could lead to incorrect assumptions being made about the composition of the visitor population
  • They can cost the customer money, not just in inflated mPulse beacon counts, but in higher CDN and bandwidth usage bills.

While it is impossible to isolate and eliminate/block/trim all of them from mPulse data, watching for some of these signals can help organizations realize that bots could be a larger issue than they think, requiring more effective remediation than simple blacklists and filter rules.