Bad Robot!** Sneaky Bots in mPulse Data – For Retailers

**With apologies to JJ Abrams

mPulse collects massive amounts of RUM data every day: Navigations, SPA transitions, XHR calls, and page Resources to name some of the content that gets included. And it is all collected from within real browsers.

However, real browser can be light years removed from real visitor. mPulse tries to eliminate as many automated visits as possible, but blacklists can’t move as fast as automators, a term to describe those entities that develop and release scripted browsers into the world for purposes that range from innocuous to malevolent.

The hardest thing to detect, unless you have a system such as Akamai Bot Manager, are the patterns that might indicate that traffic is unusual, especially if it is not enough to trigger a massive spike in the number of beacons being collected by mPulse. Those patterns to exist, and they can be easily spotted, if you focus on a few key areas.

Watch your product pages

A favorite target for one species on bot is the product page (or PDP). This species, the price scraper, is designed to collect volumes of data on pricing to provide to competitors looking for some advantage in a hyper-competitive world. While relatively innocuous, the volume of these could slow the experience of real visitors to the site, especially if they don’t work as expected.

An example of this happened a few years ago when a retailer saw a massive spike in requests (to the tune of thousands of requests per day!) for a single product from a single location and browser version. This negatively affected the overall performance of the site as a whole.

Watch for Linux

Linux is a popular operating system…for servers. However, except in some very specific cases, it should not appear as one of the top 5 operating systems on your site. When this happens, treat it as a red flag event. 

In the instance used here, filtering the data to Linux only quickly showed that this was a scraper bot, targeting this customer’s PDP content. And while this bot accounted for only 4% of the PDP beacons for that day, the performance of these requests slightly skewed the median of the page group, increasing it from 2.03s (no Linux) to 2.09s (with Linux). This was due to a 20.66s median for the Linux bots.

These bots were using contemporary versions of popular browsers — Chrome/70 and Firefox/66, the release versions of these browsers at the time data was collected — making them indistinguishable from real traffic. The only dimension that flagged these as bots was Linux OS.

In this example from another retailer, the Linux presence on the PDP is even greater, comprising 27% of the overall traffic, a volume orders of magnitude greater than in the overall real user population.

Watch for Old Browser Versions

Another factor is that bots don’t always upgrade to the latest version of the browser at the same rate as real users, so finding a population of older browser versions among the data can provide a clear indicator of:

  1. A population of real-users that use old browsers due to corporate restrictions on software
  2. A population of bots that have not used the auto-upgrade feature to move to the latest version.

In the example at the right, the retailer sees its visitors upgrading to Chrome on a regular cycle, with the older versions aging out as expected based on the auto-upgrade feature in Chrome.

In the next example, the pattern is very different: Chrome 71 did not age out as would be expected using the auto-upgrade feature. Where Retailer 1 saw Chrome/71 age out by February 19, Retailer 2 saw the population of Chrome/71 stabilize at a level that was much higher than can be explained by residual, non-upgraded visitors.

But it’s not just Chrome that is affected; Firefox can also be used to create bot traffic. In the example below, the largest populations of real-user Firefox versions that should appear are 65 and 68. In the data, however, Firefox/60 and Firefox/38 are present, in numbers far exceeding those of the real user visitors, a clear indicator of bot traffic

These bots will also negatively affect the  recorded performance of Firefox during this period, as the median performance for both Firefox/38 and Firefox/60 when visiting the site was above 80 seconds.

Bots Matter in RUM

As shown by the data from two customers above, bots matter in a number of ways:

  • They can skew your mPulse performance metrics in a way that could lead to incorrect conclusions being made about the performance of key performance groups
  • They can inflate the metrics from certain OS and Browser families in a way that could lead to incorrect assumptions being made about the composition of the visitor population
  • They can cost the customer money, not just in inflated mPulse beacon counts, but in higher CDN and bandwidth usage bills.

While it is impossible to isolate and eliminate/block/trim all of them from mPulse data, watching for some of these signals can help organizations realize that bots could be a larger issue than they think, requiring more effective remediation than simple blacklists and filter rules.

Real User Measurement – A tool for the whole business

The latest trend in web performance measurement is the drive to implement Real User Measurement (RUM) as a component of a web performance measurement strategy. As someone who cut their teeth on synthetic measurements using distributed robots and repeatable scripts, it took me a long time to see the light of RUM, but I am now a complete convert – I understand that the richness and completeness of RUM provides data that I was blocked from seeing with synthetic data.
They key for organizations now is to realize that RUM is not a replacement for Synthetic Measurements. In fact, the two are integral to each other for identifying and solving tricky external web performance issues that can be missed by using a single measurement perspective.
My view is that the best way to drive RUM collection is to shape the metrics in a manner similar to that you have chosen to segment and analyze your visitors using traditional web analytics. The time and effort used in this effort can inform RUM configuration by determining:

  • Unique customer populations – registered users, loyalty program levels, etc
  • Geography
  • Browser and Device
  • Pages and site categories visited
  • Etc.

This information needs to bleed through so that it can be linked directly to the components of the infrastructure and codebase that were used when the customer made their visit. But to limit this vast new data pool to the identification and solving of infrastructure, application, and operations issues isolates the information from a potentially huge population of hungry RUM consumers – the business side of any organization.
This side of the company, the side that fed their web analytics data into the setup of RUM, needs to now see the benefit of their efforts. By sharing RUM with the teams that use web analytics and aligning the two strategies, companies can directly tie detailed performance data to existing customer analytics. With this combination, they can begin to truly understand the effects of A/B testing, marketing campaigns, and performance changes on business success and health. But business users need a different language to understand the data that web performance professionals consume so naturally.
I don’t know what the language is, but developing it means taking the data into business teams and seeing how it works for them. What companies will likely find is that the data used by one group won’t be the same as for the other, but there will be enough shared characteristics to allow the group to share a dialectic of performance when speaking to each other.
This new audience presents the challenge of clearly presenting the data in a form that is easily consumed by business teams alongside existing analytics data. Providing yet another tool or interface will not drive adoption. Adoption will be driven be attaching RUM to the multi-billion dollar analytics industry so that the value of these critical metrics is easily understood by and made actionable to the business side of any organization.
So, as the proponents of RUM in web performance, the question we need to ask is not “Should we do this?”, but rather “Why aren’t we doing this already?”.