There comes a time in the lifespan of all large websites when bot traffic becomes an issue on some scale or another. Sometimes you get bombarded with scrapers and your servers can’t handle the load. Sometimes malicous users attempt to brute force security-related endpoints. Sometimes bots drop spam content into input fields. Regardless of the usecase, eventually the problem grows enough that it needs to be addressed somehow.

This happened to us, and here’s the long road we traveled.

The Attack Begins

First of all, observability and alerting is important, and that can’t be stressed enough. It’s possible that bot traffic was always there and we never noticed, but with the introduction of proper monitoring and alerting, it seemed as though we were suddenly getting bombarded out of nowhere.

We started to see attacks on some of our endpoints protecting customer-data using techniques like credential stuffing or pure brute forcing. This was soon combatted with manual labor to reset passwords and to block certain IPs, and soon became a never-ending struggle.

We also started seeing massive spikes in some of our slower endpoints from (presumably) scrapers. Here we would see the load increase to over 30 times the normal peak traffic suddenly from one second to the next. To combat this, we spent a lot of time optimizing endpoints to make them faster and more stable, but it also became an endless fight as the more we did to speed up the traffic, the more traffic was able to be thrown our way.

Rate Limiting

Sometimes scaling your services up is not possible or is too expensive. For security-related cases, it’s not even a valid solution. So the first thing you would normally attempt is rate limiting. When you have a clear session or user connected to your requests, rate limiting is easy, but if you allow “guest” traffic, it becomes much more challenging.

Probably the most common way to rate limit is using the client’s IP. Sounds easy, but when you have traffic that comes through a CDN like Cloudflare or Akamai, it’s not as simple as just grabbing the RemoteAddr of the request. When the traffic comes from your CDN, the remote address will always be the address of an edge server, not the actual client, so using this IP to limit would be pretty disastrous (you’ll end up blocking all your traffic). But CDNs usually pass along the real IP in the X-Forwarded-For header. This header can contain a list of IPs, and Cloudflare sticks the client IP to the front.

func getClientIpFromRequest(r *http.Request) string {
    forwardedFor := r.Header.Get("X-Forwarded-For")
    ips := strings.Split(strings.ReplaceAll(forwardedFor, " ", ""), ",")
    
    if len(forwardedFor) > 0 {
      return forwardedFor[0]
  }
  return r.RemoteAddr
}

Hulksmash

Our rate limiting worked pretty well until I realized it easy to bypass. While trying to break our new rate limiting and search for places where limits were missing, I started building a lot of test scripts, and all the duplicated code getting copied around eventually grew into a QA tool for brute-forcing and load-testing APIs called Hulksmash. It quickly occurred to me that headers like X-Forwarded-For can also be sent by the client, so what would happen if I sent a random IP in that header? It turns out Cloudflare sends the header structured like this:

[original-x-forwarded-for, client-ip, cloudflare-hop1, cloudflare-hop2]

This means you can bypass any IP-based limiting using the above algorithm by simply providing an X-Forwarded-For header (and it doesn’t even need to contain a real IP address, Cloudflare accepts any string). So here’s a smarter algorithm:

func getClientIpFromRequest(r *http.Request) string {
    forwardedFor := r.Header.Get("X-Forwarded-For")
    ips := strings.Split(strings.ReplaceAll(forwardedFor, " ", ""), ",")
    ipCount := len(ips)
    switch {
    case forwardedFor == "" || ipCount == 0:
        return r.RemoteAddr
    case len(ips) < 3:
        return ips[0]
    default:
        //in case the count is more than 3, always take the third-to-last ip to 
    //mitigate x-forwarded-for spoofing
        return ips[ipCount-3]
    }
}

Hulksmash was soon adapted to send randomized headers as well like the user agent, cache-control, etc, and it soon became clear that it was fairly easy to make traffic which looked like it might be real. But at least for now, it looked like most of the bots were thwarted.

The Battle of the Bots

It didn’t take long for the bots to be replaced by more sophisticated killing machines which could adapt like the T1000. These new bots were delivering ten times the load and with rotating IPs. So that meant we were back to the drawing board.

At this point we began tweaking the thresholds for the rate limiter, and thus began the Battle of the Bots. Every time we lowered the thresholds, the traffic improved for a few days until the bots adapted and the whole cycle repeated. Eventually we reached the point where each IP would only be used a single time before rotating to a new one, and the IPs even came from private IP ranges.

After that, we spent some time looking into other methods for correlating traffic between users such as a browser fingerprint or device ID, and we spent some time looking at various headers whose presence or absence could be used to identify that a user is fake. After some time playing with Hulksmash, it became clear that if it could bypass our filter, then the bots would be able to as well. After a bit of experimentation, it became obvious that any content supplied by the user which could not be verified (like an auth token) was useless.

Don’t get me wrong, I like robots. Robots are cool. But these particular bots were annoying and they needed to be stopped. So I kept digging and soon realized that the problem was in trying to use just one thing to identify the legitimacy of traffic. It can’t be consistently done with any one attribute, however you can look at the entire request holistically, taking the IP, the headers, the TCP handshake, packet sizes, and request timings all into account to form a sort of confidence rating.

The BotScore

Cloudflare provides a botscore which is a number ranging from 1 (definitely a bot) to 99 (definitely a human). We soon observed from the logs that this botscore was quite consistently able to differentiate between real traffic and “fake” traffic. It’s easy to test it out. Take any valid request made with a standard browser, open the developer console and watch the network tab, copy the request as cURL, and then send it with your terminal. The browser request consistently gets a botscore of 99 while the cURL request consistently gets a botscore of 1 despite having exactly the same parameters, headers, and origin IP.

The next task was to make this botscore available to our APIs. This can be done with this simple Cloudflare worker:

addEventListener('fetch', event => {
    event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
    request = new Request(request);
    var score = "N/A";
    var verifiedBot = "N/A";
    if (request.cf.botManagement !== undefined) {
        score = request.cf.botManagement.score.toString();
        verifiedBot = request.cf.botManagement.verifiedBot.toString();
    }
    request.headers.set("Cf-Bot-Score", score);
    request.headers.set("Cf-Bot-Verified", verifiedBot);
    return fetch(request.url, request);
}

Now, with the botscore in place, instead of rate limiting a single offending user, it became possible to block all non-human traffic when the load exceeds thresholds.

func getClientIpFromRequest(r *http.Request) string {
    forwardedFor := r.Header.Get("X-Forwarded-For")
    ips := strings.Split(strings.ReplaceAll(forwardedFor, " ", ""), ",")
  botScore, _ := strconv.ParseInt(r.Header.Get("Cf-Bot-Score"), 10, 32)
    if botScore > 0 && botScore <= BotThreshold && 
      r.Header.Get("Cf-Bot-Verified") != "true" {
        return "bot" //group all bots together into one bucket
    }
    ipCount := len(ips)
    switch {
    case forwardedFor == "" || ipCount == 0:
        return r.RemoteAddr
    case len(ips) < 3:
        return ips[0]
    default:
        //in case the count is more than 3, always take the third-to-last ip to 
    //mitigate x-forwarded-for spoofing
        return ips[ipCount-3]
    }
}

The beauty of this algorithm is that it still allows some traffic from bots which may be harmless and only begins blocking once the load becomes too high.