Sometimes, you want to review bot traffic. For example, if you think that bot traffic has been taking up too much of your computing resources (e.g. CPU cycles), maybe you want to review bot traffic to see if there is a particular bot that is visiting you excessively. Alternatively, sometimes you might want to answer questions such as "How many pages on this website does Google's crawler index in a day?"
In this tutorial, we'll see how to filter your site's visit history to review visits from bots.
There are two broad categories to consider when reviewing bot traffic: bots that identify themselves and bots that do not.
We'll look at the easier category first: bots that identify themselves. Go to the main page to view the visit history for your site. On the left, there should be a form that allows you to apply filters. Change the filter type to Tag and select self-identified bot for the tag. Click the Filter button to update the table.
When the visit history table updates, all rows should correspond to visits where the visitor used a user-agent string that Gatekeeper thinks is a bot (e.g. "Googlebot"). If you see a user-agent string that you think Gatekeeper should recognize as a bot but does not, please let us know.
Self-identified bots tend not to be malicious. After all, why would a bot that is trying to access pages reserved for humans announce that it is a bot? Instead, malicious bots will try to cover their tracks and appear as normal human users, which leads us to the second category: bots that do not identify themselves. While there is no perfect solution when trying to automatically detect bots, intelligence about the IP addresses might give us clues. For example, if a user is visiting your website from a data center, there is a good chance that that user is a bot (this isn't always true since some people use virtual private networking, which might route traffic through data centers; in those cases, CAPTCHA challenges would be useful in distinguishing bots from humans). Fortunately, Gatekeeper makes it easier to focus on such visits by making the data center tag available.
A trickier category of IP addresses are ones that belong to organizations that operate both a data center and an ISP (internet service provider). In that case, Gatekeeper does not know if those IP addresses are strictly being used for data centers or as part of an ISP, and so those IP addresses are treated differently from the ones that are identified as data center. When looking for bot traffic, you might want to filter for data center and ISP as well.
If you come across IP addresses that you think are data centers, but we have not marked them as such, please let us know.
Lastly, bots can be run on residential networks and can be very difficult to detect. One method to discover new scraping attacks is to review which IP addresses have frequently visited your site in a day. Gatekeeper makes this easier by highlighting which IP addresses have frequently visited your site in a day in the "Top visitors" section right underneath the visits graph ("Top visitors" is collapsed by default).
You can review the list and look for unusually high counts from IP addresses that are not bots that you expect. Clicking on the IP address will apply that IP address as a filter so you can more easily review the URLs that that IP address is visiting. If you decide that an IP address is unwelcome, you can blacklist it or add it to a watchlist (a custom visitor group that you can filter for).