Dark Reading is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Cloud

9/27/2017
10:30 AM
Ido Safruti
Ido Safruti
Commentary
Connect Directly
Twitter
LinkedIn
RSS
E-Mail vvv
100%
0%

How to Live by the Code of Good Bots

Following these four tenets will show the world that your bot means no harm.

Although my company fights problems caused by malicious bots on the Internet, many bots are doing good things. These beneficial bots may help a site get better exposure, provide better product recommendations, or monitor critical online services. The most famous good bot is the Googlebot, which crawls links to build the search engine many of us use.

To keep their access to the Web open, the makers of good bots must understand how to tell the world about their bots' intentions. At PerimeterX, we defined a "Code of Good Bots" that provides basic rules of good behavior. If legitimate bot makers follow this code, then websites and security services (like PerimeterX) can easily identify such bots.  

If you're a bot developer, we recommend following the Code of Good Bots:

1. Declare who you are.
The Internet is awash in spoofed and poorly identified traffic. This includes bad bots that seek to harm sites. To avoid suspicion, a bot developer should make its bot declare its identity in the user-agent HTTP header when communicating with a site. We also recommended that bot developers provide a link in the user-agent header to a page describing the bot, what it's doing, why a site owner should grant it access, and methods a site owner can use to control the bot.

Googlebot, for example, will always include the word "googlebot" in the user-agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

2. Provide a method to accurately identify your bot.

Good bot builders should provide a defensible method to verify a bot is what it declares itself to be. While declaring a specific user-agent is important, malicious bots can pretend to be legitimate by copying the user-agent header of a beneficial bot (also known as user-agent spoofing). For this reason, validating the good bot's source IP address is crucial.

The bot maker should provide a list of IP ranges as an XML or JSON file on its website. This list also can be provided as a DNS TXT record in the bot owner's domain. We recommend an expiration time for the list, to indicate the frequency at which this list should be retrieved.

Another method used by many crawlers was introduced by Google and calls for a sequence of reverse DNS and DNS lookups to validate the source IP address. You can read more about this method here. Although Google's verification method is common and offers the required safety, it is very inefficient for the validating site. Providing a list of IP ranges enables a much more efficient validation process.

We recommend that bot makers specify the verification method in the URL provided in the user-agent string.

The validation method should be strong enough that bad bots can't pass this test. Specifically, the method should restrict the IP address ranges to those controlled or owned by the bot operator. For example, suggesting that a site owner verify that the IP address is in the Amazon Web Services (AWS) IP ranges isn't a good idea. Anyone can purchase an AWS virtual server and use it to send requests across the Web.

3. Follow robots.txt.
The robots.txt file is the de facto standard used by websites to communicate to bots and crawlers the general access policies of the site. The standard specifies how to inform the bot about which areas of the website shouldn't be crawled or scraped, rate and frequency in which a bot can access the site, and more. Good bots are required to download the robots.txt file from the site before accessing it, parse it, and follow the instructions.

4. Don't be too aggressive.
This is related to robots.txt and requires some common sense. Overly aggressive behavior can slow down a site or even take it offline. Different websites have different capacities to handle bot traffic. Some are set up to scale up quickly should massive traffic appear; others are not and will choke on even a small amount of additional traffic. While a bot may want to collect data quickly, this desire must be tempered against the realities of what the site can handle. We've seen cases where good bots contribute over 90% of the requests coming to a site.

Bots should respect the "Crawl-delay" instruction if specified in robots.txt. For example, site owners could use this to instruct a 10-second delay between requests. Some bots, such as Googlebot and Bingbot, provide more enhanced methods to control their crawlers, and specifically crawling rates.

If a site owner doesn't provide instructions on crawl speed and crawl access, the bot maker should default to a moderate, generally acceptable crawling rate.

The Code of Good Bots is Good for Us All
The importance of bot makers following the Code of Good Bots grows more urgent because of the rise of sophisticated bot attacks that piggyback on users via malicious browser extensions or on poorly secured Internet of Things devices. Site owners may default to more aggressive anti-bot policies in the effort to defend their user experience, site performance, and site integrity. The Code of Good Bots is critical for making sure that even then we continue to benefit from the good things good bots offer for users and businesses.

Related Content:

Join Dark Reading LIVE for two days of practical cyber defense discussions. Learn from the industry’s most knowledgeable IT security experts. Check out the INsecurity agenda here.

Ido Safruti is a co-founder and CTO at PerimeterX, provider of application security solutions that keep businesses safe in the digital world, detecting risks to web and mobile applications and proactively managing them. Previously, Ido headed a product group in Akamai focused ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Mobile Banking Malware Up 50% in First Half of 2019
Kelly Sheridan, Staff Editor, Dark Reading,  1/17/2020
7 Tips for Infosec Pros Considering A Lateral Career Move
Kelly Sheridan, Staff Editor, Dark Reading,  1/21/2020
For Mismanaged SOCs, The Price Is Not Right
Kelly Sheridan, Staff Editor, Dark Reading,  1/22/2020
Register for Dark Reading Newsletters
White Papers
Video
Cartoon Contest
Write a Caption, Win a Starbucks Card! Click Here
Latest Comment:   It's a PEN test of our cloud security.
Current Issue
IT 2020: A Look Ahead
Are you ready for the critical changes that will occur in 2020? We've compiled editor insights from the best of our network (Dark Reading, Data Center Knowledge, InformationWeek, ITPro Today and Network Computing) to deliver to you a look at the trends, technologies, and threats that are emerging in the coming year. Download it today!
Flash Poll
How Enterprises are Attacking the Cybersecurity Problem
How Enterprises are Attacking the Cybersecurity Problem
Organizations have invested in a sweeping array of security technologies to address challenges associated with the growing number of cybersecurity attacks. However, the complexity involved in managing these technologies is emerging as a major problem. Read this report to find out what your peers biggest security challenges are and the technologies they are using to address them.
Twitter Feed
Dark Reading - Bug Report
Bug Report
Enterprise Vulnerabilities
From DHS/US-CERT's National Vulnerability Database
CVE-2020-7245
PUBLISHED: 2020-01-23
Incorrect username validation in the registration processes of CTFd through 2.2.2 allows a remote attacker to take over an arbitrary account after initiating a password reset. This is related to register() and reset_password() in auth.py. To exploit the vulnerability, one must register with a userna...
CVE-2019-14885
PUBLISHED: 2020-01-23
A flaw was found in the JBoss EAP Vault system in all versions before 7.2.6.GA. Confidential information of the system property's security attribute value is revealed in the JBoss EAP log file when executing a JBoss CLI 'reload' command. This flaw can lead to the exposure of confidential information...
CVE-2019-17570
PUBLISHED: 2020-01-23
An untrusted deserialization was found in the org.apache.xmlrpc.parser.XmlRpcResponseParser:addResult method of Apache XML-RPC (aka ws-xmlrpc) library. A malicious XML-RPC server could target a XML-RPC client causing it to execute arbitrary code. Apache XML-RPC is no longer maintained and this issue...
CVE-2020-6007
PUBLISHED: 2020-01-23
Philips Hue Bridge model 2.X prior to and including version 1935144020 contains a Heap-based Buffer Overflow when handling a long ZCL string during the commissioning phase, resulting in a remote code execution.
CVE-2012-4606
PUBLISHED: 2020-01-23
Citrix XenServer 4.1, 6.0, 5.6 SP2, 5.6 Feature Pack 1, 5.6 Common Criteria, 5.6, 5.5, 5.0, and 5.0 Update 3 contains a Local Privilege Escalation Vulnerability which could allow local users with access to a guest operating system to gain elevated privileges.