robots.txt Pen Test extension

Given that I’m seeing an increase in unauthorised, and essentially illegal, Pen Tests against business production instances, which only serve to:

  • Test the firewalls around your production instance (rather than testing the application code directly),
  • Put at serious risk the availability of the web service under ‘test’ by consuming resources that should be only used by actual customers; whilst also risking data integrity through corruption – potentially harming the business bottom line and reputation,
  • Create a lot of ‘noise’ in the logs, which only helps to hide the bad guys.

(for additional reasons why this is bad idea all around see https://www.aykira.com.au/2020/09/3rd-party-pen-testing-make-sure-you-are-doing-it-right/)

I think its time something was done to let site owners signal if they either allow or specifically restrict such Pen Tests. Therefore I propose an extension to the robots.txt standard to cover off robots used to perform Pen Tests.

The robots.txt exclusion standard

Wikipedia has a good write up of the current standard here. In essence, you specify User Agents of bots of interest and then indicate if you allow or deny then access to areas of your site. Quite simple, but for us, the problem is it has no way to flag a whole set or group of bots you want to exclude or include. Let’s fix that.

The @ category user agent extension

What I propose first is adding into the User-agent line spec the ability to include a ‘@<CATEGORY>’ form of a user agent. In essence, a category refers to a group or set of bots which perform a particular function. So to ban any pen test bots from scanning your website, go:

User-agent: @pentest
Disallow: /

In order for this to work, pen test bots should first pick up the robots.txt file and parse it looking for either reference to their specific User-agent OR usage of ‘@pentest’ and respect the restrictions placed upon them.

Prior to this Pen Test crawlers typically totally ignored the robots.txt file; with this extension, they MUST parse the robots.txt file looking for rules that either include their specific User-Agent or ‘@pentest’.

Note: I use ‘@’ quite deliberately here as ‘#’ is already used for comments.

The Pentest verb extension

Additionally, if you want to allow pentest’s to occur but only with certain restrictions, you can use the proposed ‘Pentest’ verb, which has the following format

Pentest: [allow|owner] [rate=R] [during=START-FINISH] [clean]

Using ‘allow‘ indicates any pentest is permitted, you give express permission to go ahead to all Pen Test tools. Using ‘owner‘ indicates only if permission has been expressly and specifically gained prior to the scan via proof of ownership or control (it cannot be implied or inferred, it has to be evidenced). Rate=R indicates the maximum number of requests per minute permitted, it is expected the bot will average the request load per minute to be under this limit. During indicates a time period (hour:minute) to (hour:minute) based in GMT during which pen tests are only permitted, this not being set indicates scans can occur any time. This time period could go over midnight. The ‘clean‘ attribute indicates that only scanning which leaves no ‘remnants’ is permitted, i.e. you cannot do blind XSS and other resident vulnerability tests as this could either deface the site or make it easy for hackers to search for such vulnerabilities later on – you must leave the site as clean as you found it.

Note: the presence of the Pentest verb on its own is valid. If @pentest is used, it whoever comes first wins.

Why the Strict Proof of Ownership Test?

The test of ownership is strict for a good reason, it’s a clear indication to those operating or providing pen test services that they need specific approval from the actual site owners prior. Just writing clauses into your T&C’s saying to your customers they can only scan sites they own or operate is not enough with this robots.txt extension – you need to actually perform a specific check – for instance, checking DNS records or requiring the site owner to upload a magic file or adding a magic meta tag to sites home page. Nothing else is sufficient to fulfil this requirement on ownership

Examples

So if you want no Pen Tests at all against your site, put at the top of your robots.txt file:

User-agent: @pentest
Disallow: /

If you want only owner verified pentests put:

Pentest: owner
User-agent: @pentest
Disallow: /

This way if the pentest bot only honours the @pentest but doesn’t process the Pentest verb, it will get blocked. Quite neat I think.

If you want only owner verified pen test, but a certain directory is always off-limits, put:

User-agent: @pentest
Disallow: /noscan
Pentest: owner
User-agent: @pentest
Disallow: /

This way you can add in specific rules for the Pen Test tools without effecting the other programs that make use of robots.txt.

Isn’t this helping the hackers?

No, not at all – you should always expect hackers to not honour what is given in the robots.txt AND your website should be hardened enough that what is given in the robots.txt provides them with no advantage. In fact, having such rules in your robots.txt indicates that you are on the ball and set up correctly to identify and deal with hackers.

Won’t this be hard to implement?

For a website owner, it should take a few minutes to update the robots.txt file. For the Pen Test tool provider, most tools ‘scan’ the robots.txt file looking for directories to scan – so they are already reading the file, they just need to add code to honour this extension – which should not be that hard to do. For them, the hard bit will be the proof of ownership, which most already require in their T&C’s of usage – this just gives them a mechanism to enforce it and avoid legal ‘grey areas’.

What now?

I suggest you do three things:

  • Update your robots.txt file using the new extension, it’s designed not to break anything existing and will provide clean comms on how you want to deal with Pen Tests going forwards.
  • If you have T&C’s on your site that refers to Pen testing, link to this and say you utilise it to signal to tools your requirements.
  • If you like this, please ‘Like’ below and share with your friends. The more this is taken up, the better for all.