Step away from the Web site!

I had a bot banning session today.

I have noticed more and more new bots hitting my site lately and decided to investigate.

The bots I found and wanted to ban roughly fell into two categories:

  • Nice bots.

    I don’t have anything against them. I just don’t benefit from having them looking at my site.

    They are “nice” because they provide useful info on how to ban them from my site.

    This went into my robots.txt anyway.


    User-agent: BlogPulse
    Disallow: /
    User-agent: Nutch
    Disallow: /
  • Nasty bots.

    Bots that can’t be blocked by robots.txt.

    This is usually because they don’t provide a link or any other info on how to block them.

    Some are downright evil and ignore robots.txt and access URLs I have blocked.

    To ban them I had to add stuff to my Apache config.


    RewriteEngine on
    RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mail.Sweeper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^newknow-larbin_2.6.2 [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LARBIN-EXPERIMENTAL [OR]
    RewriteCond %{HTTP_USER_AGENT} ^grub-client [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy.Library
    RewriteRule ^.* - [F]

    I got the lowdown on how to do this from the following thread on webmasterworld

    Any bots with those User Agents will get a 403 error when accessing my site, I tested this using wget -U to fake the User Agent.

    Hopefully now I can not have to think about this for a few months.

    Update

    The regex for grub is wrong, it should read:


    RewriteCond %{HTTP_USER_AGENT} grub-client [OR]

    Its user agent is Mozilla/4.0 (compatible; grub-client-1.3.7; Crawl your own stuff with http://grub.org)

    Interesting thread about the broken robots.txt “support” in grub. Definitely sufficient reason to permanently ban it.

  • 1 thought on “Step away from the Web site!”

    Leave a Reply

    Your email address will not be published. Required fields are marked *