Monday, September 04, 2006

A Simple PHP based Bad-Bot Trap

A very simple bad-bot trap that catches both bots that ignore your robots.txt, and site rippers who don't read the robots.txt. There are many versions of this trap out there. This one is not particularly sophisticated, but it works.
Use with care to make sure you don't shut out visitors that you do want, or worse, shut down your site.
If you don't understand the following code, don't use it!
P.S. you can use it but you cannot post it elsewhere. Copyright 2006
What you need:
  1. PHP enabled site
  2. Ability to incorporate robots.txt
  3. Ability to incorporate .htaccess files on your site
  4. Ability to send email via PHP
  5. Stamina to monitor your logs and .htaccess file
The following files are created or edited under your .html directory, typically public_html/ on shared web hosting services.
robots.txt
.htaccess
badbots.php
bad-bots-script.php
index.php (or index.html)
  1. Add the following lines (or appropriate version) to your robots.txt:

    User-agent: *
    Disallow: /badbots.php

  2. Create the following file: badbots.php
    <?php
    header("Content-type: text/html; charset=utf-8");
    echo '
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    ';
    ?>

    <html xmlns="http://www.w3.org/1999/xhtml">

    <head>

    <title>Bad-Bots and Rippers Denied</title>
    <meta name="author" content="seven-3-five.blogspot.com 2006-09-04" />

    </head>

    <body>

    <p>whatever message you would like the scum to see</p>

    <?php
    include 'bad-bot-script.php';
    ?>

    </body>
    </html>
  3. Create the following file: bad-bot-script.php
    <?php
    // author: seven-3-five, 2006-09-04, seven-3-five.blogspot.com
    //this script is the meat and potatoes of the bot-trap
    // 1. It sends you an email when the page /badbots.php is visited.
    //The email contains various info about the visitor.
    //2. It adds the directive
    //'deny from $ip' ($ip being the visitor's ip address)
    //to the bottom of your .htaccess file.

    // SERVER VARIABLES USED TO IDENTIFY THE OFFENDING BOT

    $ip = $_SERVER['REMOTE_ADDR'];
    $agent = $_SERVER['HTTP_USER_AGENT'];
    $request = $_SERVER['REQUEST_URI'];
    $referer = $_SERVER['HTTP_REFERER'];

    // CONSTRUCT THE EMAIL MESSAGE

    $subject = 'bad-bots';
    $email = 'your_email@your_site.com'; //edit accordingly
    $to = $email;
    $message ='ip: ' . $ip . "\r\n" .
    'user-agent string: ' . $agent . "\r\n" .
    'requested url: ' . $request . "\r\n" .
    'referer: ' . $referer . "\r\n"; // often is blank

    $message = wordwrap($message, 70);

    $headers = 'From: ' . $email . "\r\n" .
    'Reply-To: ' . $email . "\r\n" .
    'X-Mailer PHP/' . phpversion();

    // SEND THE MESSAGE

    mail($to, $subject, $message, $headers);

    // ADD 'deny from $ip' TO THE BOTTOM OF YOUR MAIN .htaccess FILE

    $text = 'deny from ' . $ip . "\n";
    $file = '.htaccess';

    add_badbot($text, $file);

    // Function add_bad_bot($text, $file_name): appends $text to $file_name
    // make sure PHP has permission to write to $file_name

    function add_badbot($text, $file_name) {
    $handle = fopen($file_name, 'a');
    fwrite($handle, $text);
    fclose($handle);
    }

    ?>
  4. Add the following html soon after the <body> tag of your main /index.php (or index.html) page:

    <p style="color:white;background:white;height:0;visibility:collapse;">
    <a href="badbots.php" >.</a>
    </p>
  5. Test thoroughly
Possible issues depending on your server ...
  • You may have to create an empty.htaccess file if your site does not already have one
  • You may have to adjust the permissions for .htaccess so that bad-bot-script.php can write to it. If so, try:
    touch .htaccess
    chgrp www .htaccess
    chmod 664 .htaccess
  • Your mail server may not like PHP generated mail
  • Your mail server may need to be configured
  • If you are locking out everyone- try adding the following two lines near the top of your main .htaccess file:
    order allow,deny
    allow from all
    though they should appear in the httpd.conf file on any public Apache Server and shouldn't be necessary here (I believe...)
  • You have locked yourself out -this will happen every time you test the system, so be prepared to remove your ip from your .htaccess file.
  • Your test adds your ip to the .htaccess file, but you still have access - your server may not be configured to allow use of .htaccess files.

What Happens:

A bad-bot grabs your robots.txt file and either ignores the file's directives, or uses that information to find stuff. If the bot follows the link to /badbots.php, the bad-bot-script.php fires, writing the visitor's ip to your .htaccess file and sending you an email to the fact. The bad-bot can no longer transverse your site.

Alternately a ripper visits your site and starts to download everything it can find. It will quickly stumble upon the /badbots.php link on your /index.*. Once visiting /badbots.php, it will be unable to download any more of your stuff, just like in the previous example.

Once the bot or ripper discovers it is locked out, it may thrash about a bit, trying to retrieve any largish file it may have an url for, but of course it will just be denied access, getting a 403 code and nothing else, and quickly move on.

Variations: endless

10 comments:

Anonymous said...

This trap is good and easy but when I'm going to badbots.php, I've this error message :
Fatal error: Call to unsupported or undefined function wordwrap() in bad-bot-script.php on line 27

What to do ?

Jdy

seven said...

Jdy's error
call to unsupported function wordwrap() ...

The function wordwrap() should be available starting with PHP 4.0.2. It may be disabled on public servers due to a potential heap-based buffer overflow problem. Check disable_functions in your php.ini file or with phpinfo(INFO_CONFIGURATION); to see if that is the case.
The problem was addressed in PHP 4.4.3 and 5.1.3.
See secunia.com/advisories/19803/ for details.

The statement
$message = wordwrap($message, 70);
is purely aesthetic and not strictly necessary for the script to work.
You could just comment it out or even remove it.

If you really want a wordwrap function, but won't or can't use the included wordwrap(), I believe there are some alternative scripts at php.net, search wordwrap.

If you want to flush out the original bad-bot-script, perhaps write checks that limit/sanitize the $_SERVER variables before constructing $message, and so on.

seven said...

This script has proved to be suprisingly popular. It seems to be popping up on various php sites all over the place.

Most posters have provided links back here, thanks, but have also quoted the post in full. As a courtesy, perhaps you can just provide a link to here with a short description, and please make sure the link works! You could also let me know about your post, rather than me finding out through Google.

If you are using the trap and are having success (or not) with it let me know here and I will publish your comments.

Also if someone wants to translate the script's commentary into another language, (already in french), you can add the translation as a comment.

Thanks

Jabi said...

This is a great tip and has solved a problem for me so thankyou for that!

The only thing I would add is to make sure you change the filenames because as this script becomes more commonly used, the scumbags will just configure their scanner to ignore any link to badbots.php.

Anonymous said...

This is great and thank you.

It would be even better if you could integrate somehow a reverse dns look up on true search engine bots not spoofing.
Then it woul be a gun system

www.TropHort.com said...

This script blocks Google's bot!!
----------

After implementing this, I got the following message:

ip: 66.249.72.147 user-agent string: Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html) requested url: /badbots.php

How do I UNBLOCK wanted bots???

seven said...

1. Go into your .htaccess file and remove the line blocking google's ip

2. Make sure you robots.txt exclusion file is configured properly as described in the text above.

Combining php, htacess and robots.txt can create powerful tools. Probably better not to use them on live sites until you are comfortable with these tools, have tested them thoroughly, and fully understand the code/markup.

regards,
7

R Reid said...

What do you do about crawlers that have cached versions of your robots.txt file at the time you implement the bot trap. If they do not reload the file some okay bots may fall victim to the trap?

Heinzmann said...
This comment has been removed by a blog administrator.
Nico said...

It's an awesome script, but the it shouldn't be bad-bots-script.php it should be bad-bot-script.php