Saturday, September 30, 2006

Missigua Spam Bot

Host: 70.86.142.210
Agent: Missigua Locator 1.9

This bot reads links from top to the bottom on the page. After visiting the document root on a site, it quickly ran into a bot trap, actually the first link it followed. It then tried all the links on the page, but of course, was denied access. It did not read the robots.txt before trying to crawl the site.

Whois Record

OrgName: ThePlanet.com Internet Services, Inc.

OrgID: TPCM
Address: 1333 North Stemmons Freeway
Address: Suite 110
City: Dallas
StateProv: TX
PostalCode: 75207
Country: US

Thursday, September 21, 2006

Setting Bad-Bot Traps

In A Simple PHP Based Bot Trap, I presented a fairly simple script for trapping, well really excluding, robots or site rippers that either ignore or surreptitiously use a site's robots.txt file.

The trap was set by adding a link to an excluded file (via robots.txt) on the main index page as bait for the bad-bot. The trap is hidden from the regular user through CSS styles and by using a dot (.) in-between the anchor tags.

The weakness in this trap's method of camouflage, is that the trap could still be tripped by users of text or non-visual browsers.

There are two variations on this trap that can be used in conjunction with or instead of the original.

The first is simply replacing the dot with an image file, 1px by 1 px, that is the same colour as the page's background. Use of the alt attribute can help identify the trap to legitimate users of text-browsers. For example:

<p>
<a href="/badbots.php">
<img src="small_image.gif" alt="do not follow" />
</a>
</p>


The second variation is to use html comments, hiding the link from everybody except that particular species of bot that will try to follow anything that even remotely resembles a link:

<!-- <a href="/badbots.php">look, a link!</a> -->

Placement of the traps can vary also. Bot's do not necessarily follow links in the order found on the page, nor do they necessarily enter a site through the main index page. Traps can be placed soon after the <body> tag, near the bottom of the page, or within a list of links, such as a navigation bar. If the site has a complicated hierarchy of nested folders, laying traps at different depths may also yield results.

Tuesday, September 19, 2006

Performance Systems International Inc. Bot

Agent: Java/1.6.0-beta2 and Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.0)
Host: 38.99.203.110

Whois Record
OrgName: Performance Systems International Inc.
OrgID: PSI
Address: 1015 31st St NW
City: Washington
StateProv: DC
PostalCode: 20007
Country: US

This bot visited an unprotected site last July. It grabbed the robots.txt and then proceeded to download every link on the site, including javascript files.
It is now blocked by ip and user-agent (^Java).

It has since revisited the site twice. Today it tried to grab robots.txt and was sent a 403 code (denied access) as Java/1.6.0-beta2. It then changed its user-agent string, four second delay, to Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.0) and tried to grab the main index page. Again denied, it seems to have moved on.
It appears to be hosted by Cogent Communications.

Saturday, September 16, 2006

Un moyen élégant de piéger les robots indélicats

Salut Spitfire, forum.phpbb-fr.com! Merci pour la traduction!

Une trés simple trappe à mauvais robots qui piège à la fois les robots qui ignorent robots.txt et aux aspirateurs de site qui ne lisent pas robots.txt.
Il existe de nombreuses versions de cette trappe. celle-ci n'est pas particulièrement sophistiquée, mais elle marche.
Utilisez-la avec prudence pour être certain de ne pas éjecter des visiteurs souhaités, ou, pire, de ne pas planter votre site.
Si vous ne comprenez pas le code ci-dessous, ne l'utilisez pas.

Requis:

  1. Hébergement acceptant le PHP
  2. Capacité d'incorporer robots.txt
  3. Capacité d'incorporer .htaccess sur votre site
  4. Capacité d'envoyer des emails via PHP
  5. Stamina to monitor your logs and .htaccess file
Les fichiers suivants sont à créer ou éditer et à uploader sur la racine de votre site:

  1. robots.txt
  2. .htaccess
  3. badbots.php
  4. bad-bots-script.php
  5. index.php (ou index.html)
1. Ajouter les lignes suivantes à votre fichiers robots.txt

User-agent: *
Disallow: /badbots.php

2. Créez le fichier suivant: badbots.php

<?php
header("Content-type: text/html; charset=utf-8");
echo ' <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> ';
?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Bad-Bots and Rippers Denied</title>
<meta name="author" content="seven-3-five.blogspot.com 2006-09-04" />
</head>
<body>
<p>whatever message you would like the scum to see</p>
<?php
include 'bad-bot-script.php';
?>
</body>
</html>

3. Créez le fichier suivant: bad-bot-script.php

<?php
/* author: seven-3-five, 2006-09-04, seven-3-five.blogspot.com
* Merci Spitfire pour la tranduction
à français
*Ce script est le plat de résistance de ce piège à robots
* 1. Il vous envoie un email quand la page /badbots.php est visité.
* L'email contient diverses infos sur le visiteur
* 2. Il ajoute la directive
* 'deny from $ip' ($ip étant l'adresse ip du visiteur)
* à la fin de votre fichier .htaccess */

/* VARIABLES SERVEUR UTILISEES
* POUR IDENTIFIER LE ROBOT ATTAQUANT */

$ip = $_SERVER['REMOTE_ADDR'];
$agent = $_SERVER['HTTP_USER_AGENT'];
$request = $_SERVER['REQUEST_URI'];
$referer = $_SERVER['HTTP_REFERER'];

// CONSTRUIT LE MESSAGE DE L'EMAIL

$subject = 'bad-bots';
$email = 'your_email@your_site.com';
$to = $email;
$message ='ip: ' . $ip . "\r\n" .
'user-agent string: ' . $agent . "\r\n" .
'requested url: ' . $request . "\r\n" .
'referer: ' . $referer . "\r\n";
// referer souvent une page blanche

$message = wordwrap($message, 70);
$headers = 'From: ' . $email . "\r\n" .
'Reply-To: ' . $email . "\r\n" .
'X-Mailer PHP/' . phpversion();


// ENVOIE LE MESSAGE


mail($to, $subject, $message, $headers);

/* AJOUTE 'deny from $ip'
* A LA FIN DE VOTRE FICJIER .htaccess */

$text = 'deny from ' . $ip . "\n";
$file = '.htaccess';
add_badbot($text, $file);

/* Function
* add_bad_bot($text, $file_name): appends $text to $file_name
* Vérifiez que PHP a la permission d'écrire dans $file_name */

function add_badbot($text, $file_name)
{
$handle = fopen($file_name, 'a');
fwrite($handle, $text);
fclose($handle);
}
?>

4. Ajoutez le code suivant, après le tag <body> de votre page d'index, index.php ou index.html:

<p style="color:white;background:white;height:0;visibility:collapse;">
<a href="badbots.php" >.</a>
</p>


5. Testez-le complètement



Que se passera-t-il?

Un vilain robot parcourt le fichier robots.txt et ignore les directives ou utilise cette information. Si le robot suit le lien vers /badbots.php, alors le script bad-bot-script.php se déclenche, écrit l'adresse IP du visiteur dans votre fichier .htaccess et vous signale le fait par email. Le vilain robot ne pourra plus parcourir le site.

Autre possibilité: un aspirateur de sites visite votre site et commence par télécharger tout ce qu'il trouve. Il tombera rapidement sur le lien /badbots.php de votre page d'index. Une fois visité ce lien, il ne pourra plus rien téklécharger d'autre, comme dans l'exemple précédent.

Incidents possibles, dépendant de votre serveur:

  • Vous aurez peut-être à créer un .htaccess vide si votre site n'en a pas déja un
  • Vous aurez peut-être à paramétrer les permissions de .htaccess afin que bad-bot-script.php puisse y écrire. si oui, essayez:
    touch .htaccess
    chgrp www .htaccess
    chmod 664 .htaccess
  • Votre serveur de mails peut ne pas accepter les mails générés par PHP
  • Il a peut-être besoin d'être configuré
  • Si vous avez éjecté tout le mode, essayez d'ajouter les lignes suivantes au début de votre vfichier .htaccess

    order allow,deny
    allow from all

    bien qu'elles devraient être présentes dans le fichier httpd.conf de tout serveur apache public et neseront sans doute pas nécessaires ici (du moins je pense)
  • Vous vous êtes éjecté vous même: cela arrivera chaque fois que vous testerez le système, aussi soyez préparé à enlever votre adresse IP de votre fichier .htaccess
  • Vos tests ajoutent votre adresse IP au fichier .htaccess, mais vous n'êtes pas éjecté: votre serveur n'accepte sans doute pas l'utilisation des fichiers .htaccess

Friday, September 15, 2006

Probable Spam-Bot 59.26.150.110

Host: 59.26.150.110
Agent: - (empty string)
visited this week. Note the lack of a user-agent string. Initial searches suggest that this is a spam-bot trolling for email addresses.

Friday, September 08, 2006

Another Bad-Bot Falls into Trap

Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)
Host: 63.100.163.70
This bot, disguised as MSIE tried to rip through one of my sites, and ran right into a bot trap. It started off looking like a regular browser: it loaded the site's root index and the page's .css. It didn't load the robots.txt.

Nevertheless, several things in combination gave it away as a disrespectful bot or ripper:
  1. It wasn't loading the page's associated binaries, i.e. images and so on
  2. It wasn't loading javascript (everyone knows that MSIE can't do much without it)
  3. It was crawling pages at three pages per second
  4. On closer inspection, it only loaded one of the two .css files on the index page
  5. It tried to follow links that were commented out in the page's mark-up
  6. It ran into a bot trap that a normal user wouldn't see.

Whois says 63.100.163.70 belongs to:

UUNET Technologies, Inc.
22001 Loudoun County Parkway
Ashburn, VA, 20147, US

Spammer Jeremy Jaynes' Conviction Upheld

Jaynes' nine year conviction for violating US state of Virginia's anti-spam law was upheld by The Court of Appeals of Virginia on Sept. 05-06.

The N. Carolina man was originally convicted of illegally flooding A.O.L. customers with bulk email ads.

Jaynes is out on a million dollar bond, but the Virginia Attorney General's office would like the bond revoked so that Jaynes can start to serve his sentence.

Let's hope so.

Tuesday, September 05, 2006

My Favorite Emacs Feature

Currently, my favorite Emacs feature has got to be its ability to edit remote files. This feature is so handy for editing web-sites, a daily event for me.

Most web hosting companies do not give their clients access to a command line interface, i.e. ssh access, without them buying one of the more expensive packages, something I cannot afford. The basic packages do, however, usually allow ftp access. And there it is. Emacs can use ftp! No more crappy control panels and crappier online text editors!

All one does is open a buffer in the usual way:
C-x C-f
The file path needs then to begin with the ftp account in the format:
/user_name@your_site.com: followed by:
path/to/your/file

The complete minibuffer looks something like:
/user_name@your_site.com:path/to/your/file

After pressing the enter key, you will be prompted for a password if needed, and you're off to the races. Editing files on the remote location is now seamlessly integrated into your current Emacs session.

FYI, Emacs is using 'ange-FTP,' which keeps most of the ftp stuff hidden in the background.

Way too easy!!

For some documentation, try in Emacs:

C-h i
m Emacs
m Files
m Remote Files

Monday, September 04, 2006

A Simple PHP based Bad-Bot Trap

A very simple bad-bot trap that catches both bots that ignore your robots.txt, and site rippers who don't read the robots.txt. There are many versions of this trap out there. This one is not particularly sophisticated, but it works.
Use with care to make sure you don't shut out visitors that you do want, or worse, shut down your site.
If you don't understand the following code, don't use it!
P.S. you can use it but you cannot post it elsewhere. Copyright 2006
What you need:
  1. PHP enabled site
  2. Ability to incorporate robots.txt
  3. Ability to incorporate .htaccess files on your site
  4. Ability to send email via PHP
  5. Stamina to monitor your logs and .htaccess file
The following files are created or edited under your .html directory, typically public_html/ on shared web hosting services.
robots.txt
.htaccess
badbots.php
bad-bots-script.php
index.php (or index.html)
  1. Add the following lines (or appropriate version) to your robots.txt:

    User-agent: *
    Disallow: /badbots.php

  2. Create the following file: badbots.php
    <?php
    header("Content-type: text/html; charset=utf-8");
    echo '
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    ';
    ?>

    <html xmlns="http://www.w3.org/1999/xhtml">

    <head>

    <title>Bad-Bots and Rippers Denied</title>
    <meta name="author" content="seven-3-five.blogspot.com 2006-09-04" />

    </head>

    <body>

    <p>whatever message you would like the scum to see</p>

    <?php
    include 'bad-bot-script.php';
    ?>

    </body>
    </html>
  3. Create the following file: bad-bot-script.php
    <?php
    // author: seven-3-five, 2006-09-04, seven-3-five.blogspot.com
    //this script is the meat and potatoes of the bot-trap
    // 1. It sends you an email when the page /badbots.php is visited.
    //The email contains various info about the visitor.
    //2. It adds the directive
    //'deny from $ip' ($ip being the visitor's ip address)
    //to the bottom of your .htaccess file.

    // SERVER VARIABLES USED TO IDENTIFY THE OFFENDING BOT

    $ip = $_SERVER['REMOTE_ADDR'];
    $agent = $_SERVER['HTTP_USER_AGENT'];
    $request = $_SERVER['REQUEST_URI'];
    $referer = $_SERVER['HTTP_REFERER'];

    // CONSTRUCT THE EMAIL MESSAGE

    $subject = 'bad-bots';
    $email = 'your_email@your_site.com'; //edit accordingly
    $to = $email;
    $message ='ip: ' . $ip . "\r\n" .
    'user-agent string: ' . $agent . "\r\n" .
    'requested url: ' . $request . "\r\n" .
    'referer: ' . $referer . "\r\n"; // often is blank

    $message = wordwrap($message, 70);

    $headers = 'From: ' . $email . "\r\n" .
    'Reply-To: ' . $email . "\r\n" .
    'X-Mailer PHP/' . phpversion();

    // SEND THE MESSAGE

    mail($to, $subject, $message, $headers);

    // ADD 'deny from $ip' TO THE BOTTOM OF YOUR MAIN .htaccess FILE

    $text = 'deny from ' . $ip . "\n";
    $file = '.htaccess';

    add_badbot($text, $file);

    // Function add_bad_bot($text, $file_name): appends $text to $file_name
    // make sure PHP has permission to write to $file_name

    function add_badbot($text, $file_name) {
    $handle = fopen($file_name, 'a');
    fwrite($handle, $text);
    fclose($handle);
    }

    ?>
  4. Add the following html soon after the <body> tag of your main /index.php (or index.html) page:

    <p style="color:white;background:white;height:0;visibility:collapse;">
    <a href="badbots.php" >.</a>
    </p>
  5. Test thoroughly
Possible issues depending on your server ...
  • You may have to create an empty.htaccess file if your site does not already have one
  • You may have to adjust the permissions for .htaccess so that bad-bot-script.php can write to it. If so, try:
    touch .htaccess
    chgrp www .htaccess
    chmod 664 .htaccess
  • Your mail server may not like PHP generated mail
  • Your mail server may need to be configured
  • If you are locking out everyone- try adding the following two lines near the top of your main .htaccess file:
    order allow,deny
    allow from all
    though they should appear in the httpd.conf file on any public Apache Server and shouldn't be necessary here (I believe...)
  • You have locked yourself out -this will happen every time you test the system, so be prepared to remove your ip from your .htaccess file.
  • Your test adds your ip to the .htaccess file, but you still have access - your server may not be configured to allow use of .htaccess files.

What Happens:

A bad-bot grabs your robots.txt file and either ignores the file's directives, or uses that information to find stuff. If the bot follows the link to /badbots.php, the bad-bot-script.php fires, writing the visitor's ip to your .htaccess file and sending you an email to the fact. The bad-bot can no longer transverse your site.

Alternately a ripper visits your site and starts to download everything it can find. It will quickly stumble upon the /badbots.php link on your /index.*. Once visiting /badbots.php, it will be unable to download any more of your stuff, just like in the previous example.

Once the bot or ripper discovers it is locked out, it may thrash about a bit, trying to retrieve any largish file it may have an url for, but of course it will just be denied access, getting a 403 code and nothing else, and quickly move on.

Variations: endless

Sunday, September 03, 2006

Matthew Garrett Moves to Ubuntu

To bad that Garrett has left the Debian project ... but to Ubuntu? Sure there is a lot of old school type flaming at Debian, but Ubuntu is kinda creepy in a sort of scientology way.

And what's with the Mandela connection?

I tried out Ubuntu 6.06.1 desktop on my old ibook (G4). The live CD worked fine, actually really well. But for some reason after booting back into Debian, I no longer need to use the "fn" key to access 'f1-12', wierd huh?

I then tried the Ubuntu CD on a mini G4, and Dapper didn't recognize the bluetooth keyboard or mouse, so no go.

Saturday, September 02, 2006

Java Site Ripper

A Java Browser,
Host: 64.105.113.204
Agent: Java/1.5.0_04
tried to rip through one of my sites last week and ran right into a bot-trap. Now the browser and ip are banned.

deny from 64.105.113.204

RewriteCond %{HTTP_USER_AGENT} ^Java
RewriteRule ^.* - [F,L]

pyscheclone as MSIE

The bad bot that originally appeared as psycheclone last June had another go at one of my sites, now disguised as MSIE.
Host: 208.66.195.3
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322).
Too bad you tried to grab the robots.txt: 403, then a video file, still 403!