Thursday, May 31, 2007

TMCrawler

Host: 128.241.20.206
Agent: TMCrawler

This bad-bot visited a site, grabbed the robots.txt twice and started to follow links at a very leisurely pace: just sub-directories over a couple of hours.

Then it started to explore a directory in a systematic fashion. It picked a sub-directory and began a numerical search: /directory/0, then /directory/1, and so on. I caught this bad-bot quickly because I have any 404 errors emailed to me immediately. I suspect it would have tripped a trap soon enough though.

Others have reported this bot as a WTF is this thing doing?

Based on its activites, I have simply decided to deny it access to my sites.

Wednesday, May 23, 2007

Beware of Telus

The future is not friendly. At least if you have an account with Telus, based in Alberta, Canada.

Telus offers telephone, cell-phone and internet services for individuals and businesses in in Alberta and British Columbia, Canada; and web-hosting for anyone who will sign up with a credit card.

Just try to close your account at the end of your service agreement, however. Telus keeps on charging your credit card, sometimes cancelling your service, sometimes not.

I've had three different accounts with Telus, web-hosting, telephone, and internet, and when I've tried to close them it was the same each time:

each time the service agreement was at an end, I gave them appropriate notice that I would be cancelling service. In each case, Telus either continued to charge my credit card monthly for the canceled and disconnected service, or send me a monthly bill for advance charges for services.

With the web-hosting account, Telus eventually did reverse the charges to my credit card: it took several emails on my part to draw their attention to their extra-billing.

When I canceled my phone service, they disconnected the service right on time. Nevertheless, the billing department didn't get the news: they continued to send me a monthly bill for the service. It took a letter of complaint to get the phone charges reversed.

I canceled my internet over the phone. They said no problem and sent me another bill for the upcoming month. I sent a notice in writing describing the extra billing, and re-affirming my notice of cancellation. They sent me a bill showing the outstanding amount from the previous bill -and new charges for the upcoming month, with a threat to disconnect service if I didn't pay up soon! Egads!


If this extra-billing is happening to me, I would expect that it is happening to many other Telus customers. It is too much of a coincidence that they would continue to bill on a closed account on three separate occasions.

Canceling a phone or internet service should not be this much work. Service agreement ends. Customer calls and cancels service, Company says thank you for your business, rather than continue to bill you.

Telus, your future will go down the tubes. Ask anyone on the street, your customer service is abhorrent and offensive.

Sunday, March 25, 2007

Bad-Bot Trap Revisited

There seems to be an increase of late in bad-bot activity with new ips, and new user-agents. So I thought I would add a couple of ideas to flesh out my original A Simple PHP based Bad-Bot Trap that seems to be rather popular.

I'm flattered that people are posting and collecting the links to this blog. But some scum are stealing the articles and posting them on their own sites: you who do so will receive instant and debilitating bad karma as a result. Furthermore, you do not have permission to do so. May you experience endless server and php errors.

Just link, its nicer. If you have comments or suggests, go for it.

I offer the following with the standard disclaimer: If you don't understand the code, don't use it!

We can notice that the bots tend to follow the links on a page in one of three fairly predictable ways: top down, alphabetically ascending, and alphabetically descending. If we wish to trap a bad-bot early on in its travels through our site, we can easily set traps for each possibility using the original bad-bot trap and a little .htaccess magic.

First add the following rules to the robots.txt under User-agent: *

Disallow: /afile.html
Disallow: /zfile.html
Disallow: /nofile.html

add to .htacess

# set 'RewriteEngine On' if you haven't already
# redirect badbots
RewriteRule ^afile.* /badbots.php [L]
RewriteRule ^zfile.* /badbots.php [L]
RewriteRule ^nofile.* /badbots.php [L]


Now we have three different traps to embed in our pages:

<p style="color:white;background:white;height:0;visibility:collapse;"
onclick="return false" >
<a href="/badbots.php" >.</a>

which can go at the top of the page

<p style="color:white;background:white;height:0;visibility:collapse;"
onclick="return false" >
<a href="/afile.html" >.</a>

and

<p style="color:white;background:white;height:0;visibility:collapse;"
onclick="return false" >
<a href="/zfile.html" >.</a>

which can go pretty much anywhere on a page.

The traps should be self-evident as to their use. The Disallow: /nofile.html exclusion was added for that particular species of bad-bot that uses the robots.txt to find links.

Happy Trapping!

Tuesday, March 20, 2007

Internet Explorer 7.0 (MSIE 7.0)

So, after spending some time with Internet Explorer 7.0 ( MSIE 7.0), I can't decide if its a move sideways or backwards. From a user's perspective, it certainly is prettier than MSIE 6. Unfortunately it's 'security' features get annoying really fast. Try it and you'll see what I mean.

From a web developer's perspective, its a pain in the butt. Though some of the problems of MSIE 6 have been addressed in 7, a whole new set of problems need to be dealt with: web pages that have hacks to get 6 to behave, have to be re-hacked now to get 7 to behave.

Hey Microsoft: can't you guys figure out how to handle 'float' properly? And javascript, sorry JScript, don't get me started. What the hell's the problem? Mozilla figured this stuff out long ago. Were you not at the table when the standards were developed? My god CSS 2 is almost 10 year old! Its as if you purposely undermine standards that you were involved in establishing, by releasing broken software like MSIE 7, in order to undermine your competition. Truly evil, Microsoft.

Do some bug comparisons between 6 and 7 if you like. A good place to start is http://www.gtalbot.org/BrowserBugsSection/

So now we, as developers, have to maintain 3 sets of pages: one for browsers that actually work the way they are supposed to ( or at least try to address their bugs and short-comings in an open and timely manner), one for MSIE 6, and one for MSIE 7.

Egads, I feel dirty every time I have to use something created by Microsoft.

Sunday, March 18, 2007

Trapped Bad-Bots

2007-10-22
Host: 74.53.249.34
Agent: Mozilla/5.0 (compatible; LiteFinder/1.0;
+http://www.litefinder.net/about.html)

2007-10-18
Host: 99.238.107.208
Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)

2007-10-08
Host: 213.189.25.182
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET

2007-10-06
Host: 82.99.30.27
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

2007-10-05
Host: 82.99.30.32
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

2007-10-05
Host: 131.107.0.95
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; WOW64; SV1)

2007-10-03
Host: 67.19.250.26
Agent: Mozilla/5.0 (compatible; Gigamega.bot/1.0; +http://www.gigamega.net/bot.html)

2007-10-02
Host: 82.99.30.10
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

2007-09-10
Host:207.46.55.27/30
Agent: MSNPTC/1.0 (stupid ms bot can't parse robots.txt properly)

2007-07-13
Host:218.231.136.5
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP; DigExt)

2007-07-10
Host: 38.100.41.112
2007-07-06
Host: 209.85.94.164
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
+http://process4.com) Gecko/20070508 Firefox/1.5.0.12
stupid bot grabbed the robots.txt, then the first link listed in its exclusion list


2007-07-03
Host:74.208.71.84
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR 1.1.4322; .NET CLR 2.0.50727)

2007-06-27
Host: 63.251.174.4
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR 1.1.4322; .NET CLR 2.0.50727)


2007-06-26
Host: 24.87.89.186
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET
CLR 1.1.4322)

2007-06-22
Host: 64.92.199.41 and 64.92.199.41 (they have the whole block actually, I'm just banning the agent for a while)
Agent: libwww-perl/5.805

64.92.199.42
Agent: libwww-perl/5.805

2007-06-18
Host: 81.223.254.34
Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT)

2007-06-15
Host: 201.5.229.201
Agent: Mozilla/3.0 (compatible; WebCapture 1.0; Auto; Windows)

2007-06-14
Host: 208.99.195.54
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)

2007-06-05
Host: 202.179.180.42
Agent: Mozilla/4.0 (compatible; NaverBot/1.0;
http://help.naver.com/delete_main.asp)

2007--05-26
Host: 24.242.34.213
Agent: MJ12bot/v1.2.0 (http://majestic12.co.uk/bot.php?+)

2007-05-25
Host: 84.88.32.199
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Opera 7.23[ca]

2007-05-10
Host: 220.181.34.177
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; QihooBot
1.0 qihoobot@qihoo.net)

2007-04-26
Host: 65.222.176.124
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

2007-04-24
Host: 212.219.190.178
Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)

Host: 207.115.69.215
Agent: Mozilla/4.0/ (compatible- MSIE 6.0- Windows NT 5.1- SV1- .NET CLR 1.1.4322; ; )

Host: 65.222.176.125
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

Host: 203.162.3.157
Agent: -

Host: 222.254.232.24
Agent: -

Host: 66.199.236.50
Agent: Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)

Host: 69.84.207.39
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50215)

Host: 208.223.208.181
Agent: Python-urllib/1.16

Host: 208.53.147.89
Mozilla/3.0 (compatible; NetPositive/2.2)

Host: 70.87.196.242
Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.10)
Gecko/20050716 Firefox/1.0.6


Host: 65.222.176.122
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)


Host: 84.69.146.235
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322)

Host: 84.70.209.45
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322)


Host: 38.100.41.105
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

Host: 38.100.41.102
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

The related block of ips hosting bad-bots that I've seen so far are in the range 38.100.41.100 - 38.100.41.107

Host: 88.198.7.39
Agent: findfiles.org/0.9 (Robot;robot@findfiles.org)

Host: 72.21.50.202
Agent: Mozilla/4.0 (compatible; MSIE 5.01; MSNIA; Windows 98)

Host: 65.222.176.123
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

Host: 88.151.114.39
Agent: Mozilla/5.0 (compatible; Webbot/0.1; http://www.webbot.ru/bot.html)

Saturday, December 16, 2006

BadVista.org

Before buying a new computer or moving to Vista, read what many techies already know and believe about the system at badvista.fsf.org

Thursday, October 19, 2006

MSIE 7 More Security Problems

Secunia.com has identified a security problem with the latest release of MSIE 7 which "can be exploited by malicious people to disclose potentially sensitive information." See details and sample test at http://secunia.com/advisories/22477.

Best to use the better browsers like Firefox and Opera.

Monday, October 09, 2006

BBC HoneyPot

The recent BBC article by Mark Ward describing an unprotected computer/honeypot set-up is nothing but a piece produced to create F.U.D. He describes how an unprotected XP computer is attacked repeatedly when connected to the internet. Of course, as with most tech articles produced by the BBC, the only operating system that seems to exist is Microsoft Windows.

The weaknesses in his article are explored on slashdot.org, so I won't rehearse them here.

Perhaps more interesting is the BBC/Microsoft memorandum of understanding "that aims to identify 'common interests' between the BBC and Microsoft. Areas for collaboration include search and navigation, distribution, and content enablement."

To purely speculate the relationship between BBC tech articles and the MS/BBC agreement:

Microsoft is going to have a hard time selling its upcoming release of the Vistas system, specifically, getting users of XP to upgrade, and to return ex-Microsoft users to the fold (for example all the college kids that bought new Apple laptops this year). MS will probably market the new system's "security" features as a main selling point.

Articles like the one produced by the BBC, that begin to explore the all too well known security problems in current Microsoft software, help prepare the marketplace for a new "secure" system, and condition consumers to see security as a need. The new Vistas OS will then present itself as the only viable solution to the problem.

Again, pure speculation. Nevertheless, when visiting the Vistas site on microsoft.com, there rarely is a page that does not mention security in some context. BBC articles on computer technology focus very heavily on the MS OS, almost to the exclusion of others.

Saturday, October 07, 2006

PSI/Cogent yet again

Host: 38.100.41.107
Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)

Its getting tempting to just block everything in the range from
38.0.0.0 to 38.255.255.255, because the only visitors I've ever seen from this range are scum spam bots. Typical user-agents include Snapbot, voyager, cfetch, Java, as well as MSIE poser bots. They always run into a trap though, as the one listed here, and that keeps it fun.

Whois:

OrgName: Performance Systems International Inc.
OrgID: PSI
Address: 1015 31st St NW
City: Washington
StateProv: DC
PostalCode: 20007
Country: US

The New iPOD WOW!! (or not)

Red Hot Chilli Peppers, U2, and now . . . wait for it . . .
Tetris!!

Sorry Steve, iPod has officially lost its cool.

Perhaps the movie business will save ya.

Tuesday, October 03, 2006

Nusearch Spider

Agent: Nusearch Spider (www.nusearch.com)
Host: 84.9.136.223

The Nusearch Spider dropped by for a visit. It only followed the first ten top-level html links. It was not interested in going to other directories, and even tried to load directory-names as files by dropping the trailing "/" then ignoring the resulting redirect.

The Spider obeyed some of the directives in the robots.txt, but not all. My guess its a configuration issue with the bot at this time. We will be watching them to see what's up.

I dropped by their site, nusearch.com and its yet another search engine promising to be better than the other guys. Ya whatever, but they need to get their bot under better control, and clean up it's blacklist status if they want us to allow them to crawl our sites.

Saturday, September 30, 2006

Missigua Spam Bot

Host: 70.86.142.210
Agent: Missigua Locator 1.9

This bot reads links from top to the bottom on the page. After visiting the document root on a site, it quickly ran into a bot trap, actually the first link it followed. It then tried all the links on the page, but of course, was denied access. It did not read the robots.txt before trying to crawl the site.

Whois Record

OrgName: ThePlanet.com Internet Services, Inc.

OrgID: TPCM
Address: 1333 North Stemmons Freeway
Address: Suite 110
City: Dallas
StateProv: TX
PostalCode: 75207
Country: US

Thursday, September 21, 2006

Setting Bad-Bot Traps

In A Simple PHP Based Bot Trap, I presented a fairly simple script for trapping, well really excluding, robots or site rippers that either ignore or surreptitiously use a site's robots.txt file.

The trap was set by adding a link to an excluded file (via robots.txt) on the main index page as bait for the bad-bot. The trap is hidden from the regular user through CSS styles and by using a dot (.) in-between the anchor tags.

The weakness in this trap's method of camouflage, is that the trap could still be tripped by users of text or non-visual browsers.

There are two variations on this trap that can be used in conjunction with or instead of the original.

The first is simply replacing the dot with an image file, 1px by 1 px, that is the same colour as the page's background. Use of the alt attribute can help identify the trap to legitimate users of text-browsers. For example:

<p>
<a href="/badbots.php">
<img src="small_image.gif" alt="do not follow" />
</a>
</p>


The second variation is to use html comments, hiding the link from everybody except that particular species of bot that will try to follow anything that even remotely resembles a link:

<!-- <a href="/badbots.php">look, a link!</a> -->

Placement of the traps can vary also. Bot's do not necessarily follow links in the order found on the page, nor do they necessarily enter a site through the main index page. Traps can be placed soon after the <body> tag, near the bottom of the page, or within a list of links, such as a navigation bar. If the site has a complicated hierarchy of nested folders, laying traps at different depths may also yield results.

Tuesday, September 19, 2006

Performance Systems International Inc. Bot

Agent: Java/1.6.0-beta2 and Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.0)
Host: 38.99.203.110

Whois Record
OrgName: Performance Systems International Inc.
OrgID: PSI
Address: 1015 31st St NW
City: Washington
StateProv: DC
PostalCode: 20007
Country: US

This bot visited an unprotected site last July. It grabbed the robots.txt and then proceeded to download every link on the site, including javascript files.
It is now blocked by ip and user-agent (^Java).

It has since revisited the site twice. Today it tried to grab robots.txt and was sent a 403 code (denied access) as Java/1.6.0-beta2. It then changed its user-agent string, four second delay, to Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.0) and tried to grab the main index page. Again denied, it seems to have moved on.
It appears to be hosted by Cogent Communications.

Saturday, September 16, 2006

Un moyen élégant de piéger les robots indélicats

Salut Spitfire, forum.phpbb-fr.com! Merci pour la traduction!

Une trés simple trappe à mauvais robots qui piège à la fois les robots qui ignorent robots.txt et aux aspirateurs de site qui ne lisent pas robots.txt.
Il existe de nombreuses versions de cette trappe. celle-ci n'est pas particulièrement sophistiquée, mais elle marche.
Utilisez-la avec prudence pour être certain de ne pas éjecter des visiteurs souhaités, ou, pire, de ne pas planter votre site.
Si vous ne comprenez pas le code ci-dessous, ne l'utilisez pas.

Requis:

  1. Hébergement acceptant le PHP
  2. Capacité d'incorporer robots.txt
  3. Capacité d'incorporer .htaccess sur votre site
  4. Capacité d'envoyer des emails via PHP
  5. Stamina to monitor your logs and .htaccess file
Les fichiers suivants sont à créer ou éditer et à uploader sur la racine de votre site:

  1. robots.txt
  2. .htaccess
  3. badbots.php
  4. bad-bots-script.php
  5. index.php (ou index.html)
1. Ajouter les lignes suivantes à votre fichiers robots.txt

User-agent: *
Disallow: /badbots.php

2. Créez le fichier suivant: badbots.php

<?php
header("Content-type: text/html; charset=utf-8");
echo ' <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> ';
?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Bad-Bots and Rippers Denied</title>
<meta name="author" content="seven-3-five.blogspot.com 2006-09-04" />
</head>
<body>
<p>whatever message you would like the scum to see</p>
<?php
include 'bad-bot-script.php';
?>
</body>
</html>

3. Créez le fichier suivant: bad-bot-script.php

<?php
/* author: seven-3-five, 2006-09-04, seven-3-five.blogspot.com
* Merci Spitfire pour la tranduction
à français
*Ce script est le plat de résistance de ce piège à robots
* 1. Il vous envoie un email quand la page /badbots.php est visité.
* L'email contient diverses infos sur le visiteur
* 2. Il ajoute la directive
* 'deny from $ip' ($ip étant l'adresse ip du visiteur)
* à la fin de votre fichier .htaccess */

/* VARIABLES SERVEUR UTILISEES
* POUR IDENTIFIER LE ROBOT ATTAQUANT */

$ip = $_SERVER['REMOTE_ADDR'];
$agent = $_SERVER['HTTP_USER_AGENT'];
$request = $_SERVER['REQUEST_URI'];
$referer = $_SERVER['HTTP_REFERER'];

// CONSTRUIT LE MESSAGE DE L'EMAIL

$subject = 'bad-bots';
$email = 'your_email@your_site.com';
$to = $email;
$message ='ip: ' . $ip . "\r\n" .
'user-agent string: ' . $agent . "\r\n" .
'requested url: ' . $request . "\r\n" .
'referer: ' . $referer . "\r\n";
// referer souvent une page blanche

$message = wordwrap($message, 70);
$headers = 'From: ' . $email . "\r\n" .
'Reply-To: ' . $email . "\r\n" .
'X-Mailer PHP/' . phpversion();


// ENVOIE LE MESSAGE


mail($to, $subject, $message, $headers);

/* AJOUTE 'deny from $ip'
* A LA FIN DE VOTRE FICJIER .htaccess */

$text = 'deny from ' . $ip . "\n";
$file = '.htaccess';
add_badbot($text, $file);

/* Function
* add_bad_bot($text, $file_name): appends $text to $file_name
* Vérifiez que PHP a la permission d'écrire dans $file_name */

function add_badbot($text, $file_name)
{
$handle = fopen($file_name, 'a');
fwrite($handle, $text);
fclose($handle);
}
?>

4. Ajoutez le code suivant, après le tag <body> de votre page d'index, index.php ou index.html:

<p style="color:white;background:white;height:0;visibility:collapse;">
<a href="badbots.php" >.</a>
</p>


5. Testez-le complètement



Que se passera-t-il?

Un vilain robot parcourt le fichier robots.txt et ignore les directives ou utilise cette information. Si le robot suit le lien vers /badbots.php, alors le script bad-bot-script.php se déclenche, écrit l'adresse IP du visiteur dans votre fichier .htaccess et vous signale le fait par email. Le vilain robot ne pourra plus parcourir le site.

Autre possibilité: un aspirateur de sites visite votre site et commence par télécharger tout ce qu'il trouve. Il tombera rapidement sur le lien /badbots.php de votre page d'index. Une fois visité ce lien, il ne pourra plus rien téklécharger d'autre, comme dans l'exemple précédent.

Incidents possibles, dépendant de votre serveur:

  • Vous aurez peut-être à créer un .htaccess vide si votre site n'en a pas déja un
  • Vous aurez peut-être à paramétrer les permissions de .htaccess afin que bad-bot-script.php puisse y écrire. si oui, essayez:
    touch .htaccess
    chgrp www .htaccess
    chmod 664 .htaccess
  • Votre serveur de mails peut ne pas accepter les mails générés par PHP
  • Il a peut-être besoin d'être configuré
  • Si vous avez éjecté tout le mode, essayez d'ajouter les lignes suivantes au début de votre vfichier .htaccess

    order allow,deny
    allow from all

    bien qu'elles devraient être présentes dans le fichier httpd.conf de tout serveur apache public et neseront sans doute pas nécessaires ici (du moins je pense)
  • Vous vous êtes éjecté vous même: cela arrivera chaque fois que vous testerez le système, aussi soyez préparé à enlever votre adresse IP de votre fichier .htaccess
  • Vos tests ajoutent votre adresse IP au fichier .htaccess, mais vous n'êtes pas éjecté: votre serveur n'accepte sans doute pas l'utilisation des fichiers .htaccess

Friday, September 15, 2006

Probable Spam-Bot 59.26.150.110

Host: 59.26.150.110
Agent: - (empty string)
visited this week. Note the lack of a user-agent string. Initial searches suggest that this is a spam-bot trolling for email addresses.

Friday, September 08, 2006

Another Bad-Bot Falls into Trap

Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)
Host: 63.100.163.70
This bot, disguised as MSIE tried to rip through one of my sites, and ran right into a bot trap. It started off looking like a regular browser: it loaded the site's root index and the page's .css. It didn't load the robots.txt.

Nevertheless, several things in combination gave it away as a disrespectful bot or ripper:
  1. It wasn't loading the page's associated binaries, i.e. images and so on
  2. It wasn't loading javascript (everyone knows that MSIE can't do much without it)
  3. It was crawling pages at three pages per second
  4. On closer inspection, it only loaded one of the two .css files on the index page
  5. It tried to follow links that were commented out in the page's mark-up
  6. It ran into a bot trap that a normal user wouldn't see.

Whois says 63.100.163.70 belongs to:

UUNET Technologies, Inc.
22001 Loudoun County Parkway
Ashburn, VA, 20147, US

Spammer Jeremy Jaynes' Conviction Upheld

Jaynes' nine year conviction for violating US state of Virginia's anti-spam law was upheld by The Court of Appeals of Virginia on Sept. 05-06.

The N. Carolina man was originally convicted of illegally flooding A.O.L. customers with bulk email ads.

Jaynes is out on a million dollar bond, but the Virginia Attorney General's office would like the bond revoked so that Jaynes can start to serve his sentence.

Let's hope so.

Tuesday, September 05, 2006

My Favorite Emacs Feature

Currently, my favorite Emacs feature has got to be its ability to edit remote files. This feature is so handy for editing web-sites, a daily event for me.

Most web hosting companies do not give their clients access to a command line interface, i.e. ssh access, without them buying one of the more expensive packages, something I cannot afford. The basic packages do, however, usually allow ftp access. And there it is. Emacs can use ftp! No more crappy control panels and crappier online text editors!

All one does is open a buffer in the usual way:
C-x C-f
The file path needs then to begin with the ftp account in the format:
/user_name@your_site.com: followed by:
path/to/your/file

The complete minibuffer looks something like:
/user_name@your_site.com:path/to/your/file

After pressing the enter key, you will be prompted for a password if needed, and you're off to the races. Editing files on the remote location is now seamlessly integrated into your current Emacs session.

FYI, Emacs is using 'ange-FTP,' which keeps most of the ftp stuff hidden in the background.

Way too easy!!

For some documentation, try in Emacs:

C-h i
m Emacs
m Files
m Remote Files

Monday, September 04, 2006

A Simple PHP based Bad-Bot Trap

A very simple bad-bot trap that catches both bots that ignore your robots.txt, and site rippers who don't read the robots.txt. There are many versions of this trap out there. This one is not particularly sophisticated, but it works.
Use with care to make sure you don't shut out visitors that you do want, or worse, shut down your site.
If you don't understand the following code, don't use it!
P.S. you can use it but you cannot post it elsewhere. Copyright 2006
What you need:
  1. PHP enabled site
  2. Ability to incorporate robots.txt
  3. Ability to incorporate .htaccess files on your site
  4. Ability to send email via PHP
  5. Stamina to monitor your logs and .htaccess file
The following files are created or edited under your .html directory, typically public_html/ on shared web hosting services.
robots.txt
.htaccess
badbots.php
bad-bots-script.php
index.php (or index.html)
  1. Add the following lines (or appropriate version) to your robots.txt:

    User-agent: *
    Disallow: /badbots.php

  2. Create the following file: badbots.php
    <?php
    header("Content-type: text/html; charset=utf-8");
    echo '
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    ';
    ?>

    <html xmlns="http://www.w3.org/1999/xhtml">

    <head>

    <title>Bad-Bots and Rippers Denied</title>
    <meta name="author" content="seven-3-five.blogspot.com 2006-09-04" />

    </head>

    <body>

    <p>whatever message you would like the scum to see</p>

    <?php
    include 'bad-bot-script.php';
    ?>

    </body>
    </html>
  3. Create the following file: bad-bot-script.php
    <?php
    // author: seven-3-five, 2006-09-04, seven-3-five.blogspot.com
    //this script is the meat and potatoes of the bot-trap
    // 1. It sends you an email when the page /badbots.php is visited.
    //The email contains various info about the visitor.
    //2. It adds the directive
    //'deny from $ip' ($ip being the visitor's ip address)
    //to the bottom of your .htaccess file.

    // SERVER VARIABLES USED TO IDENTIFY THE OFFENDING BOT

    $ip = $_SERVER['REMOTE_ADDR'];
    $agent = $_SERVER['HTTP_USER_AGENT'];
    $request = $_SERVER['REQUEST_URI'];
    $referer = $_SERVER['HTTP_REFERER'];

    // CONSTRUCT THE EMAIL MESSAGE

    $subject = 'bad-bots';
    $email = 'your_email@your_site.com'; //edit accordingly
    $to = $email;
    $message ='ip: ' . $ip . "\r\n" .
    'user-agent string: ' . $agent . "\r\n" .
    'requested url: ' . $request . "\r\n" .
    'referer: ' . $referer . "\r\n"; // often is blank

    $message = wordwrap($message, 70);

    $headers = 'From: ' . $email . "\r\n" .
    'Reply-To: ' . $email . "\r\n" .
    'X-Mailer PHP/' . phpversion();

    // SEND THE MESSAGE

    mail($to, $subject, $message, $headers);

    // ADD 'deny from $ip' TO THE BOTTOM OF YOUR MAIN .htaccess FILE

    $text = 'deny from ' . $ip . "\n";
    $file = '.htaccess';

    add_badbot($text, $file);

    // Function add_bad_bot($text, $file_name): appends $text to $file_name
    // make sure PHP has permission to write to $file_name

    function add_badbot($text, $file_name) {
    $handle = fopen($file_name, 'a');
    fwrite($handle, $text);
    fclose($handle);
    }

    ?>
  4. Add the following html soon after the <body> tag of your main /index.php (or index.html) page:

    <p style="color:white;background:white;height:0;visibility:collapse;">
    <a href="badbots.php" >.</a>
    </p>
  5. Test thoroughly
Possible issues depending on your server ...
  • You may have to create an empty.htaccess file if your site does not already have one
  • You may have to adjust the permissions for .htaccess so that bad-bot-script.php can write to it. If so, try:
    touch .htaccess
    chgrp www .htaccess
    chmod 664 .htaccess
  • Your mail server may not like PHP generated mail
  • Your mail server may need to be configured
  • If you are locking out everyone- try adding the following two lines near the top of your main .htaccess file:
    order allow,deny
    allow from all
    though they should appear in the httpd.conf file on any public Apache Server and shouldn't be necessary here (I believe...)
  • You have locked yourself out -this will happen every time you test the system, so be prepared to remove your ip from your .htaccess file.
  • Your test adds your ip to the .htaccess file, but you still have access - your server may not be configured to allow use of .htaccess files.

What Happens:

A bad-bot grabs your robots.txt file and either ignores the file's directives, or uses that information to find stuff. If the bot follows the link to /badbots.php, the bad-bot-script.php fires, writing the visitor's ip to your .htaccess file and sending you an email to the fact. The bad-bot can no longer transverse your site.

Alternately a ripper visits your site and starts to download everything it can find. It will quickly stumble upon the /badbots.php link on your /index.*. Once visiting /badbots.php, it will be unable to download any more of your stuff, just like in the previous example.

Once the bot or ripper discovers it is locked out, it may thrash about a bit, trying to retrieve any largish file it may have an url for, but of course it will just be denied access, getting a 403 code and nothing else, and quickly move on.

Variations: endless