Forums / General / How to stop spiders?

How to stop spiders?

Author Message

Luis Muñoz

Friday 18 March 2005 7:28:27 am

Any idea about how to stop bad spiders from entering the site? With bad spiders a refer to the ones wich collect email addresses. The problem is that some of them go so fast that they can make the server stop working if they attack in the peak hours. It wold be also good to protect people emails. Masking email adresses is a bit useless, spiders learn so fast that make masking useless.

Any idea would be appreciated.
Thanks,
Luis.

Lex 007

Friday 18 March 2005 7:38:28 am

Hello

You can use the wash operator on your emails to obfuscate them.

See this post : http://www.ez.no/community/forum/setup_design/obfuscate_email_addresses

Lex

Luis Muñoz

Friday 18 March 2005 8:03:10 am

My main problem isn't obfuscate email adresses. My main problem is the spider crawling the site at the maximum speed supported by the server, what produces a slow site or even blocks the server.

Łukasz Serwatka

Friday 18 March 2005 8:16:54 am

Hi Luis,

Here you find some info about spiders control
http://www.searchengineworld.com/robots/robots_tutorial.htm

Personal website -> http://serwatka.net
Blog (about eZ Publish) -> http://serwatka.net/blog

Jonathan Dillon-Hayes

Wednesday 23 March 2005 2:57:42 am

Have to second the robots.txt file idea. BAsically, you're left with either:

1) hope that they stop
2) put a robots.txt file in there and hope that they stop.

I would do 2.

Jonathan

---------
FireBright provides advanced eZ deployment with root access
http://www.FireBright.com/

Lex 007

Wednesday 23 March 2005 5:42:58 am

Unfortunaly, I don't think the robots.txt would do anything.

If I were a spider programmer, the first steps I'd do in my program would be :
- check if there is a robots.txt
- then immediatly visit the "forbidden" folders, because they must be the most interesting ones ...

Probably if you obfuscate all your e-mail adresses, spiders won't come back because they don't see anything interesting on your site.

Tony Wood

Wednesday 23 March 2005 9:32:44 am

Hi Luis,

You might want to log the IP address of the <i>bad</i> spiders and then block them. This will only work however, on spiders that have a fixed IP.

I hope this helps

Tony

Tony Wood : twitter.com/tonywood
Vision with Technology
Experts in eZ Publish consulting & development

Power to the Editor!

Free eZ Training : http://www.VisionWT.com/training
eZ Future Podcast : http://www.VisionWT.com/eZ-Future

Harry Oosterveen

Thursday 24 March 2005 2:02:57 pm

How do you recognize a bad spider? If you can recognize it from the HTTP information, add the following lines to your .htaccess file, or in the Apache httpd.conf, if you can access that one.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (badspider)
RewriteRule !^nocrawl.html /nocrawl.html [L]

'badspider' is a regular expression matching the reported user agent of the bad spider. 'nocrawl.html' is simply a short html page with no links.

Alternatively, you can add the following code in the beginning of the index.php file:

if( preg_match( '/badspider/', $_SERVER['HTTP_USER_AGENT']))
  die( 'Go away' );

Harry Oosterveen

Friday 25 March 2005 4:44:01 am

You can find more info on http://ezinearticles.com/?Invasion-of-the-Email-Snatchers&id=20846. On this page is also a list of bad spiders, applied to the .htaccess method I mentioned above. To apply this list to the php-code for the index.php file, use this:

$badspiders = array( 
  'EmailSiphon',
  'EmailWolf',
  'ExtractorPro',
  'Mozilla.*NEWT',
  'Crescent',
  'CherryPicker',
  '[Ww]eb[Bb]andit',
  'WebEMailExtrac.*',
  'NICErsPRO',
  'Telesoft',
  'Zeus.*Webster',
  'Microsoft.URL',
  'Mozilla/3.Mozilla/2.01',
  'EmailCollector' );
	
$regex = '/^(' . join( '|', $badspiders ) . ')/';

if( preg_match( $regex, $_SERVER['HTTP_USER_AGENT'])) {
  die( 'Go away' );
}

Note that new robots will evolve, so you have to adapt this list regularly.

Jonathan Dillon-Hayes

Monday 28 March 2005 2:14:07 am

There is a much easier way...

Just include a directory in the top of your robots.txt file that includes code to add whoever visits it to a ban list. That way, if the spider doesn't listen, as soon as it starts to investigate, it gets locked out. Your human traffic will be unaffected.

You could easily adapt that php code into a simple script to do it. Just add a database handler, and a three column table with id, name, and ip.

J

---------
FireBright provides advanced eZ deployment with root access
http://www.FireBright.com/

Eivind Marienborg

Monday 28 March 2005 4:04:28 am

Your problem is that spiders visit your site at the wrong hours of day, draining your system for resources.. How about a script that replaces the robots.txt file at different times of day? Letting them search your site at night, and banning all robots at daytime, for example.