Is there an index of the IP addresses used by indexing bots?

5,488

Solution 1

http://www.user-agents.org/ might be what you are looking for.

Solution 2

All of the search engines use a huge number of IP addresses. You'll want to look at the user agent string instead. Check this page for a good list of all crawlers.

In PHP, something like this would work:

$bots = array( 'googlebot', 'msnbot', 'slurp', 'mediapartners-google' );
$isRobot = false;
$ua = strtolower( $_SERVER['HTTP_USER_AGENT'] );

foreach ( $bots as $bot ) {
  if ( strpos( $ua, $bot ) !== false )
    $isRobot = true;
}

if ( !$isRobot ) {
  // do your thing
}

Solution 3

One way or the other if you are serious about filtering out bots you will need to implement some local list as well. Sometimes random seeming IP's get obsessed with a website I am administering. University projects, poorly implemented bots that seem experimental but are not generally recognized, those sorts of things.

Also: the Cuil bot (Twiceler) is the devil.

Solution 4

Why don't you just put this in your robots.txt file?

User-agent: *
Disallow: /path/page-you-dont-want-crawled.html

That way you won't need to keep hunting for bots. I would bet anything that Google, Yahoo, and MSN have hundreds of bots and they probably have different IP addresses and new ones being created all the time. Adding the above should do the same for your file page without all of the hassle.

Solution 5

There's some code to recognize bots at http://ekstreme.com/phplabs/search-engine-authentication (as well as the Google Help Center article at http://www.google.com/support/webmasters/bin/answer.py?answer=80553 on verifying Googlebot). There's also some code at http://ekstreme.com/phplabs/crawlercontroller.php that can be used to recognize crawlers, which you could easily extend to recognize "good" crawlers as well as the spammy ones it recognizes now.

In general, it's important not to rely on either user-agent name or IP address alone, since some user-agents may be used by normal users and some IP addresses may be shared.

That said, if you're only using this for email notifications, I'd probably just ignore simple known patterns in the user-agent and live with the false positives & false negatives. Check your log files for the most common crawlers that are active on your site and just check for a unique part of the user-agent name (it might be enough to just use "googlebot|slurp|msnbot|bingbot").

Share:
5,488

Related videos on Youtube

artlung
Author by

artlung

Updated on September 17, 2022

Comments

  • artlung
    artlung almost 2 years

    I have a page that gets minimal traffic, but I set up static notifications when it gets hit. Now, I want bots to get ignored, so what I'm doing now is adding bots I see to a "no notify" list.

    Is there a reference listing of the IP addresses used by indexing robots?

    e.g, a list like:

    $no_mail = array(
        '67.195.115.105', // yahoo bot
        '207.46.199.50', // msn bot
        '61.135.249.246', //youdao bot
        '207.46.199.32', // msn bot
    );
    
    • Admin
      Admin almost 14 years
      I don't see a problem with filtering out the bots (if it doesn't change anything they see) but why would you send your self an email instead of just checking the existing web server logs, creating your own log file or add records to a db table? My thought is it'd be a lot of hassle when you get hammered by something...your mail server won't be crushed by the load. It would also save a lot of page load time as mail generation is quite time consuming.
    • Admin
      Admin almost 14 years
      I would send myself an email because that's the mechanism I want to use to be notified. I could do this by tailing the existing logs, but then it would not be instant, and I don't have access to "live" logs with my shared host. And it's not a hassle, and this is not for a page with huge amounts of traffic. I also execute this script after the page has loaded so there is no load time impact on the user. If it got into a resource issue then that would be a different issue.
  • Cebjyre
    Cebjyre almost 14 years
    Presumably he still wants the page to be searchable, just without the email being sent unless the user is a real person.
  • Virtuosi Media
    Virtuosi Media almost 14 years
    Just a note: There are plenty of bots that don't respect the robots.txt file. It would be a start for filtering out legitimate bots, though.
  • nedruod
    nedruod almost 14 years
    This is probably the best way to do it. I use the same technique to harvest lists of bots that don't respect robots.txt. @Cebjyre - the page @RandomBen is describing would have no content whatsoever, it just sends mail when a rogue bot ignores robots.txt - user's would not even see it.
  • Nathan Ridley
    Nathan Ridley almost 14 years
    Scrapers and some other crawlers mask the user agent string and pretend to be a real browser. Just something to watch out for.
  • artlung
    artlung almost 14 years
    Good resource, though the IP addresses included in their data are incomplete. Still, I think adding the UA strings as something I check is a win. So I'll be watching for bots and ip addresses as needed. Thanks!
  • artlung
    artlung almost 14 years
    Good point. I'll be going for a hybrid approach, using the UA string, and adding IP addresses as needed.
  • artlung
    artlung almost 14 years
    Right, not a useful solution. I don't mind bots visiting, I just don't want to care about them at that moment
  • Ben Hoffman
    Ben Hoffman almost 14 years
    @artlung - Now I understand what you are trying to do. It isn't that you don't want bots to crawl it. You just don't want to be notified when they do.
  • Admin
    Admin almost 14 years
    Meh. So much for my new web browser : bottoms-up.