how to detect search engine bots with php?

php web-crawler bots

143,143

Solution 1

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}

Solution 2

I use the following code which seems to be working fine:

function _bot_detected() {

  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en

added mediapartners

Solution 3

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:

http://www.useragentstring.com/pages/useragentstring.php

Or more specifically for crawlers:

http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler

If you want to -say- log the number of visits of most common search engine crawlers, you could use

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

Solution 4

You can checkout if it's a search engine with this function :

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
      'Rambler' => 'Rambler',
      'Yahoo' => 'Yahoo',
      'AbachoBOT' => 'AbachoBOT',
      'accoona' => 'Accoona',
      'AcoiRobot' => 'AcoiRobot',
      'ASPSeek' => 'ASPSeek',
      'CrocCrawler' => 'CrocCrawler',
      'Dumbot' => 'Dumbot',
      'FAST-WebCrawler' => 'FAST-WebCrawler',
      'GeonaBot' => 'GeonaBot',
      'Gigabot' => 'Gigabot',
      'Lycos spider' => 'Lycos',
      'MSRBOT' => 'MSRBOT',
      'Altavista robot' => 'Scooter',
      'AltaVista robot' => 'Altavista',
      'ID-Search Bot' => 'IDBot',
      'eStyle Bot' => 'eStyle',
      'Scrubby robot' => 'Scrubby',
      'Facebook' => 'facebookexternalhit',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode('|',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

Then you can use it like :

<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
  if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

Solution 5

I'm using this to detect bots:

if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
    // is bot
}

In addition I use a whitelist to block unwanted bots:

if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
    // allowed bot
}

An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.

Note: My whitelist is based on Facebooks robots.txt.

View more solutions

143,143

Author by

terrific

Updated on July 08, 2022

Comments

terrific almost 2 years

How can one detect the search engine bots using php?
terrific about 15 years

if ((eregi("yahoo",$this->USER_AGENT)) && (eregi("slurp",$this->USER_AGENT))) { $this->Browser = "Yahoo! Slurp"; $this->Type = "robot"; } will this work fine??
rinchik about 11 years

why strstr and not strpos?
Ólafur Waage about 11 years

Because strpos can return 0 (the position), strstr returns FALSE on failure, you can use strpos if you add a !== false check at the end.
Damon about 10 years

Erm, strpos returns FALSE on failure, too. It's faster and more efficient, though (no preprocessing, and no O(m) storage).
Jeromie Devera almost 10 years

Does this assume that bots reveal themselves as such?
Admin over 9 years

What about fake useragents?!
barwnikk over 9 years

I can change user agent in Chrome.
barwnikk over 9 years

Vote down, user agent can be changed in chrome settings, firefox,
JonShipman about 9 years

Yes the useragent can be changed, but if someone is changing it to contain "bot","crawl","slurp", or "spider" knows whats coming to them. It also depends on utility. I wouldn't use this to strip out all CSS, but I would use this to not store cookies, ignore location logging, or skip a landing page.
The Onin about 9 years

I think strpos is better. I do it like this: (strpos(strtolower($_SERVER['HTTP_USER_AGENT']), 'google') === false). I don't do googlebot cause i also wanna detect google insights tests.
Daan almost 9 years

Doesn't anyone agree with me that this is a way to wide range to match?
Daan almost 9 years

I think this list is outdated, I don't see "slurp" for example which is Yahoo it's spider help.yahoo.com/kb/SLN22600.html
Mojtaba Rezaeian almost 9 years

And what if someone could change his user agent with fake name and name it like "Googlebot"? I think checking ip range is more trustworthy!
Mojtaba Rezaeian almost 9 years

IP list is more secure if you want to make sure about user agent name is really a search engine bot, because it is possible to create fake user-agents by name.
Philipp almost 9 years

Note: This library only analyzes the user agent to decide if visitor is a bot.
Joel James almost 8 years

Too heavy, just to check a verify bot.
FarrisFahad almost 8 years

I used your function for more than 1 day now and it seems to be working. But I am not sure. How can I send testing bots to test if it works?
Robert Sinclair almost 8 years

The answer is good but I wouldn't rely on the resource that's being linked to. 'Yahoo' is not even in the list.
Gregory about 7 years

The regex in this answer is nice for being simple and wide-spanning. For my purpose I want to be quick but I don't care if there's a few false positives or false negatives.
Ludo - Off the record about 7 years

you forgot a closing ) in your first piece of code.
nikksan almost 7 years

Good solution, I would just add 'Google Page Speed Insights' to the regex - '/bot|crawl|slurp|spider|mediapartners|Google Page Speed Insights/i'
mlissner over 6 years

All the other answers using user-agent strings are only halfway there. Wow.
mlissner over 6 years

This is only half of verifying, if you want to do it right. The other half is to use DNS to verify the IP. See the answer below: stackoverflow.com/a/29457983/64911
Brady Emerson over 5 years

There are many comments about user-agent checking only being half the check. This is true, but keep in mind, there's a huge performance impact to doing the full DNS and reverse DNS lookup. It all depends on the level of certainty you need to obtain to support your use case. This is for 100% certainty at the expense of performance. You have to decide what the right balance is (and therefore best solution) for your situation.
Average Joe over 5 years

What would be your (if_clause ) string piece for this? mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
Fabian Kessler over 5 years

There's no "huge performance impact". First, the reverse dns lookup is only performed on visitors that identify as search engine. All humans are not affected at all. Then, this lookup is only performed once per IP. The result is cached. Search engines keep using the same IP ranges for a very long time, and usually hit one site with one or few IPs only. Also: you could perform the validation delayed. Let the first request through, then background-validate. And if negative, prevent successive requests. (I would advise against this because harvesters have large IP pools now ...)
userlond almost 5 years

Is there some simular library written in PHP?
Frodik about 4 years

This is good answer, but one note from PHP documentation for preg_match: Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
MrPHP almost 4 years

if (preg_match('/http|bot|bingbot|googlebot|robot|spider|slurp|‌crawler|curl|^$/i', $userAgent))
boppy about 3 years

Please do not use this method to identify a google bot! Even on a small scale site we have 403 Agent-IP combinations with "googlebot" in it, while only 126 are real google bots (as of access logs from Feb 2021)! Please use طراحی سایت تهران answer below and see the linked document about verifying a real google bot!
Sjoerd Linders almost 3 years

This is the only right answer, when you absolutely need to be sure the request is from Google or Googlebot. See the Google documentation Verifying Googlebot.
Sergio Abreu almost 3 years

stristr() does case insensitive on strstr()
Randy Lam almost 3 years

For those people trying to verify the Google bot by UA, you guys are fooling yourselves ( and your partners ). Like Sjoerd said, verifying the host is the ONLY correct solution.