how to detect search engine bots with php?

143,143

Solution 1

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
    // what to do
}

Solution 2

I use the following code which seems to be working fine:

function _bot_detected() {

  return (
    isset($_SERVER['HTTP_USER_AGENT'])
    && preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
  );
}

update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en

added mediapartners

Solution 3

Check the $_SERVER['HTTP_USER_AGENT'] for some of the strings listed here:

http://www.useragentstring.com/pages/useragentstring.php

Or more specifically for crawlers:

http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler

If you want to -say- log the number of visits of most common search engine crawlers, you could use

$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
  // $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}

Solution 4

You can checkout if it's a search engine with this function :

<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
      'Rambler' => 'Rambler',
      'Yahoo' => 'Yahoo',
      'AbachoBOT' => 'AbachoBOT',
      'accoona' => 'Accoona',
      'AcoiRobot' => 'AcoiRobot',
      'ASPSeek' => 'ASPSeek',
      'CrocCrawler' => 'CrocCrawler',
      'Dumbot' => 'Dumbot',
      'FAST-WebCrawler' => 'FAST-WebCrawler',
      'GeonaBot' => 'GeonaBot',
      'Gigabot' => 'Gigabot',
      'Lycos spider' => 'Lycos',
      'MSRBOT' => 'MSRBOT',
      'Altavista robot' => 'Scooter',
      'AltaVista robot' => 'Altavista',
      'ID-Search Bot' => 'IDBot',
      'eStyle Bot' => 'eStyle',
      'Scrubby robot' => 'Scrubby',
      'Facebook' => 'facebookexternalhit',
  );
  // to get crawlers string used in function uncomment it
  // it is better to save it in string than use implode every time
  // global $crawlers
   $crawlers_agents = implode('|',$crawlers);
  if (strpos($crawlers_agents, $USER_AGENT) === false)
      return false;
    else {
    return TRUE;
    }
}
?>

Then you can use it like :

<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
  if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>

Solution 5

I'm using this to detect bots:

if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
    // is bot
}

In addition I use a whitelist to block unwanted bots:

if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
    // allowed bot
}

An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.

Note: My whitelist is based on Facebooks robots.txt.

Share:
143,143
terrific
Author by

terrific

Updated on July 08, 2022

Comments

  • terrific
    terrific almost 2 years

    How can one detect the search engine bots using php?

  • terrific
    terrific about 15 years
    if ((eregi("yahoo",$this->USER_AGENT)) && (eregi("slurp",$this->USER_AGENT))) { $this->Browser = "Yahoo! Slurp"; $this->Type = "robot"; } will this work fine??
  • rinchik
    rinchik about 11 years
    why strstr and not strpos?
  • Ólafur Waage
    Ólafur Waage about 11 years
    Because strpos can return 0 (the position), strstr returns FALSE on failure, you can use strpos if you add a !== false check at the end.
  • Damon
    Damon about 10 years
    Erm, strpos returns FALSE on failure, too. It's faster and more efficient, though (no preprocessing, and no O(m) storage).
  • Jeromie Devera
    Jeromie Devera almost 10 years
    Does this assume that bots reveal themselves as such?
  • Admin
    Admin over 9 years
    What about fake useragents?!
  • barwnikk
    barwnikk over 9 years
    I can change user agent in Chrome.
  • barwnikk
    barwnikk over 9 years
    Vote down, user agent can be changed in chrome settings, firefox,
  • JonShipman
    JonShipman about 9 years
    Yes the useragent can be changed, but if someone is changing it to contain "bot","crawl","slurp", or "spider" knows whats coming to them. It also depends on utility. I wouldn't use this to strip out all CSS, but I would use this to not store cookies, ignore location logging, or skip a landing page.
  • The Onin
    The Onin about 9 years
    I think strpos is better. I do it like this: (strpos(strtolower($_SERVER['HTTP_USER_AGENT']), 'google') === false). I don't do googlebot cause i also wanna detect google insights tests.
  • Daan
    Daan almost 9 years
    Doesn't anyone agree with me that this is a way to wide range to match?
  • Daan
    Daan almost 9 years
    I think this list is outdated, I don't see "slurp" for example which is Yahoo it's spider help.yahoo.com/kb/SLN22600.html
  • Mojtaba Rezaeian
    Mojtaba Rezaeian almost 9 years
    And what if someone could change his user agent with fake name and name it like "Googlebot"? I think checking ip range is more trustworthy!
  • Mojtaba Rezaeian
    Mojtaba Rezaeian almost 9 years
    IP list is more secure if you want to make sure about user agent name is really a search engine bot, because it is possible to create fake user-agents by name.
  • Philipp
    Philipp almost 9 years
    Note: This library only analyzes the user agent to decide if visitor is a bot.
  • Joel James
    Joel James almost 8 years
    Too heavy, just to check a verify bot.
  • FarrisFahad
    FarrisFahad almost 8 years
    I used your function for more than 1 day now and it seems to be working. But I am not sure. How can I send testing bots to test if it works?
  • Robert Sinclair
    Robert Sinclair almost 8 years
    The answer is good but I wouldn't rely on the resource that's being linked to. 'Yahoo' is not even in the list.
  • Gregory
    Gregory about 7 years
    The regex in this answer is nice for being simple and wide-spanning. For my purpose I want to be quick but I don't care if there's a few false positives or false negatives.
  • Ludo - Off the record
    Ludo - Off the record about 7 years
    you forgot a closing ) in your first piece of code.
  • nikksan
    nikksan almost 7 years
    Good solution, I would just add 'Google Page Speed Insights' to the regex - '/bot|crawl|slurp|spider|mediapartners|Google Page Speed Insights/i'
  • mlissner
    mlissner over 6 years
    All the other answers using user-agent strings are only halfway there. Wow.
  • mlissner
    mlissner over 6 years
    This is only half of verifying, if you want to do it right. The other half is to use DNS to verify the IP. See the answer below: stackoverflow.com/a/29457983/64911
  • Brady Emerson
    Brady Emerson over 5 years
    There are many comments about user-agent checking only being half the check. This is true, but keep in mind, there's a huge performance impact to doing the full DNS and reverse DNS lookup. It all depends on the level of certainty you need to obtain to support your use case. This is for 100% certainty at the expense of performance. You have to decide what the right balance is (and therefore best solution) for your situation.
  • Average Joe
    Average Joe over 5 years
    What would be your (if_clause ) string piece for this? mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
  • Fabian Kessler
    Fabian Kessler over 5 years
    There's no "huge performance impact". First, the reverse dns lookup is only performed on visitors that identify as search engine. All humans are not affected at all. Then, this lookup is only performed once per IP. The result is cached. Search engines keep using the same IP ranges for a very long time, and usually hit one site with one or few IPs only. Also: you could perform the validation delayed. Let the first request through, then background-validate. And if negative, prevent successive requests. (I would advise against this because harvesters have large IP pools now ...)
  • userlond
    userlond almost 5 years
    Is there some simular library written in PHP?
  • Frodik
    Frodik about 4 years
    This is good answer, but one note from PHP documentation for preg_match: Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
  • MrPHP
    MrPHP almost 4 years
    if (preg_match('/http|bot|bingbot|googlebot|robot|spider|slurp|‌​crawler|curl|^$/i', $userAgent))
  • boppy
    boppy about 3 years
    Please do not use this method to identify a google bot! Even on a small scale site we have 403 Agent-IP combinations with "googlebot" in it, while only 126 are real google bots (as of access logs from Feb 2021)! Please use طراحی سایت تهران answer below and see the linked document about verifying a real google bot!
  • Sjoerd Linders
    Sjoerd Linders almost 3 years
    This is the only right answer, when you absolutely need to be sure the request is from Google or Googlebot. See the Google documentation Verifying Googlebot.
  • Sergio Abreu
    Sergio Abreu almost 3 years
    stristr() does case insensitive on strstr()
  • Randy Lam
    Randy Lam almost 3 years
    For those people trying to verify the Google bot by UA, you guys are fooling yourselves ( and your partners ). Like Sjoerd said, verifying the host is the ONLY correct solution.