how to detect search engine bots with php?
Solution 1
Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT'];
to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
Solution 2
I use the following code which seems to be working fine:
function _bot_detected() {
return (
isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|mediapartners/i', $_SERVER['HTTP_USER_AGENT'])
);
}
update 16-06-2017 https://support.google.com/webmasters/answer/1061943?hl=en
added mediapartners
Solution 3
Check the $_SERVER['HTTP_USER_AGENT']
for some of the strings listed here:
http://www.useragentstring.com/pages/useragentstring.php
Or more specifically for crawlers:
http://www.useragentstring.com/pages/useragentstring.php?typ=Crawler
If you want to -say- log the number of visits of most common search engine crawlers, you could use
$interestingCrawlers = array( 'google', 'yahoo' );
$pattern = '/(' . implode('|', $interestingCrawlers) .')/';
$matches = array();
$numMatches = preg_match($pattern, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');
if($numMatches > 0) // Found a match
{
// $matches[1] contains an array of all text matches to either 'google' or 'yahoo'
}
Solution 4
You can checkout if it's a search engine with this function :
<?php
function crawlerDetect($USER_AGENT)
{
$crawlers = array(
'Google' => 'Google',
'MSN' => 'msnbot',
'Rambler' => 'Rambler',
'Yahoo' => 'Yahoo',
'AbachoBOT' => 'AbachoBOT',
'accoona' => 'Accoona',
'AcoiRobot' => 'AcoiRobot',
'ASPSeek' => 'ASPSeek',
'CrocCrawler' => 'CrocCrawler',
'Dumbot' => 'Dumbot',
'FAST-WebCrawler' => 'FAST-WebCrawler',
'GeonaBot' => 'GeonaBot',
'Gigabot' => 'Gigabot',
'Lycos spider' => 'Lycos',
'MSRBOT' => 'MSRBOT',
'Altavista robot' => 'Scooter',
'AltaVista robot' => 'Altavista',
'ID-Search Bot' => 'IDBot',
'eStyle Bot' => 'eStyle',
'Scrubby robot' => 'Scrubby',
'Facebook' => 'facebookexternalhit',
);
// to get crawlers string used in function uncomment it
// it is better to save it in string than use implode every time
// global $crawlers
$crawlers_agents = implode('|',$crawlers);
if (strpos($crawlers_agents, $USER_AGENT) === false)
return false;
else {
return TRUE;
}
}
?>
Then you can use it like :
<?php $USER_AGENT = $_SERVER['HTTP_USER_AGENT'];
if(crawlerDetect($USER_AGENT)) return "no need to lang redirection";?>
Solution 5
I'm using this to detect bots:
if (preg_match('/bot|crawl|curl|dataprovider|search|get|spider|find|java|majesticsEO|google|yahoo|teoma|contaxe|yandex|libwww-perl|facebookexternalhit/i', $_SERVER['HTTP_USER_AGENT'])) {
// is bot
}
In addition I use a whitelist to block unwanted bots:
if (preg_match('/apple|baidu|bingbot|facebookexternalhit|googlebot|-google|ia_archiver|msnbot|naverbot|pingdom|seznambot|slurp|teoma|twitter|yandex|yeti/i', $_SERVER['HTTP_USER_AGENT'])) {
// allowed bot
}
An unwanted bot (= false-positive user) is then able to solve a captcha to unblock himself for 24 hours. And as no one solves this captcha, I know it does not produce false-positives. So the bot detection seem to work perfectly.
Note: My whitelist is based on Facebooks robots.txt.
terrific
Updated on July 08, 2022Comments
-
terrific almost 2 years
How can one detect the search engine bots using php?
-
terrific about 15 yearsif ((eregi("yahoo",$this->USER_AGENT)) && (eregi("slurp",$this->USER_AGENT))) { $this->Browser = "Yahoo! Slurp"; $this->Type = "robot"; } will this work fine??
-
rinchik about 11 yearswhy strstr and not strpos?
-
Ólafur Waage about 11 yearsBecause strpos can return 0 (the position), strstr returns FALSE on failure, you can use strpos if you add a !== false check at the end.
-
Damon about 10 yearsErm,
strpos
returnsFALSE
on failure, too. It's faster and more efficient, though (no preprocessing, and no O(m) storage). -
Jeromie Devera almost 10 yearsDoes this assume that bots reveal themselves as such?
-
Admin over 9 yearsWhat about fake useragents?!
-
barwnikk over 9 yearsI can change user agent in Chrome.
-
barwnikk over 9 yearsVote down, user agent can be changed in chrome settings, firefox,
-
JonShipman about 9 yearsYes the useragent can be changed, but if someone is changing it to contain "bot","crawl","slurp", or "spider" knows whats coming to them. It also depends on utility. I wouldn't use this to strip out all CSS, but I would use this to not store cookies, ignore location logging, or skip a landing page.
-
The Onin about 9 yearsI think strpos is better. I do it like this:
(strpos(strtolower($_SERVER['HTTP_USER_AGENT']), 'google') === false)
. I don't do googlebot cause i also wanna detect google insights tests. -
Daan almost 9 yearsDoesn't anyone agree with me that this is a way to wide range to match?
-
Daan almost 9 yearsI think this list is outdated, I don't see "slurp" for example which is Yahoo it's spider help.yahoo.com/kb/SLN22600.html
-
Mojtaba Rezaeian almost 9 yearsAnd what if someone could change his user agent with fake name and name it like "Googlebot"? I think checking ip range is more trustworthy!
-
Mojtaba Rezaeian almost 9 yearsIP list is more secure if you want to make sure about user agent name is really a search engine bot, because it is possible to create fake user-agents by name.
-
Philipp almost 9 yearsNote: This library only analyzes the user agent to decide if visitor is a bot.
-
Joel James almost 8 yearsToo heavy, just to check a verify bot.
-
FarrisFahad almost 8 yearsI used your function for more than 1 day now and it seems to be working. But I am not sure. How can I send testing bots to test if it works?
-
Robert Sinclair almost 8 yearsThe answer is good but I wouldn't rely on the resource that's being linked to. 'Yahoo' is not even in the list.
-
Gregory about 7 yearsThe regex in this answer is nice for being simple and wide-spanning. For my purpose I want to be quick but I don't care if there's a few false positives or false negatives.
-
Ludo - Off the record about 7 yearsyou forgot a closing
)
in your first piece of code. -
nikksan almost 7 yearsGood solution, I would just add 'Google Page Speed Insights' to the regex - '/bot|crawl|slurp|spider|mediapartners|Google Page Speed Insights/i'
-
mlissner over 6 yearsAll the other answers using user-agent strings are only halfway there. Wow.
-
mlissner over 6 yearsThis is only half of verifying, if you want to do it right. The other half is to use DNS to verify the IP. See the answer below: stackoverflow.com/a/29457983/64911
-
Brady Emerson over 5 yearsThere are many comments about user-agent checking only being half the check. This is true, but keep in mind, there's a huge performance impact to doing the full DNS and reverse DNS lookup. It all depends on the level of certainty you need to obtain to support your use case. This is for 100% certainty at the expense of performance. You have to decide what the right balance is (and therefore best solution) for your situation.
-
Average Joe over 5 yearsWhat would be your (if_clause ) string piece for this? mozilla/5.0 (iphone; cpu iphone os 9_3 like mac os x) applewebkit/601.1.46 (khtml, like gecko) version/9.0 mobile/13e198 safari/601.1
-
Fabian Kessler over 5 yearsThere's no "huge performance impact". First, the reverse dns lookup is only performed on visitors that identify as search engine. All humans are not affected at all. Then, this lookup is only performed once per IP. The result is cached. Search engines keep using the same IP ranges for a very long time, and usually hit one site with one or few IPs only. Also: you could perform the validation delayed. Let the first request through, then background-validate. And if negative, prevent successive requests. (I would advise against this because harvesters have large IP pools now ...)
-
userlond almost 5 yearsIs there some simular library written in PHP?
-
Frodik about 4 yearsThis is good answer, but one note from PHP documentation for preg_match: Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() or strstr() instead as they will be faster.
-
MrPHP almost 4 yearsif (preg_match('/http|bot|bingbot|googlebot|robot|spider|slurp|crawler|curl|^$/i', $userAgent))
-
boppy about 3 yearsPlease do not use this method to identify a google bot! Even on a small scale site we have 403 Agent-IP combinations with "googlebot" in it, while only 126 are real google bots (as of access logs from Feb 2021)! Please use طراحی سایت تهران answer below and see the linked document about verifying a real google bot!
-
Sjoerd Linders almost 3 yearsThis is the only right answer, when you absolutely need to be sure the request is from Google or Googlebot. See the Google documentation Verifying Googlebot.
-
Sergio Abreu almost 3 yearsstristr() does case insensitive on strstr()
-
Randy Lam almost 3 yearsFor those people trying to verify the Google bot by UA, you guys are fooling yourselves ( and your partners ). Like Sjoerd said, verifying the host is the ONLY correct solution.