Detect Search Crawlers via JavaScript

40,480

Solution 1

This is the regex the ruby UA agent_orange library uses to test if a userAgent looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here:

/bot|crawler|spider|crawling/i

For example you have some object, util.browser, you can store what type of device a user is on:

util.browser = {
   bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
   mobile: ...,
   desktop: ...
}

Solution 2

Try this. It's based on the crawlers list on available on https://github.com/monperrus/crawler-user-agents

var botPattern = "(googlebot\/|bot|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = navigator.userAgent; 
if (re.test(userAgent)) {
    console.log('the user agent is a crawler!');
}

Solution 3

The following regex will match the biggest search engines according to this post.

/bot|google|baidu|bing|msn|teoma|slurp|yandex/i
    .test(navigator.userAgent)

The matches search engines are:

  • Baidu
  • Bingbot/MSN
  • DuckDuckGo (duckduckbot)
  • Google
  • Teoma
  • Yahoo!
  • Yandex

Additionally, I've added bot as a catchall for smaller crawlers/bots.

Solution 4

This might help to detect the robots user agents while also keeping things more organized:

Javascript

const detectRobot = (userAgent) => {
  const robots = new RegExp([
    /bot/,/spider/,/crawl/,                            // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,        // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,       // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                         // OTHER
  ].map((r) => r.source).join("|"),"i");               // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

Typescript

const detectRobot = (userAgent: string): boolean => {
  const robots = new RegExp(([
    /bot/,/spider/,/crawl/,                               // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                   // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,           // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,          // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                            // OTHER
  ] as RegExp[]).map((r) => r.source).join("|"),"i");     // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

Use on server:

const userAgent = req.get('user-agent');
const isRobot = detectRobot(userAgent);

Use on "client" / some phantom browser a bot might be using:

const userAgent = navigator.userAgent;
const isRobot = detectRobot(userAgent);

Overview of Google crawlers:

https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

Solution 5

isTrusted property could help you.

The isTrusted read-only property of the Event interface is a Boolean that is true when the event was generated by a user action, and false when the event was created or modified by a script or dispatched via EventTarget.dispatchEvent().

eg:

isCrawler() {
  return event.isTrusted;
}

⚠ Note that IE isn't compatible.

Read more from doc: https://developer.mozilla.org/en-US/docs/Web/API/Event/isTrusted

Share:
40,480
Jon
Author by

Jon

Updated on July 09, 2022

Comments

  • Jon
    Jon almost 2 years

    I am wondering how would I go abouts in detecting search crawlers? The reason I ask is because I want to suppress certain JavaScript calls if the user agent is a bot.

    I have found an example of how to to detect a certain browser, but am unable to find examples of how to detect a search crawler:

    /MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x

    Example of search crawlers I want to block:

    Google 
    Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 
    Googlebot/2.1 (+http://www.googlebot.com/bot.html) 
    Googlebot/2.1 (+http://www.google.com/bot.html) 
    
    Baidu 
    Baiduspider+(+http://www.baidu.com/search/spider_jp.html) 
    Baiduspider+(+http://www.baidu.com/search/spider.htm) 
    BaiDuSpider 
    
  • Jon
    Jon over 10 years
    Cool, thank you. I am curious about my requirements for Google. On my second line, I am to block out Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). I am wondering what that means? Shouldn't Mozilla be one of the Regexp I should be including in my code?
  • Jon
    Jon over 10 years
    Cool, thank you. I am curious about my requirements for Google. On my second line, I am to block out Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html). I am wondering what that means? Shouldn't Mozilla be one of the Regexp I should be including in my code?
  • megawac
    megawac over 10 years
    @icu222much see stackoverflow.com/questions/5125438/… You should just match if the string contains bot/spider/etc to check if a ua is a bot
  • morten.c
    morten.c over 10 years
    I thought you just don't know how to match the user agent against you're list, so stick to the answer/comment of megawac, I don't have much expirience identifying bots/crawler. So +1 for his answer.
  • Jon
    Jon over 10 years
    I tried if (/YahooSeeker|/.test(navigator.userAgent)) {console.log('yahoo')} and I left my user-agent as default (Mozilla) but the if statement returned true. Am I doing something incorrectly?
  • Jon
    Jon over 10 years
    I tried if (/YahooSeeker|/.test(navigator.userAgent)) {console.log('yahoo')} and I left my UA as default (Mozilla) but the if statement returned true. Am I doing something incorrectly?
  • megawac
    megawac over 10 years
    you have an extraneous | (or statement) in your regex so that test will always pass. Try /YahooSeeker/
  • Jon
    Jon over 10 years
    I have removed the extra pipe so my statement now says if (/Googlebot/.test(navigator.userAgent)) {...} but is now reporting false even when I am using Googlebot as my UA.
  • megawac
    megawac over 10 years
    The googlebot ua is Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) so try /Googlebot/i.test("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"). You were missing the i flag
  • Jon
    Jon over 10 years
    Sorry, I don't mean to sound noob-ish, but it is still not working. I have if ( /Googlebot/i.test("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)") ) which is always returning true even though I have disabled my UA. Can we move this into a chat?
  • megawac
    megawac over 10 years
  • morten.c
    morten.c over 10 years
    There is again a pipe too much at the end of your RegEx, change it to "/YahooSeeker/" should solve this issue.
  • tiernanx
    tiernanx almost 8 years
    googlebot and robot are redundant in the regex string used since bot will match first. /bot|crawler|spider|crawling/i would be much simpler.
  • Hariom Balhara
    Hariom Balhara about 7 years
    Now that navigator.userAgent is deprecated what would be the preferred way to do it on javascript.
  • rocky
    rocky almost 7 years
    aolbuild is not a bot. We removed it from our regex today because multiple customers called and complained about being flagged as a bot. perishablepress.com is incorrect about aolbuild.
  • Edo
    Edo almost 7 years
    Thanks @rocky, I've removed aolbuild from the answer
  • Amir Bar
    Amir Bar almost 7 years
    there is also facebook crawler bots facebookexternalhit|facebot developers.facebook.com/docs/sharing/webmasters/crawler
  • dave
    dave about 6 years
    duckduckgo should be: duckduckbot (see: duckduckgo.com/duckduckbot)
  • Edo
    Edo about 6 years
    Thanks @dave, edited. Funnily enough, perishablepress.com lists the correct user agent string, but the regex they suggest is wrong.
  • Omri
    Omri almost 4 years
    duckduckbot is redundant by "bot" /bot|google|baidu|bing|msn|teoma|slurp|yandex/i
  • tzazo
    tzazo almost 4 years
    You can simplify it even further by combining crawler and crawling into crawl: /bot|crawl|spider/i