MS Bing web crawler out of control causing our site to go down

8,460

Solution 1

Sign up with Bing webmaster tools and fill out their crawl speed chart. Set it for fastest crawling during your off hours and a much reduced rate during your busiest times.

If Bing is knocking over your website, you need to rethink your web server capacity. The best test is to see if you can survive Google, Bing, Yahoo and Baidu all hitting your system at once. If it remains in service during the onslaught, then you're ready for a live customer load.

Yes, Bing can hit you pretty hard if you haven't given them a limit. It was causing me serious issues here two months ago. I just tuned the system up to handle it and it was a good thing, otherwise Black Friday would have resulted in a very Blue Monday after viewing the server stats.enter image description here

Solution 2

Use PHP plus Regex. Forget the Robots.txt. Several bad bots don't respect it...

if (preg_match('/(?i)bingbot/',$_SERVER['HTTP_USER_AGENT']))
{
exit(); 
}

And you tell for Bing: The door is closed for you!

Solution 3

There are two ways of controlling the Bingbot; see http://www.bing.com/webmaster/help/crawl-control-55a30302 for details.

If you don't want to use their control panel just use a robots.txt file.

"If we find a crawl-delay: directive in your robots.txt file then it will take always precedence over the information from this feature."

Share:
8,460

Related videos on Youtube

akaDanPaul
Author by

akaDanPaul

Updated on September 18, 2022

Comments

  • akaDanPaul
    akaDanPaul almost 2 years

    Here is a weird one that I am not sure what to do. Today our companies e-commerce site went down. I tailed the production log and saw that we were receiving a ton of request from this range of IP's 157.55.98.0/157.55.100.0. I googled around and come to find out that it is a MSN Web Crawler.

    So essentially MS web crawler overloaded our site causing it not to respond. Even though in our robots.txt file we have the following;

    Crawl-delay: 10 
    

    So what I did was just banned the IP range in iptables.

    But what I am not sure to do from here is how to follow up. I can't find anywhere to contact Bing about this issue, I don't want to keep those IPs blocked because I am sure eventually we will get de-indexed from Bing. And it doesn't really seem like this has happened to anyone else before.

    Any Suggestions?

    Update, My Server / Web Stats

    Our web server is using Nginx, Rails 3, and 5 Unicorn workers. We have 4gb of memory and 2 virtual cores. We have been running this setup for over 9 months now and never had an issue, 95% of the time our system is under very little load. On average we receive 800,000 page views a month and this never comes close to bringing / slowing down our web server.

    Taking a look at the logs we were receiving anywhere from 5 up to 40 request / second from this IP range.

    In all my years of web development I have never seen a crawler hit a website so many times.

    Is this new with Bing?

  • Fiasco Labs
    Fiasco Labs over 11 years
    Good choice if you don't depend on Bing/Live/MSNSearch for incoming traffic. This will completely deindex your website with them and do a pretty good job of reducing web server loading.
  • Mike Niner
    Mike Niner over 11 years
    Thanks Fiasco. In my opinion, BingBot is an evil bot, as they work like a Web Ripper. If the webmaster has dependence of Bing for make revenue, so he need to consider buying more and more resource to work on it. Bing was banned on all my 95 sites. Good luck for all you.
  • Fiasco Labs
    Fiasco Labs over 11 years
    My comment was almost, but not quite tongue in cheek. I've had both Yahoo and Bing hit my site at once and nearly take the site to its knees. The loading was worse than Yandex which in the past has caused me grief. Yandex actually has upgraded their internal operations to work more like Google and not strain stuff so terribly. Baidu and Bing are on equal terms now for being overaggressive and requiring server tuning to handle the extra traffic.
  • Aristos
    Aristos almost 11 years
    I have done that - and not work at all....
  • Fiasco Labs
    Fiasco Labs almost 11 years
    Did you install the file that identifies your website to BWT and check that they've verified it? If Bing can't id the site, the crawl rate histogram will do nothing at all for limiting traffic.
  • Aristos
    Aristos almost 11 years
    Yes, my sites are verified, I just check it. I have cut them now with firewall... to calm down, however Bing support are very friendly, I have contact with them, they suggest me to add on the robots.txt this line crawl-delay: 10 that is not work ether, and now they ask me for the logs, that I have all ready send them, to look them out.
  • Koen.
    Koen. over 10 years
    If denying based on user agent is desired, you'd better deny them in your server configuration.
  • Fiasco Labs
    Fiasco Labs about 10 years
    @blunders - and since Bing, Google and Yandex are the majority traffic sources on our website, we have to survive all of them scanning our website simultaneously. Guess what happens to Baidu here --> Scrapheap. Heh, statement still holds true that your website will need to withstand being indexed by the web crawlers you choose to let in or it is no website at all.