One out of every 5 hits to my site is from a bot.
I spent a fair bit of time this weekend analysing my log file for last month (which runs to gigabytes, and I ended up learning a few things about file system optimisation, but more on that later). 80% of the hits were from regular browsers. 20% were from robots. Here’s a sample of the user-agents:
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp) Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) Mediapartners-Google DotBot/1.0.1 (http://www.dotnetdotcom.org/#info, email@example.com) Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html) msnbot/1.1 (+http://search.msn.com/msnbot.htm) FeedBurner/1.0 (http://www.FeedBurner.com) Mozilla/5.0 (compatible; attributor/1.13.2 +http://www.attributor.com) WebAlta Crawler/2.0 (http://www.webalta.net/ru/about_webmaster.html) (Windows; U; Windows NT 5.1; ru-RU) Yandex/1.01.001 (compatible; Win16; I) ...
You get the idea. The bulk of these are search engines. Over two-thirds of the bot requests were from Yahoo Slurp. Now, this struck me as weird. If I take the top 3 search engines that are sending traffic my way,
|Referral %||Crawl %|
The search engine that sends me the most traffic is being reasonably conservative, while Yahoo is just eating up the bandwidth on my site. Actually, this shouldn’t bother me too much. It’s not taking up too much bandwidth, or even CPU usage, given that all the bots put together make up only 20% of my traffic. But somehow… it’s sub-optimal. Inelegant, even.
So I decided to take a closer look. Just how often are they crawling my site?
|Yahoo||Every 5 seconds|
|Every 13 seconds|
|DotBot||Every 9 minutes|
|Cuill||Every 9 minutes|
|Microsoft||Every 18 minutes|
|Feedburner||Every 18 minutes|
|Attributor||Every 23 minutes|
|Yandex||Every 27 minutes|
Look at those numbers. Yahoo is hitting my site once every 5 seconds. No wonder there’s a help page at Yahoo titled How can I reduce the number of requests you make on my web site? I followed their advice and set the crawl-delay to 60, so at least it slows down to once a minute.
Just that one little line change should (hopefully) reduce the load on my site by around 15%.
As for the other engines, I don’t mind that much in terms of load.
- Google, for all that it crawls every 13 seconds, has faithfully reported that it has only 11% of my site under its index, so I’ve no idea what they’re doing, but I’m not complaining about the traffic that’s coming my way.
- DotBot. Today was the first I’d heard of them. Visited the site, and smiled. These guys can do all the crawling of my site that they like, and I hope something interesting comes out of their work.
- Cuill, sends me 0.2% of my traffic, but it’s a new search engine, I’m happy to give it time.
- Microsoft‘s OK, sends me a tiny stream of traffic.
- Feedburner is just pinging my RSS feed every 18 minutes.
- Attributor and Yandex I’m hearing of for the first time, again. Not too much load on a system, so that’s OK.
What’s amazing is the sheer number of bots out there. Last month, I counted over 600 distinct user-agent strings just representing bots. So it’s true. The Web is no longer just for humans. We do need a Semantic Web.