logo

Blog

I ran my own blog from November 2006 till October 2014. All posts are still online, but I don't have time to update it anymore. Please note that all images and media files have been removed when the backup was moved to a new host in early 2016. Enjoy!

Why I hate Cuil

31. july 2008

On Monday a new search engine has been launched, it's named Cuil but pronounced Cool (huh?) and it got a lot of really bad reviews which made me really happy.

So why do I hate Cuil?

Well, a few months ago we started noticing more and more traffic on our company websites, caused by a crawler called Twiceler. Twiceler was run by a company called "Cuil" and claimed to be some kind of an experimental search engine robot. A few days later the same crawler also started affecting my personal websites.

The Twiceler bot is probably the most stupid crawler I've ever seen, it just downloads everything it can find and it seems that it just won't ever stop. If there's a page using dynamic input in a URL (a calendar for example) it will download the same page 100,000 and more times, simply by following all kinds of dynamic links it can find without using any kind of intelligent limitation. 

By downloading thousands of pages per hour on each website it can cause an incredible traffic on a server, and dynamic scripts (written in Perl, Python or PHP for example) start causing an immense CPU load that may even take your entire server down (as reported by several webmasters). Twiceler is really harmful and can cost both money and downtime. A well written crawler such as Googlebot or Slurp (Yahoo) would never affect a website in such a malicious way.

After googling for Twiceler we found out that many webmasters experienced such problems with Cuil. Of course we thought that such a crappy crawler - which doesn't seem to care about similar content, website performance, bandwidth and traffic costs - had to be some kind of a malicious spam bot.

As the stupid Cuil/Twiceler bot just won't stop the first thing you'll do as a webmaster or system administrator is setting up a robots.txt file which tells Twiceler not to index any more pages (or at least blocks some of the directories that shall not be indexed, such as dynamic scripts for example). 

Cuil claims that their Twiceler crawler respects the robots.txt file, but even days after setting it up nothing changed, the damn bot continued indexed anything it could get and completely ignored all robots.txt rules (google for Twiceler and you'll see that this is what other webmasters are experiencing too).

So finally we blocked the entire Cuil bot on our servers, just as many other people recommend in webmaster forums. On our company servers we blocked all incoming connections that could be identified as a Cuil/Twiceler bot, on my personal websites I blocked all of Cuil's IP addresses using .htaccess files.

It was a funny moment when the Cuil search engine went live on Monday and they claimed to have the world's biggest index. Of course they have! Their damn bot seems to be indexing each dynamic web page a million times, no matter if it's always the same content of if you're clearly saying that this page should not be indexed at all (via robots.txt).

Maybe this also explains the poor quality of their search results - their index may be the largest on this planet, but it's probably full of crap and duplicates. 

If you're a webmaster/website owner and you're currently experiencing high bandwidth or traffic problems, then you should check your access_log because there's a good chance that your problems are caused by Cuil. If this is the case I can just recommend to block all of Cuils IP addresses on your server because that seems to be the only thing that really works.

To finish I'd like say that I think Cuil should start focusing on the quality of their algorithms and their content instead of completely relying on the marking of doubtful numbers.