Sunday, June 6, 2010

Snooping on search engines with hex tricks.

The big hex number you see on the title of the blog is there for a reason. It's an experiment on the search engines out there.

What's this hex number ? It's a SHA-1 hash of the string "Andrew Yourtchenko" - this gives a nice token that is unique in the whole world (because noone else has yet come up with the idea to hash my name and put the result online). This gives some very entertaining results if you try to search for this number in the search engines.


  • Google: 2 results after folding the duplicates - main page + one post. Expanding them gives 25 results. Cool.

  • Bing: 0 results. Boo. I could not find any page on this blog using it. 2xBoo.

  • Yahoo: 2 results - main page and one post, different from the post that Google shows.

  • duckduckgo: One hit to the same post as Yahoo shows, not to the main page.

  • altavista: 2 results, same as Yahoo.

  • ask.com: 2 results, same as Google

  • cuil.com: 0 results.

  • baidu: 0 results. Somehow unsurprising at all

  • kosmix: 2 results in google web search (same as google), 4 results in the google blog search, 0 results in yahoo web search. Entertaing how they disagree with Yahoo.

  • yandex.ru: 0 results.

  • yebol.com: 2 results same as yahoo + 1 site result. Very interesting form of presentation. I got to play with this one for daily searches, even though it does not appear to be too fast compared to google.com.



Conclusions:


  1. I don't have many inbound links here - probably about two :)

  2. google.com has a better reach towards the "long tail" (EDIT: 'long tail' in this case being blogger.com - would be interesting to test e.g. typepad.com or other blog sites)

  3. There are much more than 127 billion pages on the web.

  4. yebol.com is a new toy to play with



Though some of the above are fairly obvious, not bad of a result for a single SHA-1 hash, I think. Would be interesting what results this method gives for more popular blogs/websites.

The debatable point in this method is - to which extent do the search engines discriminate the "oddball hex stuff" vs. the "common words". From all I know about the search engines, they should not - vice versa, it's the stopwords (too frequent to be useful ones) that are usually filtered.

No comments: