October 10, 2002
Yahoo debuts new Ratings feature

Yahoo! News - Reader Ratings is a new feature. And it did seem to weed out some so-so stories like that sex with a twist lemon as contraceptive story.

Posted by Heretic at 09:58 PM
Google RSS NewsFeed

Now that Mark has publicized it, this should last a good 5 more minutes before google shuts it down.

Posted by Heretic at 11:53 AM
October 08, 2002
Google Analysis

OK, I got the Google Toolbar. Through it, I found out that my blog has a PageRank of 4. This got me thinking about Google some more and I thought I'd do some analysis. Let's start with the most recent terms people have used to find me.

Biggest assumptions in physics. Well, not much to say there because I'm not in the top 2 pages anymore I guess and I'm not going to look any deeper...

Under Forbidden Glasses, not including the sponsored link, I have the 5th position. Above me are such luminaries as So you'd like to ... Borrow a pair of reading glasses for these super books! A guide by mickey1114, Nearsighted Reader of Great Books of Fiction!. Why? Because "glasses" are in the title and one of the books he lists is titled "Forbidden". There's also some karate site which has the phrase "Glasses are forbidden" in it. Good guess I suppose. I also just beat out Edible undies.

But let's get serious in our analysis of these various search requests. As a human, the first question we need to understand is what is the human using these search terms searching for? In the case of "Forbidden Glasses", Nothing immediately springs to mind. One would tend to think this is a reference to some historical artifact or the like...so for all these terms, let's scan the google results and see if we can A. discover what the searcher was looking for, B. see how well Google did in coming up with the result and C. determine if there was a way that google could have found it despite being a machine...

Alas, for Forbidden Glasses, I could find nothing to explain what the searcher was looking for. Refining the search by putting quotes around it provides one potential response here where it's part of some lyrics, but somehow I doubt that was the intention. So did Google do a good job? Even though probably the proper response we come to as a human was that there were no relevant results rather than the 57,300 that Google came up with, I think it is reasonable for Google to provide every page containing those two terms.

Next up is a Googlified Yahoo search for robin richards tutankhamen. I no longer come up in the first page, but I was pretty early in commenting on the story so I think it was appropriate for my post to be there initially. So what would be the ideal search result for this query? As a human, I'd rather see the original research page rather than someone's account of it. My post referenced where I first saw the story, which was on CNN. From reading that CNN story it's clear that indeed the searcher is looking for more original references which is why "Robin Richards", who was quoted in the story, is part of the search terms. Let's see how google did here.

One would suspect that google, with their pagerank algorithms will more likely give a better result to a popular News site than to the university or organization that released the data in the first place. But first let's find the original reference. It seems that discovery channel has a lot of info on this topic. But there are two other leads, one is that the story mentioned that the face is on exhibit at the London Museum and the other mentions that Robin Richards is from University College of London. Looking at those leads I'd say that as a human if we had to rank the best results for the intent of this search, it would be London Museum's King Tut Page first, with Discovery Channel's second and all the news channels below it. You may argue that a specific page with Robin Richards on it might be more relevant, but when you look at Google's results, they don't fare too well in that department either.

The first position went to MSNBC's king tut story, which had a pagerank of 7. The CNN story has a pagerank of 2 for some reason and doesn't show up in the result set. What's third is Gene Expression, which has a link to that CNN news story.

OK, I have figured it out. Gene expression has a pagerank of 6. Why? Because blogs, with all the cross linking, tend to have high pageranks. So probably, he happened to have been indexed when that story was at the top of his page, just like probably happened to me. So the high pagerank and having a story on the front page of the site is the key. That is why blogs do so well. They have topical stories which tend to be searched on a lot, and the newest stories are at the top of the page. If you had to drill down to Gene Expression's individual post, the pagerank goes down to 5 (still not bad).

So even though the best links are the London Museum and Discovery Channel, those pages have a rank of 5 and 3 respectively.

Coming back to the results, the second one is some saigon site. That looks like an appropriate result but I can't understand a word being said there. Pagerank seems to be messed up because it says 0 of 10, but google's toolbar does that sometimes. More baffling is why this result about Richard Rockwood did so well. It has a pagerank of 1. The rest of the results are rather poor. So the two best links didn't even show up in the results. How could google have done better? I'm not sure.

I was planning to do more searches but I didn't expect this to be so intensive. So let me conclude by speculating a bit. Isn't it interesting that only one news site showed up? Part of it seems to be the inclusion of "Robin Richards" in the search query since not all the news stories bothered mentioning him (an oversight which is a good topic for another discussion). However, this Yahoo news story has a pagerank of 8 out of 10 yet I couldn't get a query to bring up that page. I'm wondering if google has restriction on some news sites?

Here are some pages for further reading:

The Google Weblog

Dive into Mark

High Search Engine Rankings

Posted by Heretic at 03:07 AM