Google And Ye Shall Find

By Andrew Price, 2008-01-30 23:51:59 in General.

I've just had an odd Google experience while scouring the web for relevant papers for my computer science project. It all started at the Google Scholar site which I was using to search for papers relating to Linux file systems and storage, particularly performance analysis, block I/O and anything to do with blktrace or GFS2.

After a while I stumbled across a highly relevant looking abstract of a paper entitled Evaluating Block-level Optimization through the IO Path [Riska, Larkby-Lahet & Riedel] so I clicked on the link. A username/password dialog appeared. It turned out I needed a USENIX account to get at this potentially juicy morsel of knowledge. So I clicked on the Web Search link that Google Scholar places next to search results, hoping to find the paper offered freely elsewhere. Expectedly, the first page of results showed links to pages on the USENIX site including a link to the actual PDF which again popped up a username/password dialog (Google's crawler must have a USENIX account...?).

Then I noticed something else - a View as HTML link. Surely it couldn't work. It shouldn't. Could Google search really be that nifty (or subversive)? I clicked on the link. Before my eyes the once-imprisoned-as-a-password-protected-PDF paper loaded in all its oddly formatted glory. A brief skim through reveals that some figures have missing information but I think I can live with that. I'll put it with the others and read it in the morning. Thanks, Google.

[Edit: Good news - since I wrote this entry USENIX has announced that it is making all the proceedings of its conferences available to everyone.]

Comments

Auzon writes:

Many websites cater to Google's bot. If you put it as your user agent, you see a very different web. Sometimes you can see content that ordinarily requires you to login, like this case.

2008-01-31 00:20:57

nixternal writes:

I did the same recently as well. I think it was Google, but I had used the universities database browser and found a paper relevant to what I was writing about. Come to find out, they wanted like $25 for the paper, but I clicked the "view as html" and sure enough, I got that same badly formatted paper that they wanted $25 for, for absolutely free. Glad it worked out that way too, because I thought about purchasing the paper, but after reading it, it wouldn't work for my case.

2008-01-31 00:21:39

Anon writes:

This is remarkably common on the web. However there are all sorts of headers/robots.txt you can set that will prevent bot caching ( http://googleblog.blogspot.com/2007/07/robots-exclusion-protocol-now-with-even.html ). Most places (like portal.acm.org) that may ask for logins but let Google crawl seem to have arranged it with Google at some point in the past not allow caching...

2008-01-31 10:03:57

Jason Brower writes:

Actually, I have noticed that for some reason that some pdf docs ask for passwords when using our open source friendly readers. Try opening that passworded document with the real deal and you may see a perfectly formated document.

2008-01-31 12:29:09

Dave A writes:

I've had a similar experience, but I got around it by Googling for the paper's title. Sure enough, there was a PDF link in the search results that worked, no password required.

2008-01-31 13:40:26

Test writes:

Testing 123

2009-03-25 03:59:37