The assertion that this is because "*indexing the whole Web is crushingly expens...

jsnell · on April 8, 2019

Crawling isn't the real problem, nor is the bulk storage for the crawled pages.

What do you do with these pages after you've crawled them? You need to build an index out of them, and serve that index out of some kind of low latency storage (DRAM, Flash). That makes increasing the index size very expensive. The index size has to be limited, and selecting the right pages to include in the index is thus a core quality feature for a search engine.

Udo · on April 8, 2019

I'm having trouble imagining that Google would be more limited by the ratio of hardware power vs data size today than it was in the early days. If keeping the whole index in DRAM is now a requirement, then yes, I'd expect a hugely reduced overall dataset - but wouldn't that affect way more sites/pages than the comparatively few dropped historical records?

I still suspect that this whole thing is more about bias (and personalization, be it correct or incorrect) in the results.

puzzle · on April 8, 2019

Google's index has been in memory for most of its life now: http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsd...

It's actually more complicated than just a single static index, which is also why it's unrealistic to expect a search engine to be deterministic at scale.

jsnell · on April 8, 2019

It was a limit back in the day as well. Remember Google's "supplemental results"[0]? This has always happened. The only thing that's different is that a blogger was personally insulted by his output not being fully indexed, and decided to pitch it as history being erased.

[0] https://searchengineland.com/google-dumps-the-supplemental-r...

0815test · on April 8, 2019

The index only has to be on "low latency storage" if low-latency results to any query are required. While that's definitely true of the modal "Google Search", most of these queries for "long-tail, old content" as discussed in the OP don't really need that sort of quick response.

puzzle · on April 8, 2019

Interview questions:

How would a search engine distinguish between the two kinds of queries, tens of thousands of times a second?

And how would one architect such a two-tiered system, particularly with an eye toward cascading failures?

typon · on April 8, 2019

Make it opt-in. Instead of requiring every search to be finished in less than 0.5 seconds, allow users to tick a box that says "Take your time" and pull indices from slow storage in that case. If I know I want something niche, I am willing to wait the extra few seconds or even a minute.

aeorgnoieang · on April 8, 2019

Hell, even waiting an entire day (e.g with results sent via email) might be reasonable for some searches.

puzzle · on April 8, 2019

Behind the scenes, a search on Google involves at least a thousand machines.

How much extra state (internal connections, memory for partial results, etc.) would such a new search type create?

How do you deal with a new kind of hot spots now?

What if millions of people suddenly activate such an option?

What if a botnet does it?

typon · on April 8, 2019

All good points :)

bshipp · on April 8, 2019

I don't work at Google so I'm probably way off base, but if I was designing it I wouldn't bother telling the difference between the two types of queries.

I'd break up the indices into digestible chunks, perhaps chronologically by year/month crawled, and then run all queries simultaneously (in parallel) against all those index chunks and combine the results at the end. Infinitely scalable and can be tweaked to ensure specific response times.

And there'd definitely be no need to set some arbitrary date cut-off; just add a few more virtual machines. I'd bet that's what Google was doing, and then scaled back those machines to save money and boost profits.

puzzle · on April 8, 2019

That's kind of how Google works, with multiple index tiers. Look up patents by Anna Paterson to get a few clues, assuming your lawyers won't bark at you.

Still, you can't keep partial results around forever, unless you want to make searches a lot more expensive, having to add a lot of capacity just to deal with the buffer bloat. Each query touches at least a thousand machines. Adding "a few more virtual machines" isn't going to cut it, especially if you have to handle tens of thousands of requests per second.

ummonk · on April 8, 2019

Progressive loading is a thing.

ghaff · on April 8, 2019

Yeah, it seems a natural consequence of the combination of vast amounts of recent content with the fact that people mostly want recent content. To pick one trivial example from yesterday, if I'm looking for help with an interface issue with some current version of a program, forum posts from 10 years ago are probably not useful.

Information that people regularly access for whatever reason will tend to remain relatively visible. But, yeah, relatively obscure older content is just going to get drowned out unless you know exactly where and how to look. One might argue with Google's criteria around relevance. However, that older information is going to get harder and harder to find just in the natural course of things.