Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The assertion that this is because "indexing the whole Web is crushingly expensive, and getting more so every day" is a bit flawed. Since old content is very unlikely to be updated, it doesn't have to be re-crawled a lot. I'm certain Google has a score that tells it how often the content of a given site is likely to change. This argument of expense becomes even less durable when you consider that DuckDuckGo, a company with an infinitesimal fraction of Google's resources, is perfectly able to keep that kind of content in its database.

I agree with the observation that this is about shifting everything to current data, because people overwhelmingly care about things that happened a few days ago. There used to be a long tail of users searching for old data and references, but I suspect they're fading away. Biasing the index towards recency also has legal advantages for Google, because delisting old content makes it less likely to receive takedown requests in connection with "right to be forgotten" legislation.



Crawling isn't the real problem, nor is the bulk storage for the crawled pages.

What do you do with these pages after you've crawled them? You need to build an index out of them, and serve that index out of some kind of low latency storage (DRAM, Flash). That makes increasing the index size very expensive. The index size has to be limited, and selecting the right pages to include in the index is thus a core quality feature for a search engine.


I'm having trouble imagining that Google would be more limited by the ratio of hardware power vs data size today than it was in the early days. If keeping the whole index in DRAM is now a requirement, then yes, I'd expect a hugely reduced overall dataset - but wouldn't that affect way more sites/pages than the comparatively few dropped historical records?

I still suspect that this whole thing is more about bias (and personalization, be it correct or incorrect) in the results.


Google's index has been in memory for most of its life now: http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsd...

It's actually more complicated than just a single static index, which is also why it's unrealistic to expect a search engine to be deterministic at scale.


It was a limit back in the day as well. Remember Google's "supplemental results"[0]? This has always happened. The only thing that's different is that a blogger was personally insulted by his output not being fully indexed, and decided to pitch it as history being erased.

[0] https://searchengineland.com/google-dumps-the-supplemental-r...


The index only has to be on "low latency storage" if low-latency results to any query are required. While that's definitely true of the modal "Google Search", most of these queries for "long-tail, old content" as discussed in the OP don't really need that sort of quick response.


Interview questions:

How would a search engine distinguish between the two kinds of queries, tens of thousands of times a second?

And how would one architect such a two-tiered system, particularly with an eye toward cascading failures?


Make it opt-in. Instead of requiring every search to be finished in less than 0.5 seconds, allow users to tick a box that says "Take your time" and pull indices from slow storage in that case. If I know I want something niche, I am willing to wait the extra few seconds or even a minute.


Hell, even waiting an entire day (e.g with results sent via email) might be reasonable for some searches.


Behind the scenes, a search on Google involves at least a thousand machines.

How much extra state (internal connections, memory for partial results, etc.) would such a new search type create?

How do you deal with a new kind of hot spots now?

What if millions of people suddenly activate such an option?

What if a botnet does it?


All good points :)


I don't work at Google so I'm probably way off base, but if I was designing it I wouldn't bother telling the difference between the two types of queries.

I'd break up the indices into digestible chunks, perhaps chronologically by year/month crawled, and then run all queries simultaneously (in parallel) against all those index chunks and combine the results at the end. Infinitely scalable and can be tweaked to ensure specific response times.

And there'd definitely be no need to set some arbitrary date cut-off; just add a few more virtual machines. I'd bet that's what Google was doing, and then scaled back those machines to save money and boost profits.


That's kind of how Google works, with multiple index tiers. Look up patents by Anna Paterson to get a few clues, assuming your lawyers won't bark at you.

Still, you can't keep partial results around forever, unless you want to make searches a lot more expensive, having to add a lot of capacity just to deal with the buffer bloat. Each query touches at least a thousand machines. Adding "a few more virtual machines" isn't going to cut it, especially if you have to handle tens of thousands of requests per second.


Progressive loading is a thing.


Yeah, it seems a natural consequence of the combination of vast amounts of recent content with the fact that people mostly want recent content. To pick one trivial example from yesterday, if I'm looking for help with an interface issue with some current version of a program, forum posts from 10 years ago are probably not useful.

Information that people regularly access for whatever reason will tend to remain relatively visible. But, yeah, relatively obscure older content is just going to get drowned out unless you know exactly where and how to look. One might argue with Google's criteria around relevance. However, that older information is going to get harder and harder to find just in the natural course of things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: