I wonder what they’re doing to combat the growth rate of their data. A 13x speed up, or 82% cost reduction is great, but doesn’t seem significant enough compared to the growth of the business and (my assumption) demand for adding new data sources and added data to existing sources.
Like, if the current latency is ~60 minutes for 90% of updates, will it ever be better than that? Won’t it just slowly degrade until the next multi-year migration?
PS: this article was infuriating to read on iPad - it kept jumping back to the top of the page and couldn’t figure out why
As noted in the bottom of the design doc at https://github.com/ray-project/deltacat/blob/main/deltacat/c..., we also improved the runtime efficiency of compaction from O(nlogn) to O(n). However, a lot of this also comes down to making intentional data engineering decisions to control how physical data is laid out (and retained) across files to keep reads/writes as localized as possible. For example, we found that grouping records according to the date they were last updated to be very helpful, as outlined in our 2022 Ray Summit talk: https://youtu.be/u1XqELIRabI?t=1589.
Like, if the current latency is ~60 minutes for 90% of updates, will it ever be better than that? Won’t it just slowly degrade until the next multi-year migration?
PS: this article was infuriating to read on iPad - it kept jumping back to the top of the page and couldn’t figure out why