I wonder what they’re doing to combat the growth rate of their data. A 13x speed...

thedood · on Aug 1, 2024

As noted in the bottom of the design doc at https://github.com/ray-project/deltacat/blob/main/deltacat/c..., we also improved the runtime efficiency of compaction from O(nlogn) to O(n). However, a lot of this also comes down to making intentional data engineering decisions to control how physical data is laid out (and retained) across files to keep reads/writes as localized as possible. For example, we found that grouping records according to the date they were last updated to be very helpful, as outlined in our 2022 Ray Summit talk: https://youtu.be/u1XqELIRabI?t=1589.

e28eta · on Aug 1, 2024

That’s a nice durable improvement! Thanks