Do you use it over many machines (RAMs)? I dont care so much about the memory an...

cpcloud · on Aug 29, 2024

Polars' support for SQL is pretty nascent and missing a lot of functionality.

If it were better, we'd use it internally in Ibis for the Polars backend implementation.

If you're going down the mixed SQL, DataFrame API route then Ibis is probably the best solution out there for that.

I work on Ibis, so take what I say with a grain of salt. There may yet be other libraries out that there that have similar functionality.

magnio · on Aug 30, 2024

> Do you use it over many machines (RAMs)?

If you mean whether I run it distributedly a la Spark then no. If you mean whether I test it on various machines with different RAM sizes then yes.

> I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.

Well, I care. Both pandas and polars are, to my view, single-machine dataframe library, so the memory and CPU constraints are rather stringent.

My comparison is based solely on my experience: reading csv files that are 20% to 50% the size of RAM, pandas takes (or errors out after) 2 to 10 minutes, while polars finishes in 20 seconds. Queries in pandas are almost always slower than polars.

But reading your comment, it seems you and I have different use cases for dataframe libraries, which is fine. I mostly use them for exploratory analysis, so the SQL api is not that much of a plus to me, but the performance is.

fifilura · on Aug 31, 2024

My point is that it is still not a magnitude change. And it (probably?) introduces bugs and incompatibilities.

Many cloud providers now offer serverless SQL and Spark capacities (serverless=no set up for you). This is the magnitude change for me.

With pandas you can maybe process 10 million rows, with polars maybe 50 million. But with a distributed service maybe 100 times more?

oreilles · on Aug 29, 2024

When using Pandas appropriately, that is with method chaining, lambda expressions (instead of intermediate assignments) and pyarrow datatypes, you also get much faster speed and null values handling.

fifilura · on Aug 30, 2024

I know.

And by now I know that very well.

Like someone-screaming-in-my-ears-know.

I am starting to think that Polars is showing all the signs of a hype or a cult.

I am still not convinced, particularly since the community feels more like a marketing department than someone who wants to genuinely help.

I can do that thing you describe with SQL.