Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you use it over many machines (RAMs)?

I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.

Although the Null handling seems very compelling, I guess it comes at a cost of incompatibility with existing libraries, otherwise Pandas would have implemented it as well?

I am curious about the SQL api though.



Polars' support for SQL is pretty nascent and missing a lot of functionality.

If it were better, we'd use it internally in Ibis for the Polars backend implementation.

If you're going down the mixed SQL, DataFrame API route then Ibis is probably the best solution out there for that.

I work on Ibis, so take what I say with a grain of salt. There may yet be other libraries out that there that have similar functionality.


> Do you use it over many machines (RAMs)?

If you mean whether I run it distributedly a la Spark then no. If you mean whether I test it on various machines with different RAM sizes then yes.

> I dont care so much about the memory and CPU stuff, I mostly leave the heavy lifting to an SQL engine.

Well, I care. Both pandas and polars are, to my view, single-machine dataframe library, so the memory and CPU constraints are rather stringent.

My comparison is based solely on my experience: reading csv files that are 20% to 50% the size of RAM, pandas takes (or errors out after) 2 to 10 minutes, while polars finishes in 20 seconds. Queries in pandas are almost always slower than polars.

But reading your comment, it seems you and I have different use cases for dataframe libraries, which is fine. I mostly use them for exploratory analysis, so the SQL api is not that much of a plus to me, but the performance is.


My point is that it is still not a magnitude change. And it (probably?) introduces bugs and incompatibilities.

Many cloud providers now offer serverless SQL and Spark capacities (serverless=no set up for you). This is the magnitude change for me.

With pandas you can maybe process 10 million rows, with polars maybe 50 million. But with a distributed service maybe 100 times more?


When using Pandas appropriately, that is with method chaining, lambda expressions (instead of intermediate assignments) and pyarrow datatypes, you also get much faster speed and null values handling.


I know.

And by now I know that very well.

Like someone-screaming-in-my-ears-know.

I am starting to think that Polars is showing all the signs of a hype or a cult.

I am still not convinced, particularly since the community feels more like a marketing department than someone who wants to genuinely help.

I can do that thing you describe with SQL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: