What is the virtual thread / event loop pattern seeking to optimize? Is it context switching?
A number of years ago I remember trying to have a sane discussion about “non blocking” and I remember saying “something” will block eventually no matter what… anything from the buffer being full on the NIC to your cpu being at anything less than 100%. Does it shake out to any real advantage?
It's a brave attempt to release the programmer from worrying or even thinking about thread pools and blocking code. Java has gone all in - they even cancelled a non-blocking rewrite of their database driver architecture because why have that if you won't have to worry about blocking code? And the JVM really is a marvel of engineering, it's really really good at what it does, so what team to better pull this off?
So far, they're not quite there yet: the issue of "thread pinning" is something developers still have to be aware of. I hear the newest JVM version has removed a few more cases where it happens, but will we ever truly 100% not have to care about all that anymore?
I have to say things are already pretty awesome however. If you avoid the few thread pinning causes (and can avoid libraries that use them - although most of not all modern libraries have already adapted), you can write really clean code. We had to rewrite an old app that made a huge mess tracking a process where multiple event sources can act independently, and virtual threads seemed the perfect thing for it. Now our business logic looks more like a game loop and not the complicated mix of pollers, request handlers, intermediate state persisters (with their endless thirst for various mappers) and whatnot that it was before (granted, all those things weren't there just because of threading.. the previous version was really really shitily written).
It's true that virtual threads sometimes hurt performance (since their main benefit is cleaner simpler code). Not by much, usually, but a precisely written and carefully tuned piece of performance critical code can often still do things better than automatic threading code. And as a fun aside, some very popular libraries assumed the developer is using thread pools (before virtual threads, which non trivial Java app didn't? - ok nobody answer that, I'm sure there are cases :D) so these libraries had performance tricks (ab)using thread pool code specifics. So that's another possible performance issue with virtual threads - like always with performance of course: don't just assume, try it and measure! :P
Just a side note, async JDBC was a thing way before Loom came about, and it failed miserably. I'm not sure why, but my guess would be is that most enterprise software is not web-scale, so JDBC worked well as it was.
Also, all the database vendors provided their drivers implementing the JDBC API - good luck getting Oracle or IBM contribute to R2DBC.. (Actually, I stand corrected: there is an Oracle R2DBC driver now - it was released fairly recently though.)
EDIT: "failed miserably" is maybe too strong - but R2DBC certainly doesn't have the support and acceptance of JDBC.
R2DBC allows to efficiently maintain millions of connections to the database. But what database supports millions of connections? Not postgres for sure, and probably no other conventional database. So using reactive JDBC driver makes little sense, if you're going to use 1000 connections, 1000 threads will do just fine and bring little overhead. Those who use Java, don't care about spending 100 more MB of RAM when their service already eats 60GB.
Reactive drivers were not about 1000 connections, they were about reusing a single connection better, by queuing a little bit more efficient over a single connection. Reactive programming is not about parallelism, it’s about concurrency.
It is not possible to reuse a single connection better, if we're talking about postgres. You must conduct the transaction over a single connection and you cannot mix different transactions simultaneously over a single connection. That's the way postgres wire protocol works. I think that there's some rudimentary async capabilities, but they don't change anything fundamentally.
It might be different for some exotic databases, but I don't see any reason why ordinary JDBC driver couldn't reuse single TCP connection for multiple logical JDBC connections in this case.
It could also be that there just isn’t enough demand for a non-blocking JDBC. For example, Postgresql server is not coping very well with lots of simultaneous connections, due to it’s (a.o.) process-per-connection model. From the client-side (JDBC), a small thread poool would be enough to max out the Postgresql server. And there is almost no benefit of using non-blocking vs a small thread pool.
I would argue the main benefit would be that the threadpool that the developer would create anyway would instead be created by the async database driver, which has more intimate knowledge about the server's capabilities. Maybe it knows the limits to the number of connections, or can do other smart optimizations. In any case, for the developer it would be a more streamlined experience, with less code needed, and better defaults.
I think we’re confusing async and non-blocking? Non-blocking is the part what makes virtual threads more efficient than threads. Async is the programming style; e.g. do things concurrently. Async can be implemented with threads or non-blocking, if the API supports it. I was merely arguing that a non-blocking JDBC has little merit as the connections to a DB are limited. Non-blocking APIs are only beneficial when there are lots, > 10k connections.
JDBC knows nothing about the amount of connections a server can handle, but to try so
many connections until it won’t connect any more.
| In any case, for the developer it would be a more streamlined experience, with less code needed, and better defaults.
I agree it would be best not to bother the dev with what is going on under the hood.
The short answer is that blocking is expensive due to the overhead of the implied context switch and poor locality. As computers become faster, a larger percentage of the CPU time is dedicated to context-switching overhead and non-blocking architectures eliminate that. For applications like databases where this problem is more severe, the difference in throughput between a blocking architecture and a non-blocking architecture can be 10x on the same hardware, so it is a very important optimization if you want your software to have performance that is competitive.
A modern thread-per-core shared-nothing architecture takes this even further and tries to eliminate blocking at the hardware level for the same basic reason.
So... What is it seeking to optimize? Why did you need a thread pool before but not any more? What resource was exhausted to prevent you from putting every request on a thread?
The goal is to maximize the number of tasks you can run concurrently, while imposing on the developers a low cognitive load to write and maintain the code.
> Why did you need a thread pool before but not any more?
You still need a thread pool. Except with virtual threads you are no longer bound to run a single task per thread. This is specially desirable when workloads are IO-bound and will expectedly idle while waiting for external events. If you have a never-ending queue of tasks waiting to run, why should you block a thread consuming that task queue by running a task that stays idle while waiting for something to happen? You're better off starting the task and setting it aside the moment it awaits for something to happen.
> What resource was exhausted to prevent you from putting every request on a thread?
if creating gazillion threads on modern hardware is super cheap why not? I have transparency and debuggability what threads are running, can check stacktrace of each and what are they blocked on.
virtual threads adds lots of magic under the hood, and if there will be some bug or lib in your infra with no vthreads support it is absolutely not clear how to debug it.
> if creating gazillion threads on modern hardware is super cheap why not?
Virtual threads are a performance improvement over threads, no matter how cheap to create threads are. Virtual threads run on threads. If threads become cheaper to create, so do virtual threads. They are not mutually exclusive.
Virtual threads are on top of that a developer experience improvement. Code is easier to write and maintain.
Virtual threads improve throughput because the moment a task is waiting for anything like IO, the thread is able to service any other task in the queue.
> Virtual threads are on top of that a developer experience improvement. Code is easier to write and maintain.
except now you need to prove somehow that all 100 libs in your project support virtual threads.
> Virtual threads improve throughput because the moment a task is waiting for anything like IO, the thread is able to service any other task in the queue.
from reading similar discussions, linux for example doesn't have true IO async API, you just push lock of Java thread to lock of thread in the kernel
And my frustration is that java had that API for 20 years, it is used everywhere and absolutely battle tested, and now they are adding those virtual threads which break third party libs and make JVM more complicated with various degradations in exchange of benefits most will not notice..
It's mainly trying to make you not worry about how many threads you create (and not worry about the caveats that come with optimising how many threads you create, which is something you are very often forced to do).
You can create a thread in your code and not worry whether that thing will then be some day run in a huge loop or receive thousands of requests and therefore spend all your memory on thread overhead. Go and other languages (in Java's ecosystem there's Kotlin for example) employ similar mechanisms to avoid native thread overhead, but you have to think about them. Like, there's tutorial code where everything is nice & simple, and then there's real world code where a lot of it must run in these special constructs that may have little to do with what you saw in those first "Hello, world" samples.
Java's approach tries to erase the difference between virtual and real threads. The programmer should have to employ no special techniques when using virtual threads and should be able to use everything the language has to offer (this isn't true in many languages' virtual/green threads implementations). Old libraries should continue working and perhaps not even be aware they're being run on virtual threads (although, caveats do apply for low level/high performance stuff, see above posts). And libraries that you interact with don't have to care what "model" of green threading you're using or specifically expose "red" and "blue" functions.
You will still have to worry, too many virtual threads will imply too much context switching. However, virtual threads will be always interruptable on I/O, as they are not mapped to actual o.s. threads, but rather simulated by the JVM which will executed a number of instructions for each virtual thread.
This gives the chance to the JVM to use real threads more efficiently, avoiding that threads remain unused while waiting on I/O (e.g. a response from a stream). As soon as the JVM detects that a physical thread is blocked on I/O, a semaphore, a lock or anything, it will reallocate that physical thread to running a new virtual thread. This will reduce latency, context switch time (the switching is done by the JVM that already globally manages the memory of the Java process in its heap) and will avoid or at least largely reduce the chance that a real thread remains allocated but idle as it's blocked on I/O or something else.
My understanding is that virtual threads mostly eliminate context switching - for N CPUs JVM creates N platform threads and they run virtual threads as needed. There is no real context switching apart from GC and other JVM internal threads.
A platform thread picking another virtual thread to run after its current virtual thread is blocked on IO is not a context switch, that is an expensive OS-level operation.
The JVM will need to do context switching when reallocating the real thread that is running a blocked virtual thread to the next available virtual thread. It won't be CPU context switching, but context switching happens at the JVM level and represents an effort.
Ok. This JVM-level switching is called mounting/un-mounting of the virtual thread and is supposed to be several orders of magnitude cheaper compared to normal context switch. You should be fine with millions of virtual threads.
Does Java's implementation of virtual threads perform any kind of work stealing when a particular physical thread has no virtual threads to run (e.g. they are all blocked on I/O)?
It's been a really long time since I dealt with anything this low level, but in my very limited and ancient experience when people talk about context switching they're talking specifically about the userland process yielding execution back to the kernel so that the processor can be reassigned to a different process/thread. Naively, if the JVM isn't actually yielding control back to the kernel, it has the freedom to do things in a much more lightweight manner than the kernel would have to.
So I think it's meaningful to define what we mean by context switch here.
When a real thread is allocated from a virtual thread to another, the JVM needs to save into the heap the stack of the first virtual thread and restore from the heap the stack of the second virtual thread, see slide 13 of [1]. This is in fact called mounting/unmounting as already pointed out, and occurs via Java Continuation, but from the JVM perspective this is a context switch. It's called JVM and the V stands for Virtual, so yes, it's not the kernel doing it, but it's happening, and it's more frequent the more virtual threads you have in your application.
It seems that the answer to the question was "memory". Stack allocations, presumably. You have answered by telling us that virtual threads are better than real threads because real threads suck, but you didn't say why they suck or why virtual threads don't suck in the same way.
Real threads don't suck but they pay a price for generality. The kernel doesn't know what software you're going to run, and there's no standards for how that software might use the stack. So the kernel can't optimize by making any assumptions.
Virtual threads are less general than kernel threads. If you use a virtual thread to call out of the JVM you lose their benefits, because the JVM becomes like the kernel and can't make any assumptions about the stack.
But if you are running code controlled by the JVM, then it becomes possible to do optimizations (mostly stack related) that otherwise can't be done, because the GC and the compiler and the threads runtime are all developed together and work together.
Specifically, what HotSpot can do moving stack frames to and from the heap very fast, which interacts better with the GC. For instance if a virtual thread resumes, iterates in a loop and suspends again, then the stack frames are never copied out of the heap onto the kernel stack at all. Hotspot can incrementally "pages" stack frames out of the heap. Additionally, the storage space used for a suspended virtual thread stack is a lot smaller than a suspended kernel stack because a lot of administrative goop doesn't need to be saved at all.
OS Threads do not suck, they're great. But they are expensive to create as they require a syscall, and they're expensive to maintain as they consume quite a bit of memory just to exist, even if you don't need it (due to how they must pre-allocate a stack which apparently is around 2MB initially, and can't be made smaller as in most cases you will need even more, so it would make most cases worse).
Virtual Threads are very fast to create and allocate only the memory needed by the actual call stack, which can be much less than for OS Threads.
Also, blocking code is very simple compared to the equivalent async code. So using blocking code makes your code much easier to follow. Check out examples of reactive frameworks for Java and you will quickly understand why.
> and they're expensive to maintain as they consume quite a bit of memory just to exist, even if you don't need it (due to how they must pre-allocate a stack which apparently is around 2MB initially,
I'm not familiar with windows, but this certainly isn't the case on Linux. It only costs 2mb-8mb of virtual address space, not actual physical memory. And there's no particular reason to believe the JVM can have a list of threads and their states more efficiently than the kernel can.
All you really save is the syscall to create it and some context switching costs as the JVM doesn't need to deal with saving/restoring registers as there's no preemption.
The downside though is you don't have any preemption, which depending on your usage is a really fucking massive downside.
> And there's no particular reason to believe the JVM can have a list of threads and their states more efficiently than the kernel can.
Of course there is. The JVM is able to store the current stack for the Thread efficiently in the pre-allocated heap. Switching execution between Virtual Threads is very cheap. Experiments show you can have millions of VTs, but only a few thousand OS Threads.
I don't know why you think preemption is a big downside?! The JVM only suspends a Thread at safe points and those are points where it knows exactly when to resume. I don't believe there's any downsides at all.
Briefly: The cost of spawning schedulable entities, memory and the time to execution. Virtual threads, i.e., fibers, entertain lightweight stacks. You can spawn as many as you like immediately. Your runtime system won’t go out of memory as easily. In addition, the spawning happens much faster in user space. You’re not creating kernel threads, which is a limited and not cheap resource, whence the pooling you’re comparing it to. With virtual threads you can do thread per request explicitly. It makes most sense for IO-bound tasks.
A thread per request has a high risk of overcommitting on CPU use, leading to a different set of problems. Virtual threads are scheduled on a fixed-size (based on number of cores) underlying (non-virtual) thread pool to avoid this problem.
Why can't virtual threads overcommit CPU use? If I have 4 CPUs and 4000 virtual threads running CPU-bound code, is that not overcommit? A system without overcommit would refuse to create the 5th thread.
Real threads are extremely expensive both in terms of memory and CPU time compared to virtual threads. I think the main issue is not even that but context switching when switching threads which is also very expensive.
Virtual threads usually require significantly fewer resources to spawn and run. And, if the underlying system is implemented with them in mind, they can use fewer context switches, and possibly even fewer cache misses etc.
To put it shortly: Writing single-threaded blocking code is far easier for most people and has many other benefits, like more understandable and readable programs: https://www.youtube.com/watch?v=449j7oKQVkc
The main reason why non-blocking IO with it's style of intertwining concurrency and algorithms came along is that starting a thread for every request was too expensive. With virtual threads that problem is eliminated so we can go back to writing blocking code.
I’d say that writing single-threaded code is far easier for _all_ people, even async code experts :)
Also, single-threaded code is supported by programming language facilities: you have a proper call stack, thread-local vars, exceptions bubbling up, structured concurrency, simple resource management (RAII, try-with-resources, defer). Easy to reason and debug on language level.
Async runtimes are always complicated, filled with leaky abstractions, it’s like another language that one has to learn in addition, but with a less thought-out, ad-hoc design. Difficult to reason and debug, especially in edge cases
> Async runtimes are always complicated, filled with leaky abstractions, it’s like another language that one has to learn in addition, but with a less thought-out, ad-hoc design. Difficult to reason and debug, especially in edge cases
Async runtimes themselves are simply attempts to bolt-on green threads on top of a language that doesn't support them on a language level. In JavaScript, async/await uses Promises to enable callback-code to interact with key language features like try/catch, for/while/break, return, etc. In Python, async/await is just syntax sugar for coroutines, which are again just syntax sugar for CPS-style classes with methods split at each "yield". Not sure about Rust, but it probably also uses some Rust macro magic to do something similar.
Indeed. Async runtimes/sytles are attempts to provide a more readable/useable syntax for CPS[1]. CPS originally had nothing to do with blocking/non-blocking or multi-threading but arose as a technique to structure compiler code.
Its attraction for non-blocking coding is that it allows hiding the multi-threaded event dispatching loop. But as the parent comment suggests, this abstraction is extremely leaky. And in addition, CPS in non-functional languages or without syntactic sugar has poor readability. Improving the readability requires compiler changes in the host language - so many languages have added compiler support to further hide the CPS underpinnings of their async model.
I've always felt this was a big mistake in our industry - all this effort not only in compilers but also in debuggers/IDE - building on a leaky abstraction. Adding more layers of leaky abstractions has only made the issue worse. Async code, at first glance, looks simple but is a minefield for inexperienced/non-professional software engineers.
It's annoying that Rust switched to async style - the abstraction leakiness immediately hits you, as the "hidden event dispatching loop" remains a real dependency even if it's not explicit in the code. Thus libraries using asycn cannot generally be used together although last time i looked, tokio seems to have become the de-facto standard.
I absolutely agree that the virtual/green thread style is much better, more ergonomic, less likely to be correct, etc, but I can’t fault Rust’s choice, given it being a low-level language without a fat runtime, making it possible to be called into from other runtimes. What the JVM does is simply not possible that way.
>Async runtimes themselves are simply attempts to bolt-on green threads on top of a language that doesn't support them on a language level.
Haskell supports async code while also supporting green threads on a language level, and the async code has most of the same issues as async code in any other languages.
What problems exactly? Haskell has a few things that imo it does better than most languages in this area:
- All IO is non-blocking by default.
- FFI support for interruptible.
- Haskell threads can be preempted externally - this allows you to ensure they never leak. Vs a goroutine that can just spin forever if it doesn't explicitly yield.
- There are various stdlib abstractions for building concurrent programs in a compositional way.
> Haskell threads can be preempted externally - this allows you to ensure they never leak. Vs a goroutine that can just spin forever if it doesn't explicitly yield.
Goroutines are preemptible by the runtime (since https://go.dev/doc/go1.14#runtime) but they're still not addressable or killable through the language itself.
>I’d say that writing single-threaded code is far easier for _all_ people, even async code experts :)
While 'async' is just a name, underneath it's epoll - and the virtual threads would not perform better than a proper NIO (epoll) server. I dont consider myself an 'async expert' but I have my share of writing NIO code (dare say not terrible at all)
> To put it shortly: Writing single-threaded blocking code is far easier for most people and has many other benefits, like more understandable and readable programs:
I think you're missing the whole point.
The reason why so many smart people invest their time on "virtual threads" is developer experience. The goal is to turn writing event-driven concurrent code into something that's as easy as writing single-threaded blocking code.
Check why C#'s async/await implementation is such a huge success and replaced all past approaches overnight. Check why node.js is such a huge success. Check why Rust's async support is such a hot mess. It's all about developer experience.
As someone who has written multiple productions services with Async Rust, that are under constant load, I disagree. I've had team members who have only written in C, pick up and start building very comprehensive and performant services in Rust in a matter of days.
How do you developers spew such strong opinions without taking a moment to think about what you're about to say. Rust cannot be directly compared to C#, Java or even Go.
You don't get a runtime or a GC with rust. The developer experience is excellent, you get a lot of control over everything you're building with it. Yes it's not as magical as languages and runtimes like you've mentioned, but the fact that I can at anytime rip those abstractions off and make my service extremely lightweight and performant is not something those languages will allow you to do.
And this is coming from someone who's written non blocking services before Async rust was a thing with just MIO.
The very fact Rust gets mentioned between these languages should be a tribute to the efforts of it's maintainers and core team. The amount of tooling and features they've added into the language gives developers of every realm liberty to try and build what they want.
Honestly, you can hold whatever opinion you want on any language but your comparison really doesn't make sense.
> To put it shortly: Writing single-threaded blocking code is far easier for most people. [snip] With virtual threads that problem is eliminated so we can go back to writing blocking code.
This is the core misunderstanding/dishonesty behind the Loom/Virtual Threads hype. Single-threaded blocking code is easy, yes. But that ease comes from being single-threaded, not from not having to await a few Futures.
But Loom doesn't magically solve the threading problem. It hides the Futures, but that just means that you're now writing a multi-threaded program, without the guardrails that modern Future-aware APIs provide. It's the worst of all worlds. It's the scenario that gave multi-threading such a bad reputation for inscrutable failures in the first place.
> What is the virtual thread / event loop pattern seeking to optimize? Is it context switching?
Throughput.
Some workloads are not CPU-bound or memory-bound, and spend the bulk of their time waiting for external processes to make data available.
If your workloads are expected to stay idle while waiting for external events, you can switch to other tasks while you wait for those external events to trigger.
This is particularly convenient if the other tasks you're hoping to run are also tasks that are bound to stay idle while waiting for external events.
One of the textbook scenarios that suits this pattern well is making HTTP requests. Another one is request handlers, such as the controller pattern used so often in HTTP servers.
Perhaps the poster child of this pattern is Node.js. It might not be the performance king and might be single-threaded, but it features in the top spots in performance benchmarks such as TechEmpower's. Node.js is also highly favoured in function-as-a-service applications, as it's event-driven architecture is well suited for applications involving a hefty dose of network calls running on memory- and CPU-constrained systems.
One of the main reasons to do virtual threads is that it allows you to write naive "thread per request" code and still scale up significantly without hitting the kind of scaling limits you would with OS threads.
The problem with the naïve design is that even with virtual threads, you risk running out of (heap) memory if the threads ever block. Each task makes a bit of progress, allocates some objects, and then lets another one do the same thing.
With virtual threads, you can limit the damage by using a semaphore, but you still need to tune the size. This isn't much different than sizing a traditional thread pool, and so I'm not sure what benefit virtual threads will really have in practice. You're swapping one config for another.
> The problem with the naïve design is that even with virtual threads, you risk running out of (heap) memory if the threads ever block.
The key with virtual threads is they are so light weight that you can have thousands of them running concurrently: even when they block for I/O, it doesn't matter. It's similar to light weight coroutine in other language like Go or Kotlin.
What you are complaining about has nothing to do with thread pools or virtual threads. You're pointing out the fact that more parallelism will also need more hardware and that a finite hardware budget will need a back pressure strategy to keep resource consumption within a limit. While you might be correct that "sizing a traditional thread pool" is a back pressure strategy that can be applied to virtual threads, the problem with it is that IO bound threads will prevent CPU bound threads from making progress. You don't want to apply back pressure based on the number of tasks. You want back pressure to be in response to resource utilization, so that enough tasks get scheduled to max out the hardware.
This is a common problem with people using Java parallel streams, because they by default share a single global thread pool and the way to use your own thread pool is also extremely counterintuitive, because it essentially relies on some implicit thread local magic to choose to distribute the stream in the thread pool that the parallel stream was launched on, instead of passing it as a parameter.
It would be best if people came up with more dynamic back pressure strategies, because this is a more general problem that goes way beyond thread pools. In fact, one of the key problems of automatic parallelization is deciding at what point there is too much parallelization.
But that same benefit was always available with platform threads -- a simple API. What is the real gain by using virtual threads? It's either going to be performance or memory utilization.
It's combining the benefits from async models (state machines separated from os threads, thus more optimal for I/O bound workload), with the benefits from proper threading models (namely the simpler human interface).
Memory utilization & performance is going to be similar to the async callback mess.
Why is an async model better than using OS threads for an I/O bound workload? The OS is doing async stuff internally and shielding the complexity with threads. With virtual threads this work has shifted to the JVM. Can the JVM do threads better than the OS?
"Why is an async model better than using OS threads for an I/O bound workload?"
Because evented/callback-driven code is a nightmare to reason about and breaks lots of very basic tools, like the humble stack trace.
Another big thing for me is resource management - try/finally don't work across callback boundaries, but do work within a virtual thread. I recently ported a netty-based evented system to virtual threads and a very long-standing issue - resource leakage - turned into one very nice try/finally block.
Yes. The JVM has far more opportunities for optimizing threads because it doesn't need to uphold 50 years of accumulated invariants and compatibility that current OSes do, and JVM has more visibilty on the application internals.
it can do a much better job because there isn't a security boundary. OS thread scheduling requires sys calls and invalidate a bunch of cache to prevent timing leaks
Throughput. The code can be "suspended" on a blocking call (I/O, where the platform thread usually is wasted, as the CPU has nothing to do during this time). So, the platform thread can do other work in the meantime.
Yeah, and it's generally good to be RAM limited instead of CPU, no? The alternative is blowing a bunch of time on syscalls and OS scheduler overhead.
Also the virtual threads run on a "traditional" thread pool to my understanding, so you can just tweak the number of worker threads to cap the total concurrency.
The benefit is it's overall more efficient (in the general case) and lets you write linear blocking code (as opposed to function coloring). You don't have to use it, but it's nice that it's there. Now hopefully Valhalla actually makes it in eventually
The OS scheduler is still there (for the carrier threads), but now you've added on top of that FJ pool based scheduler overhead. Although virtual threads don't have the syscall overhead when they block, there's a new cost caused by allocating the internal continuation object, and copying state into it. This puts more pressure on the garbage collector. Context switching cost due to CPU cache thrashing doesn't go away regardless of which type of thread you're using.
I've not yet seen a study that shows that virtual threads offer a huge benefit. The Open Liberty study suggests that they're worse than the existing platform threads.
For quick results, check figures 11 and 15 from the (preprint) paper. Userland threads ("fred") have ~50% higher throughput while having orders of magnitude better latency at high load levels, in a real-world application (memcached).
The study says there's surprising performance problems with Java's virtual thread implementation. Their test of throughput was also hilarious, they put 2000 OS threads vs 2000 virtual threads: most of the time OS threads don't start falling apart until 100k+ threads. You can architect an application such that you can handle 200k simultaneous connections using platform-thread-per-core, but it's harder to reason about than the linear, blocking code that virtual threads and async allow for.
> Context switching cost due to CPU cache thrashing doesn't go away regardless of which type of thread you're using.
Except it's not a context switch? You're jumping to another instruction in the program, one that should be very predictable. You might lose your cache, but it will depend on a ton of factors.
> there's a new cost caused by allocating the internal continuation object, and copying state into it.
This is more of a problem with the implementation (not every virtual thread language does it this way), but yeah this is more overhead on the application. I assume there's improvements that can be made to ease GC pressure, like using object pools.
Usually virtual threads are a memory vs CPU tradeoff that you typically use in massively concurrent IO-bound applications. Total throughput should take over platform threads with hundreds of thousands of connections, but below that probably perform worse, I'm not that surprised by that.
> Except it's not a context switch? You're jumping to another instruction in the program, one that should be very predictable. You might lose your cache, but it will depend on a ton of factors.
Java virtual threads are stackful; they have to save and restore the stack every time they mount a different virtual thread to the platform thread. They do this by naive[0] copying of the stack out to a heap allocation and then back again, every time. That's clearly a context switch that you're paying for; it's just not in the kernel. I believe this is what the person you're replying to is talking about.
[0] Not totally naive. They do take some effort to copy only subsets of the stack if they can get away with it. But it's still all done by copies. I don't know enough to understand why they need to copy and can't just swap stack pointers. I think it's related to the need to dynamically grow the stack when the thread is active vs. having a fixed size heap allocation to store the stack copy.
No, it optimises hardware utilisation by simply allowing more tasks to concurrently make progress. This allows throughput to reach the maximum the hardware allows. See https://youtu.be/07V08SB1l8c.
imo the biggest difference between "virtual" threads in a managed runtime and "os" threads is that the latter uses a fixed size stack whereas the former is allowed to resize, it can grow on demand and shrink under pressure.
When you spawn an OS thread you are paying at worst the full cost of it, and at best the max depth seen so far in the program, and stack overflows can happen even if the program is written correctly. Whereas a virtual thread can grow the stack to be exactly the size it needs at any point, and when GC runs it can rewrite pointers to any data on the stack safely.
Virtual/green/user space threads aka stackful coroutines have proven to be an excellent tool for scaling concurrency in real programs, while threads and processes have always played catchup.
> “something” will block eventually no matter what…
The point is to allow everything else to make progress while that resource is busy.
---
At a broader scale, as a programming model it lets you architect programs that are designed to scale horizontally. With the commodization of compute in the cloud that means it's very easy to write a program that can be distributed as i/o demand increases. In principle, a "virtual" thread could be spawned on a different machine entirely.
They indeed optimize thread context switching. Taking the thread on and off the CPU is becoming expensive when there are thousands of threads.
You are right that everything blocks, even when going to L1 cache you have to wait 1 nanoseconds. But blocking in this context means waiting for “real” IO like a network request or spinning disk access. Virtual threads take away the problem that the thread sits there doing nothing for a while as it is waiting for data, before it is context switched.
Virtual threads won’t improve CPU-bound blocking. There the thread is actually occupying the CPU, so there is no problem of the thread doing nothing as with IO-bound blocking.
The hardware now is just as concurrent/parallel as the software. High-end NVMe SSDs and server-grade NICs can do hundreds to thousands of things simultaneously. Even if one lane does get blocked, there are other lanes which are open.
> I remember saying “something” will block eventually no matter what… anything from the buffer being full on the NIC to your cpu being at anything less than 100%.
Nope. You can go async all the way down, right to the electrical signals if you want. We usually impose some amount of synchronous clocking/polling for sanity, at various levels, but you don't have to; the world is not synchronised, the fastest way to respond to a stimulus will always be to respond when it happens.
> Does it shake out to any real advantage?
Of course it does - did you miss the whole C10K discussions 20+ years ago? Whether it matters for your business is another question, but you can absolutely get a lot more throughput by being nonblocking, and if you're doing request-response across the Internet you generally can't afford not to.
A number of years ago I remember trying to have a sane discussion about “non blocking” and I remember saying “something” will block eventually no matter what… anything from the buffer being full on the NIC to your cpu being at anything less than 100%. Does it shake out to any real advantage?