An EEVDF CPU scheduler for Linux

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
March 9, 2023

The kernel's completely fair scheduler (CFS) has the job of managing the allocation of CPU time for most of the processes running on most Linux systems. CFS was merged for the 2.6.23 release in 2007 and has, with numerous ongoing tweaks, handled the job reasonably well ever since. CFS is not perfect, though, and there are some situations it does not handle as well as it should. The EEVDF scheduler, posted by Peter Zijlstra, offers the possibility of improving on CFS while reducing its dependence on often-fragile heuristics.

CFS and scheduling constraints

One of the key design goals of CFS was, as might be understood from its name, fairness — ensuring that every process in the system gets its fair share of CPU time. This goal is achieved by tracking how much time each process has received and running those that have gotten less CPU time than the others, with each process's run time scaled by its "nice" priority. CFS is, in other words, a weighted fair queuing scheduler at its core.

Fairness, it turns out, is enough to solve many CPU-scheduling problems. There are, however, many constraints beyond the fair allocation of CPU time that are placed on the scheduler. It should, for example, maximize the benefit of the system's memory caches, which requires minimizing the movement of processes between CPUs. At the same time, though, it should keep all CPUs busy if there is work for them to do. Power management is a complication as well; sometimes the optimal decisions for system throughput must take a back seat to preserving battery life. Hybrid systems (where not all CPUs are the same) add more complications. And so on.

One place where there is a desire for improvement is in the handling of latency requirements. Some processes may not need a lot of CPU time but, when they do need that time, they need it quickly. Others might need more CPU time but can wait for it if need be. CFS does not give processes a way to express their latency requirements; nice values (priorities) can be used to give a process more CPU time, but that is not the same thing. The realtime scheduling classes can be used for latency-sensitive work, but running in a realtime class is a privileged operation, and realtime processes can badly affect the operation of the rest of the system.

What is lacking is a way to ensure that some processes can get access to a CPU quickly without necessarily giving those processes the ability to obtain more than their fair share of CPU time. The latency nice patches have been circulating for some time as an attempt to solve this problem; they allow CFS processes with tight latency requirements to jump the queue for the CPU when they want to run. These patches appear to work, but Zijlstra thinks that there might be a better approach to the problem.

Introducing EEVDF

The "Earliest Eligible Virtual Deadline First" (EEVDF) scheduling algorithm is not new; it was described in this 1995 paper by Ion Stoica and Hussein Abdel-Wahab. Its name suggests something similar to the Earliest Deadline First algorithm used by the kernel's deadline scheduler but, unlike that scheduler, EEVDF is not a realtime scheduler, so it works in different ways. Understanding EEVDF requires getting a handle on a few (relatively) simple concepts.

Like CFS, EEVDF tries to divide the available CPU time fairly among the processes that are contending for it. If, for example, there are five processes trying to run on a single CPU, each of those processes should get 20% of the available time. A given process's nice value can be used to adjust the calculation of what its fair time is; a process with a lower nice value (and thus a higher priority) is entitled to more CPU time at the expense of those with higher nice values. To this point, there is nothing new here.

Imagine a time period of one second; during that time, in our five-process scenario, each process should have gotten 200ms of CPU time. For a number of reasons, things never turn out exactly that way; some processes will have gotten too much time, while others will have been shortchanged. For each process, EEVDF calculates the difference between the time that process should have gotten and how much it actually got; that difference is called "lag". A process with a positive lag value has not received its fair share and should be scheduled sooner than one with a negative lag value.

In fact, a process is deemed to be "eligible" if — and only if — its calculated lag is greater than or equal to zero; any process with a negative lag will not be eligible to run. For any ineligible process, there will be a time in the future where the time it is entitled to catches up to the time it has actually gotten and it will become eligible again; that time is deemed the "eligible time".

The calculation of lag is, thus, a key part of the EEVDF scheduler, and much of the patch set is dedicated to finding this value correctly. Even in the absence of the full EEVDF algorithm, a process's lag can be used to place it fairly in the run queue; processes with higher lag should be run first in an attempt to even out lag values across the system.

The other factor that comes into play is the "virtual deadline", which is the earliest time by which a process should have received its due CPU time. This deadline is calculated by adding a process's allocated time slice to its eligible time. A process with a 10ms time slice, and whose eligible time is 20ms in the future, will have a virtual deadline that is 30ms in the future.

The core of EEVDF, as can be seen in its name, is that it will run the process with the earliest virtual deadline first. The scheduling choice is thus driven by a combination of fairness (the lag value that is used to calculate the eligible time) and the amount of time that each process currently has due to it.

Addressing the latency problem

With this framework in place, the implementation of quicker access for latency-sensitive processes happens naturally. When the scheduler is calculating the time slice for each process, it factors in that process's assigned latency-nice value; a process with a lower latency-nice setting (and, thus, tighter latency requirements) will get a shorter time slice. Processes that are relatively indifferent to latency will receive longer slices. Note that the amount of CPU time given to any two processes (with the same nice value) will be the same, but the low-latency process will get it in a larger number of shorter slices.

Remember that the virtual deadline is calculated by adding the time slice to the eligible time. That will cause processes with shorter time slices to have closer virtual deadlines and, as a result, to be executed first. Latency-sensitive processes, which normally don't need large amounts of CPU time, will be able to respond quickly to events, while processes without latency requirements will be given longer time slices, which can help to improve throughput. No tricky scheduler heuristics are needed to get this result.

There is a big distance, though, between an academic paper and an implementation that can perform well in the Linux kernel. Zijlstra has only begun to run benchmarks on his EEVDF scheduler; his initial conclusion is that "there's a bunch of wins and losses, but nothing that indicates a total fail". Some of the results, he said, "seem to indicate EEVDF schedules a lot more consistently than CFS and has a bunch of latency wins".

While this is clearly a reasonable starting point, Zijlstra acknowledges that there is still quite a bit of work to be done. But, he said, "if we can pull this off we can delete a whole [bunch] of icky heuristics code", replacing it with a better-defined policy. This is not a small change, he added: "It completely reworks the base scheduler, placement, preemption, picking -- everything. The only thing they have in common is that they're both a virtual time based scheduler."

Needless to say, such a fundamental change is unlikely to be rushed into the kernel. Helpfully, the current patches implement EEVDF as an option alongside CFS, which will enable wider testing without actually replacing the current scheduler. The CPU scheduler has to do the right thing for almost any conceivable workload on the wide range of systems supported by the kernel; that leaves a lot of room for unwelcome regressions resulting from even small changes — which this is not. So a lot of that testing will have to happen before consideration might be given to replacing CFS with EEVDF; there is no deadline, virtual or otherwise, for when that might happen.

Index entries for this article
Kernel	Releases/6.6
Kernel	Scheduler/Completely fair scheduler
Kernel	Scheduler/EEVDF

(Log in to post comments)

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 15:55 UTC (Thu) by dullfire (subscriber, #111432) [Link]

Assuming I understand correctly: I wonder what happens in the case where are process gets a "large" negative lag, and then all other previously run-able process leave the queue. Does that force the system idle until the negative lag expires (or another process becomes eligible), or is the lag "debt" forgiven?

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 16:06 UTC (Thu) by Wol (subscriber, #4433) [Link]

I guess you can only get a large negative lag if you've been asleep ...

So if you've got a process that needs large chunks of CPU at random intervals you might let it accumulate entitlement while sleeping, but then it'll hog the CPU when it wakes; or more sensibly processes can only accumulate entitlement while they're actually waiting to run. Once they drop off the run queue of their own accord (sleep or whatever), they should be deleted from the scheduling algorithm until they wake up again.

Cheers,
Wol

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 16:12 UTC (Thu) by corbet (editor, #1) [Link]

I wasn't really able to go into the details of this...but: the scheduler (both CFS and EEVDF) works with virtual time that can flow faster or slower than clock time. If there are no other processes in the queue, virtual time will speed up — overcoming the negative lag — and the remaining process's fair share of CPU time will increase. So no, a process will not be forced idle if there are no others with a better claim to the CPU.

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 18:32 UTC (Thu) by dullfire (subscriber, #111432) [Link]

Ah interesting. Thanks for clarifying that.

An EEVDF CPU scheduler for Linux

Posted Apr 2, 2024 8:44 UTC (Tue) by Ayxan (guest, #169409) [Link]

> virtual time will speed up
But will the system idle until the sped up virtual time catches up? Or does it speed up enough to catch up immediately?

An EEVDF CPU scheduler for Linux

Posted Apr 2, 2024 9:38 UTC (Tue) by farnz (subscriber, #17727) [Link]

The core of virtual time is that it's OK to skip virtual time in which the system is idle; when you're about to go idle, you first look to see if there are any tasks blocked because their lag is negative, and if there are, you can skip virtual time until you reach the nearest "eligible time". You thus never go idle when you have tasks waiting to run; if you would go idle, you skip the virtual time to the nearest eligible time, and now at least one task is ready to run.

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 19:04 UTC (Thu) by geofft (subscriber, #59789) [Link]

This would actually be a useful feature: it's what's implemented by CFS's "quota" mechanism, more or less. A process that has a quota of, say, 100 ms per second is ineligible to run once it's consumed 100 ms of CPU time, even if the system is otherwise idle. The advantage of this is that it means the process's behavior is more predictable/deterministic, which helps you figure out how many system resources you need, especially on something like a large heterogeneous fleet (think Kubernetes/Borg/etc.). If I'm running a little monitoring daemon that I think needs half a CPU, or a transcoding pipeline for customers that I think needs four CPUs, or something, but the system makes a habit of giving me more than that because the CPUs are free, then I'm not very confident that my process indeed only needs half a CPU or four CPUs. So when someone else comes along and schedules a computationally-intensive job, I might see significant slowdowns.

Unfortunately, the CFS quota mechanism tends to result in a lot of weird runtime behavior; the high-level problem, I think, is that you can use a lot of CPU right after being scheduled without the scheduler stopping you, especially in a multi-threaded process, and once you get descheduled, the scheduler will realize that you're so deeply in the red for quota that it won't reschedule you for seconds or even minutes afterwards, which isn't really what users - or the TCP services they talk to - expect. So even though Kubernetes turns it on by default, lots of operators turn it off in practice. It's gotten better recently (see https://lwn.net/Articles/844976/, which also does a better job of explaining the overall mechanism than I'm doing) but I haven't gotten a chance to try it yet.

I'd be curious to know whether EEVDF can implement the quota concept in a way that is less jittery.

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 22:33 UTC (Thu) by Wol (subscriber, #4433) [Link]

And what happens if you have 11 processes each entitled to 100ms/s :-)

I'd be surprised if CPU was actually allocated as "so much each second", I'd allocate it per cycle. So effectively you should allocate a bunch of CPU to all your processes, and then when it runs out you go round the cycle and allocate again. That way your CPU is not idle if you have processes waiting.

Of course, there's plenty of other things to take into account - what happens if a process fork-bombs or something? And this idea of smaller chunks increasing your likelihood of scheduling might not be so easy to implement this way.

Actually, it's given me an idea for a fork-bomb-proof scheduler :-) If, instead of forking processes each getting a fresh new slice of CPU, you set a limit of how much is available each cycle. Let's say for example that we want a cycle to last a maximum of 2 secs, and the default allocation is 100ms. That gives us a maximum of 20 processes getting a full timeslice allocation. So when process 20 forks, the parent loses half its allocation to its child, giving them 50ms each. Effectively, as it forks the "nice" value goes up. Then as processes die their slice gets given to other processes.

So a fork bomb can't DoS processes that are already running. And if say you had a terminal session already running it at least stands a chance of getting going and letting you kill the fork-bomb ...

Cheers,
Wol

An EEVDF CPU scheduler for Linux

Posted Mar 11, 2023 19:30 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

> And what happens if you have 11 processes each entitled to 100ms/s :-)

I imagine this is resolved by the "virtual time" that corbet mentioned in another comment. If you have too many processes, your virtual clock runs fast, so now there are more than 1000 (virtual) ms in a (real) second, and everybody gets scheduled as allocated. It's just that the 100 (virtual) ms that they get is significantly less than 100 real milliseconds. This is mathematically equivalent to multiplying everyone's quota by 10/11, but you don't have to actually go around doing all those multiplies and divides, nor do you have to deal with the rounding errors failing to line up with each other.

Of course, that wouldn't work for realtime scheduling (where you have actually given each process a contractual guarantee of 100 ms/s, and the process likely expects that to be 100 *real* milliseconds), but we're not talking about that. If you try to configure SCHED_DEADLINE in such a way, it will simply refuse the request as impossible to fulfill.

An EEVDF CPU scheduler for Linux

Posted Mar 17, 2023 20:49 UTC (Fri) by prauld (subscriber, #39414) [Link]

This is not a replacement for the quota system. It only prevents negative lag tasks from running in the face of contention for the resource. EEVDF is work conserving as is CFS. Don't be caught up by that sentence which could have been clearer I suppose. Throttled tasks due to cpu limits are removed from the run queue and are not runnable and are not contending.

An EEVDF CPU scheduler for Linux

Posted Mar 9, 2023 17:16 UTC (Thu) by flussence (subscriber, #85566) [Link]

Sounds like it would've been an ideal first user of the BPF pluggable scheduler feature.

An EEVDF CPU scheduler for Linux

Posted Mar 10, 2023 7:13 UTC (Fri) by jason.rahman (subscriber, #120510) [Link]

Part of me wonders if this was posted precisely to head off BPF sched_ext by addressing the (correct) argument that CFS doesn't address latency sensitive ,application specific scheduling needs

An EEVDF CPU scheduler for Linux

Posted Mar 10, 2023 9:40 UTC (Fri) by fredrik (subscriber, #232) [Link]

A scheduler api for BPF would reasonably also limit new scheduler designs. The article mentions that the eevdf scheduler changed things all over the map.

> "It completely reworks the base scheduler, placement, preemption, picking -- everything."

A BPF scheduler api would have to expose all those concepts and either allow modification of structs in the api or limit what is possible to express. Other scheduler experiments would potentially require other components and concepts in the BPF-api.

And then there's the question of maintenance. Are kernel BPF api:s part of the kernel public api:s and hence expected to be maintained more or less for ever?

Of course, one counter argument is that anything related to software is possible in theory.

It would be interesting to get Peter Zijlstra's and other scheduler developers take on this topic.

An EEVDF CPU scheduler for Linux

Posted Mar 11, 2023 4:57 UTC (Sat) by flussence (subscriber, #85566) [Link]

> A scheduler api for BPF would reasonably also limit new scheduler designs. The article mentions that the eevdf scheduler changed things all over the map.
>
> > "It completely reworks the base scheduler, placement, preemption, picking -- everything."

That's only *more* reason to make this work via BPF, where people can try it out in a sandbox without risking bringing the entire system down. Otherwise something this intrusive ought to spend a good 10-15 years stewing in an external tree like every other attempt to overthrow CFS.

An EEVDF CPU scheduler for Linux

Posted Mar 10, 2023 0:31 UTC (Fri) by josh (subscriber, #17465) [Link]

I look forward to seeing performance numbers for `make -j$(nproc)` on large codebases.

An EEVDF CPU scheduler for Linux

Posted Mar 12, 2023 21:17 UTC (Sun) by dullfire (subscriber, #111432) [Link]

This use case is so common that I highly doubt there is any room to improve (significantly) against CFS on this. Especially since the scheduler devs have a direct incentive themselves to make this use case work well.

I suspect that most cases of large code base build times are going to be most fruitfully improved by fixing any lingering bugs in their build system (like abuse of recursive make).

Further more, it is also well out of scope of the areas the EEVDF scheduler is attempting to improve upon.

An EEVDF CPU scheduler for Linux

Posted Mar 26, 2023 6:36 UTC (Sun) by dottedmag (subscriber, #18590) [Link]

From the description I'd expect a more responsible system under `make -j$(nproc)` rather than the reduction of the build time.

An EEVDF CPU scheduler for Linux

Posted Mar 10, 2023 14:46 UTC (Fri) by HenrikH (subscriber, #31152) [Link]

"It should, for example, maximize the benefit of the system's memory caches, which requires minimizing the movement of processes between CPUs"

Which is a topic I feel that CFS fails quite a bit in, on the old O(1) scheduler you could launch a cpu heavy thread and it would more or less remain running on the same core while that same thread moves around a lot under CFS (not as bad as in the Windows scheduler though that seems to move around such a thread constantly so that a thread that takes 100% cpu on e.g a 4 core system results in all 4 taking 25% constantly).

So I hope that EEVDF will some how be better in this regard. And then we have to added complexity of the E-cores, P-cores mix from Intel and CCD:s with and without the 3d-cache from AMD, and of your BIG.little from ARM.

An EEVDF CPU scheduler for Linux

Posted Mar 15, 2023 0:01 UTC (Wed) by intgr (subscriber, #39733) [Link]

> on the old O(1) scheduler you could launch a cpu heavy thread and it would more or less remain running on the same core

Note that this is no longer optimal for modern CPUs. There are lots of thermal sensors, one or more per core, and boost clocks for the entire CPU are usually limited by the hottest reading. So keeping a heavy task on the same core leads to it developing a hotspot and limiting boost -- whereas boost is most useful when most cores are idle.

So some bouncing can actually lead to better single thread performance. Not sure what the ideal bounce interval is, I doubt CFS is particularly tuned for it.

Also it's common that some cores are better cooled due to their positioning on the chip, uneven cooler contact etc. It would be even better if the scheduler had such thermal information and would prefer the coolest cores.

An EEVDF CPU scheduler for Linux

Posted Mar 15, 2023 14:52 UTC (Wed) by farnz (subscriber, #17727) [Link]

In theory, at least, AMD offers the information you want via amd-pstate. It tells you what the highest performance is of a given CPU thread if all other threads are idle and cooling is adequate, what performance you can get if the cooling subsystem meets AMD specs for that CPU, and where you transition from "maximum perf/joule" to "fewer joules but lower perf".

An EEVDF CPU scheduler for Linux

Posted Mar 24, 2023 12:59 UTC (Fri) by nim (guest, #102653) [Link]

I really doubt that a single active core can heat up to throttling levels given adequate cooling (i.e. one that can maintain the nominal load on all cores). Plus, AFAIK, modern CPUs from both Intel and AMD expose the concept of best core, which is meant to be used exclusively when you need the highest clock. Doesn't that contradict your claim?

OTOH, cold caches are a real phenomenon. Sadly, I cannot point to a figure about how much that affects execution speed, especially if it can pull the data from the shared L3 cache instead of going to RAM, but it seems an easier nut to crack. It would be nice to at least have the option to say "don't move threads to other cores unless their current one is needed for sth else."

An EEVDF CPU scheduler for Linux

Posted Mar 24, 2023 15:19 UTC (Fri) by intgr (subscriber, #39733) [Link]

> I really doubt that a single active core can heat up to throttling levels given adequate cooling (i.e. one that can maintain the nominal load on all cores).

This is two sides of the same coin -- modern CPUs dynamically adjust their clocks to always operate at thermal limits (or sometimes power limits or other constraints) -- but CPU manufacturers call it "boost" rather than "throttling".

For each CPU, AMD individually lists "Base Clock" and "Max. Boost Clock", e.g. 5950X can range from 3.4 to 4.9GHz, the latter being the clock under ideal conditions for single-threaded workload. But that's usually not achievable for an extended steady workload. This article has some graphs showing clock behavior during extended single-thread workloads: https://www.tomshardware.com/news/amd-ryzen-3000-boost-cl...

> modern CPUs from both Intel and AMD expose the concept of best core, which is meant to be used exclusively when you need the highest clock. Doesn't that contradict your claim?

This is true as well, there may be only one or a few cores on a package capable of reaching the max boost clocks.

But as stated above, under longer workloads, it's common for modern CPUs to be thermally limited. If due to thermals, the "best" core's clock can no longer boost higher than the 2nd best core, you don't lose anything by running on the 2nd best core.

> OTOH, cold caches are a real phenomenon.

Agreed. It all depends on how often this bouncing happens. If it's just once every few seconds, it's likely that the performance penalty due to cold caches is small enough to be unmeasurable.

But I admit my analysis is hypothetical at this point: I have no information whether possible performance gains from core bouncing are any higher than the losses.

An EEVDF CPU scheduler for Linux

Posted Mar 24, 2023 15:28 UTC (Fri) by Wol (subscriber, #4433) [Link]

> I really doubt that a single active core can heat up to throttling levels given adequate cooling (i.e. one that can maintain the nominal load on all cores). Plus, AFAIK, modern CPUs from both Intel and AMD expose the concept of best core, which is meant to be used exclusively when you need the highest clock. Doesn't that contradict your claim?

"Given adequate cooling" - THIS IS THE PROBLEM. In order for cores to run faster, they need to get smaller. If they get smaller, cooling gets harder, and "adequate cooling" becomes impossible.

So as your core heats up, you have a choice - reduce the VA going in (and hence the heat that needs to be dissipated), or move the task to a different core to give the first one a chance to cool down. Do neither, and you'll kill the core. If you could easily kill older larger chips (I had problems with a 686 ages ago with inadequate cooling) it's going to be LOT easier to kill today's faster, more powerful cores.

Cheers,
Wol