TSO sizing and the FQ scheduler

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
August 28, 2013

One might think that, by now, we would have a pretty good idea of how to optimally manage data streams with the TCP protocol. In truth, there still seems to be substantial room for improvement. Larger problems like bufferbloat have received a lot of attention recently, but there is ongoing work in other aspects of real-world networking as well. A couple of patches posted recently by Eric Dumazet show the kind of work that is being done.

TSO sizing

TCP segmentation offload (TSO) is a hardware-assisted technique for improving the performance of outbound data streams. As a stream of data (a "flow") is transmitted, it must be broken up into smaller segments, each of which fits into one packet. A network interface that supports TSO can accept a large buffer of data and do this segmentation work within the hardware. That reduces the number of operations performed and interrupts taken by the host CPU, making the transmission process more efficient. The use of techniques like TSO makes it possible for Linux systems to run high-end network interfaces at their full speed.

A potential problem with TSO is that it can cause the interface to dump a large number of packets associated with a single stream onto the wire in a short period of time. In other words, data transmission can be bursty, especially if the overall flow rate for the connection is not all that high. All of those packets will just end up sitting in a buffer somewhere, contributing to bufferbloat and increasing the chances that some of those packets will be dropped. If those packets were transmitted at a more steady pace, the stress on the net as a whole would be reduced and throughput could well increase.

The simple TSO automatic sizing patch posted by Eric (with a Cc to Van Jacobson at a new google.com address) tries to spread out transmissions in just that way. There are two changes involved, the first of which is to make intelligent choices about how much data should be handed to the interface in a single TSO transmission. Current kernels will provide a maximally-sized buffer — usually 64KB — to be transmitted all at once. With the automatic sizing patch, that buffer size is reduced to an amount that, it is estimated, will take roughly 1ms to transmit at the current flow rate. In this way, each transmission will produce a smaller burst of data if the flow rate is not high.

The other piece of the puzzle is called "TCP pacing"; it is a TCP implementation change intended to set the pace at which packets are transmitted to approximately match the pace at which they can get through the network. The existing TCP flow control mechanisms tell each endpoint how much data it is allowed to transmit, but they don't provide a time period over which that transmission should be spread, so, once again, the result tends to be bursts of packets. TCP pacing ensures that packets are transmitted at a rate that doesn't cause them to pile up in buffers somewhere between the two endpoints. It can, of course, also be used to restrict the data rate of a given flow to something lower than what the network could handle, but that is not the objective of this patch.

Interestingly, the patch does not actually implement pacing, which would add some significant complexity to the TCP stack — code that does not really need more complexity. All it does is to calculate the rate that should be used, in the hope that some other level of the stack can then enforce that rate. For now, that other part would appear to be the new "fair queue" packet scheduler.

The FQ scheduler

A packet scheduler is charged with organizing the flow of packets through the network stack to meet a set of policy objectives. The kernel has quite a few of them, including CBQ for fancy class-based routing, CHOKe for routers, and a couple of variants on the CoDel queue management algorithm. FQ joins this list as a relatively simple scheduler designed to implement fair access across large numbers of flows with local endpoints while keeping buffer sizes down; it also happens to implement TCP pacing.

FQ keeps track of every flow it sees passing through the system. To do so, it calculates an eight-bit hash based on the socket associated with the flow, then uses the result as an index into an array of red-black trees. The data structure is designed, according to Eric, to scale well up to millions of concurrent flows. A number of parameters are associated with each flow, including its current transmission quota and, optionally, the time at which the next packet can be transmitted.

That transmission time is used to implement the TCP pacing support. If a given socket has a pace specified for it, FQ will calculate how far the packets should be spaced in time to conform to that pace. If a flow's next transmission time is in the future, that flow is added to another red-black tree with the transmission time used as the key; that tree, thus, allows the kernel to track delayed flows and quickly find the one whose next packet is due to go out the soonest. A single timer is then used, if needed, to ensure that said packet is transmitted at the right time.

The scheduler maintains two linked lists of active flows, the "new" and "old" lists. When a flow is first encountered, it is placed on the new list. The packet dispatcher services flows on the new list first; once a flow uses up its quota, that flow is moved to the old list. The idea here appears to be to give preferential treatment to new, short-lived connections — a DNS lookup or HTTP "GET" command, for example — and not let those connections be buried underneath larger, longer-lasting flows. Eventually the scheduler works its way through all active flows, sending a quota of data from each; then the process starts over.

There are a number of additional details, of course. There are limits on the amount of data queued for each flow, as well as a limit on the amount of data buffered within the scheduler as a whole; any packet that would exceed one of those limits is dropped. A special "internal" queue exists for high-priority traffic, allowing it to reach the wire more quickly. And so on.

One other detail is garbage collection. One problem with this kind of flow tracking is that nothing tells the scheduler when a particular flow is shut down; indeed, nothing can tell the scheduler for flows without local endpoints or for non-connection-oriented protocols. So the scheduler must figure out on its own when it can stop tracking any given flow. One way to do that would be to drop the flow as soon as there are no packets associated with it, but that would cause some thrashing as the queues empty and refill; it is better to keep flow data around for a little while in anticipation of more traffic. FQ handles this by putting idle flows into a special "detached" state, off the lists of active flows. Whenever a new flow is added, a pass is made over the associated red-black tree to clean out flows that have been detached for a sufficiently long time — three seconds in the current patch.

The result, says Eric, is fair scheduling of packets from any number of flows with nice spacing in situations where pacing is being used. Given that the comments so far have been mostly concerned with issues like white space, it seems unlikely that anybody is going to disagree with its merging. So TCP pacing and the FQ scheduler seem like candidates for the mainline in the near future — though the upcoming 3.12 cycle may be just a bit too near at this point.

Index entries for this article
Kernel	Networking

(Log in to post comments)

TSO sizing and the FQ scheduler

Posted Aug 29, 2013 12:01 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

>Whenever a new flow is added, a pass is made over the associated red-black tree to clean out flows that have been detached for a sufficiently long time — three seconds in the current patch.

Given the discussion is of flow rates and queues which reflect known information about the state of the network, it seems crazy that this is a fixed number. Maybe the maths has been done offline to find the 99th percentile of an appropriate lognormal for TCP flow duration, but the approach of sampling and making predictions which is used in the other components of this patch suggests that calculating an estimate of the mean at clean-up time could inform the clean-out expiry timer. That might be best done with some floating point maths, unfortunately a kernel no-no. But like many statisticians say, 'we can make an approximation'..!

K3n.

TSO sizing and the FQ scheduler

Posted Sep 3, 2013 9:11 UTC (Tue) by DavidS (guest, #84675) [Link]

I'd rather guess that the three seconds come from "if a connection hadn't had packets in three seconds, the scheduler might as well treat the next packet like a new one."

TSO sizing and the FQ scheduler

Posted Sep 3, 2013 19:50 UTC (Tue) by robbe (guest, #16131) [Link]

> nothing can tell the scheduler [when a particular flow is shut down] for
> flows without local endpoints

What about connection tracking that lives right next door in the same kernel?

TSO sizing and the FQ scheduler

Posted Sep 6, 2013 21:35 UTC (Fri) by sandymac (guest, #22424) [Link]

connection tracking is likely fine for the home or the office router/gateways performing NAT. At the ISP level that may not be cost effective.

How to configure?

Posted Nov 4, 2013 17:26 UTC (Mon) by BrucePerens (guest, #2510) [Link]

I have a Comcast Business link with native IPV6, which has a very variable speed. It can get up to 100 megabit/second bursts when their consumer users aren't home, and around 15 megabits/second (the rate I pay for) in the evening. Obviously the various QoS scripts that presume to know your connection's bandwidth aren't effective.

Can anyone tell me how to configure FC packet scheduling? The HOWTO on packet scheduling configuration is more than a year old.

Thanks

Bruce