Christian Huitema's blog

Cloudy sky, waves on the sea, the sun is
shining

Optimizing QUIC performance

12 Dec 2023

I am following the discussions on the IETF mailing lists, and I got engaged in a review of a proposal by Fred templin to better implement packet fragmentation and reassembly at the IPv6 layer. I am not a great fan of that proposal, largely because of the many arguments against IP level fragmentation explained in RFC8900 and the security issues described in RFC7739. I think it is a bad idea to try perform segmentation and reassembly outside the protection of encryption. But then, what about the argument that using larger packet sizes, with or without network level fragmentation and reassembly, would improve performance?

Fred Templin reports in his draft that tests with UDP and TCP performance tools show improvements with larger packet sizes, but that the “qperf” tool does not show any such gain for QUIC, because “‘qperf’ limits its packet sizes to 1280 octets.” Fred then concludes that this is the reason why in his benchmarks QUIC is much slower than TCP or raw UDP. For the record, the Quic Perf protocol specification does not mandate a packet size. Many QUIC implementations do implement path MTU discovery, a.k.a., DPLPMTUD and could use large packet sizes. If they did so, would that improve performance?

Some QUIC implementations have tried have tried larger packet sizes. A prominent example is the “litespeed” implementation, which did describe how to improve QUIC performance with DPLPMTUD. They found that increasing the packet size from 1280 to 4096 results in a sizeable performance gain, from 400Mbps to over 600Mbps. But we should qualify: this gains were observed in their specific implementation. Other implementations are using different approaches.

Processing of packets goes through multiple steps such as preparing QUIC headers, processing and sending control frames, copying data frames, encrypting and decrypting data. We analyze the performance of implementations using tools like “flame graph” to measure which components of the software consume the most CPU. In 2019, everybody pretty much assumed that the most onerous component was encryption. We were wrong, and somewhat shocked to observe that 70 to 80% of the CPU load was consumed in the socket layer, in calls to ‘sendto’ or ‘recvfrom’, or maybe ‘sendmsg’ and ‘recvmsg’.

The study by Alessandro Ghedini documented in a [Cloudflare blog] (https://blog.cloudflare.com/accelerating-udp-packet-transmission-for-quic/) was one of the first to show the importance of the UDP Socket API. The blog discusses three successive improvements:

using ‘sendmmsg’ to send multiple messages across the user/kernel boundary in a single call,
using Generic Software Offload (GSO) for UDP in order to send a ‘train’ of packets in a single call to ‘sendmsg’
combining sendmmsg and GSO to send multiple trains of packets in a single call.

This series of improvements allowed them to drive the performance of their implementation from 640 Mbps to 1.6 Gbps. (The litespeed values were well below that.)

The implementation of QUIC with the highest performance is probably msquic, by a team at Microsoft.
They publish performance reports in a variety of environments, routinely showing data rates of 6 to 7 Gbps. On top of the improvements described by Alessandro Ghedini, the Msquic implementation uses multiple CPU threads, and pays great attention to details of memory allocation, flow control parameters, or cryptographic APIs, as detailed by Nick Banks in this report.

I followed pretty much the same recipes to drive the performance of the picoquic implementation to between 1.5 and 2Gbps. Contrary to Cloudflare, the picoquic implementation relies only on sendmsg and GSO, not sendmmsg – but I spent a good amount of time studying the interation between congestion control, flow control, pacing, and the formation of packet trains. A difference with msquic is that the picoquic implementation uses a single thread, the idea being that deployment that want higher performance can run several servers in parallel, each in its own thread, and balance the incoming load.

Neither Cloudflare nor Microsoft conditions their performance on changing the packet size. One reason is that even with DPLPMTUD, the packet size is limited by the maximum packet size that the receiver accepts – apparently 1476 bytes when the receiver is a Chromium browser. Instead, they rely on sending “packet trains” and minimizing the cost of using the UDP socket API.

Once the implementation starts using packet trains, the size of individual packets matters much less. You do get some benefit from not running the packet composition tasks so often, but these are relatively minor components of the CPU load. There is a small per packet overhead for encryption, but the bulk of the CPU load is proportional to the amount of bytes being encrypted, which means that larger packets only bring modest benefits. The main value is the reduction of per packet overhead: 40 bytes for IPv6, 16 bytes for UDP, maybe 12 bytes for QUIC, 16 bytes for the AEAD checksum, that’s 84 bytes out of 1500 hundred bytes packet, about 6%. It would only be 2.1% of a 4096 bytes packet, a gain in performance of about 4%.

To answer the initial question, I do not believe that a modest performance gain of 4% or even 6% is sufficient to justify the security risk of running fragmentation and reassembly at the IPv6 layer. Fragmentation not protected by encryption open the path for denial of service attacks in because spoofed fragments cannot be easily distinguished from valid fragments. The performance loss from these DOS attacks would be much higher than 6%!

Comments

If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.