Christian Huitema's blog

Cloudy sky, waves on the sea, the sun is
shining

The Retire Connection ID stuffing Attack against QUIC

12 Mar 2024

Back in December, Marten Seeman found an attack against QUIC. Malevolent clients could exploit the “path validation” mechanism to create large queues of “Path Response” messages, eventually saturating the server”s memory. Back then, it turned out that the picoquic implementation was not vulnerable, because it only responded to the last challenge received. Marten has since found another similar issue, this time exploiting the handling of “NEW CONNECTION ID” frames. And this time, I did have to fix the picoquic code.

QUIC was designed with security in mind. There were lots of reviews, lots of discussions, and the development of formal proofs that the security of QUIC is equivalent to that of TLS 1.3. There are known limit to that security: on path observers can cause the initial handshake to fail before the session keys are fully negotiated; they can “fingerprint” the encrypted traffic and match it to known patterns; on path attackers can drop packets or mess with IP headers. These are very hard issues, the kind that will need serious efforts to resolve – QUIC is not better than TLS1.3 in that regard. But outside of that, most attacks against QUIC are attacks against implementations that did not correctly implement the specification. Marten’s attacks are different, because they are based on the QUIC specification itself, and affect potentially every implementation.

QUIC packet headers include a “connection identifier” that allows node to link packets to connection contexts. The connections ID are created by the QUIC node that will eventually decode them in packet headers: the client uses connection IDs provide by the server in the packets sent to the server, and vice versa. To facilitate that, each node sends “NEW CONNECTION ID” frames to its peer. And to control the use of resource, both nodes tell their peer how many “NEW CONNECTION ID” they are willing to receive – trying to send more than that triggers a protocol violation. When a node does not need an old connection ID anymore, it sends a “retire connection ID” frame, and the peer will provide a new one to replace it.

But there is a catch. The servers and clients need to maintain a list of valid connection identifiers so they can successfully direct packets to connection contexts. In big server farms, this is done by the load balancers, and it is often done using encryption and decryption: the connection ID typically contains bytes encrypting a server identifier and a random number, with a key known to the load balancer, see for example the QUIC-LB draft. These keys are rotated regularly, and when an old key is discarded, the corresponding connection ID need to be discarded and new ones need to be provided. The new connection IDs will be carried in “NEW CONNECTION ID” frames, with an attribute asking to retire connection IDs with a sequence number lower than a “retire prior” value.

This is the mechanism exploited in the attack. The malevolent client will send a series of “NEW CONNECTION ID” frames saying something like “this is connection ID number N, please retire connection ID number N-1”. The server will accept the frame because once the previous connection ID has been retired, the total number of connection ID provided by the peer remains at or below the maximum number allowed. The server will also send a “RETIRE CONNECTION ID” frame confirming that the old connection ID number “N-1” has been retired.

That, by itself, does not sound too bad, but wait. The client sends a series of “NEW CONNECTION ID” frames to force the server to send a series of as many “RETIRE CONNECTION ID” frames. Clients have many other ways to cause the server to send traffic, such as for example requesting web pages. But all other types of traffic are regulated by flow control mechanisms, which limit how much data will be queued. In contrast, there is no limit to the number of “RETIRE CONNECTION ID” frames that could be queued.

Still, these frames are very small, so they would normally not be queued for very long. The attack only works if the client manages to slow down the server, for example by only acknowledging a fraction of the packets that the server sends. This will cause the congestion control mechanisms to reduce the congestion window, dropping eventually to a minimum size like 1 or 2 packets per RTT. Each such packet would carry several hundred “RETIRE CONNECTION ID” frames. The queues will start to build up if the client sends thousands of “NEW CONNECTION ID” frames per RTT. There are certainly network scenarios in which that is doable. In those scenarios, the queues will build up, and the process memory will keep growing. Do that long enough and the server will run out of memory, especially if it is a small server, such as an embedded system.

The discussion above only considers the sized of the queued frames, which is the only effect per specification. But implementation choices may make servers more sensitive to the attack. For example, some servers may keep the memory allocation for a connection ID until the “RETIRE CID” frame has not just been sent, but also acknowledged by the peer. In that case, the attack will cause much larger memory allocations. And then, the code handling connection IDs may be designed for the small number of CID expected, but the attack could increase the size of the table, the handling would be inefficient, and the CPU load will increase.

The attack is not hard to mitigate: just limit the number of “RETIRE CONNECTION ID” frames that the server is willing to queue, detect an anomaly and break the connection if the client sends more than that. But it is yet another example that designing network protocols is hard. Dozens of engineers have reviewed the QUIC specification, yet Marten found the issue three years after the specification was complete. We have seen reviewed the specification again and could not find other similar issues. But I guess that we have to keep looking!

Oh, and if you are building a project using picoquic, the fix for that issue was merged on March 12 at 18:00 UTC. Please update to a recent version!

Comments

If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.