Christian Huitema's blog

Cloudy sky, waves on the sea, the sun is
shining

QUIC timeouts and Handshake Interop

19 Jan 2024

Marten Seemann made a great contribution to QUIC interoperability by setting the QUIC interop runner. The site runs series of interoperability tests between participating QUIC implementations (17 of them when I am writing this) and reports that in a large result matrix. It is a nice complement to the internal tests of the implementations, and it was flagging an interesting issue: the test L1 was failing between ngtcp2 client and picoquic server.

The test codenamed L1 verifies that implementations can successfully establish connections in presence of high packet loss. The test consists of 50 successive connection attempts, followed by the download of a short 1KB document. The connections are run over a network simulation programmed to drop 30% of packets. The test succeeds if all connections succeed and all 50 documents are retrieved.

In the “ngtcp2 to picoquic” tests, all documents were properly downloaded, but the analysis of traffic showed 51 connection attempts instead of the expected 50, and thus the test was marked failing. It took me a while to parse the various logs and understand why this was happening, but it turned out to be a timeout issue. One of the 50 tests ran like this:

Nobody is really at fault here — NGTCP2 behaves exactly as the standard mandates, and it is perfectly legal for the Picoquic server to drop contexts after absence of activity for some period. In fact, servers should to do just that in case of DOS attacks. But explaining to the testers that “we are failing your test because it is too picky” is kind of hard. There was a simpler fix: just configure Picoquic to use longer timers, 180 seconds instead of 30. With that, the context is still present when the finally successful repeat packet arrives. Picouic creates just one connection, and everybody is happy.

But still, Picoquic was using a short handshake timer for a reason: if connections are failing, it makes sense to clean them up quickly. The L1 test between Picoquic client and server was passing despite the short timers, because Picoquic’s loss recovery process is more aggressive than what the standard specifies. The standard specifies a conservative strategy that uses “exponential backoff”, doubling the value of the timer after each failure, for the following timeline:

Time (standard) Number Timeout(ms)
0 1 300
300 2 600
900 3 1200
2100 4 2400
4500 5 4800
9300 6 9600
18700 7 19200
37900 8 38400
76300 9 76800

Picoquic deviates from that strategy, as discussed in Suspending the Exponential Backoff. The timeline is much more aggressive:

Time (picoquic) Number Timeout(ms)  
0 1 250  
250 2 250 Not doubling on first PTO
500 3 500  
1000 4 1000  
2000 5 1875 Cap timer to 1/16th of 30s timer
3875 6 1875  
5750 7 1875  
7625 8 1875  
9500 9 1875  

After configuring the handshake timer to 180 seconds, the Picoquic sequence is still more aggressive than the standard, but the difference is smaller:

Time (picoquic) Number Timeout(ms)  
0 1 250  
250 2 250 Not doubling on first PTO
500 3 500  
1000 4 1000  
2000 5 2000  
6000 6 4000  
10000 7 8000  
18000 8 11250 Cap timer to 1/16th of 180s timer
29250 9 11250  

In our test, it seems that not being much more aggressive than the peer did result in the behavior that the testers expected. In real life, I think that the intuitions developed in the previous blog still hold. It is just that for the test, we have to please the protocol police…

Comments

If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @huitema@social.secret-wg.org.