27 Nov 2023
I just spent some time revisiting how the picoquic code handles failing links. The QUIC specification in RFC 9002 makes reference to exponential backoff. If a node does not obtain any ACK for some time, the “Probe timeout” (PTO) fires. The node sends an extra packet to “probe” the state of the path, and solicit an ACK. If the ACK arrive, it should acknowledge all the pending packets that have been received, and the stack can start repeating those that were not. If the ACK does not arrive, the node sends a new probe and doubles the timeout. The node will keep waiting and doubling the timeout until it gives up, and consider the connection broken. This is known as the “exponential backoff” algorithm.
The exponential backoff algorithm is considered essential for protecting against congestion. If a path is congested, we want all connection sharing the bottleneck to quickly slow down in order to ease the congestion. That’s fine, but there is an underlying hypothesis that the packet losses are caused by either congestion or line cuts, and that in both cases sending feer packets is good. But we very fequently see other kind of cuts in wireless, for example when a mobile device looses wireless signal temporarily.
Exponential backoff is a bad way to deal with suspension of service. The timeout grows ever larger, and thus the probes for connectivity ever less frequent. On average, the first probe to go through will arrive half a timeout after the suspension stops. Take the example in which the suspension lasts 10 seconds, and the initial timeout is 1s. We will get the following sequence showing a 6 second delay between the repair of the connectivity and the first successful probe.
|Probe #1, fails
|Probe #2, fails
|Probe #3, fails
|Probe #4, fails
|Probe #5, succeeds 6 seconds “too late”
Intuitively, it seems that the first steps of doubling the timeout make sense. At that point the nodes don’t know whether connectivity has failed, or whether packet losses are due to a congestion event that is causing very large delays. Waiting twice of four times longer makes sense, because if there is congestion slowing traffic is the right response. But if no response comes after several trials, the path is probably experiencing a cut or a suspension. In that case, the timeout should be kept reasonably short, so the repair of the path is detected quickly. The question of course is, how short? What would be the reasonable value of a maximum PTO? I argue here that it should be a fraction of the “idle timeout”.
QUIC sessions start by the negotiation of an “idle timeout” between client and server. The idle timeout determines how long each endpoint waits before considering that the connection has failed. It determines how long the endpoints will wait before giving up on the connection. There is no point in computing a PTO that would extend after the idle timeout fires. But the idle timeout also determines the longest suspension that the application will tolerate. Setting the maximum PTO to a fraction of the idle timeout limits the “wasted time” between connectivity repair and first successful probe to a value in line with the maximum suspension that the application will accept.
The inverse of the fraction determines the number of probes sent before the node gives up. Sending too many probes is not desirable, if only because it uses too much resource like battery power. Plausible values could be something like 1/8th or 1/16th, maybe with some safety to guarantee that the timeout is always greater than the original PTO. In the latest version of Picoquic, I picked 1/16th, and all the tests seem to be passing fine, including the tests checking detection of connectivity loss and repair.
I don’t think this going away from full exponential backoff will cause the congestion collapse of the Internet.
If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @firstname.lastname@example.org.