22 Jun 2023
In a previous post I observed that the classic way to compute RTT statistics and retransmission timers does not work well. When acknowledgements are too frequent, the correlation between successive RTT measurements causes the smoothed RTT estimate to track closely the last values, and the RTT variations to be widely underestimated. This imprecision impacts both the effectiveness of loss recovery and the implementation of delay based congestion algorithms like BBR. I think this can be fixed by updating the variables just “once per RTT”. In this paper, I first describe the once per RTT update algorithm, then report on promising measurements using the simulations in the Picoquic test suite.
The goal of the algorithm is to update the
and PTO variables just once per RTT.
The once per RTT update algorithm operates as follow:
smoothed_rtt = kInitialRtt rttvar = kInitialRtt / 2
rtt_sample_number_in_epoch = 0 rtt_epoch_pn = last_packet_number_sent + 1 rtt_epoch_nb_samples = 0 rtt_epoch_sum_samples = 0 rtt_epoch_sample_max = 0 rtt_epoch_sample_min = UINT64_MAX;
smoothed_rtt = latest_rtt rttvar = latest_rtt / 2 min_rtt = latest_rtt
adjusted_rtt = latest_rtt if (latest_rtt >= min_rtt + ack_delay): adjusted_rtt = latest_rtt - ack_delay rtt_epoch_nb_samples += 1 rtt_epoch_sum_samples += adjusted_rtt rtt_epoch_sample_max = max(adjusted_rtt, rtt_epoch_sample_max) rtt_epoch_sample_min = min(adjusted_rtt, rtt_epoch_sample_min) min_rtt = min(min_rtt, adjusted_rtt)
rttvar_sampleis set to the maximum value of the difference between the
smoothed_rttand the maximum or minimum RTT seen during the epoch.
rttvar_sample = max(abs(rtt_epoch_sample_max - smoothed_rtt), abs(rtt_epoch_sample_min - smoothed_rtt)) rttvar = 3/4 * rttvar + 1/4 * rttvar_sample rtt_epoch_estimate = rtt_epoch_sum_samples / rtt_epoch_nb_samples smoothed_rtt = 7/8 * smoothed_rtt + 1/8 * rtt_epoch_estimate PTO = smoothed_rtt + 3*rttvar + max_ack_delay
Then, reinialize the per-epoch variables.
After coding the once RTT algorithm in picoquic I ran the picoquic test suite to evaluate the results. The test suite include a couple hundred simulations of QUIC connections under various conditions. Each of these tests verifies result such as expected simulated time for completing the smulated transfers, or sometimes maximum number of retransmission errors or maximum number of spurious retransmissions. All these tests passed, which means that the new algorithm was well within the tolerances set for the new algorithms. Looking at the details, I saw that:
The big differences happened with tests simulating connection events such as link breaks in multipath scenarios. In these cases, having a better estimate of the RTT helps react sooner. In single path scenarios link breaks result in connection failures and the measurements are not very interesting, but the test suite includes “MTU Drop” tests that simulate a change in the path MTU size, and measure how long it takes to ract to the event and continue the session with the new MTU. This reaction is also very dependent on RTT estimates, and I analyzed it in details.
The “MTU Drop” test simulates the download of a 1MB file over a 1Mbps link, with the transmission delay set at 100ms in each direction, for a nominal RTT of 200ms. Initially, the path MTU is set to 1500 bytes. The code simulates the establishment of the connection, then waiting for 1 second, and verifying that path MTU discovery was performed, and that the connection is using a packet size of 1440 bytes. At that point, the path MTU is reset to 1394 bytes. The test verifies that the code notices the drop in size, and renegotiates an appropriate path MTU. The test verifies that the download completes in the expected time.
The goal of the test is to verify the PMTU adaptation, but that adaptation depends on the packet loss recovery algorithm, which itself depends on the computation of the RTT and the retransmission timer. We evaluate two ways to computing these timers:
When using the once per RTT computation, we evaluate tree variants:
Once 4V, in which the retransmission timer is computed as specified in RFC 9002, adding 4 times “RTT_VAR” and the ACK delay to the smoothed RTT value
Once MaxRTT, in which the retransmission timer is set to the MaxRTT plus RTT_VAR plus the ACK delay
Once 3V, which is the same as Once 4V, but only adding 3 times “RTT_VAR” instead of 4.
The results of the four test runs are summarized in the following table:
We see that the “Once 3V” performs better than the classic implementation: the download completes faster, and there are significantly fewer packet losses.
It is hard to understand exactly why one specific tuning works better than others. When looking at “Once 3V” versus “Classic”, it seems that most of the difference happened during the initial “start up” phase. The log of the “Once 3V” trial shows it exiting start-up 631ms after the beginning of the connection, with the “smoothed RTT” set at 224ms. The “Classic” log shows it exiting start up 991ms after the start, with the smoothed RTT set at 353ms. By the time the MTU drop is simulated, the “classic” connection has the smoothed RTT set at 407ms, while the “once 3V” connection has the RTT at 217ms. The larger RTT means that the “classic” connection takes almost twice as long to detect the packet losses caused by the MTU drop and recovers almost twice slower.
That same reasoning explains why setting the retransmission timer from the Max RTT does not work so well. In our samples, the maximum observed RTT happens at the end of start up, before stabilizing to lower values. The retransmission timer set from Max RTT is too large, and losses are not detected quickly enough.
How many deviations in the timer?
We thus arrive to the notion that “large retransmission timers slow down recovery”. That’s the observation that pushes us to deviate from the original formula:
retransmission timer = smoothed_RTT + 4 * rtt_var + ack_delay
That formula seems pretty arbitrary. It is related to the classic observations that for a Gaussian distribution, observations over 4 standard deviations away from the average happen less than 0.01% of the time, which seems a good way to minimize “spurious packet loss detection”. But it is far from clear that the distribution of RTT samples is Gaussian, and the computation of rtt_var is only a vague approximation of computing a standard deviation. In any case, the use of N times rtt_var is a tradeoff, fewer spurious detection versus longer recovery time from errors.
In QUIC, most loss detections are based on packet number comparisons, not timers. This, alone, renders spurious loss detections fairly rare. In addition, congestion control algorithms like BBR recover much better from spurious loss detection than old algorithms like RENO. It is probably time to revisit the tradeoff. Our tests show that using a coefficent 3 rather than 4 speeds up error detection without visible side effect.
The “once per RTT” algorithm design appears sound, but the evaluations are based on simulations. The next step is to works with other QUIC developers and see whether the algorithm can be validated in large scale deployments.
If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @firstname.lastname@example.org.