18 May 2023
About two weeks ago, I was told by developers of “Media over QUIC” that there was an issue when running over Wi-Fi. After a few seconds, there would be some kind of event, triggering congestion control implemented in Picoquic to reduce the bandwidth, and then resulting in pretty bad performance. It seems due to issues with the Wi-Fi driver on the Mac, as I wrote in a toot on Mastodon. Now that I am less busy with other projects, I have the time to measure the issue in details.
The figure above shows the evolution of the round trip time (RTT) between two computers in my office: an iMac running macOS Ventura 13.3.1, and a Dell laptop running Windows 11. The measurement were taken with a simple program that was generating UDP packets every 20ms on the iMac, sending them over WiFi to the laptop, and then receiving an echo from the laptop. The program logged the time at which the packet was sent, the time at which the laptop sent the echo, and the time at which the echo was received. The RTT is of course measured as the difference between the time the packet was sent and the time the echo was received.
The RTT versus Time graph shows that most RTT samples are rather short, a few milliseconds — the median RTT is 4.04 milliseconds, and 95% of samples are echoed in less than 8ms. Out of 30,000 packets sent in 10 minutes, 38 were lost, about 0.12%. Some packets take a bit longer, with the 99th percentile at 50.7ms, which is somewhat concerning. But the obvious issues are the 18 spikes on the graph, 18 separate events during which the RTT exceeded 100ms, including 12 events with an RTT above 200ms.
The close-up graph shows a detailed view of a single spike. 14 packets were affected. The first one was lost, the second one was echoed after 250ms, and we see the RTT of the next 12 packets decreasing linearly from 250 ms and 4 ms. Looking at the raw data shows that these 13 packets were received just microseconds apart. Everything happens as if Wi-Fi transmission has been suspended for 250 ms, with packets queued during the suspension and delivered quickly when transmission resumes.
The previous graph looked at a “simple” spike happening 23 seconds after the start of the measurements. Simple events appear as narrow spike in the “time line” graph. Some events are more complex. They appear on the graph as a combination of adjacent line.
The next graph shows a close up of a series of spikes happening at short intervals. There are 14 such spikes, spread over a 3 seconds interval. Each spike has the same structure as the single spike described above: the network transmission appears to stop for an interval, and then packets are delivered. In one case, two spikes overlap. Spikes may have different intervals, between 50 ms and 280 ms.
The RTT is the sum of two one-way delays: from the Mac to the PC, and back. The previous analysis concludes that the spikes happen when transmission stops, but that could be transmission from the Mac or from the PC. The one way delay trap shows that it actually happens in both directions. Out of the 18 spikes in the RTT timeline graph, 11 happens because transmission stopped on the Mac, 3 because it stopped on the PC, and 4 because it stopped on both. It seems that PC and Mac have similar Wi-Fi drivers, both creating occasional spikes, but that this happens almost twice as often on the Mac.
At this stage, we don’t know exactly what causes the Wi-Fi drivers to stop transmission. There are two plausible ideas: wireless driver sometime stop in order to save energy; or, wireless drivers sometime stop operation on one frequency band in order to scan the other bands and locate alternative Wi-Fi routers. Out of the two, the scanning hypothesis is the most likely. It would explain the “series of spikes” patterns, with the wifi radio briefly returning to the nominal frequency band between scans of multiple bands.
My next task will be to see how the QUIC stack in Picoquic can be adapted to mitigate the effects of this Wi-Fi behavior, for example by returning quickly to nominal conditions after the end of a spike. But the best mitigation won’t help the fact that shutting down radios for a quarter of a second does nothing good to end to end latency. VoIP over Wi-Fi is going to not sound very good. The issue is for our colleagues at Apple and Microsoft to fix!
If you want to start or join a discussion on this post, the simplest way is to send a toot on the Fediverse/Mastodon to @email@example.com.