RACK makes Pion SCTP 71% faster with 27% less latency

In the RACK profile, SCP sustained 316 Mbps using ~0.044 CPU seconds, compared to 234 Mbps at ~0.056 CPU seconds before RACK. When normalized for CPU usage, this corresponds to a ~71% improvement in throughput per CPU, while max-burst CPU profiles remain comparable.

R Chiu, Joe Turki -- 2025-12-21


What is SCTP?

SCTP stands for Stream Control Transmission Protocol. At a basic level, SCTP is designed to be reliable, handle de-duplication of packets, and support packets that may be delivered in order or out of order. Beyond transporting messages, SCTP can also set up a connection between users. On a deeper level, SCTP includes native support for multiplexing: multiple applications can take advantage of a single transport connection. SCTP also supports multi-homing, which enables automatic failover from a primary connection to a secondary one.

At the most basic level, it lets you reliably send information from one computer to another without any complications.

What is SCTP used for?

SCTP’s uses can generally fit into two cases:

1. Sending some amount of data.

Imagine a scenario where two people are texting when one person remembers a picture that they want to send. As they text back and forth, an image gets uploaded, which takes some time to get sent. SCTP can handle multiple things going on at the same time and doesn’t delay any messages from being sent just because an image is being uploaded! Thanks to SCTP, text messages can be safely delivered to each person and nothing in their conversation is lost in transit or delayed just because something else is being transferred at the same time as their messages.

Building on this idea, users can share larger files with each other. This includes anything: birthday videos, audio recordings, even boring paperwork; anything that’s a file can be sent!

2. Sending small amounts of data with a purpose.

In a new scenario, imagine two people who are texting back and forth when one person gets hungry. They send a message saying, “I want a pizza!” When the other person receives the text, they think, “Maybe I should do something about that!” The recipient can choose to do something useful for the sender with that information.

This is the blueprint for many awesome technologies today, as it opens up the possibility of controlling one computer from a different computer. Consider a surgeon who performs an operation involving a remote-controlled device that needs to respond with as little latency as possible. Similarly, real-time navigation systems also need to respond to changes in traffic conditions quickly in order to avoid congested or unsafe areas due to accidents or weather conditions.

Other uses:

SCTP can be used for online multiplayer games where every frame counts, including first-person shooters and fighting games. Taking the remote surgery example in this direction leads to the idea of cloud gaming, as players can have their inputs sent to a different device than the one that they’re using while still being able to play the game!

SCTP is also used inside web browsers via WebRTC and has found use in AI applications and cryptocurrency-related technologies. Additionally, payment verification can similarly benefit from secure and fast communication.

Why SCTP for WebRTC?

SCTP is used for WebRTC because of its ability to send information via reliable and unreliable datachannels. For example, you can send messages or files in a chat with SCTP. Other uses include being able to know when users toggle their microphone or video in a video call. In some special cases, SCTP can even be used to transmit video between users, but that’s significantly less common.

In WebRTC, the ICE protocol connects users and the DTLS protocol establishes a secure connection, at which point SCTP is then used to securely transfer data. In an ideal setup, data that’s sent should “just work”. Unfortunately, that isn’t how things tend to pan out, as issues eventually crop up. Packets get dropped, the network jitters, the computer stutters, or the coffee machine doesn’t start when you thought it had. That’s why it’s important to have a backup plan for when things go wrong.

How SCTP Deals With Loss

SCTP was designed with this in mind and has two built-in recovery strategies for when networking goes wrong.

The first is called “fast retransmission.” The receiver detects if a chunk of data is missing in the transmission. If so, the receiver notifies the sender that a specific chunk ID is missing. If the sender receives three reports of a missing chunk where all three reports are referring to the same chunk ID, then the sender will assume that the chunk has been lost and resend it.

The second is a timer-based retransmission. This happens if the receiver doesn’t acknowledge that it has received all the packets within a specific window of time. If the receiver doesn’t acknowledge that all the packets have been received, then the sender is prompted to retransmit the unacknowledged data.

Both of these loss recovery strategies are used by SCTP to try to ensure that any lost data is detected and retransmitted as quickly as possible. At the time of writing, Pion’s implementation of Pion’s implementation of SCTP uses these two mechanisms for loss recovery.

These strategies are also used by TCP, which has prompted engineers to see if there’s an even better strategy to detect and mitigate lost data.

Introducing RACK

In February 2021, RFC 8985: The RACK-TLP Loss Detection Algorithm for TCP was published. This was a completely new loss detection algorithm that focused on actively keeping track of network statistics and using timer-based signals in order to remain adaptive to ever-changing network conditions. RACK’s improvements over SACK and fast retransmission in TCP were enticing enough for Linux, Windows, and FreeBSD to all implement it in TCP.

While RACK was originally intended to be implemented for TCP, it is noted in the RFC that it can be implemented in other transport protocols, including SCTP.

The implementation for SCTP was formally analyzed in Felix Weinrank’s Dissertation and other publications. Weinrank’s deep dive provides an extremely comprehensive review of SCTP and improvements regarding usage in various scenarios, including WebRTC. At the moment, we’re more concerned with Weinrank’s analysis and implementation notes regarding RACK in SCTP. In Chapter 7 of the dissertation, Weinrank goes over how SCTP handles loss and how RACK can be implemented for SCTP, including extra details regarding how the implementation interacts with various SCTP extensions.

RACK’s motivations:

The authors of RACK in RFC 8985 provides examples of situations where RACK improves SCTP and TCP during loss recovery.

How RACK’s Tail Loss Probing (TLP) Works

  sequenceDiagram
    participant S as Sender
    participant R as Receiver

    Note over S,R: The two sends are normally<br>combined but are separated<br>here for visual clarity.
    S->>R: Send 🍎
    S-->>R: Send 🍌, πŸ₯•, πŸ₯”
    Note right of R: Only 🍎 arrived, so it's ACK'd.
    R->>S: ACK 🍎

    Note over S,R: 2 RTTs later, TLP fires
    S->>R: TLP triggers a retransmit πŸ₯”
    Note right of R: πŸ₯” arrived, so it's SACK'd.
    R->>S: SACK πŸ₯”

    Note left of S: Mark 🍌 and πŸ₯• as lost
    Note over S,R: The two sends are normally<br>combined but are separated<br>here for visual clarity.
    S-->>R: Retransmit 🍌
    S->>R: Retransmit πŸ₯•
    Note right of R: Only πŸ₯• arrived, so it's SACK'd<br>alongside πŸ₯”.
    R-->>S: SACK πŸ₯• and πŸ₯”

    Note left of S: 🍌 retransmission marked as lost.<br>Note that this is the second time<br/>that 🍌 has been marked as lost.
    S->>R: Retransmit 🍌 again
    Note right of R: 🍌 finally arrived!
    R-->>S: ACK πŸ₯”
    Note over S,R: The ACK πŸ₯” is a cumulative ACK<br>for the 🍎🍌πŸ₯•πŸ₯” sequence

In the above scenario, Tail Loss Probing enables the sender to quickly know if 🍎🍌πŸ₯•πŸ₯” were successfully received. Since the sender only receives an acknowledgment (ACK) for 🍎 from the receiver, the TLP timer eventually triggers because it didn’t receive an ACK for 🍌, πŸ₯•, and πŸ₯”. The TLP can then resend the last packet in the segment as an efficient way to:

  1. Retransmit data that the receiver would have to receive down the line anyway. The alternative would be to send an empty packet, receive an ACK, then send a missing packet. This handles both at once and can potentially save an RTT if only the last packet is missing in a segment.

  2. Allow the receiver to ACK any earlier missing packets in the sequence if there were other issues due to networking, temporary stutters or freezes, etc.

  3. Check receiver responsiveness and detect if there’s a network issue. Note that this is different from (2), as the receiver could potentially never respond with an ACK.

The receiver then confirms that it has πŸ₯” by sending a selective acknowledgment (SACK) to the sender, which tells the sender that it’s received one of the packets from the segment but not all of them. At this point, the sender has an ACK for 🍎 and a SACK for πŸ₯”, which means that it can determine that 🍌 and πŸ₯• must be missing. The sender then notes that 🍌 and πŸ₯• have been lost once. It then retransmits 🍌 and πŸ₯•. In the example, 🍌 happens to get dropped by the network whereas πŸ₯• is sent and received successfully. The receiver then replies with a SACK for πŸ₯• and πŸ₯”, at which point the sender can determine that 🍌 was lost a second time. Finally, the sender retransmits 🍌, which fortunately doesn’t get dropped, and the receiver sends an ACK for πŸ₯” (note the ACK is for the last packet in the segment, which implies that all previous packets have been received).

Side note: keeping track of the number of times that a packet has been lost is important as the cubic congestion control algorithm described in RFC 9438 (which is mentioned in RFC 8985) relies on this information.

SCTP RTO vs RACK RTO

In this example, the authors of RFC 8985 show how, without RACK, SCTP and TCP can suffer from spurious retransmissions when there are retransmission timeouts (RTOs). In this case, each food icon represents an entire segment instead of a packet.

  sequenceDiagram
    participant S as Sender
    participant R as Receiver

    S->>R: Send πŸ₯
    Note over S,R: Then right before<br>the end of the RTO...
    Note right of R: The receiver has a network hiccup!
    S->>R: Send 🍳πŸ₯­

    Note right of R: Received πŸ₯
    Note over S,R: End of the RTO
    R->>S: ACK πŸ₯

In this scenario, πŸ₯ is sent, and right before the end of the RTO, 🍳 and πŸ₯­ are sent. The receiver gets πŸ₯, 🍳, and πŸ₯­, but only manages to send an ACK for πŸ₯ right after the RTO. Let’s see how SCTP handles this without RACK versus with RACK.

  sequenceDiagram
    participant S as Sender
    participant R as Receiver

    S->>R: Send πŸ₯
    Note over S,R: Then right before<br>the end of the RTO...
    Note right of R: The receiver has a network hiccup!
    S->>R: Send 🍳πŸ₯­

    Note right of R: Received πŸ₯
    Note over S,R: End of the RTO
    R->>S: ACK πŸ₯

    Note over S,R: Without RACK...

    %% without rack
    Note left of S: Mark πŸ₯πŸ³πŸ₯­ as lost<br>since RTO expired

    Note right of R: Received 🍳πŸ₯­
    Note over S,R: The Sender incorrectly<br>ignores the ack for 🍳πŸ₯­!

    Note left of S: Prepare to retransmit πŸ₯πŸ³πŸ₯­
    S->>R: Retransmit πŸ₯πŸ³πŸ₯­

We can see here that the receiver eventually sends an ACK for 🍳 and πŸ₯­, but the sender ignores it and believes that πŸ₯, 🍳, and πŸ₯­ are all missing instead of just πŸ₯. While it’s reasonable to assume that πŸ₯ is missing, it’s a little overzealous in retransmitting packets, which can increase network traffic during recovery, especially when it’s completely possible for the sender to wait for the acknowledgments of 🍳 and πŸ₯­ from the receiver.

Let’s see what RACK does!

  sequenceDiagram
    participant S as Sender
    participant R as Receiver

    S->>R: Send πŸ₯
    Note over S,R: Then right before<br>the end of the RTO...
    Note right of R: The receiver has a network hiccup!
    S->>R: Send 🍳πŸ₯­

    Note right of R: Received πŸ₯
    Note over S,R: End of the RTO
    R->>S: ACK πŸ₯
    Note over S,R: With RACK...

    %% with rack
    Note left of S: Mark πŸ₯ as lost<br>since RTO expired

    Note right of R: Received 🍳πŸ₯­

    Note left of S: Prepare to retransmit πŸ₯
    S->>R: Retransmit πŸ₯

Here, RACK makes it so only πŸ₯ is marked as lost when the RTO expires. 🍳 and πŸ₯­ aren’t marked as lost because their own RTOs have not yet expired by the time their ACKs are received. Therefore, only πŸ₯ is retransmitted, as the timers for 🍳 and πŸ₯­ would be re-armed if an ACK is received for the retransmitted πŸ₯.

In this example, even though both non-RACK and RACK end up retransmitting πŸ₯ despite the receiver already having it, the focus is on minimizing spurious retransmissions. This can save on the amount of data sent over the network, which naturally speeds up any retransmissions that might occur.

RACK’s strategy

In summary, RACK’s strategy generally has two main parts:

  1. Detect packet losses as quickly as possible by utilizing time-based acknowledgments of segmented data and inferences from network statistics.

  2. Use Tail Loss Probing (TLP), which sends sample data to gather more network statistics.

The combination of these two strategies allows it to quickly determine issues and properly rectify them once identified. It also provides better resilience for some tricky edge cases! If you’re interested in seeing how RACK could perform in SCTP and other SCTP-specific improvements, check out chapter 7 of Felix Weinrank’s thesis.

A quick look at the results (why this matters if you don’t live in SCTP land)

SCP is a Go test harness that runs two Pion SCTP stacks against each other inside a deterministic, in-process “virtual network” (from Pion/transport). It pins exact commits on each side, replays scenarios with a fixed seed, validates packet on the wire (CRC32c + basic SCTP parsing), and writes artifacts (results.json, packet logs, and pprof) so you can reproduce the numbers.

The headline (max-burst): more throughput, less CPU, lower latency

This is the cleanest microbench in the suite: no loss, no delay, no jitter. Just a burst of messages in both directions. Comparing non-rack<->non-rack vs rack<->rack (There are similar improvements even when comparing rack<->non-rack):

metricmain (baseline)RACKdelta
goodput234.55 Mbps316.42 Mbps+34.9%
CPU time (cpu_seconds)0.0560 s0.0441 sβˆ’21.3%
goodput / CPU-second4,1897,177+71.3%
latency p5016.37 ms11.86 msβˆ’27.5%
latency p9936.95 ms27.84 msβˆ’24.6%

That +71% throughput-per-CPU is simply the goodput measured (Mbps) divided by the run’s cpu_seconds. Non-rack cruised at ~234 Mbps using ~0.056 CPU seconds (~4,189 Mbps/CPU-s), while RACK sustained 316 Mbps with ~0.044 CPU seconds (~7,177 Mbps/CPU-s). That gap is the proof that rack delivers ~71% more work per unit of CPU.

Test setup

To test RACK, we ran these test profiles to compare how RACK performs against main (baseline):

  • max-burst - “how fast can we go” with no delay, loss, or reordering; it targets the raw transport path:

Goodput jumps +34.9% (234 ->316β€―Mbps) while CPU seconds drop by 21% (0.056 ->0.044β€―s) and p50/p99 both fall by ~25%. RACK now delivers ~71% more Mbps per CPU-second.

  • handshake - same burst pattern, this time including the COOKIE/SHUTDOWN handshake, so we exercise setup timers:

Goodput climbs +15% (237 ->272β€―Mbps) while latency stays basically flat (15.65 ->15.99β€―ms for p50, 35.27 ->33.25β€―ms for p99), which confirms that the faster throughput comes without slower ACK paths.

  • unordered-late-low-rtt - minor delay/jitter (10β€―ms) but unordered delivery to simulate packet trains with mild disorder:

Minor latency and throughput noise. Both branches still pass but RACK keeps the delivery steady despite small unordered bursts.

  • unordered-late-high-rtt - large RTT/jitter (180β€―ms / 60β€―ms) with unordered delivery, so we can watch how the stack copes with latency spikes:

Very high latency due to the profile, but RACK keeps throughput comparable while completely avoiding any regressions.

  • unordered-late-dynamic-rtt - fluctuating RTT (40β€―ms base Β±180β€―ms jitter) with unordered delivery to mimic burst-y network dynamics:

Both branches pass with no noticeable regressions from RACK, which shows that each branch handles jitter swings fine.

  • congestion - ordered delivery with 2% loss and modest delay/jitter to stress-tests congestion control and SACK-driven recovery:

The loss-handling path stays green and RACK doesn’t use extra CPU compared to master which shows the +35% clean-case gain doesn’t cost the loss profile.

  • retransmission - ordered with 5% loss and 20β€―ms jitter to force fast-retransmit/TLP scenario:

The fault case still hits retries and RACK’s CPU profile actually shows more JSON/packet-logging work but that’s what we expect during retransmit storms.

  • reorder-low - unordered with 1.5% loss plus deliberate reordering to exercise scheduler/queue behavior under lossοΌ‹reorder:

Goodput improves +44% (1.79 Mbps -> 2.58β€―Mbps) and the run finishes faster (~3.70β€―s vs 5.34β€―s), so RACK dominates the low-rate reordering scenario.

  • burst-loss - unordered with 4% loss and 50β€―ms jitter to push retransmit/recovery under heavy loss bursts.
  • fragmentation - oversized payloads require chunk fragmentation/reassembly to verify large-message handling:

Nothing improved or got worse.

  • media-hevc - A real-world use case with video: one-way stream, paced HEVC frames (~25β€―fps), 3% loss, ~1200-byte max payload, across a ~13-14 Mbps link (taken from a real-world use case of sending DRM media over WebRTC datachannels) to ensure sustained media delivery works.

RACK hits 12.90β€―Mbps goodput in 2.14β€―s (100% delivery) while the main branch streaming to the RACK branch sits at 11.34β€―Mbps in 4.66β€―s. That’s a 2x faster finish!

We also have 3 negative tests to ensure that any corruptions or errors are still being caught:

  • fault-checksum - corrupts every 7th DATA chunk’s checksum so receivers must drop it and log the error.
  • fault-bad-chunk-len - mangles the chunk length field every 7th chunk to validate length checks/parsing.
  • fault-nonzero-padding - corrupts padding bytes every 7th chunk so padding validation and chunk isolation logic are exercised.

In both branches, these cases fail (as desired), which confirms that both branches detect the corruption and that there is no regression in behavior.

CPU flamegraphs

The flamegraphs below show the CPU profiles for max-burst runs. You can see in the metadata that the RACK profile captured 20ms of samples (9.95% of 201.08ms duration) versus the master profile’s 40ms of samples (19.86% of 201.45ms duration) – exactly half the sample count for a similar duration, which directly supports the efficiency claims.

CPU Profile - Master

CPU Profile - Rack

Why RACK behaves better (not just “goes faster”)

RACK changes how SCTP decides that something is lost and when it sends probes, so it wastes less work fixing problems that never really happened:

  • Instead of keying almost everything off “three missing reports or an RTO fired”, RACK uses time-based loss detection: it looks at when chunks were last SACKed/ACKed and infers loss from elapsed time and the pattern of tail acknowledgments.
  • Tail Loss Probes (TLP) send a cheap “sample” chunk at the end of a burst to flush out late ACKs. If the receiver really did get the data, it answers and the sender avoids a full retransmission storm, otherwise, the probe doubles as the retransmission you needed anyway.

In practical terms that’s what the profiles and metrics are showing:

  • We still see the same hot stack (vnet.(*chunkUDP).UserData, runtime.memmove, a thin layer of runtime/type helpers) in both master and RACK. It’s the normal packet I/O path.
  • With RACK, that stack is exercised fewer times per unit of useful data because there are fewer spurious retransmits and fewer “just in case” timer expirations.
  • That’s exactly how we get more goodput, lower latency, and smaller CPU profiles at the same time: RACK spends less CPU “arguing with the network” and more CPU pushing real user data through SCTP.

Spec-aligned ACK behavior and testing

Using SCP testing tool we were able to find some issues includes:

  • In the initial RACK implementation, handling of transitions from high to low RTT was suboptimal due to the implementation using a global minimum for recent RTT measurements instead of a windowed minimum (the latter approach is only a “SHOULD” in RFC 8985 section 6.2.1). Atsushi Watanabe quickly identified it and we resolved the issue.

  • The earlier version of RACK implementation also handled packet reordering poorly and consumed more CPU than non-RACK. This was corrected by implementing improved active RTT measurement, following the approach described in Weinrank’s work, see p. 120.

  • A minor bug was discovered (and fixed) in the initial RACK implementation where the latest RTT was not measured for every packet.

  • We also found that Pion SCTP did not send a SACK immediately after a TSN gap, causing RACK to perform worse under moderate reordering. After fixing this behavior to align with RFC 4960 section 6.7 (surprisingly only a “SHOULD”), reordering test cases showed a ~30% improvement.

Looking forward

Keep an eye out for even more improvements and benchmarks from our improved SCTP implementation using real-world data, as well as how we’re doing it in an upcoming blog post!

Thanks for reading!

Credits

Huge thanks to the following for making this possible:

  • Joe Turki for introducing me to Pion, making SCP, answering countless questions, and so much more.
  • Sean DuBois for making Pion, finding Felix Weinrank’s thesis, and endless encouragement.
  • Srayan Jana for helping to bounce around many ideas.
  • Atsushi Watanabe for reviewing and catching the global minimum vs windowed minimum issue in the RACK PR.
  • And many more people along the way!