Fast Notification for tunnel-based lossless RDMA transmission in WAN

Internet-Draft	Fast Notification for tunnel-based lossl	June 2026
Hu, et al.	Expires 29 December 2026	[Page]

Abstract

With the rapid development of Large Language Models (LLMs), many emerging AI services require lossless transmission of RDMA traffic over tunnels in Wide Area Network(WAN). Existing network mechanisms were not designed for the responsiveness and scale required by these dynamic services. WAN should support the real-time, lightweight network notification to enhance the responsiveness for traffic engineering, congestion mitigation, and failure protection.¶

This document analyzes typical scenarios where RDMA traffic need to be tunneled across WAN, and proposes fast network notification solutions based on ICMPv6 or UDP.¶

1. Introduction

For modern AI services such as distributed LLMs training or inference, WAN needs to support the tunneling of RDMA traffic between data centers (DCs). RDMA is a widely used technology in high-performance computing and AI clusters, achieving low latency, reduced CPU overhead, and high network throughput. Currently, mainstream RDMA protocols (e.g., IB, RoCE) operate over best-effort forwarding, where a small number of packet losses can result in a dramatic reduction in the effective throughput. Therefore, WAN requires the FAst Notification for Traffic Engineering and Load balancing to ensure reliable and congestion-free data transfer.¶

[I-D.geng-fantel-fantel-gap-analysis] points existing TE mechanisms face limitations in responsiveness, coverage, and operational overhead, especially in high-speed, large-scale environments. ECN[RFC3168] is a widely deployed congestion control mechanism, which enables a forwarding element to notify the sender for congestion control without having to drop packets. But it still relies on end-to-end signaling, making real-time feedback challenging in long-distance WAN. BFD[RFC5880] is designed for rapid fault detection by sending frequent control packets between peers, but higher probe frequency increases CPU and bandwidth usage, make it struggles to balance detection speed with system overhead.¶

[I-D.ietf-rtgwg-net-notif-ps] is an IETF Problem Statement for Fast Network Notification(FANN), based on the analysis of gaps in current network mechanisms and the operational requirements of modern applications (e.g., AI/ML training), formally defines the scope and core requirements for fast network notifications. Moreover, it futher specifies what information such notifications carry, who the intended recipients are, how they should be delivered, and what kinds of timely actions they may enable.¶

To enable lossless data transmission, some drafts are proposed to support FANN. [I-D.wh-rtgwg-adaptive-routing-arn] proposes a proactive notification mechanism ARN for adaptive routing, and describes the information carried in ARN to notify remote nodes for re-routing. [I-D.liu-rtgwg-adaptive-routing-notification] describes the mechanisms of delivering ARN message.¶

This document specifies the FANN mechanism for scenarios where service traffic is carried over tunnels in WAN. It first introduces the typical scenarios, then specifies the process of fast notification to achieve key TE areas such as congestion control, load balancing, and failure protection, and finally defines the protocol implementation.¶

3. Scenarios

3.1. Scenario 1: distributed model training across DCs

The growth of computing power of a single DC is limited by space and power supply, making it difficult to meet the fast-growing computing resources demands of LLMs training. Therefore, distributed model training across multiple DCs provides a more efficient and cost-effective solution to aggregate computing resources. In this scenario, TB-scale training parameters need to be rapidly synchronized over WAN.¶

3.2. Scenario 2: distributed model inference between on-premise and third-party DC

Some customers deploy LLMs by building on-premises AI facilities, but as inference concurrency increases, scaling out these facilities requires significant investment. To address this, distributed model inference between customer on-premise and third-party DC provides a more agile and cost-effective solution. In this scenario, data such as the KV cache and model parameters need to be rapidly synchronized over WAN.¶

3.3. Scenario abstraction

In the above scenarios, a large volume of data between DCs need to be synchronized using RDMA protocol. RDMA traffic generated by LLM training or inference is highly concurrent, bursty, and extremely latency-sensitive. Therefore, operators typically encapsulate it in tunnels over the WAN to enable flexible steering and end-to-end service isolation. In these scenarios, the framework for RDMA traffic transmission over WAN tunnels is as follows:¶

                +--------------------------------------------------+
                |                       DC1                        |
                |                                                  |
                | +-----------+  +-----------+       +-----------+ |
                | |AI server 1|  |AI server 2|  ...  |AI server n| |
                | +-----------+  +-----------+       +-----------+ |
                +------------------------+-------------------------+
                                         |
                +------------------------+-------------------------+
                |   WAN            +-----+----+                    |
                |           +------+ingress PE+------+             |
                |           |      +----------+      |             |
                |           |                        |             |
                |        +--+---+                 +--+---+         |
                |        |  R1  +                 +  R2  |         |
                |        +--+---+\               /+--+---+         |
                |           |     \             /    |             |
                |           |      \+---------+/     |             |
                |           |       +   R5    +      |             |
                |           |      /+---------+\     |             |
                |           |     /             \    |             |
                |        +--+---+/               \+--+---+         |
                |        |  R3  +                 +  R4  |         |
                |        +--+---+                 +--+---+         |
                |           |                        |             |
                |           |       +---------+      |             |
                |           +-------+egress PE+------+             |
                |                   +----+----+                    |
                +------------------------+-------------------------+
                                         |
                +------------------------+-------------------------+
                | +-----------+  +-----------+       +-----------+ |
                | |AI server 1|  |AI server 2|  ...  |AI server m| |
                | +-----------+  +-----------+       +-----------+ |
                |                                                  |
                |                       DC2                        |
                +--------------------------------------------------+
                            Figure 1: Network diagram

The AI servers in DC1 sends RDMA traffic to WAN's ingress PE.¶
At the WAN's ingress PE, the RDMA traffic is encapsulated according to the tunnel protocol and forwarded across WAN to egress PE.¶
The WAN's P node(R1-R5) transits the payload from ingress PE to egress PE via tunnels.¶
At the WAN's egress PE, the payload are decapsulated to RDMA packets and transmitted to the AI servers in DC2.¶

4. Process analyze

Tunneling technologies include various protocols, such as GRE, VXLAN, MPLS, and SRv6. Moreover, AI workloads are highly sensitive to packet loss, latency and throughput. Network failures, congestion or underutilization can all lead to significant waste of compute resources. When transmittig RDMA traffic over tunnels, WAN should support FANN capability to realize rapid response to network conditions. Specifically, WAN devices should support fast notification mechanism to imporve three key TE scenarios: failure protection, flow control, and load balancing.¶

4.1. Failure protection

For large-scale and dynamic networks, protection mechanisms need to ensure service continuity in case of failures. According to [I-D.geng-fantel-fantel-gap-analysis], existing failure handling methods, such as BFD and FRR, lack flexibility and responsiveness in complex typologies. Therefore, WAN should support fast notification for failures, allowing near-instantaneous and dynamic protection responses, minimizing failure impact.¶

Upon network failure, the ingress PE should immediately adapt its forwarding policy to steer traffic away from faulty links or nodes. Therefore, the fast-notification-based failure protection process is as follows:¶

        notification
      +--------------+
      |              |
      |          +---+--+    +------+
      |          |  R1  +--x-+  R2  |
      |         /+------+  ^ +------+\
      |        /           |          \
      v       /         failure        \
+----------+ /                          \ +---------+
|          |/                            \|         |
|ingress PE|\                            /|egress PE|
|          | \                          / |         |
+----------+  \                        /  +---------+
               \ +------+    +------+ /
                \|  R3  +----+  R4  |/
                 +------+    +------+
                        Figure 2: Failure protection procession

When a P node detects a local link/node failure, it collects failure information about the affected link or flow.¶
The P node sends notification to ingress PE with failure information (In addition to the identity of the failed link or node, the notification must also include information about the affected traffic).¶
Ingress PE receives the notification and reroutes the traffic based on its content to exclude the failed link or node: *If backup path is available, ingress PE should switch the service traffic to the backup path. *If multiple feasible paths exist, ingress PE should updates its load-balancing policy to utilize all available paths. Ingress PE should send a corresponding notification to the sender and controller.¶

4.2. Congestion control

RDMA traffic is bursty and highly sensitive to packet loss, and WAN require proactive congestion control mechanisms. [RFC6040] redefines how the explicit congestion notification (ECN) field of the IP header should be constructed on entry to and exit from any IP-in-IP tunnel, in order to achieve ECN-based congestion control across WANs between DCs. However, [I-D.geng-fantel-fantel-gap-analysis] analysis that ECN/TCP methods still relies on end-to-end signaling and lacks precise real-time feedback.¶

Currently, PFC is widely used in data centers to prevent data loss due to congestion. PFC uses a step-by-step back-pressure mechanism to control the upstream to stop or continue transmitting traffic. PFC achieves link-layer priority-based traffic control, but still faces problems such as queue head blocking and deadlock due to coarse control granularity.¶

When network congestion occurs, the ingress PE should immediately adapt its forwarding policy to reduce the traffic sent to congested nodes. Therefore, the fast-notification-based congestion control process is as follows:¶

               notification
      +---------------------------+
      |                           |
      |          +------+    +-+--+-+
      |          |  R1  +----+  R2  |
      |         /+------+    +------+\
      |        /                      x<---congestion
      v       /                        \
+----------+ /                          \ +---------+
|          |/                            \|         |
|ingress PE|\                            /|egress PE|
|          | \                          / |         |
+----------+  \                        /  +---------+
               \ +------+    +------+ /
                \|  R3  +----+  R4  |/
                 +------+    +------+
                        Figure 3: Congestion control procession

when a P node detects congestion, it collects congestion information about the congested link or flow.¶
The P node sends notification to ingress PE with congestion information.¶
Ingress PE receives the notification and reroutes the traffic based on its content to exclude the congestion link: *If backup path is available, ingress PE should switch the service traffic to the backup path. *If multiple feasible paths exist, ingress PE should updates its load-balancing policy to utilize all available paths. Ingress PE should reduce the transmission rate of corresponding traffic, and send notification to sender and controller.¶

4.3. Load balancing for network state changes

Devices and links in WAN often carry multiple services simultaneously. In addition to failure and congestion, dynamic load balancing based on network state changes can effectively improve network resource utilization.¶

When significant changes occur in the network state, the ingress PE should dynamically adjust its forwarding strategy to maximize network resource utilization. Therefore, the fast-notification-based load balancing process is as follows:¶

        notification
      +--------------+
      |              |
      |          +---+--+    +------+
      |          |  R1  +----+  R2  |
      |         /+------+  ^ +------+\
      |        /           |          \
      v       /     link utilization   \
+----------+ /           change         \ +---------+
|          |/                            \|         |
|ingress PE|\                            /| gress PE|
|          | \          node load change/ |         |
+----------+  \                 |      /  +---------+
      ^        \                v     /
      |         \+------+    +------+/
      |          |  R3  +----+  R4  |
      |          +------+    +---+--+
      |                          |
      +--------------------------+
              notification
                        Figure 4: Load balancing for network state changes

When a node detects the network state change, it collects the network state change information, such as link utilization, queue buildup.¶
The node sends fast notification to the ingress PE with information about the network state change.¶
Ingress PE receives the fast notification and updates its load-balancing policy to maximize the utilization of network resources.¶

5. Solutions

Based on the framework analysis of fast notification in key TE areas, a unified protocol implementation for fast notification should be established, with explicit forwarding procedures to realize tunnel-based lossless transmission of RDMA packets in WAN.¶

5.1. ICMPv6-based solution

The source quench mechanism has been deprecated in ICMPv6 because TCP's built-in congestion avoidance algorithms are more efficient, and source quench may interfere with their normal operation. However, fast network notification is a network-layer mechanism confined to the forwarding plane, designed for event-driven generation and consumption without involving endpoints. This avoids the conflict that led to Source Quench's deprecation, making ICMPv6 suitable as a carrier for fast notifications.¶

5.1.1. Overall Structure

This document specifies a new ICMPv6 message to realize rapid notification in key traffic engineering areas including failure protection, congestion control, and load balancing. This ICMPv6 message consists of a fixed ICMPv6 header, and a variable-length metadata stack controlled by a 32-bit bitmap:¶

            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |     Type      |     Code      |          Checksum             |
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |    Version    |   Reserved    |   Hop Limit   |   Event Type  |
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |   Event Sub-Type   |        Event Identifier (4 bytes)        |
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |                     Timestamp (8 bytes)                       |
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |                                                               |
            |              Originating Node IPv6 Address (16 bytes)         |
            |                                                               |
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |                       Bitmap (32 bits)                        |
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |                                                               |
            |                   Metadata Stack (variable)                   |
            |                                                               |
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                        Figure 5: new ICMPv6 message for fast notification

5.1.2. Fixed Field Definitions

Type (1 byte): ICMPv6 type for FANN. IANA allocation required (suggested value: TBD).¶
Code (1 byte): 0 for event notification, 1 for recovery notification.¶
Checksum (2 bytes): Standard ICMPv6 checksum.¶
Version (1 byte): Set to 1 for this specification.¶
Reserved (1 byte): Set to 0 on transmission, ignored on reception.¶
Hop Limit (1 byte): Controls propagation scope. Decremented by each forwarding node; discarded when reaching zero.¶
Event Type (1 byte): Primary category of the event. 0x01: Link failure; 0x02: Congestion; 0x03: Performance degradation; 0x04: Microburst; 0x05: Signal degradation / link errors; 0x06: Queue buildup; 0x07: Recovery (condition cleared);¶
Event Sub-Type (1 byte): More granular classification within an event type. For example, for Congestion (Event Type = 0x02): 0x01 for mild ( > 50% utilization), 0x02 for moderate ( > 70%), 0x03 for severe ( > 90%).¶
Event Identifier (4 bytes): Unique identifier for this event instance, used for deduplication and correlation between multiple notifications.¶
Timestamp (8 bytes): Time when the event was detected, in microseconds since Unix epoch (UTC).¶
Originating Node IPv6 Address (16 bytes): IPv6 address of the node that detected the event. This serves as the node identifier, eliminating the need for a separate Node ID in the bitmap.¶
Bitmap (32 bits): Each bit indicates whether the corresponding metadata is present in the metadata stack.¶

5.1.3. Bitmap Definition

The 32-bit bitmap field indicates which metadata is present in the metadata stack. Each bit corresponds to a specific metadata type with a fixed length, enabling efficient parsing without TLV overhead. Bits are listed below by their index.¶

Bit 0: Ingress Port ID (4 bytes) Interface identifier where the event was observed on the ingress side.¶
Bit 1: Egress Port ID (4 bytes) Interface identifier where the event was observed on the egress side. Together with Ingress Port ID, uniquely identifies the location of the event.¶
Bit 2: Ingress Timestamp (8 bytes) Time when the event was observed at the ingress interface.¶
Bit 3: Egress Timestamp (8 bytes) Time when the event was observed at the egress interface.¶
Bit 4: Egress TX Link Utilization (4 bytes) Real-time bandwidth utilization of the egress interface (e.g., 95% = 0x0000005F).¶
Bit 5: Packet Loss (4 bytes) Packet loss count or loss rate at the interface.¶
Bit 6: Latency / Delay (4 bytes) Current latency of the interface or path in microseconds.¶
Bit 7: Jitter (4 bytes) Latency variation in microseconds.¶
Bit 8: Queue Occupancy (4 bytes) Current queue depth in bytes.¶
Bit 9: Buffer Occupancy (4 bytes) Overall buffer occupancy for shared buffer pools.¶
Bit 9: Buffer Occupancy (4 bytes) Overall buffer occupancy for shared buffer pools.¶
Bit 10: Signal Degradation / Link Errors (4 bytes) Bit error rate, CRC error count, or signal quality metric (0-100%).¶
Bit 11: Hard Failure / Link Down (4 bytes) Boolean flag (lower bit = 1 indicates link down).¶
Bit 12: Microburst Detected (4 bytes) Boolean flag (lower bit = 1 indicates microburst detected).¶
Bit 13: Flow ID (5-tuple) (37 bytes) Source IPv6 (16) + Destination IPv6 (16) + Source Port (2) + Destination Port (2) + Protocol (1).¶
Bit 14: Path ID (16 bytes) Identifier of the affected path (e.g., SRv6 SID list or MPLS label stack).¶
Bits 15-31: Reserved for future use.¶

Bitmap bits 0 and 1 (Ingress Port ID and Egress Port ID) serve as the location identifier, eliminating the need for a separate Event Location field. The port IDs uniquely indicate where in the network the event occurred, corresponding to "Location of Event: This can be used to indicate the location where the event occurred in the network.¶

Although bits 11 and 12 are logically single-bit flags, they each occupy 4 bytes for alignment purposes, with the upper 31 bits set to zero.¶

5.1.4. Metadata Stack Structure

The metadata stack is constructed by concatenating metadata fields in ascending bit order. For each bit that is set to 1 in the bitmap, the corresponding metadata field of fixed length is appended to the stack. Bits set to 0 are skipped.¶

For example, if the bitmap has bits 1 (Egress Port ID), 4 (Egress TX Link Utilization), 6 (Latency), and 10 (Signal Degradation) set to 1, the metadata stack will be:¶

Egress Port ID (4 bytes): 0x00000008 (port 8)¶
Egress TX Link Utilization (4 bytes): 0x0000005A (90% utilization)¶
Latency (4 bytes): 0x000003E8 (1000 microseconds)¶
Signal Degradation (4 bytes): 0x0000000A (10% error rate)¶

5.2. UDP-based solution

While the preceding sections define the Fast Network Notification (FANN) message format using ICMPv6, the same fixed header and 32-bit bitmap structure can be directly carried over UDP. When encapsulated in UDP, the ICMPv6 Type/Code/Checksum fields are simply replaced by the standard UDP header (source port, destination port, length, checksum), while the FANN fixed fields, the 32-bit bitmap, and the metadata stack remain unchanged. The receiving node identifies the FANN message by a well-known UDP destination port (to be allocated by IANA) and processes it identically to the ICMPv6 variant.¶

This approach is consistent with the principle that the solution may reuse existing protocols where appropriate. Moreover, the use of UDP offers practical deployment advantages in environments where ICMPv6 traffic is filtered or where NAT traversal is required, without compromising the notification's timeliness or forwarding-plane efficiency.¶

9. References

9.1. Normative References

[RFC2119]: Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, <https://www.rfc-editor.org/info/rfc2119>.
[RFC3688]: Mealling, M., "The IETF XML Registry", BCP 81, RFC 3688, DOI 10.17487/RFC3688, January 2004, <https://www.rfc-editor.org/info/rfc3688>.
[RFC8174]: Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, <https://www.rfc-editor.org/info/rfc8174>.

9.2. Informative References

[RFC3168]: Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, <https://www.rfc-editor.org/info/rfc3168>.
[RFC6040]: Briscoe, B., "Tunnelling of Explicit Congestion Notification", RFC 6040, DOI 10.17487/RFC6040, November 2010, <https://www.rfc-editor.org/info/rfc6040>.
[RFC7514]: Luckie, M., "Really Explicit Congestion Notification (RECN)", RFC 7514, DOI 10.17487/RFC7514, April 2015, <https://www.rfc-editor.org/info/rfc7514>.
[RFC4443]: Gupta, Mukesh., "Internet Control Message Protocol (ICMPv6) for the Internet Protocol Version 6 (IPv6) Specification", RFC 4443, DOI 10.17487/RFC4443, March 2006, <https://www.rfc-editor.org/info/rfc4443>.
[RFC5880]: Katz, Dave., "Bidirectional Forwarding Detection (BFD)", RFC 5880, DOI 10.17487/RFC5880, January 2010, <https://www.rfc-editor.org/info/rfc5880>.
[I-D.wh-rtgwg-adaptive-routing-arn]: Wang, H., Huang, H., Geng, X., Xu, X., and Y. Xia, "Adaptive Routing Notification", Work in Progress, Internet-Draft, draft-wh-rtgwg-adaptive-routing-arn-03, 13 September 2024, <https://datatracker.ietf.org/doc/html/draft-wh-rtgwg-adaptive-routing-arn-03>.
[I-D.liu-rtgwg-adaptive-routing-notification]: Liu, Y., lihesong, and W. Duan, "Adaptive Routing Notification for Load-balancing", Work in Progress, Internet-Draft, draft-liu-rtgwg-adaptive-routing-notification-02, 12 June 2025, <https://datatracker.ietf.org/doc/html/draft-liu-rtgwg-adaptive-routing-notification-02>.
[I-D.xiao-rtgwg-rocev2-fast-cnp]: Min, X. and lihesong, "Fast Congestion Notification Packet (CNP) in RoCEv2 Networks", Work in Progress, Internet-Draft, draft-xiao-rtgwg-rocev2-fast-cnp-03, 9 June 2025, <https://datatracker.ietf.org/doc/html/draft-xiao-rtgwg-rocev2-fast-cnp-03>.
[I-D.geng-fantel-fantel-gap-analysis]: Geng, X., Huo, P., Cheng, W., Li, D., Zhu, Y., and H. Zhengxin, "Gap Analysis of Fast Notification for Traffic Engineering and Load Balancing", Work in Progress, Internet-Draft, draft-geng-fantel-fantel-gap-analysis-01, 7 July 2025, <https://datatracker.ietf.org/doc/html/draft-geng-fantel-fantel-gap-analysis-01>.
[I-D.ietf-rtgwg-net-notif-ps]: Dong, J., McBride, M., Clad, F., Zhang, Z. J., Zhu, Y., Xu, X., Zhuang, R., Pang, R., Lu, H., Liu, Y., Contreras, L. M., Mehmet, D., and R. Rahman, "Fast Network Notifications Problem Statement", Work in Progress, Internet-Draft, draft-ietf-rtgwg-net-notif-ps-00, 11 February 2026, <https://datatracker.ietf.org/doc/html/draft-ietf-rtgwg-net-notif-ps-00>.

Fast Notification for tunnel-based lossless RDMA transmission in WAN

Abstract

Status of This Memo

Copyright Notice

Table of Contents