08 - Load Balancing

Congestion control vs Load Balancing

  • Endpoint Congestion: can’t be avoided by load balancing (incast)

  • Fabric Congestion: can be avoided by load balancing

  • having a static network topology allows us to statically configure hosts, resulting in a less subject to change topology this could be exploited to connect multiple sources each other because everything is already decided

There are several Load Balancers “technologies”.

RDMA (Congestion Oblivious) - Distributed in-network

Equal-cost multipath (ECMP) is a network routing strategy that allows for traffic of the same session, or flow—that is, traffic with the same source and destination—to be transmitted across multiple paths of equal cost. It is a mechanism that allows you to load balance traffic and increase bandwidth by fully utilizing otherwise unused bandwidth on links to the same destination.

When forwarding a packet, the routing technology must decide which next-hop path to use. In making a determination, the device takes into account the packet header fields that identify a flow. When ECMP is used, next-hop paths of equal cost are identified based on routing metric calculations and hash algorithms. That is, routes of equal cost have the same preference and metric values, and the same cost to the network. The ECMP process identifies a set of routers, each of which is a legitimate equal cost next hop towards the destination. The routes that are identified are referred to as an ECMP set. Because it addresses only the next hop destination, ECMP can be used with most routing protocols.


  • flows collisions
  • links fault

CONGA (Congestion aware) - Distributed in-network

Hedera (Centralized)

  • Detect Large Flows

    • Scheduler continually polls edge switches for flow byte-counts
      • Flows exceeding B/s threshold are “large”
        • e.g, > %10 of hosts’ link capacity (i.e. > 100Mbps)
    • efficient for small flows
  • Estimate Flow Demands

  • Place Flows

    • Use estimated demands to heuristically find better placement of large flows on the ECMP paths, to maximize bisection bandwidth

In-network issues

  • for centralized ones: reactiveness: how fast can this react to flows requiring less/more bandwidth, to new flows appearing and to other flows disappearing?
  • in general: it might require physical changes to the switches or transport protocol

Presto (Cong. Oblivious) - Host-based

  • Granularity: flowcells (i.e., blocks of consecutive bytes, of fixed size)
  • round-robin

FlowBender (Cong. Aware) - Host-based

  • granularity: flows (avoids reordering)
  • forces ECMP rehashing (forces paths on different flow) when it detects congestion

Important concepts


  • Flowlets are bursts of packets from the same flow separated by at least a time delta (pre-determined time gap)

  • If the gap is larger than the delay difference across the paths, no packets will be reordered.

Random Packet Spraying

  • Also known as random packet load balancing

  • Selects a random path for each packet (rather than a specific path for each message)

  • next-hops are picked randomly for each packet

  • Issue: packets might arrive out of order (issue with TCP (can trigger a retransmission), but also with Infiniband/RoCE)

  • In-network compute with programmable switches can greatly improve the performance of some Big Data applications (map-reduce, ML distributed training) but must be designed carefully

    • mapreduce
    • allreduce