03 - TCP Offloading, multicore scalability

Overview

  • Description:: TCP Offloading, multicore scalability

TCP offload means that TCP network packets are unpacked from the network card and assembled into larger segments. This task is usually done by the CPU but can be done to a large extent on hardware with modern network set-ups.

  • high overhead
    • extremely general and outdated constraints

Dennard scaling

CPU Performance despite these rules hasn’t increased in the recent years. Performance delivery by specialization

Multicore scalability

  • minimize synchronization, shared data structures and pollution

    • shared data are bad
      • but also lock/unlock, wastes a lot of time and doesn’t take advantages of inactive processors
  • the cpu0 usually takes charge of the most of interrupts

    • so cpu gets interrupted
  • the outgoing NIC packets are in queue

    • when cpu cores try to analyze them, they break (mutex, semaphores…wadda wadda)
  • solution? Multi-queue

    • policies
      • random
        • but it lacks cache and it is no connection based management, it means that if using two connections it is a mess
      • receive side scaling
        • packet belonging to the same connection must be processed on the same core
        • a tcp flow (connection) is identified by 4-tuple: src, dst, src_port, dst_port
        • cache locality
          • means that the data is available through cache
        • this is a good solution
      • RPS
      • RFS
      • XPS
  • RSS: Receive Side Scaling - is hardware implemented and hashes some bytes of packets (“hash function over the network and/or transport layer headers— for example, a 4-tuple hash over IP addresses and TCP ports of a packet”). Implementations are different, some may not filter most useful bytes or may be limited in other ways. This filtering and queue distribution is fast (only several additional cycles are needed in hw to classify packet), but not portable between some network cards or can’t be used with tunneled packets or some rare protocols. And sometimes your hardware have no support of number of queues enough to get one queue per logical CPU core.

RSS should be enabled when latency is a concern or whenever receive interrupt processing forms a bottleneck. Spreading load between CPUs decreases queue length.

  • Receive Packet Steering (RPS) “is logically a software implementation of RSS. Being in software, it is necessarily called later in the datapath.”. So, this is software alternative to hardware RSS (still parses some bytes to hash them into queue id), when you use hardware without RSS or want to classify based on more complex rule than hw can or have protocol which can’t be parsed in HW RSS classifier. But with RPS more CPU resources are used and there is additional inter-CPU traffic.

RPS has some advantages over RSS: 1) it can be used with any NIC, 2) software filters can easily be added to hash over new protocols, 3) it does not increase hardware device interrupt rate (although it does introduce inter-processor interrupts (IPIs)).

  • RFS: Receive Flow Steering is like RSS (software mechanism with more CPU overhead), but it not just hashing into pseudo-random queue id, but takes “into account application locality.” (so, packet processing will probably be faster due to good locality). Queues are tracked to be more local to the thread which will process received data, and packets are delivered to correct CPU core.

The goal of RFS is to increase datacache hitrate by steering kernel processing of packets to the CPU where the application thread consuming the packet is running. RFS relies on the same RPS mechanisms to enqueue packets onto the backlog of another CPU and to wake up that CPU. … In RFS, packets are not forwarded directly by the value of their hash, but the hash is used as index into a flow lookup table. This table maps flows to the CPUs where those flows are being processed.

  • Accelerated RFS - RFS with hw support. (Check your network driver for ndo_rx_flow_steer) “Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load balancing mechanism that uses soft state to steer flows based on where the application thread consuming the packets of each flow is running.”.

Similar method for packet transmitting (but packet is already generated and ready to be send, just select best queue to send it with - and to easier post-processing like freeing skb)

  • XPS: Transmit Packet Steering: “a mapping from CPU to hardware queue(s) is recorded.

The goal of this mapping is usually to assign queues exclusively to a subset of CPUs, where the transmit completions for these queues are processed on a CPU within this set”