Increasing Performance And Scalability Of Linux And Unix-Like Os Networking Stacks

Kernel Networking Subsystems

The Linux and Unix networking stack consists of several key components in the kernel that handle the sending, receiving, forwarding, and routing of network packets. Optimizing the performance of these subsystems is critical for achieving high throughput and low latency network communication.

TCP/IP Stack

The Linux TCP/IP stack implements core protocols like TCP, UDP, IP, ICMP, and IGMP in the kernel space for high performance packet processing. Important TCP parameters like window scaling, selective acknowledgements, and congestion control algorithms impact throughput and latency. TCP buffer sizes, backlog queues, and timeout values also require tuning for optimal data transfer given network and application characteristics.

Network Drivers

Network drivers enable the operating system to communicate with NIC hardware like Ethernet adapters. Performance depends on efficiently transferring packets between the NIC and system memory and minimizing interrupts and kernel transitions. Features like TCP Segmentation Offload (TSO), Generic Receive Offload (GRO), multi-queue, and busy polling reduce CPU usage. Driver parameter tuning may be required depending on bandwidth, connectivity, and workload patterns.

Packet Forwarding

Routers and load balancers forward packets quickly between network interfaces. Table-based software techniques like routing caches use rule sets and selection algorithms to swiftly determine outbound interfaces for packets. Hash-based approaches distribute load evenly using flow-based hashing. Caching routes and Network Address Translation (NAT) entries also improves forwarding speed. Optimal forwarding configurations prevent compute resources from becoming bottlenecks under heavy traffic loads.

Common Bottlenecks

While there are many possible network performance bottlenecks in Linux and Unix, three prevalent issues include high CPU usage, insufficient memory, and disk I/O contention.

High CPU Usage

Excessive kernel and user space CPU usage during network transfers leads to variable latency and limited throughput as packets get processed and forwarded behind real-time arrival rates. Common culprits include suboptimal TCP socket parameters, inefficient application data parsing, expensive crypto protocols, and unnecessary kernel network feature overheads.

Insufficient Memory

Network data processing requires memory for socket buffers, kernel queues, protocol metadata, and packet payloads. Running out of memory increases kernel memory reclamation frequency which adds heavyweight processing. Inadequate memory also forces packets into slower datapaths. Optimizing memory allocation across networking subsystems prevents transient memory starvation or outages.

Disk I/O Contention

Kernel logging, monitoring, storage networking, and virtual machine hypervisors can trigger disk access during packet processing and forwarding routines. Contending with disk I/O lengthens networking data paths dramatically. Separating disk workflows avoids injecting storage latency into networking operations wherever possible.

Optimizing the Network Stack

The Linux and Unix networking stack exposed many adjustable parameters and features that can be tuned and customized to improve performance for specific connectivity environments and workloads.

Tune TCP Parameters

Increasing Linux OS TCP socket buffers, backlog sizes, and various protocol timers allow for higher throughput and more concurrent connections. TCP window scaling and selective acknowledgement capabilities should be enabled for long distance and high latency links. Congestion control algorithms like BBR seek to maximize bandwidth and minimize latency simultaneously.

Enable Busy Polling

Busy polling mode reduces CPU context switching and inter-process interrupts by allowing network device drivers and the kernel stack to poll and spin wait for incoming packets. This achieves lower latency at the cost of higher CPU usage. Adaptive busy polling protocols toggle this mode on and off based on traffic patterns.

Increase Socket Buffers

Expanding TCP socket buffers and queues allow for more data to be buffered, enabling fast senders to transmit more before requiring handshaking. This is critical for long fat pipes and networks with high bandwidth, distance, and latency combinations. Receive and send buffer sizes can be increased independently as required.

Disable Unnecessary Features

Many default Linux kernel features are not required by all networking users and can impose unnecessary overhead at high packet rates. IPv6, IPsec, and multicast can be disabled selectively in many environments. Firewall rules, traffic shaping, and virtual network functions similarly introduce overhead.

Load Balancing and Scaling Out

Distributing networking load across cores, sockets, servers, and clusters allows for improved performance, resilience, and scale. Multiple techniques exist to increase networking capacity using common hardware and software capabilities.

Add More Network Interfaces

Adding additional network cards and ports increases potential bandwidth, I/O concurrency, and availability using familiar NICs and drivers. Bonding groups these interfaces into a single logical interface while still allowing individual failure resiliency.

Bond NICs for Increased Throughput

Ethernet channel bonding aggregates multiple physical NICs into a larger logical interface multiplying available network bandwidth. This also provides high availability with failover and load balancing across member ports. Bonding modes optimize for throughput, fault tolerance, or load distribution.

Distribute Load with IPVS/LVS

IPVS leverages in-kernel load balancing capabilities via direct routing, tunneling, NAT, and destination address splitting techniques. This enables even spreading of flows across many real servers providing back end application services. Scheduling algorithms and persistence rules allow flexible distribution.

Horizontal Scaling with Containers

Docker, Kubernetes, and OpenShift container orchestration clusters run replicated application instances across many servers. Containers encapsulate network services for portability across physical hosts while integrating with native OS networking. Auto-scaling increases containers automatically based on load.

Monitoring and Benchmarking

Carefully tracking networking performance metrics and analyzing packet flows allows informed optimization of Linux and Unix networking stacks. Benchmarking usingtraffic generators validates changes at scale.

Track Network Metrics

Operating system and application monitoring tools record metrics like bandwidth, latency, jitter, errors, drops, CPU usage, memory, and more. Trending these KPIs locates emerging bottlenecks during installations and upgrades. Dashboards quickly highlight abnormalities from baselines.

Profile Packet Flows

Kernel probes and packet sniffers offer visibility into data paths showing software and hardware packet handling in detail. This pinpoints non-performant code paths, drivers, tunables, and hardware. Static and dynamic tracing interprets complex control flow through kernels and applications during processing.

Stress Test Configuration Changes

Controlled overflow pressure on modified stacks validates headroom capabilities and limits using saturated load patterns. Testing reveals break points prior to production deployment when rolled back. Tests determine peak capacity and resource constraints to specify safety margins or over-provisioning requirements.

Simulate Real-World Load

Traffic generators and load injectors mimic heterogeneous client behaviors by modelling application protocols and usage profiles at large scale. This approximates real workloads analyzing metrics accordingly. Crowd-sourced data provides statistical distributions for accurate simulations.

Example Configuration Code Snippets

Linux and Unix provide many tunables and options for customizing network performance. Here are some common examples.

Sample sysctl.conf Settings

# Increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# Increase Linux autotuning TCP buffer limits
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

# Increase backlog queues 
net.core.netdev_max_backlog = 50000
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000

Network Driver Module Parameters

# Enable TCP segmentation offload
ethtool -K interface-name tso on

# Expand ring buffer sizes
ethtool -G interface-name rx 4096 tx 4096

# Increase number of RSS queues
ethtool -L interface-name combined 8

IPVS Load Balancer Example

# Create load balancer
ipvsadm -A -t 192.168.1.100:80 -s wrr

# Add two real servers
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.200:80 -m -w 1
ipvsadm -a -t 192.168.1.100:80 -r 192.168.1.300:80 -m -w 1

# Verify config
ipvsadm -ln