Optimizing Linux And Unix-Like Os For Next-Generation Cloud-Native Workloads

Enabling Kernel Features for Cloud Workloads

The Linux and Unix-like kernels offer advanced control group (cgroup) functionality for limiting, prioritizing, accounting for, and isolating CPU, memory, disk I/O, network, and other resources for workloads. Configuring cgroups allows setting resource limits, guarantees resource minimums, divides resources proportionally, and defines hard and soft caps per workload. Namespaces provide workload isolation for PID, network, mount, user, IPC, UTC, and other namespaces. Tuning schedulers like CFS, deadline, and round-robin with optimized timers allow timing sensitive workloads to achieve low latency.

Configuring control groups (cgroups) for resource limitation

Control groups allow admins to allocate resources such as CPU time, system memory, network bandwidth, or combinations of these resources for user-defined groups of tasks running on a system. Limiting resources per cgroup prevents any single workload from overutilizing resources and affecting other workloads on the shared infrastructure. Cgroups enable use cases like prioritizing latency-sensitive workloads, minimizing noisy neighbor issues, and cost attribution for shared clusters.

Key cgroup subsystems and capabilities include:

Table of Contents

cpu – Limit CPU utilization per cgroup.
cpusets – Assign individual CPUs and memory nodes to workloads.
devices – Allow or deny access to devices like GPUs.
memory – Limit memory usage per workload.
blkio – Limit block device I/O bandwidth per cgroup.

Cgroups apply resource limits and controls recursively to child cgroups and processes. Workloads can be organized logically into a hierarchical tree aligning to org structures and resource requirements.

Enabling namespaces for workload isolation

Namespaces isolate and virtualize system resources per process logical “containers”. They allow splitting up resources so each workload runs as if on a separate system oblivious to other workloads. Key namespaces include:

PID – Isolated process ID number space.
Network – Unique network devices, ports, IPs, routes, and firewalls.
Mount – Separate mount points and filesystems.
User – Maps UIDs differently per namespace.
IPC – Own IPC resources like System V IPC and POSIX message queues.

For cloud workloads, using Docker containers, Kubernetes pods, and other workload isolation technologies, namespaces are integral to containing and restricting workloads from interfering with each other and the underlying host OS. Namespaces enable true logical isolation across all resources.

Tuning schedulers and timers for low latency

The Linux kernel supports multiple scheduler classes – CFS, deadline, round-robin, and FIFO. CFS is designed for fair resource sharing but other schedulers prioritize latency-sensitive workloads.

Key low latency tuning techniques include:

Assigning separate cpusets for latency-sensitive workloads.
Using the deadline scheduler to guarantee execution windows.
Increasing timer frequencies from default of 100Hz.
Pinning interrupt handlers to housekeeping cores.
Binding NIC Rx/Tx queues to separate cores.

Production deployments should isolate scheduling domains, reserve capacity, and carefully tune timers to ensure consistent performance for telemetry, databases, distributed queues, web services and other low latency workloads running on the shared infrastructure.

Deploying Optimized Filesystems

The filesystem provides critical data access and storage capabilities to cloud-native workloads. Choosing the right filesystem and tuning its mount options optimizes performance for the expected workload patterns. Most latency-sensitive workloads benefit from filesystems with faster metadata operations.

Leveraging copy-on-write filesystems like Btrfs and ZFS

Copy-on-write (or COW) filesystems like Btrfs and ZFS minimize locking overheads by directing writes to unused block areas instead of overwriting existing blocks. This optimizes concurrent updates common for server workloads like databases, media processing, web assets, etc. Additionally, native capabilities like multi-disk pools, snapshots, compression, and checksums make these next-generation filesystems ideal for cloud workloads.

Mounting with nounatime and nodiratime

File access and modify timestamps impose overhead on filesystems for production workloads constantly creating, updating, and deleting files. Mount options like noatime and nodiratime disable capturing access time stamps which avoids unnecessary metadata update load. Further turning off barrier mount option minimizes fsync flows to increase throughput.

Setting optimized mount options

Common optimized mount options include:

nobarrier – Disable fsync flow for write barriers.
noatime – Disable updating inode access times.
nodiratime – Disable access times for directories.
largeio – Optimize for large I/O sequential access.

For spinning media, using inode_readahead_blks increases readahead values from the default of 2 blocks up to 2048 blocks. For applications opening many small files like web servers, increasing inode cache sizes and hash table sizes prevents inode contention.

Tuning Networking Stack for High Throughput

The rapid growth of east-west network traffic between cloud-native workloads running within and across data centers has spotlighted bottlenecks and deficiencies in legacy networking stacks designed decades ago. Improvements across socket buffers, TCP stack, and userspace networking radically optimize cloud workload network throughput and support massive scale.

Increasing backlog sizes for high connection counts

Applications like load balancers, reverse proxies, and app frameworks opening thousands of connections run into ceiling limitations for backlog sizes. The default range of 128 – 512 for backlogs proves inadequate resulting in failed connections. Increasing limits up to 32768 or higher using somaxconn and switches like net.core.somaxconn in sysctl prevent TCP listening socket exhaustion.

Enabling TCP optimizations like TSO and LRO

Offloading techniques including TCP segmentation offload (TSO) and large receive offload (LRO) move packet processing from the kernel to NIC hardware. This saves precious CPU cycles for the enormous bi-directional data flows typical of east-west traffic between cloud workloads. Jumbo frames further reduce transport overhead and increase throughput.

Switching to high-performance userspace networking

Kernel bypass technologies like DPDK, SPDK, netmap, and AF_XDP directly pass packets from the NIC to userspace networking stacks implemented inside applications like databases, distributed queues, SDN controllers, and virtual switch implementations. This avoids kernel bottlenecks altogether and achieve extremely low latency packet handling not possible in kernel space.

Monitoring Resource Usage and Contention

The ephemeral and fluctuating nature of cloud workloads running on shared infrastructure demands robust monitoring to track resource usage, identify hotspots, diagnose performance issues, and alert on contention. Modern tools perfectly suited for cloud environments provide stunning visibility unlocking optimization opportunities.

Setting up metrics with Prometheus

As a popular open source monitoring solution, Prometheus uses a time series database optimized for aggregating metrics scraped from instrumented jobs. It integrates neatly with Kubernetes and cloud-native tools making visualization and alerting for resource usage metrics accessible using its powerful query language PromQL.

Tracing code with eBPF

Extended BPF (eBPF) provides dynamic instrumentation embedded into the Linux kernel to trace events including scheduling, networking, and syscalls. Attaching eBPF programs to cgroups, tcp sockets, and applications traces detailed code execution flows across the software stack. eBPF safely executes sandboxed programs in-kernel with little overhead.

Detecting bottlenecks with flamegraphs

Flamegraphs provide visualization of profiled software, helping identify hot code paths and resource contention down to the stack frame level. Common profilers like perf and eBPF attach to live programs keeping overhead negligible. The folded stack output gets rendered as interactive flamegraphs making performance analysis intuitive even for complex software.

Securing Workloads Across Dev and Production

Restricting access, hardening configurations, addressing vulnerabilities, and protecting secrets prove challenging for workloads spanning development, test, staging, and production environments. Security automation and policy engines simplify applying consistent controls to provide a robust foundation for cloud workloads.

Integrating SELinux policies

SELinux provides fine-grained, policy-driven mandatory access control for hardening Linux against compromised or malicious workloads. Well-defined policies confine workloads based on least privilege principles reducing their blast radius. Distributions offer predefined policies tuned for common containers, web apps, databases, etc. maximizing workload security.

Managing secrets with Hashicorp Vault

Hashicorp Vault centrally secures, stores, and tightly controls access to tokens, passwords, certificates, encryption keys, and other secrets across heterogeneous environments. Dynamic secrets integrate with apps and configuration management tools to restrict access with short-lived leases. Policy engines control secret access while features like audit logs provide visibility.

Scanning images for vulnerabilities

Container base images and application image registries require continuous scanning to detect vulnerabilities like out-of-date packages, insecure configurations, malware, and other risks. Combining multiple scan tools broadens coverage for known and zero-day threats. Automating scan workflows as part of the CI/CD pipeline finds issues early before reaching production.

Example Code for Mounting Filesystem with Optimized Options

mount -t ext4 -o nobarrier,inode_readahead_blks=2048 /dev/vdb1 /data