/*
* Pressure stall information for CPU, memory and IO
*
* Copyright (c) 2018 Facebook, Inc.
* Author: Johannes Weiner <hannes@cmpxchg.org>
*
* Polling support by Suren Baghdasaryan <surenb@google.com>
* Copyright (c) 2018 Google, Inc.
*
* When CPU, memory and IO are contended, tasks experience delays that
* reduce throughput and introduce latencies into the workload. Memory
* and IO contention, in addition, can cause a full loss of forward
* progress in which the CPU goes idle.
*
* This code aggregates individual task delays into resource pressure
* metrics that indicate problems with both workload health and
* resource utilization.
*
* Model
*
* The time in which a task can execute on a CPU is our baseline for
* productivity. Pressure expresses the amount of time in which this
* potential cannot be realized due to resource contention.
*
* This concept of productivity has two components: the workload and
* the CPU. To measure the impact of pressure on both, we define two
* contention states for a resource: SOME and FULL.
*
* In the SOME state of a given resource, one or more tasks are
* delayed on that resource. This affects the workload's ability to
* perform work, but the CPU may still be executing other tasks.
*
* In the FULL state of a given resource, all non-idle tasks are
* delayed on that resource such that nobody is advancing and the CPU
* goes idle. This leaves both workload and CPU unproductive.
*
* Naturally, the FULL state doesn't exist for the CPU resource at the
* system level, but exist at the cgroup level, means all non-idle tasks
* in a cgroup are delayed on the CPU resource which used by others outside
* of the cgroup or throttled by the cgroup cpu.max configuration.
*
* SOME = nr_delayed_tasks != 0
* FULL = nr_delayed_tasks != 0 && nr_running_tasks == 0
*
* The percentage of wallclock time spent in those compound stall
* states gives pressure numbers between 0 and 100 for each resource,
* where the SOME percentage indicates workload slowdowns and the FULL
* percentage indicates reduced CPU utilization:
*
* %SOME = time(SOME) / period
* %FULL = time(FULL) / period
*
* Multiple CPUs
*
* The more tasks and available CPUs there are, the more work can be
* performed concurrently. This means that the potential that can go
* unrealized due to resource contention *also* scales with non-idle
* tasks and CPUs.
*
* Consider a scenario where 257 number crunching tasks are trying to
* run concurrently on 256 CPUs. If we simply aggregated the task
* states, we would have to conclude a CPU SOME pressure number of
* 100%, since *somebody* is waiting on a runqueue at all
* times. However, that is clearly not the amount of contention the
* workload is experiencing: only one out of 256 possible execution
* threads will be contended at any given time, or about 0.4%.
*
* Conversely, consider a scenario of 4 tasks and 4 CPUs where at any
* given time *one* of the tasks is delayed due to a lack of memory.
* Again, looking purely at the task state would yield a memory FULL
* pressure number of 0%, since *somebody* is always making forward
* progress. But again this wouldn't capture the amount of execution
* potential lost, which is 1 out of 4 CPUs, or 25%.
*
* To calculate wasted potential (pressure) with multiple processors,
* we have to base our calculation on the number of non-idle tasks in
* conjunction with the number of available CPUs, which is the number
* of potential execution threads. SOME becomes then the proportion of
* delayed tasks to possible threads, and FULL is the share of possible
* threads that are unproductive due to delays:
*
* threads = min(nr_nonidle_tasks, nr_cpus)
* SOME = min(nr_delayed_tasks / threads, 1)
* FULL = (threads - min(nr_running_tasks, threads)) / threads
*
* For the 257 number crunchers on 256 CPUs, this yields:
*
* threads = min(257, 256)
* SOME = min(1 / 256, 1) = 0.4%
* FULL = (256 - min(257, 256)) / 256 = 0%
*
* For the 1 out of 4 memory-delayed tasks, this yields:
*
* threads = min(4, 4)
* SOME = min(1 / 4, 1) = 25%
* FULL = (4 - min(3, 4)) / 4 = 25%
*
* [ Substitute nr_cpus with 1, and you can see that it's a natural
* extension of the single-CPU model. ]
*
* Implementation
*
* To assess the precise time spent in each such state, we would have
* to freeze the system on task changes and start/stop the state
* clocks accordingly. Obviously that doesn't scale in practice.
*
* Because the scheduler aims to distribute the compute load evenly
* among the available CPUs, we can track task state locally to each
* CPU and, at much lower frequency, extrapolate the global state for
* the cumulative stall times and the running averages.
*
* For each runqueue, we track:
*
* tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0)
* tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_running_tasks[cpu])
* tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0)
*
* and then periodically aggregate:
*
* tNONIDLE = sum(tNONIDLE[i])
*
* tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE
* tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE
*
* %SOME = tSOME / period
* %FULL = tFULL / period