summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/pm
diff options
context:
space:
mode:
authorRafael J. Wysocki <rafael.j.wysocki@intel.com>2025-02-25 12:13:52 +0100
committerRafael J. Wysocki <rafael.j.wysocki@intel.com>2025-02-25 12:13:52 +0100
commitde585eac08b972c86faa060a77263af3d209ed24 (patch)
treef003a300b55ec190e1ee315c5e592972cb4e7f34 /Documentation/admin-guide/pm
parent5e7e39ae15b0ea370e783a9326fdd1d91357fc3e (diff)
parent5c350410999653dff8d2975d794088e4c166e8b5 (diff)
downloadlinux-de585eac08b972c86faa060a77263af3d209ed24.tar.gz
linux-de585eac08b972c86faa060a77263af3d209ed24.tar.bz2
linux-de585eac08b972c86faa060a77263af3d209ed24.zip
Merge branch 'cpuidle-menu'
This work had been triggered by a report that commit 0611a640e60a ("eventpoll: prefer kfree_rcu() in __ep_remove()") had caused the critical-jOPS metric of the SPECjbb 2015 benchmark [1] to drop by around 50% even though it generally reduced kernel overhead. Indeed, it was found during further investigation that the total interrupt rate while running the SPECjbb workload had fallen as a result of that commit by 55% and the local timer interrupt rate had fallen by almost 80%. That turned out to cause the menu cpuidle governor to select the deepest idle state supplied by the cpuidle driver (intel_idle) much more often which added significantly more idle state latency to the workload and that led to the decrease of the critical-jOPS score. Interestingly enough, this problem was not visible when the teo cpuidle governor was used instead of menu, so it appeared to be specific to the latter. CPU wakeup event statistics collected while running the workload indicated that the menu governor was effectively ignoring non- timer wakeup information and all of its idle state selection decisions appeared to be based on timer wakeups only. Thus, it appeared that the reduction of the local timer interrupt rate caused the governor to predict a idle duration much more often while running the workload and the deepest idle state was selected significantly more often as a result of that. A subsequent inspection of the get_typical_interval() function in the menu governor indicated that it might return UINT_MAX too often which then caused the governor's decisions to be based entirely on information related to timers. Generally speaking, UINT_MAX is returned by get_typical_interval() if it cannot make a prediction based on the most recent idle intervals data with sufficiently high confidence, but at least in some cases this means that useful information is not taken into account at all which may lead to significant idle state selection mistakes. Moreover, this is not really unlikely to happen. One issue with get_typical_interval() is that, when it eliminates outliers from the sample set in an attempt to reduce the standard deviation (and so improve the prediction confidence), it does that by dropping high-end samples only, while samples at the low end of the set are retained. However, the samples at the low end very well may be the outliers and they should be eliminated from the sample set instead of the high-end samples. Accordingly, the likelihood of making a meaningful idle duration prediction can be improved by making it also eliminate low-end samples if they are farther from the average than high-end samples. Another issue is that get_typical_interval() gives up after eliminating 1/4 of the samples if the standard deviation is still not as low as desired (within 1/6 of the average or within 20 us if the average is close to 0), but the remaining samples in the set still represent useful information at that point and discarding them altogether may lead to suboptimal idle state selection. For instance, the largest idle duration value in the get_typical_interval() data set is the maximum idle duration observed recently and it is likely that the upcoming idle duration will not exceed it. Therefore, in the absence of a better choice, this value can be used as an upper bound on the target residency of the idle state to select. * cpuidle-menu: cpuidle: menu: Update documentation after get_typical_interval() changes cpuidle: menu: Avoid discarding useful information cpuidle: menu: Eliminate outliers on both ends of the sample set cpuidle: menu: Tweak threshold use in get_typical_interval() cpuidle: menu: Use one loop for average and variance computations cpuidle: menu: Drop a redundant local variable
Diffstat (limited to 'Documentation/admin-guide/pm')
-rw-r--r--Documentation/admin-guide/pm/cpuidle.rst29
1 files changed, 17 insertions, 12 deletions
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
index eb58d7a5affd..0c090b076224 100644
--- a/Documentation/admin-guide/pm/cpuidle.rst
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -275,20 +275,25 @@ values and, when predicting the idle duration next time, it computes the average
and variance of them. If the variance is small (smaller than 400 square
milliseconds) or it is small relative to the average (the average is greater
that 6 times the standard deviation), the average is regarded as the "typical
-interval" value. Otherwise, the longest of the saved observed idle duration
+interval" value. Otherwise, either the longest or the shortest (depending on
+which one is farther from the average) of the saved observed idle duration
values is discarded and the computation is repeated for the remaining ones.
+
Again, if the variance of them is small (in the above sense), the average is
taken as the "typical interval" value and so on, until either the "typical
-interval" is determined or too many data points are disregarded, in which case
-the "typical interval" is assumed to equal "infinity" (the maximum unsigned
-integer value).
-
-If the "typical interval" computed this way is long enough, the governor obtains
-the time until the closest timer event with the assumption that the scheduler
-tick will be stopped. That time, referred to as the *sleep length* in what follows,
-is the upper bound on the time before the next CPU wakeup. It is used to determine
-the sleep length range, which in turn is needed to get the sleep length correction
-factor.
+interval" is determined or too many data points are disregarded. In the latter
+case, if the size of the set of data points still under consideration is
+sufficiently large, the next idle duration is not likely to be above the largest
+idle duration value still in that set, so that value is taken as the predicted
+next idle duration. Finally, if the set of data points still under
+consideration is too small, no prediction is made.
+
+If the preliminary prediction of the next idle duration computed this way is
+long enough, the governor obtains the time until the closest timer event with
+the assumption that the scheduler tick will be stopped. That time, referred to
+as the *sleep length* in what follows, is the upper bound on the time before the
+next CPU wakeup. It is used to determine the sleep length range, which in turn
+is needed to get the sleep length correction factor.
The ``menu`` governor maintains an array containing several correction factor
values that correspond to different sleep length ranges organized so that each
@@ -302,7 +307,7 @@ to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
The sleep length is multiplied by the correction factor for the range that it
falls into to obtain an approximation of the predicted idle duration that is
compared to the "typical interval" determined previously and the minimum of
-the two is taken as the idle duration prediction.
+the two is taken as the final idle duration prediction.
If the "typical interval" value is small, which means that the CPU is likely
to be woken up soon enough, the sleep length computation is skipped as it may