diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-03-30 17:01:51 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-03-30 17:01:51 -0700 |
| commit | 642e53ead6aea8740a219ede509a5d138fd4f780 (patch) | |
| tree | 5c4680d0c07315dab24fe7333c62f56bc19ec4e4 | |
| parent | 9b82f05f869a823d43ea4186f5f732f2924d3693 (diff) | |
| parent | 313f16e2e35abb833eab5bdebc6ae30699adca18 (diff) | |
| download | linux-642e53ead6aea8740a219ede509a5d138fd4f780.tar.gz linux-642e53ead6aea8740a219ede509a5d138fd4f780.tar.bz2 linux-642e53ead6aea8740a219ede509a5d138fd4f780.zip | |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"The main changes in this cycle are:
- Various NUMA scheduling updates: harmonize the load-balancer and
NUMA placement logic to not work against each other. The intended
result is better locality, better utilization and fewer migrations.
- Introduce Thermal Pressure tracking and optimizations, to improve
task placement on thermally overloaded systems.
- Implement frequency invariant scheduler accounting on (some) x86
CPUs. This is done by observing and sampling the 'recent' CPU
frequency average at ~tick boundaries. The CPU provides this data
via the APERF/MPERF MSRs. This hopefully makes our capacity
estimates more precise and keeps tasks on the same CPU better even
if it might seem overloaded at a lower momentary frequency. (As
usual, turbo mode is a complication that we resolve by observing
the maximum frequency and renormalizing to it.)
- Add asymmetric CPU capacity wakeup scan to improve capacity
utilization on asymmetric topologies. (big.LITTLE systems)
- PSI fixes and optimizations.
- RT scheduling capacity awareness fixes & improvements.
- Optimize the CONFIG_RT_GROUP_SCHED constraints code.
- Misc fixes, cleanups and optimizations - see the changelog for
details"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
threads: Update PID limit comment according to futex UAPI change
sched/fair: Fix condition of avg_load calculation
sched/rt: cpupri_find: Trigger a full search as fallback
kthread: Do not preempt current task if it is going to call schedule()
sched/fair: Improve spreading of utilization
sched: Avoid scale real weight down to zero
psi: Move PF_MEMSTALL out of task->flags
MAINTAINERS: Add maintenance information for psi
psi: Optimize switching tasks inside shared cgroups
psi: Fix cpu.pressure for cpu.max and competing cgroups
sched/core: Distribute tasks within affinity masks
sched/fair: Fix enqueue_task_fair warning
thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code
sched/rt: Remove unnecessary push for unfit tasks
sched/rt: Allow pulling unfitting task
sched/rt: Optimize cpupri_find() on non-heterogenous systems
sched/rt: Re-instate old behavior in select_task_rq_rt()
sched/rt: cpupri_find: Implement fallback mechanism for !fit case
sched/fair: Fix reordering of enqueue/dequeue_task_fair()
sched/fair: Fix runnable_avg for throttled cfs
...
37 files changed, 1552 insertions, 513 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 74e75a7b4b24..9024fb2dc418 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4428,6 +4428,22 @@ incurs a small amount of overhead in the scheduler but is useful for debugging and performance tuning. + sched_thermal_decay_shift= + [KNL, SMP] Set a decay shift for scheduler thermal + pressure signal. Thermal pressure signal follows the + default decay period of other scheduler pelt + signals(usually 32 ms but configurable). Setting + sched_thermal_decay_shift will left shift the decay + period for the thermal pressure signal by the shift + value. + i.e. with the default pelt decay period of 32 ms + sched_thermal_decay_shift thermal pressure decay pr + 1 64 ms + 2 128 ms + and so on. + Format: integer between 0 and 10 + Default is 0. + skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. diff --git a/Documentation/robust-futex-ABI.txt b/Documentation/robust-futex-ABI.txt index 8a5d34abf726..f24904f1c16f 100644 --- a/Documentation/robust-futex-ABI.txt +++ b/Documentation/robust-futex-ABI.txt @@ -61,8 +61,8 @@ setup that list. address of the associated 'lock entry', plus or minus, of what will be called the 'lock word', from that 'lock entry'. The 'lock word' is always a 32 bit word, unlike the other words above. The 'lock - word' holds 3 flag bits in the upper 3 bits, and the thread id (TID) - of the thread holding the lock in the bottom 29 bits. See further + word' holds 2 flag bits in the upper 2 bits, and the thread id (TID) + of the thread holding the lock in the bottom 30 bits. See further below for a description of the flag bits. The third word, called 'list_op_pending', contains transient copy of @@ -128,7 +128,7 @@ that thread's robust_futex linked lock list a given time. A given futex lock structure in a user shared memory region may be held at different times by any of the threads with access to that region. The thread currently holding such a lock, if any, is marked with the threads -TID in the lower 29 bits of the 'lock word'. +TID in the lower 30 bits of the 'lock word'. When adding or removing a lock from its list of held locks, in order for the kernel to correctly handle lock cleanup regardless of when the task @@ -141,7 +141,7 @@ On insertion: 1) set the 'list_op_pending' word to the address of the 'lock entry' to be inserted, 2) acquire the futex lock, - 3) add the lock entry, with its thread id (TID) in the bottom 29 bits + 3) add the lock entry, with its thread id (TID) in the bottom 30 bits of the 'lock word', to the linked list starting at 'head', and 4) clear the 'list_op_pending' word. @@ -155,7 +155,7 @@ On removal: On exit, the kernel will consider the address stored in 'list_op_pending' and the address of each 'lock word' found by walking -the list starting at 'head'. For each such address, if the bottom 29 +the list starting at 'head'. For each such address, if the bottom 30 bits of the 'lock word' at offset 'offset' from that address equals the exiting threads TID, then the kernel will do two things: @@ -180,7 +180,5 @@ any point: future kernel configuration changes) elements. When the kernel sees a list entry whose 'lock word' doesn't have the -current threads TID in the lower 29 bits, it does nothing with that +current threads TID in the lower 30 bits, it does nothing with that entry, and goes on to the next entry. - -Bit 29 (0x20000000) of the 'lock word' is reserved for future use. diff --git a/MAINTAINERS b/MAINTAINERS index 251e9c3475e5..f114a90a23f8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13552,6 +13552,12 @@ F: net/psample F: include/net/psample.h F: include/uapi/linux/psample.h +PRESSURE STALL INFORMATION (PSI) +M: Johannes Weiner <hannes@cmpxchg.org> +S: Maintained +F: kernel/sched/psi.c +F: include/linux/psi* + PSTORE FILESYSTEM M: Kees Cook <keescook@chromium.org> M: Anton Vorontsov <anton@enomsg.org> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 8a0fae94d45e..435aba289fc5 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -16,6 +16,9 @@ /* Enable topology flag updates */ #define arch_update_cpu_topology topology_update_cpu_topology +/* Replace task scheduler's default thermal pressure retrieve API */ +#define arch_scale_thermal_pressure topology_get_thermal_pressure + #else static inline void init_cpu_topology(void) { } diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index 4db223dbc549..e7573289a66f 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig @@ -62,6 +62,7 @@ CONFIG_ARCH_ZX=y CONFIG_ARCH_ZYNQMP=y CONFIG_ARM64_VA_BITS_48=y CONFIG_SCHED_MC=y +CONFIG_SCHED_SMT=y CONFIG_NUMA=y CONFIG_SECCOMP=y CONFIG_KEXEC=y diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index a4d945db95a2..cbd70d78ef15 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -25,6 +25,9 @@ int pcibus_to_node(struct pci_bus *bus); /* Enable topology flag updates */ #define arch_update_cpu_topology topology_update_cpu_topology +/* Replace task scheduler's default thermal pressure retrieve API */ +#define arch_scale_thermal_pressure topology_get_thermal_pressure + #include <asm-generic/topology.h> #endif /* _ASM_ARM_TOPOLOGY_H */ diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 4b14d2318251..79d8d5496330 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -193,4 +193,29 @@ static inline void sched_clear_itmt_support(void) } #endif /* CONFIG_SCHED_MC_PRIO */ +#ifdef CONFIG_SMP +#include <asm/cpufeature.h> + +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); + +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) + +DECLARE_PER_CPU(unsigned long, arch_freq_scale); + +static inline long arch_scale_freq_capacity(int cpu) +{ + return per_cpu(arch_freq_scale, cpu); +} +#define arch_scale_freq_capacity arch_scale_freq_capacity + +extern void arch_scale_freq_tick(void); +#define arch_scale_freq_tick arch_scale_freq_tick + +extern void arch_set_max_freq_ratio(bool turbo_disabled); +#else +static inline void arch_set_max_freq_ratio(bool turbo_disabled) +{ +} +#endif + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 3076ef0864dd..d59525d2ead3 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void) *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; } +static void init_freq_invariance(void); + /* * Report back to the Boot Processor during boot time or to the caller processor * during CPU online. @@ -183,6 +185,8 @@ static void smp_callin(void) */ set_cpu_sibling_map(raw_smp_processor_id()); + init_freq_invariance(); + /* * Get our bogomips. * Update loops_per_jiffy in cpu_data. Previous call to @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) set_sched_topology(x86_topology); set_cpu_sibling_map(0); - + init_freq_invariance(); smp_sanity_check(); switch (apic_intr_mode) { @@ -1764,3 +1768,287 @@ void native_play_dead(void) } #endif + +/* + * APERF/MPERF frequency ratio computation. + * + * The scheduler wants to do frequency invariant accounting and needs a <1 + * ratio to account for the 'current' frequency, corresponding to + * freq_curr / freq_max. + * + * Since the frequency freq_curr on x86 is controlled by micro-controller and + * our P-state setting is little more than a request/hint, we need to observe + * the effective frequency 'BusyMHz', i.e. the average frequency over a time + * interval after discarding idle time. This is given by: + * + * BusyMHz = delta_APERF / delta_MPERF * freq_base + * + * where freq_base is the max non-turbo P-state. + * + * The freq_max term has to be set to a somewhat arbitrary value, because we + * can't know which turbo states will be available at a given point in time: + * it all depends on the thermal headroom of the entire package. We set it to + * the turbo level with 4 cores active. + * + * Benchmarks show that's a good compromise between the 1C turbo ratio + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base, + * which would ignore the entire turbo range (a conspicuous part, making + * freq_curr/freq_max always maxed out). + * + * An exception to the heuristic above is the Atom uarch, where we choose the + * highest turbo level for freq_max since Atom's are generally oriented towards + * power efficiency. + * + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. + */ + +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); + +static DEFINE_PER_CPU(u64, arch_prev_aperf); +static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_turbo_freq_ratio = SCHED_CAPACITY_SCALE; +static u64 arch_max_freq_ratio = SCHED_CAPACITY_SCALE; + +void arch_set_max_freq_ratio(bool turbo_disabled) +{ + arch_max_freq_ratio = turbo_disabled ? SCHED_CAPACITY_SCALE : + arch_turbo_freq_ratio; +} + +static bool turbo_disabled(void) +{ + u64 misc_en; + int err; + + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); + if (err) + return false; + + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); +} + +static bool slv_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq) +{ + int err; + + err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, base_freq); + if (err) + return false; + + err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_freq); + if (err) + return false; + + *base_freq = (*base_freq >> 16) & 0x3F; /* max P state */ + *turbo_freq = *turbo_freq & 0x3F; /* 1C turbo */ + + return true; +} + +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> + +#define ICPU(model) \ + {X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} + +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_XEON_PHI_KNL), + ICPU(INTEL_FAM6_XEON_PHI_KNM), + {} +}; + +static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_SKYLAKE_X), + {} +}; + +static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_ATOM_GOLDMONT), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), + {} +}; + +static bool knl_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, + int num_delta_fratio) +{ + int fratio, delta_fratio, found; + int err, i; + u64 msr; + + if (!x86_match_cpu(has_knl_turbo_ratio_limits)) + return false; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq); + if (err) + return false; + + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */ + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr); + if (err) + return false; + + fratio = (msr >> 8) & 0xFF; + i = 16; + found = 0; + do { + if (found >= num_delta_fratio) { + *turbo_freq = fratio; + return true; + } + + delta_fratio = (msr >> (i + 5)) & 0x7; + + if (delta_fratio) { + found += 1; + fratio -= delta_fratio; + } + + i += 8; + } while (i < 64); + + return true; +} + +static bool skx_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq, int size) +{ + u64 ratios, counts; + u32 group_size; + int err, i; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq); + if (err) + return false; + + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */ + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratios); + if (err) + return false; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &counts); + if (err) + return false; + + for (i = 0; i < 64; i += 8) { + group_size = (counts >> i) & 0xFF; + if (group_size >= size) { + *turbo_freq = (ratios >> i) & 0xFF; + return true; + } + } + + return false; +} + +static bool core_set_max_freq_ratio(u64 *base_freq, u64 *turbo_freq) +{ + int err; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, base_freq); + if (err) + return false; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_freq); + if (err) + return false; + + *base_freq = (*base_freq >> 8) & 0xFF; /* max P state */ + *turbo_freq = (*turbo_freq >> 24) & 0xFF; /* 4C turbo */ + + return true; +} + +static bool intel_set_max_freq_ratio(void) +{ + u64 base_freq, turbo_freq; + + if (slv_set_max_freq_ratio(&base_freq, &turbo_freq)) + goto out; + + if (x86_match_cpu(has_glm_turbo_ratio_limits) && + skx_set_max_freq_ratio(&base_freq, &turbo_freq, 1)) + goto out; + + if (knl_set_max_freq_ratio(&base_freq, &turbo_freq, 1)) + goto out; + + if (x86_match_cpu(has_skx_turbo_ratio_limits) && + skx_set_max_freq_ratio(&base_freq, &turbo_freq, 4)) + goto out; + + if (core_set_max_freq_ratio(&base_freq, &turbo_freq)) + goto out; + + return false; + +out: + arch_turbo_freq_ratio = div_u64(turbo_freq * SCHED_CAPACITY_SCALE, + base_freq); + arch_set_max_freq_ratio(turbo_disabled()); + return true; +} + +static void init_counter_refs(void *arg) +{ + u64 aperf, mperf; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); +} + +static void init_freq_invariance(void) +{ + bool ret = false; + + if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF)) + return; + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + ret = intel_set_max_freq_ratio(); + + if (ret) { + on_each_cpu(init_counter_refs, NULL, 1); + static_branch_enable(&arch_scale_freq_key); + } else { + pr_debug("Couldn't determine max cpu frequency, necessary for scale-invariant accounting.\n"); + } +} + +DEFINE_PER_CPU(unsigned long, arch_freq_scale) = SCHED_CAPACITY_SCALE; + +void arch_scale_freq_tick(void) +{ + u64 freq_scale; + u64 aperf, mperf; + u64 acnt, mcnt; + + if (!arch_scale_freq_invariant()) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + acnt = aperf - this_cpu_read(arch_prev_aperf); + mcnt = mperf - this_cpu_read(arch_prev_mperf); + if (!mcnt) + return; + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); + + acnt <<= 2*SCHED_CAPACITY_SHIFT; + mcnt *= arch_max_freq_ratio; + + freq_scale = div64_u64(acnt, mcnt); + + if (freq_scale > SCHED_CAPACITY_SCALE) + freq_scale = SCHED_CAPACITY_SCALE; + + this_cpu_write(arch_freq_scale, freq_scale); +} diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 95abbf66e95e..4d1e25d1ced1 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -922,6 +922,7 @@ static void intel_pstate_update_limits(unsigned int cpu) */ if (global.turbo_disabled_mf != global.turbo_disabled) { global.turbo_disabled_mf = global.turbo_disabled; + arch_set_max_freq_ratio(global.turbo_disabled); for_each_possible_cpu(cpu) intel_pstate_update_max_freq(cpu); } else { diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c index fe83d7a210d4..4ae8c856c88e 100644 --- a/drivers/thermal/cpufreq_cooling.c +++ b/drivers/thermal/cpufreq_cooling.c @@ -431,6 +431,10 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev, unsigned long state) { struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata; + struct cpumask *cpus; + unsigned int frequency; + unsigned long max_capacity, capacity; + int ret; /* Request state should be less than max_level */ if (WARN_ON(state > cpufreq_cdev->max_level)) @@ -442,8 +446,19 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev, cpufreq_cdev->cpufreq_state = state; - return freq_qos_update_request(&cpufreq_cdev->qos_req, - get_state_freq(cpufreq_cdev, state)); + frequency = get_state_freq(cpufreq_cdev, state); + + ret = freq_qos_update_request(&cpufreq_cdev->qos_req, frequency); + + if (ret > 0) { + cpus = cpufreq_cdev->policy->cpus; + max_capacity = arch_scale_cpu_capacity(cpumask_first(cpus)); + capacity = frequency * max_capacity; + capacity /= cpufreq_cdev->policy->cpuinfo.max_freq; + arch_set_thermal_pressure(cpus, max_capacity - capacity); + } + + return ret; } /* Bind cpufreq callbacks to thermal cooling device ops */ diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index c507e9ddd909..8dd9aeb6cdeb 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -30,6 +30,16 @@ static inline unsigned long topology_get_freq_scale(int cpu) return per_cpu(freq_scale, cpu); } +DECLARE_PER_CPU(unsigned long, thermal_pressure); + +static inline unsigned long topology_get_thermal_pressure(int cpu) +{ + return per_cpu(thermal_pressure, cpu); +} + +void arch_set_thermal_pressure(struct cpumask *cpus, + unsigned long th_pressure); + struct cpu_topology { int thread_id; int core_id; diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index d5cc88514aee..f0d895d6ac39 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -194,6 +194,11 @@ static inline unsigned int cpumask_local_spread(unsigned int i, int node) return 0; } +static inline int cpumask_any_and_distribute(const struct cpumask *src1p, + const struct cpumask *src2p) { + return cpumask_next_and(-1, src1p, src2p); +} + #define for_each_cpu(cpu, mask) \ for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask) #define for_each_cpu_not(cpu, mask) \ @@ -245,6 +250,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp) int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *); int cpumask_any_but(const struct cpumask *mask, unsigned int cpu); unsigned int cpumask_local_spread(unsigned int i, int node); +int cpumask_any_and_distribute(const struct cpumask *src1p, + const struct cpumask *src2p); /** * for_each_cpu - iterate over every cpu in a mask diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 0d9db2a14f44..9b7a8d74a9d6 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -257,6 +257,13 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset); #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0) +#ifndef CONFIG_PREEMPT_RT +# define cant_migrate() cant_sleep() +#else + /* Placeholder for now */ +# define cant_migrate() do { } while (0) +#endif + /** * abs - return absolute value of an argument * @x: the value. If it is unsigned type, it is converted to signed type first. diff --git a/include/linux/preempt.h b/include/linux/preempt.h index bbb68dba37cc..bc3f1aecaa19 100644 --- a/include/linux/preempt.h +++ b/include/linux/preempt.h @@ -322,4 +322,34 @@ static inline void preempt_notifier_init(struct preempt_notifier *notifier, #endif +/** + * migrate_disable - Prevent migration of the current task + * + * Maps to preempt_disable() which also disables preemption. Use + * migrate_disable() to annotate that the intent is to prevent migration, + * but not necessarily preemption. + * + * Can be invoked nested like preempt_disable() and needs the corresponding + * number of migrate_enable() invocations. + */ +static __always_inline void migrate_disable(void) +{ + preempt_disable(); +} + +/** + * migrate_enable - Allow migration of the current task + * + * Counterpart to migrate_disable(). + * + * As migrate_disable() can be invoked nested, only the outermost invocation + * reenables migration. + * + * Currently mapped to preempt_enable(). + */ +static __always_inline void migrate_enable(void) +{ + preempt_enable(); +} + #endif /* __LINUX_PREEMPT_H */ diff --git a/include/linux/psi.h b/include/linux/psi.h index 7b3de7321219..7361023f3fdd 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -17,6 +17,8 @@ extern struct psi_group psi_system; void psi_init(void); void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); void psi_memstall_tick(struct task_struct *task, int cpu); void psi_memstall_enter(unsigned long *flags); diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 07aaf9b82241..4b7258495a04 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -14,13 +14,21 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - NR_PSI_TASK_COUNTS = 3, + /* + * This can't have values other than 0 or 1 and could be + * implemented as a bit flag. But for now we still have room + * in the first cacheline of psi_group_cpu, and this way we + * don't have to special case any state tracking for it. + */ + NR_ONCPU, + NR_PSI_TASK_COUNTS = 4, }; /* Task state bitmasks */ #define TSK_IOWAIT (1 << NR_IOWAIT) #define TSK_MEMSTALL (1 << NR_MEMSTALL) #define TSK_RUNNING (1 << NR_RUNNING) +#define TSK_ONCPU (1 << NR_ONCPU) /* Resources that workloads could be stalled on */ enum psi_res { diff --git a/include/linux/sched.h b/include/linux/sched.h index 933914cdf219..5d0d2fc018e9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -356,28 +356,30 @@ struct util_est { } __attribute__((__aligned__(sizeof(u64)))); /* - * The load_avg/util_avg accumulates an infinite geometric series - * (see __update_load_avg() in kernel/sched/fair.c). + * The load/runnable/util_avg accumulates an infinite geometric series + * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c). * * [load_avg definition] * * load_avg = runnable% * scale_load_down(load) * - * where runnable% is the time ratio that a sched_entity is runnable. - * For cfs_rq, it is the aggregated load_avg of all runnable and - * blocked sched_entities. |
