linux.git/kernel/sched/debug.c, branch v6.6.131

sched/debug: Provide slice length for fair tasks

2025-03-22T19:50:40+00:00

[ Upstream commit 9065ce69754dece78606c8bbb3821449272e56bf ]

Since commit:

  857b158dc5e8 ("sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestion")

... we have the userspace per-task tunable slice length, which is
a key parameter that is otherwise difficult to obtain, so provide
it in /proc/$PID/sched.

[ mingo: Clarified the changelog. ]

Signed-off-by: Christian Loehle 
Signed-off-by: Ingo Molnar 
Cc: Peter Zijlstra 
Link: https://lore.kernel.org/r/453349b1-1637-42f5-a7b2-2385392b5956@arm.com
Signed-off-by: Sasha Levin

sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice

2023-07-19T07:43:59+00:00

EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org

sched/fair: Commit to EEVDF

2023-07-19T07:43:58+00:00

EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org

sched/fair: Implement an EEVDF-like scheduling policy

2023-07-19T07:43:58+00:00

Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.

Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.

EEVDF has two parameters:

 - weight, or time-slope: which is mapped to nice just as before

 - request size, or slice length: which is used to compute
   the virtual deadline as: vd_i = ve_i + r_i/w_i

Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.

Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.

Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org

sched/fair: Add cfs_rq::avg_vruntime

2023-07-19T07:43:58+00:00

In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.

Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).

As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.

Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org

sched/debug: Dump domains' sched group flags

2023-07-13T13:21:53+00:00

There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
in a parent domain were not set and propagated properly when a degenerate
domain is removed.

Add dump of domain sched group flags of a CPU to make debug easier
in the future.

Usage:
cat /debug/sched/domains/cpu0/domain1/groups_flags
to dump cpu0 domain1's sched group flags.

Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Tim Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Valentin Schneider 
Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com

sched/debug: Correct printing for rq->nr_uninterruptible

2023-05-08T08:58:39+00:00

Commit e6fe3f422be1 ("sched: Make multiple runqueue task counters
32-bit") changed the type for rq->nr_uninterruptible from "unsigned
long" to "unsigned int", but left wrong cast print to
/sys/kernel/debug/sched/debug and to the console.

For example, nr_uninterruptible's value is fffffff7 with type
"unsigned int", (long)nr_uninterruptible shows 4294967287 while
(int)nr_uninterruptible prints -9. So using int cast fixes wrong
printing.

Signed-off-by: Yan Yan 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20230506074253.44526-1-yanyan.yan@antgroup.com

sched/debug: Put sched/domains files under the verbose flag

2023-03-17T14:24:19+00:00

The debug files under sched/domains can take a long time to regenerate,
especially when updates are done one at a time. Move these files under
the sched verbose debug flag. Allow changes to verbose to trigger
generation of the files. This lets a user batch the updates but still
have the information available.  The detailed topology printk messages
are also under verbose.

Discussion that lead to this approach can be found in the link below.

Simplified code to maintain use of debugfs bool routines suggested by
Michael Ellerman .

Signed-off-by: Phil Auld 
Acked-by: Greg Kroah-Hartman 
Reviewed-by: Valentin Schneider 
Reviewed-by: Srikar Dronamraju 
Tested-by: Vishal Chourasia 
Cc: Michael Ellerman 
Cc: Greg Kroah-Hartman 
Cc: Ingo Molnar 
Cc: Peter Zijlstra (Intel) 
Cc: Srikar Dronamraju 
Cc: Valentin Schneider 
Cc: Vishal Chourasia 
Cc: Vincent Guittot 
Link: https://lore.kernel.org/all/Y01UWQL2y2r69sBX@li-05afa54c-330e-11b2-a85c-e3f3aa0db1e9.ibm.com/
Link: https://lore.kernel.org/r/20230303183754.3076321-1-pauld@redhat.com
Signed-off-by: Greg Kroah-Hartman

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2022-10-11T00:53:04+00:00

Pull MM updates from Andrew Morton:

- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).

- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.

Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.

Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.

- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.

KMSAN keeps finding bugs. New ones, as well as the legacy ones.

- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.

- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.

- userfaultfd updates from Axel Rasmussen

- zsmalloc cleanups from Alexey Romanov

- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure

- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.

- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.

- memcg cleanups from Kairui Song.

- memcg fixes and cleanups from Johannes Weiner.

- Vishal Moola provides more folio conversions

- Zhang Yi removed ll_rw_block() :(

- migration enhancements from Peter Xu

- migration error-path bugfixes from Huang Ying

- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.

- vma merging improvements from Jakub Matěn.

- NUMA hinting cleanups from David Hildenbrand.

- xu xin added aditional userspace visibility into KSM merging
activity.

- THP & KSM code consolidation from Qi Zheng.

- more folio work from Matthew Wilcox.

- KASAN updates from Andrey Konovalov.

- DAMON cleanups from Kaixu Xia.

- DAMON work from SeongJae Park: fixes, cleanups.

- hugetlb sysfs cleanups from Muchun Song.

- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...

memory tiering: hot page selection with hint page fault latency

2022-09-12T03:25:54+00:00

Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified. 
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote.  But this isn't a perfect
algorithm to identify the hot pages.  Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g.  60 seconds).  So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed.  The test results are
as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.


This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified.  Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
identify the hot pages.  Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g.  60 seconds).  The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm. 
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time. 
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node.  So the accessing CPU and PID information are unnecessary for the
slow memory pages.  We can reuse these bits in struct page to record the
scan time.  For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test.  All pages with hint page fault latency < hot
threshold will be considered hot.

It's hard for users to determine the hot threshold.  So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment.  We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" 
Reviewed-by: Baolin Wang 
Tested-by: Baolin Wang 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Dave Hansen 
Cc: Yang Shi 
Cc: Zi Yan 
Cc: Wei Xu 
Cc: osalvador 
Cc: Shakeel Butt 
Cc: Zhong Jiang 
Cc: Oscar Salvador 
Signed-off-by: Andrew Morton