linux.git/kernel/crash_core.c, branch v6.6.132

mm: turn folio_test_hugetlb into a PageType

2024-05-02T14:32:47+00:00

commit d99e3140a4d33e26066183ff727d8f02f56bec64 upstream.

The current folio_test_hugetlb() can be fooled by a concurrent folio split
into returning true for a folio which has never belonged to hugetlbfs.
This can't happen if the caller holds a refcount on it, but we have a few
places (memory-failure, compaction, procfs) which do not and should not
take a speculative reference.

Since hugetlb pages do not use individual page mapcounts (they are always
fully mapped and use the entire_mapcount field to record the number of
mappings), the PageType field is available now that page_mapcount()
ignores the value in this field.

In compaction and with CONFIG_DEBUG_VM enabled, the current implementation
can result in an oops, as reported by Luis. This happens since 9c5ccf2db04b
("mm: remove HUGETLB_PAGE_DTOR") effectively added some VM_BUG_ON() checks
in the PageHuge() testing path.

[willy@infradead.org: update vmcoreinfo]
  Link: https://lkml.kernel.org/r/ZgGZUvsdhaT1Va-T@casper.infradead.org
Link: https://lkml.kernel.org/r/20240321142448.1645400-6-willy@infradead.org
Fixes: 9c5ccf2db04b ("mm: remove HUGETLB_PAGE_DTOR")
Signed-off-by: Matthew Wilcox (Oracle) 
Reviewed-by: David Hildenbrand 
Acked-by: Vlastimil Babka 
Reported-by: Luis Chamberlain 
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218227
Cc: Miaohe Lin 
Cc: Muchun Song 
Cc: Oscar Salvador 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Greg Kroah-Hartman

mm, treewide: introduce NR_PAGE_ORDERS

2024-05-02T14:32:41+00:00

[ Upstream commit fd37721803c6e73619108f76ad2e12a9aa5fafaf ]

NR_PAGE_ORDERS defines the number of page orders supported by the page
allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total.

NR_PAGE_ORDERS assists in defining arrays of page orders and allows for
more natural iteration over them.

[kirill.shutemov@linux.intel.com: fixup for kerneldoc warning]
  Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box
Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov 
Reviewed-by: Zi Yan 
Cc: Linus Torvalds 
Signed-off-by: Andrew Morton 
Stable-dep-of: b6976f323a86 ("drm/ttm: stop pooling cached NUMA pages v2")
Signed-off-by: Sasha Levin

Crash: add lock to serialize crash hotplug handling

2023-09-30T00:20:48+00:00

Eric reported that handling corresponding crash hotplug event can be
failed easily when many memory hotplug event are notified in a short
period.  They failed because failing to take __kexec_lock.

=======
[   78.714569] Fallback order for Node 0: 0
[   78.714575] Built 1 zonelists, mobility grouping on.  Total pages: 1817886
[   78.717133] Policy zone: Normal
[   78.724423] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[   78.727207] crash hp: kexec_trylock() failed, elfcorehdr may be inaccurate
[   80.056643] PEFILE: Unsigned PE binary
=======

The memory hotplug events are notified very quickly and very many, while
the handling of crash hotplug is much slower relatively.  So the atomic
variable __kexec_lock and kexec_trylock() can't guarantee the
serialization of crash hotplug handling.

Here, add a new mutex lock __crash_hotplug_lock to serialize crash hotplug
handling specifically.  This doesn't impact the usage of __kexec_lock.

Link: https://lkml.kernel.org/r/20230926120905.392903-1-bhe@redhat.com
Fixes: 247262756121 ("crash: add generic infrastructure for crash hotplug support")
Signed-off-by: Baoquan He 
Tested-by: Eric DeVolder 
Reviewed-by: Eric DeVolder 
Reviewed-by: Valentin Schneider 
Cc: Sourabh Jain 
Cc: 
Signed-off-by: Andrew Morton

Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2023-08-29T21:53:51+00:00

Pull non-MM updates from Andrew Morton:

 - An extensive rework of kexec and crash Kconfig from Eric DeVolder
   ("refactor Kconfig to consolidate KEXEC and CRASH options")

 - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a
   couple of macros to args.h")

 - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper
   commands")

 - vsprintf inclusion rationalization from Andy Shevchenko
   ("lib/vsprintf: Rework header inclusions")

 - Switch the handling of kdump from a udev scheme to in-kernel
   handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory
   hot un/plug")

 - Many singleton patches to various parts of the tree

* tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits)
  document while_each_thread(), change first_tid() to use for_each_thread()
  drivers/char/mem.c: shrink character device's devlist[] array
  x86/crash: optimize CPU changes
  crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
  crash: hotplug support for kexec_load()
  x86/crash: add x86 crash hotplug support
  crash: memory and CPU hotplug sysfs attributes
  kexec: exclude elfcorehdr from the segment digest
  crash: add generic infrastructure for crash hotplug support
  crash: move a few code bits to setup support of crash hotplug
  kstrtox: consistently use _tolower()
  kill do_each_thread()
  nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse
  scripts/bloat-o-meter: count weak symbol sizes
  treewide: drop CONFIG_EMBEDDED
  lockdep: fix static memory detection even more
  lib/vsprintf: declare no_hash_pointers in sprintf.h
  lib/vsprintf: split out sprintf() and friends
  kernel/fork: stop playing lockless games for exe_file replacement
  adfs: delete unused "union adfs_dirtail" definition
  ...

crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()

2023-08-24T23:25:14+00:00

The function crash_prepare_elf64_headers() generates the elfcorehdr which
describes the CPUs and memory in the system for the crash kernel.  In
particular, it writes out ELF PT_NOTEs for memory regions and the CPUs in
the system.

With respect to the CPUs, the current implementation utilizes
for_each_present_cpu() which means that as CPUs are added and removed, the
elfcorehdr must again be updated to reflect the new set of CPUs.

The reasoning behind the move to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
  possible CPUs; that is, crash_notes are not allocated dynamically
  when CPUs are plugged/unplugged. Thus the crash_notes for each
  possible CPU are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU.
  Changing to for_each_possible_cpu() is valid as the crash_notes
  pointed to by each CPU PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

 kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
           elfcorehdr      /proc/vmcore     vmcore

reveals how the ELF CPU PT_NOTEs are utilized:

- Upon panic, each CPU is sent an IPI and shuts itself down, recording
 its state in its crash_notes. When all CPUs are shutdown, the
 crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
 use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the CPU
 PT_NOTEs to craft a nr_cpus variable, which is reported in a
 header but otherwise generally unused. Makedumpfile creates the
 vmcore.

- The 'crash' dump analyzer does not appear to reference the CPU
 PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
 symbols and directly examines those structure contents from vmcore
 memory. From that information it is able to determine which CPUs
 are present and online, and locate the corresponding crash_notes.
 Said differently, it appears that 'crash' analyzer does not rely
 on the ELF PT_NOTEs for CPUs; rather it obtains the information
 directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use these
PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most common
solution.)

This results in the benefit of having all CPUs described in the
elfcorehdr, and therefore reducing the need to re-generate the elfcorehdr
on CPU changes, at the small expense of an additional 56 bytes per PT_NOTE
for not-present-but-possible CPUs.

On systems where kexec_file_load() syscall is utilized, all the above is
valid.  On systems where kexec_load() syscall is utilized, there may be
the need for the elfcorehdr to be regenerated once.  The reason being that
some archs only populate the 'present' CPUs from the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility uses
to generate the userspace-supplied elfcorehdr.  In this situation, one
memory or CPU change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible CPUs will be
described, just as with kexec_file_load() syscall.

Link: https://lkml.kernel.org/r/20230814214446.6659-8-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder 
Suggested-by: Sourabh Jain 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
Cc: Akhil Raj 
Cc: Bjorn Helgaas 
Cc: Borislav Petkov (AMD) 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Dave Young 
Cc: David Hildenbrand 
Cc: Eric W. Biederman 
Cc: Greg Kroah-Hartman 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: Konrad Rzeszutek Wilk 
Cc: Mimi Zohar 
Cc: Naveen N. Rao 
Cc: Oscar Salvador 
Cc: "Rafael J. Wysocki" 
Cc: Sean Christopherson 
Cc: Takashi Iwai 
Cc: Thomas Gleixner 
Cc: Thomas Weißschuh 
Cc: Valentin Schneider 
Cc: Vivek Goyal 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

crash: hotplug support for kexec_load()

2023-08-24T23:25:14+00:00

The hotplug support for kexec_load() requires changes to the userspace
kexec-tools and a little extra help from the kernel.

Given a kdump capture kernel loaded via kexec_load(), and a subsequent
hotplug event, the crash hotplug handler finds the elfcorehdr and rewrites
it to reflect the hotplug change.  That is the desired outcome, however,
at kernel panic time, the purgatory integrity check fails (because the
elfcorehdr changed), and the capture kernel does not boot and no vmcore is
generated.

Therefore, the userspace kexec-tools/kexec must indicate to the kernel
that the elfcorehdr can be modified (because the kexec excluded the
elfcorehdr from the digest, and sized the elfcorehdr memory buffer
appropriately).

To facilitate hotplug support with kexec_load():
 - a new kexec flag KEXEC_UPATE_ELFCOREHDR indicates that it is
   safe for the kernel to modify the kexec_load()'d elfcorehdr
 - the /sys/kernel/crash_elfcorehdr_size node communicates the
   preferred size of the elfcorehdr memory buffer
 - The sysfs crash_hotplug nodes (ie.
   /sys/devices/system/[cpu|memory]/crash_hotplug) dynamically
   take into account kexec_file_load() vs kexec_load() and
   KEXEC_UPDATE_ELFCOREHDR.
   This is critical so that the udev rule processing of crash_hotplug
   is all that is needed to determine if the userspace unload-then-load
   of the kdump image is to be skipped, or not. The proposed udev
   rule change looks like:
   # The kernel updates the crash elfcorehdr for CPU and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

The table below indicates the behavior of kexec_load()'d kdump image
updates (with the new udev crash_hotplug rule in place):

 Kernel |Kexec
 -------+-----+----
 Old    |Old  |New
        |  a  | a
 -------+-----+----
 New    |  a  | b
 -------+-----+----

where kexec 'old' and 'new' delineate kexec-tools has the needed
modifications for the crash hotplug feature, and kernel 'old' and 'new'
delineate the kernel supports this crash hotplug feature.

Behavior 'a' indicates the unload-then-reload of the entire kdump image. 
For the kexec 'old' column, the unload-then-reload occurs due to the
missing flag KEXEC_UPDATE_ELFCOREHDR.  An 'old' kernel (with 'new' kexec)
does not present the crash_hotplug sysfs node, which leads to the
unload-then-reload of the kdump image.

Behavior 'b' indicates the desired optimized behavior of the kernel
directly modifying the elfcorehdr and avoiding the unload-then-reload of
the kdump image.

If the udev rule is not updated with crash_hotplug node check, then no
matter any combination of kernel or kexec is new or old, the kdump image
continues to be unload-then-reload on hotplug changes.

To fully support crash hotplug feature, there needs to be a rollout of
kernel, kexec-tools and udev rule changes.  However, the order of the
rollout of these pieces does not matter; kexec_load()'d kdump images still
function for hotplug as-is.

Link: https://lkml.kernel.org/r/20230814214446.6659-7-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder 
Suggested-by: Hari Bathini 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
Cc: Akhil Raj 
Cc: Bjorn Helgaas 
Cc: Borislav Petkov (AMD) 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Dave Young 
Cc: David Hildenbrand 
Cc: Eric W. Biederman 
Cc: Greg Kroah-Hartman 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: Konrad Rzeszutek Wilk 
Cc: Mimi Zohar 
Cc: Naveen N. Rao 
Cc: Oscar Salvador 
Cc: "Rafael J. Wysocki" 
Cc: Sean Christopherson 
Cc: Sourabh Jain 
Cc: Takashi Iwai 
Cc: Thomas Gleixner 
Cc: Thomas Weißschuh 
Cc: Valentin Schneider 
Cc: Vivek Goyal 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

crash: add generic infrastructure for crash hotplug support

2023-08-24T23:25:13+00:00

To support crash hotplug, a mechanism is needed to update the crash
elfcorehdr upon CPU or memory changes (eg.  hot un/plug or off/ onlining).
The crash elfcorehdr describes the CPUs and memory to be written into the
vmcore.

To track CPU changes, callbacks are registered with the cpuhp mechanism
via cpuhp_setup_state_nocalls(CPUHP_BP_PREPARE_DYN).  The crash hotplug
elfcorehdr update has no explicit ordering requirement (relative to other
cpuhp states), so meets the criteria for utilizing CPUHP_BP_PREPARE_DYN. 
CPUHP_BP_PREPARE_DYN is a dynamic state and avoids the need to introduce a
new state for crash hotplug.  Also, CPUHP_BP_PREPARE_DYN is the last state
in the PREPARE group, just prior to the STARTING group, which is very
close to the CPU starting up in a plug/online situation, or stopping in a
unplug/ offline situation.  This minimizes the window of time during an
actual plug/online or unplug/offline situation in which the elfcorehdr
would be inaccurate.  Note that for a CPU being unplugged or offlined, the
CPU will still be present in the list of CPUs generated by
crash_prepare_elf64_headers().  However, there is no need to explicitly
omit the CPU, see justification in 'crash: change
crash_prepare_elf64_headers() to for_each_possible_cpu()'.

To track memory changes, a notifier is registered to capture the memblock
MEM_ONLINE and MEM_OFFLINE events via register_memory_notifier().

The CPU callbacks and memory notifiers invoke crash_handle_hotplug_event()
which performs needed tasks and then dispatches the event to the
architecture specific arch_crash_handle_hotplug_event() to update the
elfcorehdr with the current state of CPUs and memory.  During the process,
the kexec_lock is held.

Link: https://lkml.kernel.org/r/20230814214446.6659-3-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
Cc: Akhil Raj 
Cc: Bjorn Helgaas 
Cc: Borislav Petkov (AMD) 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Dave Young 
Cc: David Hildenbrand 
Cc: Eric W. Biederman 
Cc: Greg Kroah-Hartman 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: Konrad Rzeszutek Wilk 
Cc: Mimi Zohar 
Cc: Naveen N. Rao 
Cc: Oscar Salvador 
Cc: "Rafael J. Wysocki" 
Cc: Sean Christopherson 
Cc: Takashi Iwai 
Cc: Thomas Gleixner 
Cc: Thomas Weißschuh 
Cc: Valentin Schneider 
Cc: Vivek Goyal 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

crash: move a few code bits to setup support of crash hotplug

2023-08-24T23:25:13+00:00

Patch series "crash: Kernel handling of CPU and memory hot un/plug", v28.

Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also be
updated.

The elfcorehdr describes to kdump the CPUs and memory in the system, and
any inaccuracies can result in a vmcore with missing CPU context or memory
regions.

The current solution utilizes udev to initiate an unload-then-reload of
the kdump image (eg.  kernel, initrd, boot_params, purgatory and
elfcorehdr) by the userspace kexec utility.  In the original post I
outlined the significant performance problems related to offloading this
activity to userspace.

This patchset introduces a generic crash handler that registers with the
CPU and memory notifiers.  Upon CPU or memory changes, from either hot
un/plug or off/onlining, this generic handler is invoked and performs
important housekeeping, for example obtaining the appropriate lock, and
then invokes an architecture specific handler to do the appropriate
elfcorehdr update.

Note the description in patch 'crash: change crash_prepare_elf64_headers()
to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that
enables further optimizations related to CPU plug/unplug/online/offline
performance of elfcorehdr updates.

In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement with
userspace needed.

To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:

 - Prevent udev from updating kdump crash kernel on hot un/plug changes.
   Add the following as the first lines to the RHEL udev rule file
   /usr/lib/udev/rules.d/98-kexec.rules:

   # The kernel updates the crash elfcorehdr for CPU and memory changes
   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

   With this changeset applied, the two rules evaluate to false for
   CPU and memory change events and thus skip the userspace
   unload-then-reload of kdump.

 - Change to the kexec_file_load for loading the kdump kernel:
   Eg. on RHEL: in /usr/bin/kdumpctl, change to:
    standard_kexec_args="-p -d -s"
   which adds the -s to select kexec_file_load() syscall.

This kernel patchset also supports kexec_load() with a modified kexec
userspace utility.  A working changeset to the kexec userspace utility is
posted to the kexec-tools mailing list here:

 http://lists.infradead.org/pipermail/kexec/2023-May/027049.html

To use the kexec-tools patch, apply, build and install kexec-tools, then
change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug.  The removal of -s reverts to the kexec_load syscall and the
addition of --hotplug invokes the changes put forth in the kexec-tools
patch.


This patch (of 8):

The crash hotplug support leans on the work for the kexec_file_load()
syscall.  To also support the kexec_load() syscall, a few bits of code
need to be move outside of CONFIG_KEXEC_FILE.  As such, these bits are
moved out of kexec_file.c and into a common location crash_core.c.

In addition, struct crash_mem and crash_notes were moved to new locales so
that PROC_KCORE, which sets CRASH_CORE alone, builds correctly.

No functionality change intended.

Link: https://lkml.kernel.org/r/20230814214446.6659-1-eric.devolder@oracle.com
Link: https://lkml.kernel.org/r/20230814214446.6659-2-eric.devolder@oracle.com
Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
Acked-by: Hari Bathini 
Acked-by: Baoquan He 
Cc: Akhil Raj 
Cc: Bjorn Helgaas 
Cc: Borislav Petkov (AMD) 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Dave Young 
Cc: David Hildenbrand 
Cc: Eric W. Biederman 
Cc: Greg Kroah-Hartman 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jonathan Corbet 
Cc: Konrad Rzeszutek Wilk 
Cc: Mimi Zohar 
Cc: Naveen N. Rao 
Cc: Oscar Salvador 
Cc: "Rafael J. Wysocki" 
Cc: Sean Christopherson 
Cc: Takashi Iwai 
Cc: Thomas Gleixner 
Cc: Thomas Weißschuh 
Cc: Valentin Schneider 
Cc: Vivek Goyal 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton

mm: free up a word in the first tail page

2023-08-21T21:28:45+00:00

Store the folio order in the low byte of the flags word in the first tail
page.  This frees up the word that was being used to store the order and
dtor bytes previously.

Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Cc: David Hildenbrand 
Cc: Jens Axboe 
Cc: Sidhartha Kumar 
Cc: Yanteng Si 
Signed-off-by: Andrew Morton

mm: add large_rmappable page flag

2023-08-21T21:28:44+00:00

Stored in the first tail page's flags, this flag replaces the destructor. 
That removes the last of the destructors, so remove all references to
folio_dtor and compound_dtor.

Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) 
Cc: David Hildenbrand 
Cc: Jens Axboe 
Cc: Sidhartha Kumar 
Cc: Yanteng Si 
Signed-off-by: Andrew Morton