diff options
| author | Linus Torvalds <torvalds@linux-foundation.org> | 2021-09-08 12:55:35 -0700 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2021-09-08 12:55:35 -0700 |
| commit | 2d338201d5311bcd79d42f66df4cecbcbc5f4f2c (patch) | |
| tree | 75d87f65c31f4721ba6a5356d2a487af9e2961c3 /Documentation | |
| parent | cc09ee80c3b18ae1a897a30a17fe710b2b2f620a (diff) | |
| parent | b285437d1d929785a5bef3603da78d2cd5341893 (diff) | |
| download | linux-2d338201d5311bcd79d42f66df4cecbcbc5f4f2c.tar.gz linux-2d338201d5311bcd79d42f66df4cecbcbc5f4f2c.tar.bz2 linux-2d338201d5311bcd79d42f66df4cecbcbc5f4f2c.zip | |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
"147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.
Subsystems affected by this patch series: mm (memory-hotplug, rmap,
ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
selftests, ipc, and scripts"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
scripts: check_extable: fix typo in user error message
mm/workingset: correct kernel-doc notations
ipc: replace costly bailout check in sysvipc_find_ipc()
selftests/memfd: remove unused variable
Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
configs: remove the obsolete CONFIG_INPUT_POLLDEV
prctl: allow to setup brk for et_dyn executables
pid: cleanup the stale comment mentioning pidmap_init().
kernel/fork.c: unexport get_{mm,task}_exe_file
coredump: fix memleak in dump_vma_snapshot()
fs/coredump.c: log if a core dump is aborted due to changed file permissions
nilfs2: use refcount_dec_and_lock() to fix potential UAF
nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
nilfs2: fix NULL pointer in nilfs_##name##_attr_release
nilfs2: fix memory leak in nilfs_sysfs_create_device_group
trap: cleanup trap_init()
init: move usermodehelper_enable() to populate_rootfs()
...
Diffstat (limited to 'Documentation')
| -rw-r--r-- | Documentation/admin-guide/mm/damon/index.rst | 15 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/damon/start.rst | 114 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/damon/usage.rst | 112 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/admin-guide/mm/memory-hotplug.rst | 800 | ||||
| -rw-r--r-- | Documentation/dev-tools/kfence.rst | 98 | ||||
| -rw-r--r-- | Documentation/kbuild/llvm.rst | 5 | ||||
| -rw-r--r-- | Documentation/vm/damon/api.rst | 20 | ||||
| -rw-r--r-- | Documentation/vm/damon/design.rst | 166 | ||||
| -rw-r--r-- | Documentation/vm/damon/faq.rst | 51 | ||||
| -rw-r--r-- | Documentation/vm/damon/index.rst | 30 | ||||
| -rw-r--r-- | Documentation/vm/index.rst | 1 |
12 files changed, 1021 insertions, 392 deletions
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst new file mode 100644 index 000000000000..8c5dde3a5754 --- /dev/null +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -0,0 +1,15 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +Monitoring Data Accesses +======================== + +:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring. +Using DAMON, users can analyze the memory access patterns of their systems and +optimize those. + +.. toctree:: + :maxdepth: 2 + + start + usage diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst new file mode 100644 index 000000000000..d5eb89a8fc38 --- /dev/null +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -0,0 +1,114 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Getting Started +=============== + +This document briefly describes how you can use DAMON by demonstrating its +default user space tool. Please note that this document describes only a part +of its features for brevity. Please refer to :doc:`usage` for more details. + + +TL; DR +====== + +Follow the commands below to monitor and visualize the memory access pattern of +your workload. :: + + # # build the kernel with CONFIG_DAMON_*=y, install it, and reboot + # mount -t debugfs none /sys/kernel/debug/ + # git clone https://github.com/awslabs/damo + # ./damo/damo record $(pidof <your workload>) + # ./damo/damo report heat --plot_ascii + +The final command draws the access heatmap of ``<your workload>``. The heatmap +shows which memory region (x-axis) is accessed when (y-axis) and how frequently +(number; the higher the more accesses have been observed). :: + + 111111111111111111111111111111111111111111111111111111110000 + 111121111111111111111111111111211111111111111111111111110000 + 000000000000000000000000000000000000000000000000001555552000 + 000000000000000000000000000000000000000000000222223555552000 + 000000000000000000000000000000000000000011111677775000000000 + 000000000000000000000000000000000000000488888000000000000000 + 000000000000000000000000000000000177888400000000000000000000 + 000000000000000000000000000046666522222100000000000000000000 + 000000000000000000000014444344444300000000000000000000000000 + 000000000000000002222245555510000000000000000000000000000000 + # access_frequency: 0 1 2 3 4 5 6 7 8 9 + # x-axis: space (140286319947776-140286426374096: 101.496 MiB) + # y-axis: time (605442256436361-605479951866441: 37.695430s) + # resolution: 60x10 (1.692 MiB and 3.770s for each character) + + +Prerequisites +============= + +Kernel +------ + +You should first ensure your system is running on a kernel built with +``CONFIG_DAMON_*=y``. + + +User Space Tool +--------------- + +For the demonstration, we will use the default user space tool for DAMON, +called DAMON Operator (DAMO). It is available at +https://github.com/awslabs/damo. The examples below assume that ``damo`` is on +your ``$PATH``. It's not mandatory, though. + +Because DAMO is using the debugfs interface (refer to :doc:`usage` for the +detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as +below:: + + # mount -t debugfs none /sys/kernel/debug/ + +or append the following line to your ``/etc/fstab`` file so that your system +can automatically mount debugfs upon booting:: + + debugfs /sys/kernel/debug debugfs defaults 0 0 + + +Recording Data Access Patterns +============================== + +The commands below record the memory access patterns of a program and save the +monitoring results to a file. :: + + $ git clone https://github.com/sjp38/masim + $ cd masim; make; ./masim ./configs/zigzag.cfg & + $ sudo damo record -o damon.data $(pidof masim) + +The first two lines of the commands download an artificial memory access +generator program and run it in the background. The generator will repeatedly +access two 100 MiB sized memory regions one by one. You can substitute this +with your real workload. The last line asks ``damo`` to record the access +pattern in the ``damon.data`` file. + + +Visualizing Recorded Patterns +============================= + +The following three commands visualize the recorded access patterns and save +the results as separate image files. :: + + $ damo report heats --heatmap access_pattern_heatmap.png + $ damo report wss --range 0 101 1 --plot wss_dist.png + $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png + +- ``access_pattern_heatmap.png`` will visualize the data access pattern in a + heatmap, showing which memory region (y-axis) got accessed when (x-axis) + and how frequently (color). +- ``wss_dist.png`` will show the distribution of the working set size. +- ``wss_chron_change.png`` will show how the working set size has + chronologically changed. + +You can view the visualizations of this example workload at [1]_. +Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_. + +.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns +.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html +.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html +.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst new file mode 100644 index 000000000000..a72cda374aba --- /dev/null +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -0,0 +1,112 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Detailed Usages +=============== + +DAMON provides below three interfaces for different users. + +- *DAMON user space tool.* + This is for privileged people such as system administrators who want a + just-working human-friendly interface. Using this, users can use the DAMON’s + major features in a human-friendly way. It may not be highly tuned for + special cases, though. It supports only virtual address spaces monitoring. +- *debugfs interface.* + This is for privileged user space programmers who want more optimized use of + DAMON. Using this, users can use DAMON’s major features by reading + from and writing to special debugfs files. Therefore, you can write and use + your personalized DAMON debugfs wrapper programs that reads/writes the + debugfs files instead of you. The DAMON user space tool is also a reference + implementation of such programs. It supports only virtual address spaces + monitoring. +- *Kernel Space Programming Interface.* + This is for kernel space programmers. Using this, users can utilize every + feature of DAMON most flexibly and efficiently by writing kernel space + DAMON application programs for you. You can even extend DAMON for various + address spaces. + +Nevertheless, you could write your own user space tool using the debugfs +interface. A reference implementation is available at +https://github.com/awslabs/damo. If you are a kernel programmer, you could +refer to :doc:`/vm/damon/api` for the kernel space programming interface. For +the reason, this document describes only the debugfs interface + +debugfs Interface +================= + +DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under +its debugfs directory, ``<debugfs>/damon/``. + + +Attributes +---------- + +Users can get and set the ``sampling interval``, ``aggregation interval``, +``regions update interval``, and min/max number of monitoring target regions by +reading from and writing to the ``attrs`` file. To know about the monitoring +attributes in detail, please refer to the :doc:`/vm/damon/design`. For +example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and +1000, and then check it again:: + + # cd <debugfs>/damon + # echo 5000 100000 1000000 10 1000 > attrs + # cat attrs + 5000 100000 1000000 10 1000 + + +Target IDs +---------- + +Some types of address spaces supports multiple monitoring target. For example, +the virtual memory address spaces monitoring can have multiple processes as the +monitoring targets. Users can set the targets by writing relevant id values of +the targets to, and get the ids of the current targets by reading from the +``target_ids`` file. In case of the virtual address spaces monitoring, the +values should be pids of the monitoring target processes. For example, below +commands set processes having pids 42 and 4242 as the monitoring targets and +check it again:: + + # cd <debugfs>/damon + # echo 42 4242 > target_ids + # cat target_ids + 42 4242 + +Note that setting the target ids doesn't start the monitoring. + + +Turning On/Off +-------------- + +Setting the files as described above doesn't incur effect unless you explicitly +start the monitoring. You can start, stop, and check the current status of the +monitoring by writing to and reading from the ``monitor_on`` file. Writing +``on`` to the file starts the monitoring of the targets with the attributes. +Writing ``off`` to the file stops those. DAMON also stops if every target +process is terminated. Below example commands turn on, off, and check the +status of DAMON:: + + # cd <debugfs>/damon + # echo on > monitor_on + # echo off > monitor_on + # cat monitor_on + off + +Please note that you cannot write to the above-mentioned debugfs files while +the monitoring is turned on. If you write to the files while DAMON is running, +an error code such as ``-EBUSY`` will be returned. + + +Tracepoint for Monitoring Results +================================= + +DAMON provides the monitoring results via a tracepoint, +``damon:damon_aggregated``. While the monitoring is turned on, you could +record the tracepoint events and show results using tracepoint supporting tools +like ``perf``. For example:: + + # echo on > monitor_on + # perf record -e damon:damon_aggregated & + # sleep 5 + # kill 9 $(pidof perf) + # echo off > monitor_on + # perf script diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 4b14d8b50e9e..cbd19d5e625f 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -27,6 +27,7 @@ the Linux memory management. concepts cma_debugfs + damon/index hugetlbpage idle_page_tracking ksm diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index c6bae2d77160..03dfbc925252 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -1,466 +1,576 @@ .. _admin_guide_memory_hotplug: -============== -Memory Hotplug -============== +================== +Memory Hot(Un)Plug +================== -:Created: Jul 28 2007 -:Updated: Add some details about locking internals: Aug 20 2018 - -This document is about memory hotplug including how-to-use and current status. -Because Memory Hotplug is still under development, contents of this text will -be changed often. +This document describes generic Linux support for memory hot(un)plug with +a focus on System RAM, including ZONE_MOVABLE support. .. contents:: :local: -.. note:: +Introduction +============ - (1) x86_64's has special implementation for memory hotplug. - This text does not describe it. - (2) This text assumes that sysfs is mounted at ``/sys``. +Memory hot(un)plug allows for increasing and decreasing the size of physical +memory available to a machine at runtime. In the simplest case, it consists of +physically plugging or unplugging a DIMM at runtime, coordinated with the +operating system. +Memory hot(un)plug is used for various purposes: -Introduction -============ +- The physical memory available to a machine can be adjusted at runtime, up- or + downgrading the memory capacity. This dynamic memory resizing, sometimes + referred to as "capacity on demand", is frequently used with virtual machines + and logical partitions. + +- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One + example is replacing failing memory modules. -Purpose of memory hotplug -------------------------- +- Reducing energy consumption either by physically unplugging memory modules or + by logically unplugging (parts of) memory modules from Linux. -Memory Hotplug allows users to increase/decrease the amount of memory. -Generally, there are two purposes. +Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also +used to expose persistent memory, other performance-differentiated memory and +reserved memory regions as ordinary system RAM to Linux. -(A) For changing the amount of memory. - This is to allow a feature like capacity on demand. -(B) For installing/removing DIMMs or NUMA-nodes physically. - This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. +Linux only supports memory hot(un)plug on selected 64 bit architectures, such as +x86_64, arm64, ppc64, s390x and ia64. -(A) is required by highly virtualized environments and (B) is required by -hardware which supports memory power management. +Memory Hot(Un)Plug Granularity +------------------------------ -Linux memory hotplug is designed for both purpose. +Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the +physical memory address space into chunks of the same size: memory sections. The +size of a memory section is architecture dependent. For example, x86_64 uses +128 MiB and ppc64 uses 16 MiB. -Phases of memory hotplug +Memory sections are combined into chunks referred to as "memory blocks". The +size of a memory block is architecture dependent and corresponds to the smallest +granularity that can be hot(un)plugged. The default size of a memory block is +the same as memory section size, unless an architecture specifies otherwise. + +All memory blocks have the same size. + +Phases of Memory Hotplug ------------------------ -There are 2 phases in Memory Hotplug: +Memory hotplug consists of two phases: - 1) Physical Memory Hotplug phase - 2) Logical Memory Hotplug phase. +(1) Adding the memory to Linux +(2) Onlining memory blocks -The First phase is to communicate hardware/firmware and make/erase -environment for hotplugged memory. Basically, this phase is necessary -for the purpose (B), but this is good phase for communication between -highly virtualized environments too. +In the first phase, metadata, such as the memory map ("memmap") and page tables +for the direct mapping, is allocated and initialized, and memory blocks are +created; the latter also creates sysfs files for managing newly created memory +blocks. -When memory is hotplugged, the kernel recognizes new memory, makes new memory -management tables, and makes sysfs files for new memory's operation. +In the second phase, added memory is exposed to the page allocator. After this +phase, the memory is visible in memory statistics, such as free and total +memory, of the system. -If firmware supports notification of connection of new memory to OS, -this phase is triggered automatically. ACPI can notify this event. If not, -"probe" operation by system administration is used instead. -(see :ref:`memory_hotplug_physical_mem`). +Phases of Memory Hotunplug +-------------------------- -Logical Memory Hotplug phase is to change memory state into -available/unavailable for users. Amount of memory from user's view is -changed by this phase. The kernel makes all memory in it as free pages -when a memory range is available. +Memory hotunplug consists of two phases: -In this document, this phase is described as online/offline. +(1) Offlining memory blocks +(2) Removing the memory from Linux -Logical Memory Hotplug phase is triggered by write of sysfs file by system -administrator. For the hot-add case, it must be executed after Physical Hotplug -phase by hand. -(However, if you writes udev's hotplug scripts for memory hotplug, these -phases can be execute in seamless way.) +In the fist phase, memory is "hidden" from the page allocator again, for +example, by migrating busy memory to other memory locations and removing all +relevant free pages from the page allocator After this phase, the memory is no +longer visible in memory statistics of the system. -Unit of Memory online/offline operation ---------------------------------------- +In the second phase, the memory blocks are removed and metadata is freed. -Memory hotplug uses SPARSEMEM memory model which allows memory to be divided -into chunks of the same size. These chunks are called "sections". The size of -a memory section is architecture dependent. For example, power uses 16MiB, ia64 -uses 1GiB. +Memory Hotplug Notifications +============================ -Memory sections are combined into chunks referred to as "memory blocks". The -size of a memory block is architecture dependent and represents the logical -unit upon which memory online/offline operations are to be performed. The -default size of a memory block is the same as memory section size unless an -architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.) +There are various ways how Linux is notified about memory hotplug events such +that it can start adding hotplugged memory. This description is limited to +systems that support ACPI; mechanisms specific to other firmware interfaces or +virtual machines are not described. -To determine the size (in bytes) of a memory block please read this file:: +ACPI Notifications +------------------ - /sys/devices/system/memory/block_size_bytes +Platforms that support ACPI, such as x86_64, can support memory hotplug +notifications via ACPI. -Kernel Configuration -==================== +In general, a firmware supporting memory hotplug defines a memory class object +HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI +driver will hotplug the memory to Linux. -To use memory hotplug feature, kernel must be compiled with following -config options. +If the firmware supports hotplug of NUMA nodes, it defines an object _HID +"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all +assigned memory devices are added to Linux by the ACPI driver. -- For all memory hotplug: - - Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``) - - Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``) +Similarly, Linux can be notified about requests to hotunplug a memory device or +a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory +blocks, and, if successful, hotunplug the memory from Linux. -- To enable memory removal, the following are also necessary: - - Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``) - - Page Migration (``CONFIG_MIGRATION``) +Manual Probing +-------------- -- For ACPI memory hotplug, the following are also necessary: - - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``) - - This option can be kernel module. +On some architectures, the firmware may not be able to notify the operating +system about a memory hotplug event. Instead, the memory has to be manually +probed from user space. -- As a related configuration, if your box has a feature of NUMA-node hotplug - via ACPI, then this option is necessary too. +The probe interface is located at:: - - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) - (``CONFIG_ACPI_CONTAINER``). + /sys/devices/system/memory/probe - This option can be kernel module too. +Only complete memory blocks can be probed. Individual memory blocks are probed +by providing the physical start address of the memory block:: + % echo addr > /sys/devices/system/memory/probe -.. _memory_hotplug_sysfs_files: +Which results in a memory block for the range [addr, addr + memory_block_size) +being created. -sysfs files for memory hotplug -============================== +.. note:: -All memory blocks have their device information in sysfs. Each memory block -is described under ``/sys/devices/system/memory`` as:: + Using the probe interface is discouraged as it is easy to crash the kernel, + because Linux cannot validate user input; this interface might be removed in + the future. - /sys/devices/system/memory/memoryXXX +Onlining and Offlining Memory Blocks +==================================== -where XXX is the memory block id. +After a memory block has been created, Linux has to be instructed to actually +make use of that memory: the memory block has to be "online". -For the memory block covered by the sysfs directory. It is expected that all -memory sections in this range are present and no memory holes exist in the -range. Currently there is no way to determine if there is a memory hole, but -the existence of one should not affect the hotplug capabilities of the memory -block. +Before a memory block can be removed, Linux has to stop using any memory part of +the memory block: the memory block has to be "offlined". -For example, assume 1GiB memory block size. A device for a memory starting at -0x100000000 is ``/sys/device/system/memory/memory4``:: +The Linux kernel can be configured to automatically online added memory blocks +and drivers automatically trigger offlining of memory blocks when trying +hotunplug of memory. Memory blocks can only be removed once offlining succeeded +and drivers may trigger offlining of memory blocks when attempting hotunplug of +memory. - (0x100000000 / 1Gib = 4) +Onlining Memory Blocks Manually +------------------------------- -This device covers address range [0x100000000 ... 0x140000000) +If auto-onlining of memory blocks isn't enabled, user-space has to manually +trigger onlining of memory blocks. Often, udev rules are used to automate this +task in user space. -Under each memory block, you can see 5 files: +Onlining of a memory block can be triggered via:: -- ``/sys/devices/system/memory/memoryXXX/phys_index`` -- ``/sys/devices/system/memory/memoryXXX/phys_device`` -- ``/sys/devices/system/memory/memoryXXX/state`` -- ``/sys/devices/system/memory/memoryXXX/removable`` -- ``/sys/devices/system/memory/memoryXXX/valid_zones`` + % echo online > /sys/devices/system/memory/memoryXXX/state -=================== ============================================================ -``phys_index`` read-only and contains memory block id, same as XXX. -``state`` read-write +Or alternatively:: - - at read: contains online/offline state of memory. - - at write: user can specify "online_kernel", + % echo 1 > /sys/devices/system/memory/memoryXXX/online - "online_movable", "online", "offline" command - which will be performed on all sections in the block. -``phys_device`` read-only: legacy interface only ever used on s390x to - expose the covered storage increment. -``removable`` read-only: legacy interface that indicated whether a memory - block was likely to be offlineable or not. Newer kernel - versions return "1" if and only if the kernel supports - memory offlining. -``valid_zones`` read-only: designed to show by which zone memory provided by - a memory block is managed, and to show by which zone memory - provided by an offline memory block could be managed when - onlining. - - The first column shows it`s default zone. - - "memory6/valid_zones: Normal Movable" shows this memoryblock - can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE - by online_movable. - - "memory7/valid_zones: Movable Normal" shows this memoryblock - can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL - by online_kernel. -=================== ============================================================ +The kernel will select the target zone automatically, usually defaulting to +``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel +command line or if the memory block would intersect the ZONE_MOVABLE already. -.. note:: +One can explicitly request to associate an offline memory block with +ZONE_MOVABLE by:: - These directories/files appear after physical memory hotplug phase. + % echo online_movable > /sys/devices/system/memory/memoryXXX/state -If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed -via symbolic links located in the ``/sys/devices/system/node/node*`` directories. +Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: -For example:: + % echo online_kernel > /sys/devices/system/memory/memoryXXX/state - /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 +In any case, if onlining succeeds, the state of the memory block is changed to +be "online". If it fails, the state of the memory block will remain unchanged +and the above commands will fail. -A backlink will also be created:: +Onlining Memory Blocks Automatically +------------------------------------ - /sys/devices/system/memory/memory9/node0 -> ../../node/node0 +The kernel can be configured to try auto-onlining of newly added memory blocks. +If this feature is disabled, the memory blocks will stay offline until +explicitly onlined from user space. -.. _memory_hotplug_physical_mem: +The configured auto-online behavior can be observed via:: -Physical memory hot-add phase -============================= + % cat /sys/devices/system/memory/auto_online_blocks -Hardware(Firmware) Support --------------------------- +Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or +``online_movable`` to that file, like:: -On x86_64/ia64 platform, memory hotplug by ACPI is supported. + % echo online > /sys/devices/system/memory/auto_online_blocks -In general, the firmware (ACPI) which supports memory hotplug defines -memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80, -Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev -script. This will be done automatically. +Modifying the auto-online behavior will only affect all subsequently added +memory blocks only. -But scripts for memory hotplug are not contained in generic udev package(now). -You may have to write it by yourself or online/offline memory by hand. -Please see :ref:`memory_hotplug_how_to_online_memory` and -:ref:`memory_hotplug_how_to_offline_memory`. +.. note:: -If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", -"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler -calls hotplug code for all of objects which are defined in it. -If memory device is found, memory hotplug code will be called. + In corner cases, auto-onlining can fail. The kernel won't retry. Note that + auto-onlining is not expected to fail in default configurations. -Notify memory hot-add event by hand ------------------------------------ +.. note:: -On some architectures, the firmware may not notify the kernel of a memory -hotplug event. Therefore, the memory "probe" interface is supported to -explicitly notify the kernel. This interface depends on -CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86 -if hotplug is supported, although for x86 this should be handled by ACPI -notification. + DLPAR on ppc64 ignores the ``offline`` setting and will still online added + memory blocks; if onlining fails, memory blocks are removed again. -Probe interface is located at:: +Offlining Memory Blocks +----------------------- - /sys/devices/system/memory/probe +In the current implementation, Linux's memory offlining will try migrating all +movable pages off the affected memory block. As most kernel allocations, such as +page tables, are unmovable, page migration can fail and, therefore, inhibit +memory offlining from succeeding. -You can tell the physical address of new memory to the kernel by:: +Having the memory provided by memory block managed by ZONE_MOVABLE significantly +increases memory offlining reliability; still, memory offlining can fail in +some corner cases. - % echo start_address_of_new_memory > /sys/devices/system/memory/probe +Further, memory offlining might retry for a long time (or even forever), until +aborted by the user. -Then, [start_address_of_new_memory, start_address_of_new_memory + -memory_block_size] memory range is hot-added. In this case, hotplug script is -not called (in current implementation). You'll have to online memory by -yourself. Please see :ref:`memory_hotplug_how_to_online_memory`. +Offlining of a memory block can be triggered via:: -Logical Memory hot-add phase -============================ + % echo offline > /sys/devices/system/memory/memoryXXX/state -State of memory ---------------- +Or alternatively:: -To see (online/offline) state of a memory block, read 'state' file:: + % echo 0 > /sys/devices/system/memory/memoryXXX/online - % cat /sys/device/system/memory/memoryXXX/state +If offlining succeeds, the state of the memory block is changed to be "offline". +If it fails, the state of the memory block will remain unchanged and the above +commands will fail, for example, via:: + bash: echo: write error: Device or resource busy -- If the memory block is online, you'll read "online". -- If the memory block is offline, you'll read "offline". +or via:: + bash: echo: write error: Invalid argument -.. _memory_hotplug_how_to_online_memory: +Observing the State of Memory Blocks +------------------------------------ -How to online memory --------------------- +The state (online/offline/going-offline) of a memory block can be observed +either via:: -When the memory is hot-added, the kernel decides whether or not to "online" -it according to the policy which can be read from "auto_online_blocks" file:: + % cat /sys/device/system/memory/memoryXXX/state - % cat /sys/devices/system/memory/auto_online_blocks +Or alternatively (1/0) via:: -The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config -option. If it is disabled the default is "offline" which means the newly added -memory is not in a ready-to-use state and you have to "online" the newly added -memory blocks manually. Automatic onlining can be requested by writing "online" -to "auto_online_blocks" file:: + % cat /sys/device/system/memory/memoryXXX/online - % echo online > /sys/devices/system/memory/auto_online_blocks +For an online memory block, the managing zone can be observed via:: -This sets a global policy and impacts all memory blocks that will subsequently -be hotplugged. Currently offline blocks keep their state. It is possible, under -certain circumstances, that some memory blocks will be added but will fail to -online. User space tools can check their "state" files -(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually. + % cat /sys/device/system/memory/memoryXXX/valid_zones -If the automatic onlining wasn't requested, failed, or some memory block was -offlined it is possible to change the individual block's state by writing to the -"state" file:: +Configuring Memory Hot(Un)Plug +============================== - % echo online > /sys/devices/system/memory/memoryXXX/state +There are various ways how system administrators can configure memory +hot(un)plug and interact with memory blocks, especially, to online them. -This onlining will not change the ZONE type of the target memory block, -If the memory block doesn't belong to any zone an appropriate kernel zone -(usually ZONE_NORMAL) will be used unless movable_node kernel command line -option is specified when ZONE_MOVABLE will be used. +Memory Hot(Un)Plug Configuration via Sysfs +------------------------------------------ -You can explicitly request to associate it with ZONE_MOVABLE by:: +Some memory hot(un)plug properties can be configured or inspected via sysfs in:: - % echo online_movable > /sys/devices/system/memory/memoryXXX/state + /sys/devices/system/memory/ -.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE +The following files are currently defined: -Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:: +====================== ========================================================= +``auto_online_blocks`` read-write: set or get the default state of new memory + blocks; configure auto-onlining. - % echo online_kernel > /sys/devices/system/memory/memoryXXX/state + The default value depends on the + CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration + option. -.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL + See the ``state`` property of memory blocks for details. +``block_size_bytes`` read-only: the size in bytes of a memory block. +``probe |
