memory.rst - OpenGrok cross reference for /Linux-v5.15/Documentation/admin-guide/cgroup-v1/memory.rst

Lines Matching +full:system +full:- +full:cache +full:- +full:controller
2 Memory Resource Controller
12       The Memory Resource Controller has generically been referred to as the
13       memory controller in this document. Do not confuse memory controller
14       used here with the memory controller that is used in hardware.
17       When we mention a cgroup (cgroupfs's directory) with memory controller,
18       we call it "memory cgroup". When you see git-log and source code, you'll
22 Benefits and Purpose of the memory controller
25 The memory controller isolates the memory behaviour of a group of tasks
26 from the rest of the system. The article on LWN [12] mentions some probable
27 uses of the memory controller. The memory controller can be used to
30    Memory-hungry applications can be isolated and limited to a smaller
37    rest of the system to ensure that burning does not fail due to lack
39 e. There are several other use cases; find one or use the controller just
42 Current Status: linux-2.6.34-mmotm(development version of 2010/April)
46  - accounting anonymous pages, file caches, swap caches usage and limiting them.
47  - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
48  - optionally, memory+swap usage can be accounted and limited.
49  - hierarchical accounting
50  - soft limit
51  - moving (recharging) account at moving a task is selectable.
52  - usage threshold notifier
53  - memory pressure notifier
54  - oom-killer disable knob and oom-notifier
55  - Root cgroup has no limit controls.
109 The memory controller has a long history. A request for comments for the memory
110 controller was posted by Balbir Singh [1]. At the time the RFC was posted
113 for memory control. The first RSS controller was posted by Balbir Singh[2]
115 RSS controller. At OLS, at the resource management BoF, everyone suggested
116 that we handle both page cache and RSS together. Another request was raised
117 to allow user space handling of OOM. The current memory controller is
119 Cache Control [11].
129 The memory controller implementation has been divided into phases. These
132 1. Memory controller
133 2. mlock(2) controller
135 4. user mappings length controller
137 The memory controller is the first controller developed.
140 -----------
144 processes associated with the controller. Each cgroup has a memory controller
148 ---------------
152 		+--------------------+
155 		+--------------------+
158            +---------------+  |        +---------------+
161            +---------------+  |        +---------------+
163                               + --------------+
165            +---------------+           +------+--------+
166            | page          +---------->  page_cgroup|
168            +---------------+           +---------------+
173 Figure 1 shows the important aspects of the controller
184 If everything goes well, a page meta-data-structure called page_cgroup is
186 (*) page_cgroup structure is allocated at boot/memory-hotplug time.
189 ------------------------
191 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
196 for earlier. A file page will be accounted for as Page Cache when it's
197 inserted into inode (radix-tree). While it's mapped into the page tables of
201 unaccounted when it's removed from radix-tree. Even if RSS pages are fully
202 unmapped (by kswapd), they may exist as SwapCache in the system until they
204 A swapped-in page is accounted after adding into swapcache.
206 Note: The kernel does swapin-readahead and reads multiple swaps at once.
212 Note: we just account pages-on-LRU because our purpose is to control amount
213 of used pages; not-on-LRU pages tend to be out-of-control from VM view.
216 --------------------------
222 the cgroup that brought it in -- this will happen on memory pressure).
228 --------------------------------------
235  - memory.memsw.usage_in_bytes.
236  - memory.memsw.limit_in_bytes.
241 Example: Assume a system with 4G of swap. A task which allocates 6G of memory
244 By using the memsw limit, you can avoid system OOM which can be caused by swap
249 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
257 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
258 in this cgroup. Then, swap-out will not be done by cgroup routine and file
260 from it for sanity of the system's memory management state. You can't forbid
264 -----------
274 pages that are selected for reclaiming come from the per-cgroup LRU
282   When panic_on_oom is set to "2", the whole system will panic.
288 -----------
292   Page lock (PG_locked bit of page->flags)
293     mm->page_table_lock or split pte_lock
294       lock_page_memcg (memcg->move_lock)
295         mapping->i_pages lock
296           lruvec->lru_lock.
298 Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
299 lruvec->lru_lock; PG_lru bit of page->flags is cleared before
300 isolating a page from its LRU under lruvec->lru_lock.
303 -----------------------------------------------
305 With the Kernel memory extension, the Memory Controller is able to limit
306 the amount of kernel memory used by the system. Kernel memory is fundamentally
308 possible to DoS the system by consuming too much of this precious resource.
311 it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
326 -----------------------------------------------
335   of each kmem_cache is created every time the cache is touched by the first time
337   skipped while the cache is being created. All objects in a slab page should
339   different memcg during the page allocation by the cache.
343   thresholds. The Memory Controller allows them to be controlled individually
350 ----------------------
363     deployments where the total amount of memory per-cgroup is overcommitted.
365     box can still run out of non-reclaimable memory.
385 ------------------
393 -------------------------------------------------------------------
397 	# mount -t tmpfs none /sys/fs/cgroup
399 	# mount -t cgroup none /sys/fs/cgroup/memory -o memory
416   We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
434 availability of memory on the system. The user is required to re-read
452 Performance test is also important. To see pure memory controller's overhead,
456 Page-fault scalability is also important. At measuring parallel
457 page fault test, multi-process test may be better than multi-thread
461 Trying usual test under memory controller is always helpful.
464 -------------------
473 some of the pages cached in the cgroup (page cache pages).
479 ------------------
490 ---------------------
508 ---------------
518   charged file caches. Some out-of-use page caches may keep charged until
527 -------------
531 per-memory cgroup local status
535 cache		# of bytes of page cache memory.
536 rss		# of bytes of anonymous and swap cache memory (includes
542 		anon page(RSS) or cache page(Page Cache) to the cgroup.
547 writeback	# of bytes of file/anon cache that are queued for syncing to
549 inactive_anon	# of bytes of anonymous and swap cache memory on inactive
551 active_anon	# of bytes of anonymous and swap cache memory on active
553 inactive_file	# of bytes of file-backed memory on inactive LRU list.
554 active_file	# of bytes of file-backed memory on active LRU list.
589 	Only anonymous and swap cache memory is listed as part of 'rss' stat.
597 	cache.)
600 --------------
611 -----------
623 ------------------
629 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
633 -------------
635 This is similar to numa_maps but operates on a per-memcg basis.  This is
642 per-node page counts including "hierarchical_<counter>" which sums up all
658 The memory controller supports a deep hierarchy and hierarchical accounting.
677 ---------------------------------------
696 When the system detects memory contention or low memory, control groups
701 Please note that soft limits is a best-effort feature; it comes with
708 -------------
735 -------------
748       Charges are moved only when you move mm->owner, in other words,
762 --------------------------------------
769 +---+--------------------------------------------------------------------------+
774 +---+--------------------------------------------------------------------------+
783 +---+--------------------------------------------------------------------------+
786 --------
788 - All of moving charge operations are done under cgroup_mutex. It's not good
800 - create an eventfd using eventfd(2);
801 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
802 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
808 It's applicable for root and non-root cgroup.
821  - create an eventfd using eventfd(2)
822  - open memory.oom_control file
823  - write string like "<event_fd> <fd of memory.oom_control>" to
829 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
833 If OOM-killer is disabled, tasks under cgroup will hang/sleep
834 in memory cgroup's OOM-waitqueue when they request accountable memory.
850 	- oom_kill_disable 0 or 1
851 	  (if 1, oom-killer is disabled)
852 	- under_oom	   0 or 1
854         - oom_kill         integer counter
866 The "low" level means that the system is reclaiming memory for new
868 maintaining cache level. Upon notification, the program (typically
872 The "medium" level means that the system is experiencing medium memory
873 pressure, the system might be making swap, paging out active file caches,
876 resources that can be easily reconstructed or re-read from a disk.
878 The "critical" level means that the system is actively thrashing, it is
879 about to out of memory (OOM) or even the in-kernel OOM killer is on its
881 system. It might be too late to consult with vmstat or any other
885 events are not pass-through. For example, you have three cgroups: A->B->C. Now
889 excessive "broadcasting" of messages, which disturbs the system and which is
895  - "default": this is the default behavior specified above. This mode is the
899  - "hierarchy": events always propagate up to the root, similar to the default
904  - "local": events are pass-through, i.e. they only receive notifications when
913 specified by a comma-delimited string, i.e. "low,hierarchy" specifies
914 hierarchical, pass-through, notification for all ancestor memcgs. Notification
915 that is the default, non pass-through behavior, does not specify a mode.
916 "medium,local" specifies pass-through notification for the medium level.
921 - create an eventfd using eventfd(2);
922 - open memory.pressure_level;
923 - write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
945    (Expect a bunch of notifications, and eventually, the oom-killer will
951 1. Make per-cgroup scanner reclaim not-shared pages first
952 2. Teach controller to account for shared-pages
959 Overall, the memory controller has been a stable controller and has been
965 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
966 2. Singh, Balbir. Memory Controller (RSS Control),
970 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
972 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
977 8. Singh, Balbir. RSS controller v2 test results (lmbench),
979 9. Singh, Balbir. RSS controller v2 AIM9 results
981 10. Singh, Balbir. Memory controller v6 test results,
982     https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop
983 11. Singh, Balbir. Memory controller introduction (v6),
984     https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop