Lines Matching +full:charge +full:- +full:ctrl +full:- +full:value

1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. HugeTLB
68 5.8-1. HugeTLB Interface Files
69 5-9. Misc
70 5.9-1 Miscellaneous cgroup Interface Files
71 5.9-2 Migration and Ownership
72 5-10. Others
73 5-10-1. perf_event
74 5-N. Non-normative information
75 5-N-1. CPU controller root cgroup process behaviour
76 5-N-2. IO controller root cgroup process behaviour
78 6-1. Basics
79 6-2. The Root and Views
80 6-3. Migration and setns(2)
81 6-4. Interaction with Other Namespaces
83 P-1. Filesystem Support for Writeback
86 R-1. Multiple Hierarchies
87 R-2. Thread Granularity
88 R-3. Competition Between Inner Nodes and Threads
89 R-4. Other Interface Issues
90 R-5. Controller Issues and Remedies
91 R-5-1. Memory
98 -----------
107 ---------------
113 cgroup is largely composed of two parts - the core and controllers.
129 hierarchical - if a controller is enabled on a cgroup, it affects all
131 sub-hierarchy of the cgroup. When a controller is enabled on a nested
141 --------
146 # mount -t cgroup2 none $MOUNT_POINT
156 is no longer referenced in its current hierarchy. Because per-cgroup
163 to inter-controller dependencies, other controllers may need to be
184 ignored on non-init namespace mounts. Please refer to the
193 option is ignored on non-init namespace mounts.
201 behavior but is a mount-option to avoid regressing setups
207 --------------------------------
213 A child cgroup can be created by creating a sub-directory::
218 structure. Each cgroup has a read-writable interface file
220 belong to the cgroup one-per-line. The PIDs are not ordered and the
251 0::/test-cgroup/test-cgroup-nested
258 0::/test-cgroup/test-cgroup-nested (deleted)
284 constraint - threaded controllers can be enabled on non-leaf cgroups
308 - As the cgroup will join the parent's resource domain. The parent
311 - When the parent is an unthreaded domain, it must not have any domain
315 Topology-wise, a cgroup can be in an invalid state. Please consider
318 A (threaded domain) - B (threaded) - C (domain, just created)
333 threads in the cgroup. Except that the operations are per-thread
334 instead of per-process, "cgroup.threads" has the same format and
356 between threads in a non-leaf cgroup and its child cgroups. Each
361 --------------------------
363 Each non-root cgroup has a "cgroup.events" file which contains
364 "populated" field indicating whether the cgroup's sub-hierarchy has
365 live processes in it. Its value is 0 if there is no live process in
367 events are triggered when the value changes. This can be used, for
368 example, to start a clean-up operation after all processes of a given
369 sub-hierarchy have exited. The populated state updates and
370 notifications are recursive. Consider the following sub-hierarchy
374 A(4) - B(0) - C(1)
384 -----------------------
398 # echo "+cpu +memory -io" > cgroup.subtree_control
407 Consider the following sub-hierarchy. The enabled controllers are
410 A(cpu,memory) - B(memory) - C()
424 controller interface files - anything which doesn't start with
428 Top-down Constraint
431 Resources are distributed top-down and a cgroup can further distribute
433 parent. This means that all non-root "cgroup.subtree_control" files
443 Non-root cgroups can distribute domain resources to their children
458 refer to the Non-normative information section in the Controllers
471 ----------
491 delegated, the user can build sub-hierarchy under the directory,
495 happens in the delegated sub-hierarchy, nothing can escape the
499 cgroups in or nesting depth of a delegated sub-hierarchy; however,
506 A delegated sub-hierarchy is contained in the sense that processes
507 can't be moved into or out of the sub-hierarchy by the delegatee.
510 requiring the following conditions for a process with a non-root euid
514 - The writer must have write access to the "cgroup.procs" file.
516 - The writer must have write access to the "cgroup.procs" file of the
520 processes around freely in the delegated sub-hierarchy it can't pull
521 in from or push out to outside the sub-hierarchy.
527 ~~~~~~~~~~~~~ - C0 - C00
530 ~~~~~~~~~~~~~ - C1 - C10
537 will be denied with -EACCES.
542 is not reachable, the migration is rejected with -ENOENT.
546 ----------
554 inherent trade-offs between migration and various hot paths in terms
560 resource structure once on start-up. Dynamic adjustments to resource
593 -------
599 work-conserving. Due to the dynamic nature, this model is usually
615 ------
618 Limits can be over-committed - the sum of the limits of children can
623 As limits can be over-committed, all configuration combinations are
632 -----------
637 soft boundaries. Protections can also be over-committed in which case
644 As protections can be over-committed, all configuration combinations
648 "memory.low" implements best-effort memory protection and is an
653 -----------
656 resource. Allocations can't be over-committed - the sum of the
663 As allocations can't be over-committed, some configuration
668 "cpu.rt.max" hard-allocates realtime slices and is an example of this
676 ------
681 New-line separated values
682 (when only one value can be written at once)
689 (when read-only or multiple values can be written at once)
715 -----------
717 - Settings for a single feature should be contained in a single file.
719 - The root cgroup should be exempt from resource control and thus
722 - The default time unit is microseconds. If a different unit is ever
725 - A parts-per quantity should use a percentage decimal with at least
726 two digit fractional part - e.g. 13.40.
728 - If a controller implements weight based resource distribution, its
734 - If a controller implements an absolute resource guarantee and/or
743 - If a setting has a configurable default value and keyed specific
747 The default value can be updated by writing either "default $VAL" or
751 the value to indicate removal of the override. Override entries
752 with "default" as the value must not appear when read.
757 # cat cgroup-example-interface-file
761 The default value can be updated by::
763 # echo 125 > cgroup-example-interface-file
767 # echo "default 125" > cgroup-example-interface-file
771 # echo "8:16 170" > cgroup-example-interface-file
775 # echo "8:0 default" > cgroup-example-interface-file
776 # cat cgroup-example-interface-file
780 - For events which are not very high frequency, an interface file
781 "events" should be created which lists event key value pairs.
787 --------------------
792 A read-write single value file which exists on non-root
798 - "domain" : A normal valid domain cgroup.
800 - "domain threaded" : A threaded domain cgroup which is
803 - "domain invalid" : A cgroup which is in an invalid state.
807 - "threaded" : A threaded cgroup which is a member of a
814 A read-write new-line separated values file which exists on
818 the cgroup one-per-line. The PIDs are not ordered and the
827 - It must have write access to the "cgroup.procs" file.
829 - It must have write access to the "cgroup.procs" file of the
832 When delegating a sub-hierarchy, write access to this file
840 A read-write new-line separated values file which exists on
844 the cgroup one-per-line. The TIDs are not ordered and the
853 - It must have write access to the "cgroup.threads" file.
855 - The cgroup that the thread is currently in must be in the
858 - It must have write access to the "cgroup.procs" file of the
861 When delegating a sub-hierarchy, write access to this file
865 A read-only space separated values file which exists on all
872 A read-write space separated values file which exists on all
879 Space separated list of controllers prefixed with '+' or '-'
881 name prefixed with '+' enables the controller and '-'
887 A read-only flat-keyed file which exists on non-root cgroups.
889 otherwise, a value change in this file generates a file
899 A read-write single value files. The default is "max".
906 A read-write single value files. The default is "max".
913 A read-only flat-keyed file with the following entries:
931 A read-write single value file which exists on non-root cgroups.
938 is completed, the "frozen" value in the cgroup.events control file
954 create new sub-cgroups.
957 A write-only single value file which exists in non-root cgroups.
958 The only allowed value is "1".
969 the whole thread-group.
974 .. _cgroup-v2-cpu:
977 ---
1005 A read-only flat-keyed file.
1010 - usage_usec
1011 - user_usec
1012 - system_usec
1016 - nr_periods
1017 - nr_throttled
1018 - throttled_usec
1021 A read-write single value file which exists on non-root
1027 A read-write single value file which exists on non-root
1030 The nice value is in the range [-20, 19].
1035 granularity is coarser for the nice values, the read value is
1039 A read-write two value file which exists on non-root cgroups.
1051 A read-write nested-keyed file.
1057 A read-write single value file which exists on non-root cgroups.
1065 value is used to clamp the task specific minimum utilization clamp.
1068 the current value for the maximum utilization (limit), i.e.
1072 A read-write single value file which exists on non-root cgroups.
1080 value is used to clamp the task specific maximum utilization clamp.
1085 ------
1093 While not completely water-tight, all major memory usages by a given
1098 - Userland memory - page cache and anonymous memory.
1100 - Kernel data structures such as dentries and inodes.
1102 - TCP socket buffers.
1110 All memory amounts are in bytes. If a value which is not aligned to
1111 PAGE_SIZE is written, the value may be rounded up to the closest
1115 A read-only single value file which exists on non-root
1122 A read-write single value file which exists on non-root
1148 A read-write single value file which exists on non-root
1151 Best-effort memory protection. If the memory usage of a
1171 A read-write single value file which exists on non-root
1183 A read-write single value file which exists on non-root
1192 In default configuration regular 0-order allocations always
1197 as -ENOMEM or silently ignore in cases like disk readahead.
1204 A read-write single value file which exists on non-root
1205 cgroups. The default value is "0".
1214 Tasks with the OOM protection (oom_score_adj set to -1000)
1222 A read-only flat-keyed file which exists on non-root cgroups.
1224 otherwise, a value change in this file generates a file
1236 boundary is over-committed.
1256 considered as an option, e.g. for failed high-order
1269 A read-only flat-keyed file which exists on non-root cgroups.
1272 types of memory, type-specific details, and other information
1281 If the entry has no per-node counter (or not show in the
1282 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1300 Amount of memory used for storing per-cpu kernel
1307 Amount of cached filesystem data that is swap-backed,
1338 Amount of memory, swap-backed and filesystem-backed,
1344 the value for the foo counter, since the foo counter is type-based, not
1345 list-based.
1356 Amount of memory used for storing in-kernel data
1421 A read-only nested-keyed file which exists on non-root cgroups.
1424 types of memory, type-specific details, and other information
1446 A read-only single value file which exists on non-root
1453 A read-write single value file which exists on non-root
1458 allow userspace to implement custom out-of-memory procedures.
1469 A read-write single value file which exists on non-root
1476 A read-only flat-keyed file which exists on non-root cgroups.
1478 otherwise, a value change in this file generates a file
1492 because of running out of swap system-wide or max
1501 A read-only nested-keyed file.
1511 Over-committing on high limit (sum of high limits > available memory)
1525 pressure - how much the workload is being impacted due to lack of
1526 memory - is necessary to determine whether a workload needs more
1540 To which cgroup the area will be charged is in-deterministic; however,
1551 --
1556 only if cfq-iosched is in use and neither scheme is available for
1557 blk-mq devices.
1564 A read-only nested-keyed file.
1584 A read-write nested-keyed file which exists only on the root
1596 enable Weight-based control enable
1597 ctrl "auto" or "user"
1614 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
1628 devices which show wide temporary behavior changes - e.g. a
1632 When "ctrl" is "auto", the parameters are controlled by the
1633 kernel and may change automatically. Setting "ctrl" to "user"
1636 automatic mode can be restored by setting "ctrl" to "auto".
1639 A read-write nested-keyed file which exists only on the root
1651 ctrl "auto" or "user"
1652 model The cost model in use - "linear"
1655 When "ctrl" is "auto", the kernel may change all parameters
1656 dynamically. When "ctrl" is set to "user" or any other
1657 parameters are written to, "ctrl" become "user" and the
1678 generate device-specific coefficients.
1681 A read-write flat-keyed file which exists on non-root cgroups.
1701 A read-write nested-keyed file which exists on non-root
1715 When writing, any number of nested key-value pairs can be
1716 specified in any order. "max" can be specified as the value
1740 A read-only nested-keyed file.
1759 writes out dirty pages for the memory domain. Both system-wide and
1760 per-cgroup dirty memory states are examined and the more restrictive
1798 memory controller and system-wide clean memory.
1826 Generally you do not want to set a value lower than the latency your device
1827 supports. Experiment to find the value that works best for your workload.
1829 avg_lat value in io.stat for your workload group to get an idea of the
1830 latency you see during normal operation. Use the avg_lat value as a basis for
1831 your real setting, setting at 10-15% higher than the value in io.stat.
1841 - Queue depth throttling. This is the number of outstanding IO's a group is
1845 - Artificial delay induction. There are certain types of IO that cannot be
1850 fields in io.stat increase. The delay value is how many microseconds that are
1877 calculated by multiplying the win value in io.stat by the
1878 corresponding number of samples based on the win value.
1892 no-change
1895 none-to-rt
1900 restrict-to-be
1911 +-------------+---+
1912 | no-change | 0 |
1913 +-------------+---+
1914 | none-to-rt | 1 |
1915 +-------------+---+
1916 | rt-to-be | 2 |
1917 +-------------+---+
1918 | all-to-idle | 3 |
1919 +-------------+---+
1921 The numerical value that corresponds to each I/O priority class is as follows:
1923 +-------------------------------+---+
1925 +-------------------------------+---+
1926 | IOPRIO_CLASS_RT (real-time) | 1 |
1927 +-------------------------------+---+
1929 +-------------------------------+---+
1931 +-------------------------------+---+
1935 - Translate the I/O priority class policy into a number.
1936 - Change the request I/O priority class into the maximum of the I/O priority
1940 ---
1959 A read-write single value file which exists on non-root
1965 A read-only single value file which exists on all cgroups.
1975 through fork() or clone(). These will return -EAGAIN if the creation
1980 ------
1987 memory placement to reduce cross-node memory access and contention
1998 A read-write multiple values file which exists on non-root
1999 cpuset-enabled cgroups.
2006 The CPU numbers are comma-separated numbers or ranges.
2010 0-4,6,8-10
2012 An empty value indicates that the cgroup is using the same
2013 setting as the nearest cgroup ancestor with a non-empty
2016 The value of "cpuset.cpus" stays constant until the next update
2020 A read-only multiple values file which exists on all
2021 cpuset-enabled cgroups.
2034 Its value will be affected by CPU hotplug events.
2037 A read-write multiple values file which exists on non-root
2038 cpuset-enabled cgroups.
2045 The memory node numbers are comma-separated numbers or ranges.
2049 0-1,3
2051 An empty value indicates that the cgroup is using the same
2052 setting as the nearest cgroup ancestor with a non-empty
2056 The value of "cpuset.mems" stays constant until the next update
2059 Setting a non-empty value to "cpuset.mems" causes memory of
2071 A read-only multiple values file which exists on all
2072 cpuset-enabled cgroups.
2084 Its value will be affected by memory nodes hotplug events.
2087 A read-write single value file which exists on non-root
2088 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2095 "member" a non-root member of a partition
2129 partition and the new "cpuset.cpus" value is a superset of its
2138 "member" Non-root member of a partition
2165 -----------------
2176 on the return value the attempt will succeed or fail with -EPERM.
2181 If the program returns 0, the attempt fails with -EPERM, otherwise it
2189 ----
2198 A readwrite nested-keyed file that exists for all the cgroups
2219 A read-only file that describes current resource usage.
2228 -------
2242 The default value is "max". It exists for all the cgroup except root.
2245 A read-only flat-keyed file which exists on non-root cgroups.
2256 ----
2268 Once a capacity is set then the resource usage can be updated using charge and
2278 A read-only flat-keyed file shown only in the root cgroup. It shows
2287 A read-only flat-keyed file shown in the non-root cgroups. It shows
2295 A read-write flat-keyed file shown in the non root cgroups. Allowed
2310 Limits can be set higher than the capacity value in the misc.capacity
2318 a process to a different cgroup does not move the charge to the destination
2322 ------
2333 Non-normative information
2334 -------------------------
2350 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2359 weight value of 200.
2366 ------
2385 The path '/batchjobs/container_id1' can be considered as system-data
2390 # ls -l /proc/self/ns/cgroup
2391 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2397 # ls -l /proc/self/ns/cgroup
2398 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2402 When some thread from a multi-threaded process unshares its cgroup
2414 ------------------
2425 # ~/unshare -c # unshare cgroupns in some cgroup
2433 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2464 ----------------------
2493 ---------------------------------
2496 running inside a non-init cgroup namespace::
2498 # mount -t cgroup2 none $MOUNT_POINT
2505 the view of cgroup hierarchy by namespace-private cgroupfs mount
2518 --------------------------------
2521 address_space_operations->writepage[s]() to annotate bio's using the
2538 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
2555 - Multiple hierarchies including named ones are not supported.
2557 - All v1 mount options are not supported.
2559 - The "tasks" file is removed and "cgroup.procs" is not sorted.
2561 - "cgroup.clone_children" is removed.
2563 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
2571 --------------------
2624 ------------------
2632 Generally, in-process knowledge is available only to the process
2633 itself; thus, unlike service-level organization of processes,
2640 sub-hierarchies and control resource distributions along them. This
2641 effectively raised cgroup to the status of a syscall-like API exposed
2651 that the process would actually be operating on its own sub-hierarchy.
2655 system-management pseudo filesystem. cgroup ended up with interface
2658 individual applications through the ill-defined delegation mechanism
2668 -------------------------------------------
2679 cycles and the number of internal threads fluctuated - the ratios
2695 clearly defined. There were attempts to add ad-hoc behaviors and
2709 ----------------------
2713 was how an empty cgroup was notified - a userland helper binary was
2716 to in-kernel event delivery filtering mechanism further complicating
2738 ------------------------------
2745 global reclaim prefers is opt-in, rather than opt-out. The costs for
2755 becomes self-defeating.
2757 The memory.low boundary on the other hand is a top-down allocated
2795 new limit is met - or the task writing to memory.max is killed.
2804 groups can sabotage swapping by other means - such as referencing its
2805 anonymous memory in a tight loop - and an admin can not assume full