1.. _hugetlbpage: 2 3============= 4HugeTLB Pages 5============= 6 7Overview 8======== 9 10The intent of this file is to give a brief summary of hugetlbpage support in 11the Linux kernel. This support is built on top of multiple page size support 12that is provided by most modern architectures. For example, x86 CPUs normally 13support 4K and 2M (1G if architecturally supported) page sizes, ia64 14architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, 15256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical 16translations. Typically this is a very scarce resource on processor. 17Operating systems try to make best use of limited number of TLB resources. 18This optimization is more critical now as bigger and bigger physical memories 19(several GBs) are more readily available. 20 21Users can use the huge page support in Linux kernel by either using the mmap 22system call or standard SYSV shared memory system calls (shmget, shmat). 23 24First the Linux kernel needs to be built with the CONFIG_HUGETLBFS 25(present under "File systems") and CONFIG_HUGETLB_PAGE (selected 26automatically when CONFIG_HUGETLBFS is selected) configuration 27options. 28 29The ``/proc/meminfo`` file provides information about the total number of 30persistent hugetlb pages in the kernel's huge page pool. It also displays 31default huge page size and information about the number of free, reserved 32and surplus huge pages in the pool of huge pages of default size. 33The huge page size is needed for generating the proper alignment and 34size of the arguments to system calls that map huge page regions. 35 36The output of ``cat /proc/meminfo`` will include lines like:: 37 38 HugePages_Total: uuu 39 HugePages_Free: vvv 40 HugePages_Rsvd: www 41 HugePages_Surp: xxx 42 Hugepagesize: yyy kB 43 Hugetlb: zzz kB 44 45where: 46 47HugePages_Total 48 is the size of the pool of huge pages. 49HugePages_Free 50 is the number of huge pages in the pool that are not yet 51 allocated. 52HugePages_Rsvd 53 is short for "reserved," and is the number of huge pages for 54 which a commitment to allocate from the pool has been made, 55 but no allocation has yet been made. Reserved huge pages 56 guarantee that an application will be able to allocate a 57 huge page from the pool of huge pages at fault time. 58HugePages_Surp 59 is short for "surplus," and is the number of huge pages in 60 the pool above the value in ``/proc/sys/vm/nr_hugepages``. The 61 maximum number of surplus huge pages is controlled by 62 ``/proc/sys/vm/nr_overcommit_hugepages``. 63Hugepagesize 64 is the default hugepage size (in Kb). 65Hugetlb 66 is the total amount of memory (in kB), consumed by huge 67 pages of all sizes. 68 If huge pages of different sizes are in use, this number 69 will exceed HugePages_Total \* Hugepagesize. To get more 70 detailed information, please, refer to 71 ``/sys/kernel/mm/hugepages`` (described below). 72 73 74``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" 75configured in the kernel. 76 77``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge 78pages in the kernel's huge page pool. "Persistent" huge pages will be 79returned to the huge page pool when freed by a task. A user with root 80privileges can dynamically allocate more or free some persistent huge pages 81by increasing or decreasing the value of ``nr_hugepages``. 82 83Pages that are used as huge pages are reserved inside the kernel and cannot 84be used for other purposes. Huge pages cannot be swapped out under 85memory pressure. 86 87Once a number of huge pages have been pre-allocated to the kernel huge page 88pool, a user with appropriate privilege can use either the mmap system call 89or shared memory system calls to use the huge pages. See the discussion of 90:ref:`Using Huge Pages <using_huge_pages>`, below. 91 92The administrator can allocate persistent huge pages on the kernel boot 93command line by specifying the "hugepages=N" parameter, where 'N' = the 94number of huge pages requested. This is the most reliable method of 95allocating huge pages as memory has not yet become fragmented. 96 97Some platforms support multiple huge page sizes. To allocate huge pages 98of a specific size, one must precede the huge pages boot command parameters 99with a huge page size selection parameter "hugepagesz=<size>". <size> must 100be specified in bytes with optional scale suffix [kKmMgG]. The default huge 101page size may be selected with the "default_hugepagesz=<size>" boot parameter. 102 103When multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` 104indicates the current number of pre-allocated huge pages of the default size. 105Thus, one can use the following command to dynamically allocate/deallocate 106default sized persistent huge pages:: 107 108 echo 20 > /proc/sys/vm/nr_hugepages 109 110This command will try to adjust the number of default sized huge pages in the 111huge page pool to 20, allocating or freeing huge pages, as required. 112 113On a NUMA platform, the kernel will attempt to distribute the huge page pool 114over all the set of allowed nodes specified by the NUMA memory policy of the 115task that modifies ``nr_hugepages``. The default for the allowed nodes--when the 116task has default memory policy--is all on-line nodes with memory. Allowed 117nodes with insufficient available, contiguous memory for a huge page will be 118silently skipped when allocating persistent huge pages. See the 119:ref:`discussion below <mem_policy_and_hp_alloc>` 120of the interaction of task memory policy, cpusets and per node attributes 121with the allocation and freeing of persistent huge pages. 122 123The success or failure of huge page allocation depends on the amount of 124physically contiguous memory that is present in system at the time of the 125allocation attempt. If the kernel is unable to allocate huge pages from 126some nodes in a NUMA system, it will attempt to make up the difference by 127allocating extra pages on other nodes with sufficient available contiguous 128memory, if any. 129 130System administrators may want to put this command in one of the local rc 131init files. This will enable the kernel to allocate huge pages early in 132the boot process when the possibility of getting physical contiguous pages 133is still very high. Administrators can verify the number of huge pages 134actually allocated by checking the sysctl or meminfo. To check the per node 135distribution of huge pages in a NUMA system, use:: 136 137 cat /sys/devices/system/node/node*/meminfo | fgrep Huge 138 139``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of 140huge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are 141requested by applications. Writing any non-zero value into this file 142indicates that the hugetlb subsystem is allowed to try to obtain that 143number of "surplus" huge pages from the kernel's normal page pool, when the 144persistent huge page pool is exhausted. As these surplus huge pages become 145unused, they are freed back to the kernel's normal page pool. 146 147When increasing the huge page pool size via ``nr_hugepages``, any existing 148surplus pages will first be promoted to persistent huge pages. Then, additional 149huge pages will be allocated, if necessary and if possible, to fulfill 150the new persistent huge page pool size. 151 152The administrator may shrink the pool of persistent huge pages for 153the default huge page size by setting the ``nr_hugepages`` sysctl to a 154smaller value. The kernel will attempt to balance the freeing of huge pages 155across all nodes in the memory policy of the task modifying ``nr_hugepages``. 156Any free huge pages on the selected nodes will be freed back to the kernel's 157normal page pool. 158 159Caveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that 160it becomes less than the number of huge pages in use will convert the balance 161of the in-use huge pages to surplus huge pages. This will occur even if 162the number of surplus pages would exceed the overcommit value. As long as 163this condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is 164increased sufficiently, or the surplus huge pages go out of use and are freed-- 165no more surplus huge pages will be allowed to be allocated. 166 167With support for multiple huge page pools at run-time available, much of 168the huge page userspace interface in ``/proc/sys/vm`` has been duplicated in 169sysfs. 170The ``/proc`` interfaces discussed above have been retained for backwards 171compatibility. The root huge page control directory in sysfs is:: 172 173 /sys/kernel/mm/hugepages 174 175For each huge page size supported by the running kernel, a subdirectory 176will exist, of the form:: 177 178 hugepages-${size}kB 179 180Inside each of these directories, the same set of files will exist:: 181 182 nr_hugepages 183 nr_hugepages_mempolicy 184 nr_overcommit_hugepages 185 free_hugepages 186 resv_hugepages 187 surplus_hugepages 188 189which function as described above for the default huge page-sized case. 190 191.. _mem_policy_and_hp_alloc: 192 193Interaction of Task Memory Policy with Huge Page Allocation/Freeing 194=================================================================== 195 196Whether huge pages are allocated and freed via the ``/proc`` interface or 197the ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the 198NUMA nodes from which huge pages are allocated or freed are controlled by the 199NUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` 200sysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy 201is ignored. 202 203The recommended method to allocate or free huge pages to/from the kernel 204huge page pool, using the ``nr_hugepages`` example above, is:: 205 206 numactl --interleave <node-list> echo 20 \ 207 >/proc/sys/vm/nr_hugepages_mempolicy 208 209or, more succinctly:: 210 211 numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy 212 213This will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes 214specified in <node-list>, depending on whether number of persistent huge pages 215is initially less than or greater than 20, respectively. No huge pages will be 216allocated nor freed on any node not included in the specified <node-list>. 217 218When adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any 219memory policy mode--bind, preferred, local or interleave--may be used. The 220resulting effect on persistent huge page allocation is as follows: 221 222#. Regardless of mempolicy mode [see 223 :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`], 224 persistent huge pages will be distributed across the node or nodes 225 specified in the mempolicy as if "interleave" had been specified. 226 However, if a node in the policy does not contain sufficient contiguous 227 memory for a huge page, the allocation will not "fallback" to the nearest 228 neighbor node with sufficient contiguous memory. To do this would cause 229 undesirable imbalance in the distribution of the huge page pool, or 230 possibly, allocation of persistent huge pages on nodes not allowed by 231 the task's memory policy. 232 233#. One or more nodes may be specified with the bind or interleave policy. 234 If more than one node is specified with the preferred policy, only the 235 lowest numeric id will be used. Local policy will select the node where 236 the task is running at the time the nodes_allowed mask is constructed. 237 For local policy to be deterministic, the task must be bound to a cpu or 238 cpus in a single node. Otherwise, the task could be migrated to some 239 other node at any time after launch and the resulting node will be 240 indeterminate. Thus, local policy is not very useful for this purpose. 241 Any of the other mempolicy modes may be used to specify a single node. 242 243#. The nodes allowed mask will be derived from any non-default task mempolicy, 244 whether this policy was set explicitly by the task itself or one of its 245 ancestors, such as numactl. This means that if the task is invoked from a 246 shell with non-default policy, that policy will be used. One can specify a 247 node list of "all" with numactl --interleave or --membind [-m] to achieve 248 interleaving over all nodes in the system or cpuset. 249 250#. Any task mempolicy specified--e.g., using numactl--will be constrained by 251 the resource limits of any cpuset in which the task runs. Thus, there will 252 be no way for a task with non-default policy running in a cpuset with a 253 subset of the system nodes to allocate huge pages outside the cpuset 254 without first moving to a cpuset that contains all of the desired nodes. 255 256#. Boot-time huge page allocation attempts to distribute the requested number 257 of huge pages over all on-lines nodes with memory. 258 259Per Node Hugepages Attributes 260============================= 261 262A subset of the contents of the root huge page control directory in sysfs, 263described above, will be replicated under each the system device of each 264NUMA node with memory in:: 265 266 /sys/devices/system/node/node[0-9]*/hugepages/ 267 268Under this directory, the subdirectory for each supported huge page size 269contains the following attribute files:: 270 271 nr_hugepages 272 free_hugepages 273 surplus_hugepages 274 275The free\_' and surplus\_' attribute files are read-only. They return the number 276of free and surplus [overcommitted] huge pages, respectively, on the parent 277node. 278 279The ``nr_hugepages`` attribute returns the total number of huge pages on the 280specified node. When this attribute is written, the number of persistent huge 281pages on the parent node will be adjusted to the specified value, if sufficient 282resources exist, regardless of the task's mempolicy or cpuset constraints. 283 284Note that the number of overcommit and reserve pages remain global quantities, 285as we don't know until fault time, when the faulting task's mempolicy is 286applied, from which node the huge page allocation will be attempted. 287 288.. _using_huge_pages: 289 290Using Huge Pages 291================ 292 293If the user applications are going to request huge pages using mmap system 294call, then it is required that system administrator mount a file system of 295type hugetlbfs:: 296 297 mount -t hugetlbfs \ 298 -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ 299 min_size=<value>,nr_inodes=<value> none /mnt/huge 300 301This command mounts a (pseudo) filesystem of type hugetlbfs on the directory 302``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. 303 304The ``uid`` and ``gid`` options sets the owner and group of the root of the 305file system. By default the ``uid`` and ``gid`` of the current process 306are taken. 307 308The ``mode`` option sets the mode of root of file system to value & 01777. 309This value is given in octal. By default the value 0755 is picked. 310 311If the platform supports multiple huge page sizes, the ``pagesize`` option can 312be used to specify the huge page size and associated pool. ``pagesize`` 313is specified in bytes. If ``pagesize`` is not specified the platform's 314default huge page size and associated pool will be used. 315 316The ``size`` option sets the maximum value of memory (huge pages) allowed 317for that filesystem (``/mnt/huge``). The ``size`` option can be specified 318in bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). 319The size is rounded down to HPAGE_SIZE boundary. 320 321The ``min_size`` option sets the minimum value of memory (huge pages) allowed 322for the filesystem. ``min_size`` can be specified in the same way as ``size``, 323either bytes or a percentage of the huge page pool. 324At mount time, the number of huge pages specified by ``min_size`` are reserved 325for use by the filesystem. 326If there are not enough free huge pages available, the mount will fail. 327As huge pages are allocated to the filesystem and freed, the reserve count 328is adjusted so that the sum of allocated and reserved huge pages is always 329at least ``min_size``. 330 331The option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` 332can use. 333 334If the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on 335command line then no limits are set. 336 337For ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can 338use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. 339For example, size=2K has the same meaning as size=2048. 340 341While read system calls are supported on files that reside on hugetlb 342file systems, write system calls are not. 343 344Regular chown, chgrp, and chmod commands (with right permissions) could be 345used to change the file attributes on hugetlbfs. 346 347Also, it is important to note that no such mount command is required if 348applications are going to use only shmat/shmget system calls or mmap with 349MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see 350:ref:`map_hugetlb <map_hugetlb>` below. 351 352Users who wish to use hugetlb memory via shared memory segment should be 353members of a supplementary group and system admin needs to configure that gid 354into ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different 355applications to use any combination of mmaps and shm* calls, though the mount of 356filesystem will be required for using mmap calls without MAP_HUGETLB. 357 358Syscalls that operate on memory backed by hugetlb pages only have their lengths 359aligned to the native page size of the processor; they will normally fail with 360errno set to EINVAL or exclude hugetlb pages that extend beyond the length if 361not hugepage aligned. For example, munmap(2) will fail if memory is backed by 362a hugetlb page and the length is smaller than the hugepage size. 363 364 365Examples 366======== 367 368.. _map_hugetlb: 369 370``map_hugetlb`` 371 see tools/testing/selftests/vm/map_hugetlb.c 372 373``hugepage-shm`` 374 see tools/testing/selftests/vm/hugepage-shm.c 375 376``hugepage-mmap`` 377 see tools/testing/selftests/vm/hugepage-mmap.c 378 379The `libhugetlbfs`_ library provides a wide range of userspace tools 380to help with huge page usability, environment setup, and control. 381 382.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs 383