1User Interface for Resource Allocation in Intel Resource Director Technology 2 3Copyright (C) 2016 Intel Corporation 4 5Fenghua Yu <fenghua.yu@intel.com> 6Tony Luck <tony.luck@intel.com> 7Vikas Shivappa <vikas.shivappa@intel.com> 8 9This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the 10X86 /proc/cpuinfo flag bits: 11RDT (Resource Director Technology) Allocation - "rdt_a" 12CAT (Cache Allocation Technology) - "cat_l3", "cat_l2" 13CDP (Code and Data Prioritization ) - "cdp_l3", "cdp_l2" 14CQM (Cache QoS Monitoring) - "cqm_llc", "cqm_occup_llc" 15MBM (Memory Bandwidth Monitoring) - "cqm_mbm_total", "cqm_mbm_local" 16MBA (Memory Bandwidth Allocation) - "mba" 17 18To use the feature mount the file system: 19 20 # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl 21 22mount options are: 23 24"cdp": Enable code/data prioritization in L3 cache allocations. 25"cdpl2": Enable code/data prioritization in L2 cache allocations. 26"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA 27 bandwidth in MBps 28 29L2 and L3 CDP are controlled seperately. 30 31RDT features are orthogonal. A particular system may support only 32monitoring, only control, or both monitoring and control. Cache 33pseudo-locking is a unique way of using cache control to "pin" or 34"lock" data in the cache. Details can be found in 35"Cache Pseudo-Locking". 36 37 38The mount succeeds if either of allocation or monitoring is present, but 39only those files and directories supported by the system will be created. 40For more details on the behavior of the interface during monitoring 41and allocation, see the "Resource alloc and monitor groups" section. 42 43Info directory 44-------------- 45 46The 'info' directory contains information about the enabled 47resources. Each resource has its own subdirectory. The subdirectory 48names reflect the resource names. 49 50Each subdirectory contains the following files with respect to 51allocation: 52 53Cache resource(L3/L2) subdirectory contains the following files 54related to allocation: 55 56"num_closids": The number of CLOSIDs which are valid for this 57 resource. The kernel uses the smallest number of 58 CLOSIDs of all enabled resources as limit. 59 60"cbm_mask": The bitmask which is valid for this resource. 61 This mask is equivalent to 100%. 62 63"min_cbm_bits": The minimum number of consecutive bits which 64 must be set when writing a mask. 65 66"shareable_bits": Bitmask of shareable resource with other executing 67 entities (e.g. I/O). User can use this when 68 setting up exclusive cache partitions. Note that 69 some platforms support devices that have their 70 own settings for cache use which can over-ride 71 these bits. 72"bit_usage": Annotated capacity bitmasks showing how all 73 instances of the resource are used. The legend is: 74 "0" - Corresponding region is unused. When the system's 75 resources have been allocated and a "0" is found 76 in "bit_usage" it is a sign that resources are 77 wasted. 78 "H" - Corresponding region is used by hardware only 79 but available for software use. If a resource 80 has bits set in "shareable_bits" but not all 81 of these bits appear in the resource groups' 82 schematas then the bits appearing in 83 "shareable_bits" but no resource group will 84 be marked as "H". 85 "X" - Corresponding region is available for sharing and 86 used by hardware and software. These are the 87 bits that appear in "shareable_bits" as 88 well as a resource group's allocation. 89 "S" - Corresponding region is used by software 90 and available for sharing. 91 "E" - Corresponding region is used exclusively by 92 one resource group. No sharing allowed. 93 "P" - Corresponding region is pseudo-locked. No 94 sharing allowed. 95 96Memory bandwitdh(MB) subdirectory contains the following files 97with respect to allocation: 98 99"min_bandwidth": The minimum memory bandwidth percentage which 100 user can request. 101 102"bandwidth_gran": The granularity in which the memory bandwidth 103 percentage is allocated. The allocated 104 b/w percentage is rounded off to the next 105 control step available on the hardware. The 106 available bandwidth control steps are: 107 min_bandwidth + N * bandwidth_gran. 108 109"delay_linear": Indicates if the delay scale is linear or 110 non-linear. This field is purely informational 111 only. 112 113If RDT monitoring is available there will be an "L3_MON" directory 114with the following files: 115 116"num_rmids": The number of RMIDs available. This is the 117 upper bound for how many "CTRL_MON" + "MON" 118 groups can be created. 119 120"mon_features": Lists the monitoring events if 121 monitoring is enabled for the resource. 122 123"max_threshold_occupancy": 124 Read/write file provides the largest value (in 125 bytes) at which a previously used LLC_occupancy 126 counter can be considered for re-use. 127 128Finally, in the top level of the "info" directory there is a file 129named "last_cmd_status". This is reset with every "command" issued 130via the file system (making new directories or writing to any of the 131control files). If the command was successful, it will read as "ok". 132If the command failed, it will provide more information that can be 133conveyed in the error returns from file operations. E.g. 134 135 # echo L3:0=f7 > schemata 136 bash: echo: write error: Invalid argument 137 # cat info/last_cmd_status 138 mask f7 has non-consecutive 1-bits 139 140Resource alloc and monitor groups 141--------------------------------- 142 143Resource groups are represented as directories in the resctrl file 144system. The default group is the root directory which, immediately 145after mounting, owns all the tasks and cpus in the system and can make 146full use of all resources. 147 148On a system with RDT control features additional directories can be 149created in the root directory that specify different amounts of each 150resource (see "schemata" below). The root and these additional top level 151directories are referred to as "CTRL_MON" groups below. 152 153On a system with RDT monitoring the root directory and other top level 154directories contain a directory named "mon_groups" in which additional 155directories can be created to monitor subsets of tasks in the CTRL_MON 156group that is their ancestor. These are called "MON" groups in the rest 157of this document. 158 159Removing a directory will move all tasks and cpus owned by the group it 160represents to the parent. Removing one of the created CTRL_MON groups 161will automatically remove all MON groups below it. 162 163All groups contain the following files: 164 165"tasks": 166 Reading this file shows the list of all tasks that belong to 167 this group. Writing a task id to the file will add a task to the 168 group. If the group is a CTRL_MON group the task is removed from 169 whichever previous CTRL_MON group owned the task and also from 170 any MON group that owned the task. If the group is a MON group, 171 then the task must already belong to the CTRL_MON parent of this 172 group. The task is removed from any previous MON group. 173 174 175"cpus": 176 Reading this file shows a bitmask of the logical CPUs owned by 177 this group. Writing a mask to this file will add and remove 178 CPUs to/from this group. As with the tasks file a hierarchy is 179 maintained where MON groups may only include CPUs owned by the 180 parent CTRL_MON group. 181 When the resouce group is in pseudo-locked mode this file will 182 only be readable, reflecting the CPUs associated with the 183 pseudo-locked region. 184 185 186"cpus_list": 187 Just like "cpus", only using ranges of CPUs instead of bitmasks. 188 189 190When control is enabled all CTRL_MON groups will also contain: 191 192"schemata": 193 A list of all the resources available to this group. 194 Each resource has its own line and format - see below for details. 195 196"size": 197 Mirrors the display of the "schemata" file to display the size in 198 bytes of each allocation instead of the bits representing the 199 allocation. 200 201"mode": 202 The "mode" of the resource group dictates the sharing of its 203 allocations. A "shareable" resource group allows sharing of its 204 allocations while an "exclusive" resource group does not. A 205 cache pseudo-locked region is created by first writing 206 "pseudo-locksetup" to the "mode" file before writing the cache 207 pseudo-locked region's schemata to the resource group's "schemata" 208 file. On successful pseudo-locked region creation the mode will 209 automatically change to "pseudo-locked". 210 211When monitoring is enabled all MON groups will also contain: 212 213"mon_data": 214 This contains a set of files organized by L3 domain and by 215 RDT event. E.g. on a system with two L3 domains there will 216 be subdirectories "mon_L3_00" and "mon_L3_01". Each of these 217 directories have one file per event (e.g. "llc_occupancy", 218 "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these 219 files provide a read out of the current value of the event for 220 all tasks in the group. In CTRL_MON groups these files provide 221 the sum for all tasks in the CTRL_MON group and all tasks in 222 MON groups. Please see example section for more details on usage. 223 224Resource allocation rules 225------------------------- 226When a task is running the following rules define which resources are 227available to it: 228 2291) If the task is a member of a non-default group, then the schemata 230 for that group is used. 231 2322) Else if the task belongs to the default group, but is running on a 233 CPU that is assigned to some specific group, then the schemata for the 234 CPU's group is used. 235 2363) Otherwise the schemata for the default group is used. 237 238Resource monitoring rules 239------------------------- 2401) If a task is a member of a MON group, or non-default CTRL_MON group 241 then RDT events for the task will be reported in that group. 242 2432) If a task is a member of the default CTRL_MON group, but is running 244 on a CPU that is assigned to some specific group, then the RDT events 245 for the task will be reported in that group. 246 2473) Otherwise RDT events for the task will be reported in the root level 248 "mon_data" group. 249 250 251Notes on cache occupancy monitoring and control 252----------------------------------------------- 253When moving a task from one group to another you should remember that 254this only affects *new* cache allocations by the task. E.g. you may have 255a task in a monitor group showing 3 MB of cache occupancy. If you move 256to a new group and immediately check the occupancy of the old and new 257groups you will likely see that the old group is still showing 3 MB and 258the new group zero. When the task accesses locations still in cache from 259before the move, the h/w does not update any counters. On a busy system 260you will likely see the occupancy in the old group go down as cache lines 261are evicted and re-used while the occupancy in the new group rises as 262the task accesses memory and loads into the cache are counted based on 263membership in the new group. 264 265The same applies to cache allocation control. Moving a task to a group 266with a smaller cache partition will not evict any cache lines. The 267process may continue to use them from the old partition. 268 269Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID) 270to identify a control group and a monitoring group respectively. Each of 271the resource groups are mapped to these IDs based on the kind of group. The 272number of CLOSid and RMID are limited by the hardware and hence the creation of 273a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID 274and creation of "MON" group may fail if we run out of RMIDs. 275 276max_threshold_occupancy - generic concepts 277------------------------------------------ 278 279Note that an RMID once freed may not be immediately available for use as 280the RMID is still tagged the cache lines of the previous user of RMID. 281Hence such RMIDs are placed on limbo list and checked back if the cache 282occupancy has gone down. If there is a time when system has a lot of 283limbo RMIDs but which are not ready to be used, user may see an -EBUSY 284during mkdir. 285 286max_threshold_occupancy is a user configurable value to determine the 287occupancy at which an RMID can be freed. 288 289Schemata files - general concepts 290--------------------------------- 291Each line in the file describes one resource. The line starts with 292the name of the resource, followed by specific values to be applied 293in each of the instances of that resource on the system. 294 295Cache IDs 296--------- 297On current generation systems there is one L3 cache per socket and L2 298caches are generally just shared by the hyperthreads on a core, but this 299isn't an architectural requirement. We could have multiple separate L3 300caches on a socket, multiple cores could share an L2 cache. So instead 301of using "socket" or "core" to define the set of logical cpus sharing 302a resource we use a "Cache ID". At a given cache level this will be a 303unique number across the whole system (but it isn't guaranteed to be a 304contiguous sequence, there may be gaps). To find the ID for each logical 305CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id 306 307Cache Bit Masks (CBM) 308--------------------- 309For cache resources we describe the portion of the cache that is available 310for allocation using a bitmask. The maximum value of the mask is defined 311by each cpu model (and may be different for different cache levels). It 312is found using CPUID, but is also provided in the "info" directory of 313the resctrl file system in "info/{resource}/cbm_mask". X86 hardware 314requires that these masks have all the '1' bits in a contiguous block. So 3150x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9 316and 0xA are not. On a system with a 20-bit mask each bit represents 5% 317of the capacity of the cache. You could partition the cache into four 318equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. 319 320Memory bandwidth Allocation and monitoring 321------------------------------------------ 322 323For Memory bandwidth resource, by default the user controls the resource 324by indicating the percentage of total memory bandwidth. 325 326The minimum bandwidth percentage value for each cpu model is predefined 327and can be looked up through "info/MB/min_bandwidth". The bandwidth 328granularity that is allocated is also dependent on the cpu model and can 329be looked up at "info/MB/bandwidth_gran". The available bandwidth 330control steps are: min_bw + N * bw_gran. Intermediate values are rounded 331to the next control step available on the hardware. 332 333The bandwidth throttling is a core specific mechanism on some of Intel 334SKUs. Using a high bandwidth and a low bandwidth setting on two threads 335sharing a core will result in both threads being throttled to use the 336low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core 337specific mechanism where as memory bandwidth monitoring(MBM) is done at 338the package level may lead to confusion when users try to apply control 339via the MBA and then monitor the bandwidth to see if the controls are 340effective. Below are such scenarios: 341 3421. User may *not* see increase in actual bandwidth when percentage 343 values are increased: 344 345This can occur when aggregate L2 external bandwidth is more than L3 346external bandwidth. Consider an SKL SKU with 24 cores on a package and 347where L2 external is 10GBps (hence aggregate L2 external bandwidth is 348240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20 349threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3 350bandwidth of 100GBps although the percentage value specified is only 50% 351<< 100%. Hence increasing the bandwidth percentage will not yeild any 352more bandwidth. This is because although the L2 external bandwidth still 353has capacity, the L3 external bandwidth is fully used. Also note that 354this would be dependent on number of cores the benchmark is run on. 355 3562. Same bandwidth percentage may mean different actual bandwidth 357 depending on # of threads: 358 359For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 360thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although 361they have same percentage bandwidth of 10%. This is simply because as 362threads start using more cores in an rdtgroup, the actual bandwidth may 363increase or vary although user specified bandwidth percentage is same. 364 365In order to mitigate this and make the interface more user friendly, 366resctrl added support for specifying the bandwidth in MBps as well. The 367kernel underneath would use a software feedback mechanism or a "Software 368Controller(mba_sc)" which reads the actual bandwidth using MBM counters 369and adjust the memowy bandwidth percentages to ensure 370 371 "actual bandwidth < user specified bandwidth". 372 373By default, the schemata would take the bandwidth percentage values 374where as user can switch to the "MBA software controller" mode using 375a mount option 'mba_MBps'. The schemata format is specified in the below 376sections. 377 378L3 schemata file details (code and data prioritization disabled) 379---------------------------------------------------------------- 380With CDP disabled the L3 schemata format is: 381 382 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 383 384L3 schemata file details (CDP enabled via mount option to resctrl) 385------------------------------------------------------------------ 386When CDP is enabled L3 control is split into two separate resources 387so you can specify independent masks for code and data like this: 388 389 L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 390 L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 391 392L2 schemata file details 393------------------------ 394L2 cache does not support code and data prioritization, so the 395schemata format is always: 396 397 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;... 398 399Memory bandwidth Allocation (default mode) 400------------------------------------------ 401 402Memory b/w domain is L3 cache. 403 404 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... 405 406Memory bandwidth Allocation specified in MBps 407--------------------------------------------- 408 409Memory bandwidth domain is L3 cache. 410 411 MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... 412 413Reading/writing the schemata file 414--------------------------------- 415Reading the schemata file will show the state of all resources 416on all domains. When writing you only need to specify those values 417which you wish to change. E.g. 418 419# cat schemata 420L3DATA:0=fffff;1=fffff;2=fffff;3=fffff 421L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 422# echo "L3DATA:2=3c0;" > schemata 423# cat schemata 424L3DATA:0=fffff;1=fffff;2=3c0;3=fffff 425L3CODE:0=fffff;1=fffff;2=fffff;3=fffff 426 427Cache Pseudo-Locking 428-------------------- 429CAT enables a user to specify the amount of cache space that an 430application can fill. Cache pseudo-locking builds on the fact that a 431CPU can still read and write data pre-allocated outside its current 432allocated area on a cache hit. With cache pseudo-locking, data can be 433preloaded into a reserved portion of cache that no application can 434fill, and from that point on will only serve cache hits. The cache 435pseudo-locked memory is made accessible to user space where an 436application can map it into its virtual address space and thus have 437a region of memory with reduced average read latency. 438 439The creation of a cache pseudo-locked region is triggered by a request 440from the user to do so that is accompanied by a schemata of the region 441to be pseudo-locked. The cache pseudo-locked region is created as follows: 442- Create a CAT allocation CLOSNEW with a CBM matching the schemata 443 from the user of the cache region that will contain the pseudo-locked 444 memory. This region must not overlap with any current CAT allocation/CLOS 445 on the system and no future overlap with this cache region is allowed 446 while the pseudo-locked region exists. 447- Create a contiguous region of memory of the same size as the cache 448 region. 449- Flush the cache, disable hardware prefetchers, disable preemption. 450- Make CLOSNEW the active CLOS and touch the allocated memory to load 451 it into the cache. 452- Set the previous CLOS as active. 453- At this point the closid CLOSNEW can be released - the cache 454 pseudo-locked region is protected as long as its CBM does not appear in 455 any CAT allocation. Even though the cache pseudo-locked region will from 456 this point on not appear in any CBM of any CLOS an application running with 457 any CLOS will be able to access the memory in the pseudo-locked region since 458 the region continues to serve cache hits. 459- The contiguous region of memory loaded into the cache is exposed to 460 user-space as a character device. 461 462Cache pseudo-locking increases the probability that data will remain 463in the cache via carefully configuring the CAT feature and controlling 464application behavior. There is no guarantee that data is placed in 465cache. Instructions like INVD, WBINVD, CLFLUSH, etc. can still evict 466“locked” data from cache. Power management C-states may shrink or 467power off cache. Deeper C-states will automatically be restricted on 468pseudo-locked region creation. 469 470It is required that an application using a pseudo-locked region runs 471with affinity to the cores (or a subset of the cores) associated 472with the cache on which the pseudo-locked region resides. A sanity check 473within the code will not allow an application to map pseudo-locked memory 474unless it runs with affinity to cores associated with the cache on which the 475pseudo-locked region resides. The sanity check is only done during the 476initial mmap() handling, there is no enforcement afterwards and the 477application self needs to ensure it remains affine to the correct cores. 478 479Pseudo-locking is accomplished in two stages: 4801) During the first stage the system administrator allocates a portion 481 of cache that should be dedicated to pseudo-locking. At this time an 482 equivalent portion of memory is allocated, loaded into allocated 483 cache portion, and exposed as a character device. 4842) During the second stage a user-space application maps (mmap()) the 485 pseudo-locked memory into its address space. 486 487Cache Pseudo-Locking Interface 488------------------------------ 489A pseudo-locked region is created using the resctrl interface as follows: 490 4911) Create a new resource group by creating a new directory in /sys/fs/resctrl. 4922) Change the new resource group's mode to "pseudo-locksetup" by writing 493 "pseudo-locksetup" to the "mode" file. 4943) Write the schemata of the pseudo-locked region to the "schemata" file. All 495 bits within the schemata should be "unused" according to the "bit_usage" 496 file. 497 498On successful pseudo-locked region creation the "mode" file will contain 499"pseudo-locked" and a new character device with the same name as the resource 500group will exist in /dev/pseudo_lock. This character device can be mmap()'ed 501by user space in order to obtain access to the pseudo-locked memory region. 502 503An example of cache pseudo-locked region creation and usage can be found below. 504 505Cache Pseudo-Locking Debugging Interface 506--------------------------------------- 507The pseudo-locking debugging interface is enabled by default (if 508CONFIG_DEBUG_FS is enabled) and can be found in /sys/kernel/debug/resctrl. 509 510There is no explicit way for the kernel to test if a provided memory 511location is present in the cache. The pseudo-locking debugging interface uses 512the tracing infrastructure to provide two ways to measure cache residency of 513the pseudo-locked region: 5141) Memory access latency using the pseudo_lock_mem_latency tracepoint. Data 515 from these measurements are best visualized using a hist trigger (see 516 example below). In this test the pseudo-locked region is traversed at 517 a stride of 32 bytes while hardware prefetchers and preemption 518 are disabled. This also provides a substitute visualization of cache 519 hits and misses. 5202) Cache hit and miss measurements using model specific precision counters if 521 available. Depending on the levels of cache on the system the pseudo_lock_l2 522 and pseudo_lock_l3 tracepoints are available. 523 WARNING: triggering this measurement uses from two (for just L2 524 measurements) to four (for L2 and L3 measurements) precision counters on 525 the system, if any other measurements are in progress the counters and 526 their corresponding event registers will be clobbered. 527 528When a pseudo-locked region is created a new debugfs directory is created for 529it in debugfs as /sys/kernel/debug/resctrl/<newdir>. A single 530write-only file, pseudo_lock_measure, is present in this directory. The 531measurement on the pseudo-locked region depends on the number, 1 or 2, 532written to this debugfs file. Since the measurements are recorded with the 533tracing infrastructure the relevant tracepoints need to be enabled before the 534measurement is triggered. 535 536Example of latency debugging interface: 537In this example a pseudo-locked region named "newlock" was created. Here is 538how we can measure the latency in cycles of reading from this region and 539visualize this data with a histogram that is available if CONFIG_HIST_TRIGGERS 540is set: 541# :> /sys/kernel/debug/tracing/trace 542# echo 'hist:keys=latency' > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/trigger 543# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable 544# echo 1 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 545# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/enable 546# cat /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_mem_latency/hist 547 548# event histogram 549# 550# trigger info: hist:keys=latency:vals=hitcount:sort=hitcount:size=2048 [active] 551# 552 553{ latency: 456 } hitcount: 1 554{ latency: 50 } hitcount: 83 555{ latency: 36 } hitcount: 96 556{ latency: 44 } hitcount: 174 557{ latency: 48 } hitcount: 195 558{ latency: 46 } hitcount: 262 559{ latency: 42 } hitcount: 693 560{ latency: 40 } hitcount: 3204 561{ latency: 38 } hitcount: 3484 562 563Totals: 564 Hits: 8192 565 Entries: 9 566 Dropped: 0 567 568Example of cache hits/misses debugging: 569In this example a pseudo-locked region named "newlock" was created on the L2 570cache of a platform. Here is how we can obtain details of the cache hits 571and misses using the platform's precision counters. 572 573# :> /sys/kernel/debug/tracing/trace 574# echo 1 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable 575# echo 2 > /sys/kernel/debug/resctrl/newlock/pseudo_lock_measure 576# echo 0 > /sys/kernel/debug/tracing/events/resctrl/pseudo_lock_l2/enable 577# cat /sys/kernel/debug/tracing/trace 578 579# tracer: nop 580# 581# _-----=> irqs-off 582# / _----=> need-resched 583# | / _---=> hardirq/softirq 584# || / _--=> preempt-depth 585# ||| / delay 586# TASK-PID CPU# |||| TIMESTAMP FUNCTION 587# | | | |||| | | 588 pseudo_lock_mea-1672 [002] .... 3132.860500: pseudo_lock_l2: hits=4097 miss=0 589 590 591Examples for RDT allocation usage: 592 593Example 1 594--------- 595On a two socket machine (one L3 cache per socket) with just four bits 596for cache bit masks, minimum b/w of 10% with a memory bandwidth 597granularity of 10% 598 599# mount -t resctrl resctrl /sys/fs/resctrl 600# cd /sys/fs/resctrl 601# mkdir p0 p1 602# echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata 603# echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata 604 605The default resource group is unmodified, so we have access to all parts 606of all caches (its schemata file reads "L3:0=f;1=f"). 607 608Tasks that are under the control of group "p0" may only allocate from the 609"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 610Tasks in group "p1" use the "lower" 50% of cache on both sockets. 611 612Similarly, tasks that are under the control of group "p0" may use a 613maximum memory b/w of 50% on socket0 and 50% on socket 1. 614Tasks in group "p1" may also use 50% memory b/w on both sockets. 615Note that unlike cache masks, memory b/w cannot specify whether these 616allocations can overlap or not. The allocations specifies the maximum 617b/w that the group may be able to use and the system admin can configure 618the b/w accordingly. 619 620If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB 621rather than the percentage values. 622 623# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata 624# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata 625 626In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w 627of 1024MB where as on socket 1 they would use 500MB. 628 629Example 2 630--------- 631Again two sockets, but this time with a more realistic 20-bit mask. 632 633Two real time tasks pid=1234 running on processor 0 and pid=5678 running on 634processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy 635neighbors, each of the two real-time tasks exclusively occupies one quarter 636of L3 cache on socket 0. 637 638# mount -t resctrl resctrl /sys/fs/resctrl 639# cd /sys/fs/resctrl 640 641First we reset the schemata for the default group so that the "upper" 64250% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by 643ordinary tasks: 644 645# echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata 646 647Next we make a resource group for our first real time task and give 648it access to the "top" 25% of the cache on socket 0. 649 650# mkdir p0 651# echo "L3:0=f8000;1=fffff" > p0/schemata 652 653Finally we move our first real time task into this resource group. We 654also use taskset(1) to ensure the task always runs on a dedicated CPU 655on socket 0. Most uses of resource groups will also constrain which 656processors tasks run on. 657 658# echo 1234 > p0/tasks 659# taskset -cp 1 1234 660 661Ditto for the second real time task (with the remaining 25% of cache): 662 663# mkdir p1 664# echo "L3:0=7c00;1=fffff" > p1/schemata 665# echo 5678 > p1/tasks 666# taskset -cp 2 5678 667 668For the same 2 socket system with memory b/w resource and CAT L3 the 669schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is 67010): 671 672For our first real time task this would request 20% memory b/w on socket 6730. 674 675# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 676 677For our second real time task this would request an other 20% memory b/w 678on socket 0. 679 680# echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata 681 682Example 3 683--------- 684 685A single socket system which has real-time tasks running on core 4-7 and 686non real-time workload assigned to core 0-3. The real-time tasks share text 687and data, so a per task association is not required and due to interaction 688with the kernel it's desired that the kernel on these cores shares L3 with 689the tasks. 690 691# mount -t resctrl resctrl /sys/fs/resctrl 692# cd /sys/fs/resctrl 693 694First we reset the schemata for the default group so that the "upper" 69550% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0 696cannot be used by ordinary tasks: 697 698# echo "L3:0=3ff\nMB:0=50" > schemata 699 700Next we make a resource group for our real time cores and give it access 701to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on 702socket 0. 703 704# mkdir p0 705# echo "L3:0=ffc00\nMB:0=50" > p0/schemata 706 707Finally we move core 4-7 over to the new group and make sure that the 708kernel and the tasks running there get 50% of the cache. They should 709also get 50% of memory bandwidth assuming that the cores 4-7 are SMT 710siblings and only the real time threads are scheduled on the cores 4-7. 711 712# echo F0 > p0/cpus 713 714Example 4 715--------- 716 717The resource groups in previous examples were all in the default "shareable" 718mode allowing sharing of their cache allocations. If one resource group 719configures a cache allocation then nothing prevents another resource group 720to overlap with that allocation. 721 722In this example a new exclusive resource group will be created on a L2 CAT 723system with two L2 cache instances that can be configured with an 8-bit 724capacity bitmask. The new exclusive resource group will be configured to use 72525% of each cache instance. 726 727# mount -t resctrl resctrl /sys/fs/resctrl/ 728# cd /sys/fs/resctrl 729 730First, we observe that the default group is configured to allocate to all L2 731cache: 732 733# cat schemata 734L2:0=ff;1=ff 735 736We could attempt to create the new resource group at this point, but it will 737fail because of the overlap with the schemata of the default group: 738# mkdir p0 739# echo 'L2:0=0x3;1=0x3' > p0/schemata 740# cat p0/mode 741shareable 742# echo exclusive > p0/mode 743-sh: echo: write error: Invalid argument 744# cat info/last_cmd_status 745schemata overlaps 746 747To ensure that there is no overlap with another resource group the default 748resource group's schemata has to change, making it possible for the new 749resource group to become exclusive. 750# echo 'L2:0=0xfc;1=0xfc' > schemata 751# echo exclusive > p0/mode 752# grep . p0/* 753p0/cpus:0 754p0/mode:exclusive 755p0/schemata:L2:0=03;1=03 756p0/size:L2:0=262144;1=262144 757 758A new resource group will on creation not overlap with an exclusive resource 759group: 760# mkdir p1 761# grep . p1/* 762p1/cpus:0 763p1/mode:shareable 764p1/schemata:L2:0=fc;1=fc 765p1/size:L2:0=786432;1=786432 766 767The bit_usage will reflect how the cache is used: 768# cat info/L2/bit_usage 7690=SSSSSSEE;1=SSSSSSEE 770 771A resource group cannot be forced to overlap with an exclusive resource group: 772# echo 'L2:0=0x1;1=0x1' > p1/schemata 773-sh: echo: write error: Invalid argument 774# cat info/last_cmd_status 775overlaps with exclusive group 776 777Example of Cache Pseudo-Locking 778------------------------------- 779Lock portion of L2 cache from cache id 1 using CBM 0x3. Pseudo-locked 780region is exposed at /dev/pseudo_lock/newlock that can be provided to 781application for argument to mmap(). 782 783# mount -t resctrl resctrl /sys/fs/resctrl/ 784# cd /sys/fs/resctrl 785 786Ensure that there are bits available that can be pseudo-locked, since only 787unused bits can be pseudo-locked the bits to be pseudo-locked needs to be 788removed from the default resource group's schemata: 789# cat info/L2/bit_usage 7900=SSSSSSSS;1=SSSSSSSS 791# echo 'L2:1=0xfc' > schemata 792# cat info/L2/bit_usage 7930=SSSSSSSS;1=SSSSSS00 794 795Create a new resource group that will be associated with the pseudo-locked 796region, indicate that it will be used for a pseudo-locked region, and 797configure the requested pseudo-locked region capacity bitmask: 798 799# mkdir newlock 800# echo pseudo-locksetup > newlock/mode 801# echo 'L2:1=0x3' > newlock/schemata 802 803On success the resource group's mode will change to pseudo-locked, the 804bit_usage will reflect the pseudo-locked region, and the character device 805exposing the pseudo-locked region will exist: 806 807# cat newlock/mode 808pseudo-locked 809# cat info/L2/bit_usage 8100=SSSSSSSS;1=SSSSSSPP 811# ls -l /dev/pseudo_lock/newlock 812crw------- 1 root root 243, 0 Apr 3 05:01 /dev/pseudo_lock/newlock 813 814/* 815 * Example code to access one page of pseudo-locked cache region 816 * from user space. 817 */ 818#define _GNU_SOURCE 819#include <fcntl.h> 820#include <sched.h> 821#include <stdio.h> 822#include <stdlib.h> 823#include <unistd.h> 824#include <sys/mman.h> 825 826/* 827 * It is required that the application runs with affinity to only 828 * cores associated with the pseudo-locked region. Here the cpu 829 * is hardcoded for convenience of example. 830 */ 831static int cpuid = 2; 832 833int main(int argc, char *argv[]) 834{ 835 cpu_set_t cpuset; 836 long page_size; 837 void *mapping; 838 int dev_fd; 839 int ret; 840 841 page_size = sysconf(_SC_PAGESIZE); 842 843 CPU_ZERO(&cpuset); 844 CPU_SET(cpuid, &cpuset); 845 ret = sched_setaffinity(0, sizeof(cpuset), &cpuset); 846 if (ret < 0) { 847 perror("sched_setaffinity"); 848 exit(EXIT_FAILURE); 849 } 850 851 dev_fd = open("/dev/pseudo_lock/newlock", O_RDWR); 852 if (dev_fd < 0) { 853 perror("open"); 854 exit(EXIT_FAILURE); 855 } 856 857 mapping = mmap(0, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, 858 dev_fd, 0); 859 if (mapping == MAP_FAILED) { 860 perror("mmap"); 861 close(dev_fd); 862 exit(EXIT_FAILURE); 863 } 864 865 /* Application interacts with pseudo-locked memory @mapping */ 866 867 ret = munmap(mapping, page_size); 868 if (ret < 0) { 869 perror("munmap"); 870 close(dev_fd); 871 exit(EXIT_FAILURE); 872 } 873 874 close(dev_fd); 875 exit(EXIT_SUCCESS); 876} 877 878Locking between applications 879---------------------------- 880 881Certain operations on the resctrl filesystem, composed of read/writes 882to/from multiple files, must be atomic. 883 884As an example, the allocation of an exclusive reservation of L3 cache 885involves: 886 887 1. Read the cbmmasks from each directory or the per-resource "bit_usage" 888 2. Find a contiguous set of bits in the global CBM bitmask that is clear 889 in any of the directory cbmmasks 890 3. Create a new directory 891 4. Set the bits found in step 2 to the new directory "schemata" file 892 893If two applications attempt to allocate space concurrently then they can 894end up allocating the same bits so the reservations are shared instead of 895exclusive. 896 897To coordinate atomic operations on the resctrlfs and to avoid the problem 898above, the following locking procedure is recommended: 899 900Locking is based on flock, which is available in libc and also as a shell 901script command 902 903Write lock: 904 905 A) Take flock(LOCK_EX) on /sys/fs/resctrl 906 B) Read/write the directory structure. 907 C) funlock 908 909Read lock: 910 911 A) Take flock(LOCK_SH) on /sys/fs/resctrl 912 B) If success read the directory structure. 913 C) funlock 914 915Example with bash: 916 917# Atomically read directory structure 918$ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl 919 920# Read directory contents and create new subdirectory 921 922$ cat create-dir.sh 923find /sys/fs/resctrl/ > output.txt 924mask = function-of(output.txt) 925mkdir /sys/fs/resctrl/newres/ 926echo mask > /sys/fs/resctrl/newres/schemata 927 928$ flock /sys/fs/resctrl/ ./create-dir.sh 929 930Example with C: 931 932/* 933 * Example code do take advisory locks 934 * before accessing resctrl filesystem 935 */ 936#include <sys/file.h> 937#include <stdlib.h> 938 939void resctrl_take_shared_lock(int fd) 940{ 941 int ret; 942 943 /* take shared lock on resctrl filesystem */ 944 ret = flock(fd, LOCK_SH); 945 if (ret) { 946 perror("flock"); 947 exit(-1); 948 } 949} 950 951void resctrl_take_exclusive_lock(int fd) 952{ 953 int ret; 954 955 /* release lock on resctrl filesystem */ 956 ret = flock(fd, LOCK_EX); 957 if (ret) { 958 perror("flock"); 959 exit(-1); 960 } 961} 962 963void resctrl_release_lock(int fd) 964{ 965 int ret; 966 967 /* take shared lock on resctrl filesystem */ 968 ret = flock(fd, LOCK_UN); 969 if (ret) { 970 perror("flock"); 971 exit(-1); 972 } 973} 974 975void main(void) 976{ 977 int fd, ret; 978 979 fd = open("/sys/fs/resctrl", O_DIRECTORY); 980 if (fd == -1) { 981 perror("open"); 982 exit(-1); 983 } 984 resctrl_take_shared_lock(fd); 985 /* code to read directory contents */ 986 resctrl_release_lock(fd); 987 988 resctrl_take_exclusive_lock(fd); 989 /* code to read and write directory contents */ 990 resctrl_release_lock(fd); 991} 992 993Examples for RDT Monitoring along with allocation usage: 994 995Reading monitored data 996---------------------- 997Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would 998show the current snapshot of LLC occupancy of the corresponding MON 999group or CTRL_MON group. 1000 1001 1002Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group) 1003--------- 1004On a two socket machine (one L3 cache per socket) with just four bits 1005for cache bit masks 1006 1007# mount -t resctrl resctrl /sys/fs/resctrl 1008# cd /sys/fs/resctrl 1009# mkdir p0 p1 1010# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata 1011# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata 1012# echo 5678 > p1/tasks 1013# echo 5679 > p1/tasks 1014 1015The default resource group is unmodified, so we have access to all parts 1016of all caches (its schemata file reads "L3:0=f;1=f"). 1017 1018Tasks that are under the control of group "p0" may only allocate from the 1019"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1. 1020Tasks in group "p1" use the "lower" 50% of cache on both sockets. 1021 1022Create monitor groups and assign a subset of tasks to each monitor group. 1023 1024# cd /sys/fs/resctrl/p1/mon_groups 1025# mkdir m11 m12 1026# echo 5678 > m11/tasks 1027# echo 5679 > m12/tasks 1028 1029fetch data (data shown in bytes) 1030 1031# cat m11/mon_data/mon_L3_00/llc_occupancy 103216234000 1033# cat m11/mon_data/mon_L3_01/llc_occupancy 103414789000 1035# cat m12/mon_data/mon_L3_00/llc_occupancy 103616789000 1037 1038The parent ctrl_mon group shows the aggregated data. 1039 1040# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 104131234000 1042 1043Example 2 (Monitor a task from its creation) 1044--------- 1045On a two socket machine (one L3 cache per socket) 1046 1047# mount -t resctrl resctrl /sys/fs/resctrl 1048# cd /sys/fs/resctrl 1049# mkdir p0 p1 1050 1051An RMID is allocated to the group once its created and hence the <cmd> 1052below is monitored from its creation. 1053 1054# echo $$ > /sys/fs/resctrl/p1/tasks 1055# <cmd> 1056 1057Fetch the data 1058 1059# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy 106031789000 1061 1062Example 3 (Monitor without CAT support or before creating CAT groups) 1063--------- 1064 1065Assume a system like HSW has only CQM and no CAT support. In this case 1066the resctrl will still mount but cannot create CTRL_MON directories. 1067But user can create different MON groups within the root group thereby 1068able to monitor all tasks including kernel threads. 1069 1070This can also be used to profile jobs cache size footprint before being 1071able to allocate them to different allocation groups. 1072 1073# mount -t resctrl resctrl /sys/fs/resctrl 1074# cd /sys/fs/resctrl 1075# mkdir mon_groups/m01 1076# mkdir mon_groups/m02 1077 1078# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks 1079# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks 1080 1081Monitor the groups separately and also get per domain data. From the 1082below its apparent that the tasks are mostly doing work on 1083domain(socket) 0. 1084 1085# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy 108631234000 1087# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy 108834555 1089# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy 109031234000 1091# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy 109232789 1093 1094 1095Example 4 (Monitor real time tasks) 1096----------------------------------- 1097 1098A single socket system which has real time tasks running on cores 4-7 1099and non real time tasks on other cpus. We want to monitor the cache 1100occupancy of the real time threads on these cores. 1101 1102# mount -t resctrl resctrl /sys/fs/resctrl 1103# cd /sys/fs/resctrl 1104# mkdir p1 1105 1106Move the cpus 4-7 over to p1 1107# echo f0 > p1/cpus 1108 1109View the llc occupancy snapshot 1110 1111# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 111211234000 1113