1============================================================= 2An ad-hoc collection of notes on IA64 MCA and INIT processing 3============================================================= 4 5Feel free to update it with notes about any area that is not clear. 6 7--- 8 9MCA/INIT are completely asynchronous. They can occur at any time, when 10the OS is in any state. Including when one of the cpus is already 11holding a spinlock. Trying to get any lock from MCA/INIT state is 12asking for deadlock. Also the state of structures that are protected 13by locks is indeterminate, including linked lists. 14 15--- 16 17The complicated ia64 MCA process. All of this is mandated by Intel's 18specification for ia64 SAL, error recovery and unwind, it is not as 19if we have a choice here. 20 21* MCA occurs on one cpu, usually due to a double bit memory error. 22 This is the monarch cpu. 23 24* SAL sends an MCA rendezvous interrupt (which is a normal interrupt) 25 to all the other cpus, the slaves. 26 27* Slave cpus that receive the MCA interrupt call down into SAL, they 28 end up spinning disabled while the MCA is being serviced. 29 30* If any slave cpu was already spinning disabled when the MCA occurred 31 then it cannot service the MCA interrupt. SAL waits ~20 seconds then 32 sends an unmaskable INIT event to the slave cpus that have not 33 already rendezvoused. 34 35* Because MCA/INIT can be delivered at any time, including when the cpu 36 is down in PAL in physical mode, the registers at the time of the 37 event are _completely_ undefined. In particular the MCA/INIT 38 handlers cannot rely on the thread pointer, PAL physical mode can 39 (and does) modify TP. It is allowed to do that as long as it resets 40 TP on return. However MCA/INIT events expose us to these PAL 41 internal TP changes. Hence curr_task(). 42 43* If an MCA/INIT event occurs while the kernel was running (not user 44 space) and the kernel has called PAL then the MCA/INIT handler cannot 45 assume that the kernel stack is in a fit state to be used. Mainly 46 because PAL may or may not maintain the stack pointer internally. 47 Because the MCA/INIT handlers cannot trust the kernel stack, they 48 have to use their own, per-cpu stacks. The MCA/INIT stacks are 49 preformatted with just enough task state to let the relevant handlers 50 do their job. 51 52* Unlike most other architectures, the ia64 struct task is embedded in 53 the kernel stack[1]. So switching to a new kernel stack means that 54 we switch to a new task as well. Because various bits of the kernel 55 assume that current points into the struct task, switching to a new 56 stack also means a new value for current. 57 58* Once all slaves have rendezvoused and are spinning disabled, the 59 monarch is entered. The monarch now tries to diagnose the problem 60 and decide if it can recover or not. 61 62* Part of the monarch's job is to look at the state of all the other 63 tasks. The only way to do that on ia64 is to call the unwinder, 64 as mandated by Intel. 65 66* The starting point for the unwind depends on whether a task is 67 running or not. That is, whether it is on a cpu or is blocked. The 68 monarch has to determine whether or not a task is on a cpu before it 69 knows how to start unwinding it. The tasks that received an MCA or 70 INIT event are no longer running, they have been converted to blocked 71 tasks. But (and its a big but), the cpus that received the MCA 72 rendezvous interrupt are still running on their normal kernel stacks! 73 74* To distinguish between these two cases, the monarch must know which 75 tasks are on a cpu and which are not. Hence each slave cpu that 76 switches to an MCA/INIT stack, registers its new stack using 77 set_curr_task(), so the monarch can tell that the _original_ task is 78 no longer running on that cpu. That gives us a decent chance of 79 getting a valid backtrace of the _original_ task. 80 81* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a 82 nested error, we want diagnostics on the MCA/INIT handler that 83 failed, not on the task that was originally running. Again this 84 requires set_curr_task() so the MCA/INIT handlers can register their 85 own stack as running on that cpu. Then a recursive error gets a 86 trace of the failing handler's "task". 87 88[1] 89 My (Keith Owens) original design called for ia64 to separate its 90 struct task and the kernel stacks. Then the MCA/INIT data would be 91 chained stacks like i386 interrupt stacks. But that required 92 radical surgery on the rest of ia64, plus extra hard wired TLB 93 entries with its associated performance degradation. David 94 Mosberger vetoed that approach. Which meant that separate kernel 95 stacks meant separate "tasks" for the MCA/INIT handlers. 96 97--- 98 99INIT is less complicated than MCA. Pressing the nmi button or using 100the equivalent command on the management console sends INIT to all 101cpus. SAL picks one of the cpus as the monarch and the rest are 102slaves. All the OS INIT handlers are entered at approximately the same 103time. The OS monarch prints the state of all tasks and returns, after 104which the slaves return and the system resumes. 105 106At least that is what is supposed to happen. Alas there are broken 107versions of SAL out there. Some drive all the cpus as monarchs. Some 108drive them all as slaves. Some drive one cpu as monarch, wait for that 109cpu to return from the OS then drive the rest as slaves. Some versions 110of SAL cannot even cope with returning from the OS, they spin inside 111SAL on resume. The OS INIT code has workarounds for some of these 112broken SAL symptoms, but some simply cannot be fixed from the OS side. 113 114--- 115 116The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer 117violations. Unfortunately MCA/INIT start off as massive layer 118violations (can occur at _any_ time) and they build from there. 119 120At least ia64 makes an attempt at recovering from hardware errors, but 121it is a difficult problem because of the asynchronous nature of these 122errors. When processing an unmaskable interrupt we sometimes need 123special code to cope with our inability to take any locks. 124 125--- 126 127How is ia64 MCA/INIT different from x86 NMI? 128 129* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to 130 all cpus. 131 132* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 133 per cpu. 134 135* x86 has a separate struct task which points to one of multiple kernel 136 stacks. ia64 has the struct task embedded in the single kernel 137 stack, so switching stack means switching task. 138 139* x86 does not call the BIOS so the NMI handler does not have to worry 140 about any registers having changed. MCA/INIT can occur while the cpu 141 is in PAL in physical mode, with undefined registers and an undefined 142 kernel stack. 143 144* i386 backtrace is not very sensitive to whether a process is running 145 or not. ia64 unwind is very, very sensitive to whether a process is 146 running or not. 147 148--- 149 150What happens when MCA/INIT is delivered what a cpu is running user 151space code? 152 153The user mode registers are stored in the RSE area of the MCA/INIT on 154entry to the OS and are restored from there on return to SAL, so user 155mode registers are preserved across a recoverable MCA/INIT. Since the 156OS has no idea what unwind data is available for the user space stack, 157MCA/INIT never tries to backtrace user space. Which means that the OS 158does not bother making the user space process look like a blocked task, 159i.e. the OS does not copy pt_regs and switch_stack to the user space 160stack. Also the OS has no idea how big the user space RSE and memory 161stacks are, which makes it too risky to copy the saved state to a user 162mode stack. 163 164--- 165 166How do we get a backtrace on the tasks that were running when MCA/INIT 167was delivered? 168 169mca.c:::ia64_mca_modify_original_stack(). That identifies and 170verifies the original kernel stack, copies the dirty registers from 171the MCA/INIT stack's RSE to the original stack's RSE, copies the 172skeleton struct pt_regs and switch_stack to the original stack, fills 173in the skeleton structures from the PAL minstate area and updates the 174original stack's thread.ksp. That makes the original stack look 175exactly like any other blocked task, i.e. it now appears to be 176sleeping. To get a backtrace, just start with thread.ksp for the 177original task and unwind like any other sleeping task. 178 179--- 180 181How do we identify the tasks that were running when MCA/INIT was 182delivered? 183 184If the previous task has been verified and converted to a blocked 185state, then sos->prev_task on the MCA/INIT stack is updated to point to 186the previous task. You can look at that field in dumps or debuggers. 187To help distinguish between the handler and the original tasks, 188handlers have _TIF_MCA_INIT set in thread_info.flags. 189 190The sos data is always in the MCA/INIT handler stack, at offset 191MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it 192as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct 193ia64_sal_os_state), with 16 byte alignment for all structures. 194 195Also the comm field of the MCA/INIT task is modified to include the pid 196of the original task, for humans to use. For example, a comm field of 197'MCA 12159' means that pid 12159 was running when the MCA was 198delivered. 199