$ alarmtest
alarmtest starting
…………………………………………alarm!
………………………………………………alarm!
……………………………………………………alarm!
……………………………………………………………………………alarm!
………………………………………………alarm!
………………………………………………………………alarm!
…………………………………………alarm!
……………………………………………………………
alarm!
Well, I find this homework pretty interesting, and I think it’s also a great chance to walk through xv6 syscalls, so let me summarize what I’ve learnt from it in this post.
Part 1: xv6 syscall overview
Say, a user program, alarmtest.c
called a system call, alarm()
:
alarm(10, periodic);
What’s actually happening in assembly? Take a look at alarmtest.asm
:
alarm(10, periodic); 5b: 58 pop %eax 5c: 5a pop %edx 5d: 68 20 00 00 00 push $0x20 62: 6a 0a push $0xa 64: e8 81 03 00 00 call 3ea <alarm>
000003ea <alarm>: SYSCALL(alarm) 3ea: b8 17 00 00 00 mov $0x17,%eax 3ef: cd 40 int $0x40 3f1: c3 ret
It calls SYSCALL(alarm)
, mov 0x17
into %eax, and does int $0x40
.
SYSCALL(alarm)
is defined in usys.S
:
#define SYSCALL(name) \ .globl name; \ name: \ movl $SYS_ ## name, %eax; \ int $T_SYSCALL; \ ret
I don’t know all the compiler magic involved here, but for now let’s just don’t worry about it.
The int 0x40
instruction tells the CPU to find the 0x40
th entry in the IDT (Interrupt Descriptor Table), based at IDTR (In my case, 0x80115000
. Check it in QEMU monitor). Let’s see:
gdb-peda$ x/gx 0x80115000+(8*0x40)
0x80115200 <idt+512>: 0x8010ef0000085f17
Note that each entry is 8 bytes long. This data structure is defined in trap.c
:
for(i = 0; i < 256; i++) SETGATE(idt[i], 0, SEG_KCODE<<3, vectors[i], 0); SETGATE(idt[T_SYSCALL], 1, SEG_KCODE<<3, vectors[T_SYSCALL], DPL_USER);
Where T_SYSCALL
equals to precisely 0x40
. It is defined separately here because it’s the only “trap” gate. Other gates are either interruption gates (Dear OS, I just pressed on a key) or exception gates (Dear OS, I just tried to divide by 0).
The SETGATE
macro is further defined in mmu.h
:
#define SETGATE(gate, istrap, sel, off, dpl) \ { \ (gate).gd_off_15_0 = (uint32_t) (off) & 0xffff; \ (gate).gd_sel = (sel); \ (gate).gd_args = 0; \ (gate).gd_rsv1 = 0; \ (gate).gd_type = (istrap) ? STS_TG32 : STS_IG32; \ (gate).gd_s = 0; \ (gate).gd_dpl = (dpl); \ (gate).gd_p = 1; \ (gate).gd_off_31_16 = (uint32_t) (off) >> 16; \ }
Why are these hardware related data structures always look so strange? It reminds me a GDT entry. Frankly, I don’t want to demystify this thing at all. 🙂 Well OK…Let’s take another look at the memory:
gdb-peda$ x/gx 0x80115000+(8*0x40)
0x80115200 <idt+512>: 0x8010ef0000085f17
Let’s first focus on gd_off_15_0
, gd_off_31_16
and gd_sel
:
gdb-peda$ x/hx 0x80115000+(8*0x40) + 6
0x80115206 <idt+518>: 0x8010
gdb-peda$ x/hx 0x80115000+(8*0x40) + 0
0x80115200 <idt+512>: 0x5f17
gdb-peda$ x/hx 0x80115000+(8*0x40) + 2
0x80115202 <idt+514>: 0x0008
Nice. Now we know where to jump: [0x0008:0x80105f17]
. I’ve checked this segment for you: It’s based at 0x0
. Other interesting fields include:
gdb-peda$ x/t 0x80115000+(8*0x40) + 5
0x80115205 <idt+517>: 11101111
These are gd_type
, gd_s
, gd_dpl
and gd_p
. gd_type
0b1110
is STS_IG32
, 32-bit Interrupt Gate. gd_s
is 0b0
, means system. gd_dpl
is 0b11
, meaning that this gate is of user privilege (If a user program tries to trigger a system privilege gate, CPU generates another exception to prevent it). Finally, gd_p
0b1
means that this segment is present…
Another thing, though. We have to change from user stack to kernel stack. How does the CPU find the new %esp? How does it know whether there’s a privilege change or not, anyway?
CPU compares the PL (Privilege Level) of old %cs with the new %cs in IDT entry. It goes without saying we came from userland, but what about the new one, 0x0008
?:
(qemu) info registers
…
GDT= 80112850 0000002f
gdb-peda$ x/gx 0x80112850+0x008
0x80112858 <cpus+120>: 0x00cf9a000000ffff
gdb-peda$ x/t 0x80112850+0x008 + 5
0x8011285d <cpus+125>: 10011010
This means the new %cs is of supervisor privilege, which is pretty obvious since we want to jump into the kernel. Knowing that there’s a “privilege escalation” 🙂 , CPU does a few more things than when the privilege level is unchanged:
- It loads new %esp and %ss from the task segment descriptor;
- It pushes old %esp and %ss values onto (the new) stack;
Note that CPU doesn’t push anything onto user stack, since the user %esp may be corrupted and cannot be trusted anyway. We will see how things may go wrong if kernel simply trust a user %esp in later sections.
TR (Task Register) holds a segment selector for the TSS (Task State Segment):
(qemu) info registers
…
TR =0028 801127e8 00000067 00408900 DPL=0 TSS32-avl
GDT= 80112850 0000002f
OK I’m already getting used to this:
gdb-peda$ x/gx 0x80112850+0x28
0x80112878 <cpus+152>: 0x80408b1127e80067
gdb-peda$ x/hx 0x80112850+0x28 + 2
0x8011287a <cpus+154>: 0x27e8
gdb-peda$ x/bx 0x80112850+0x28 + 4
0x8011287c <cpus+156>: 0x11
gdb-peda$ x/bx 0x80112850+0x28 + 7
0x8011287f <cpus+159>: 0x80
The TSS is located at 0x801127e8
. It’s a very long data structure defined in mmu.h
:
// Task state segment format (as described by the Pentium architecture book) struct Taskstate { uint32_t ts_link; // Old ts selector uintptr_t ts_esp0; // Stack pointers and segment selectors uint16_t ts_ss0; // after an increase in privilege level ... }
gdb-peda$ x/wx 0x801127e8 + 4
0x801127ec <cpus+12>: 0x8dfbf000
gdb-peda$ x/hx 0x801127e8 + 8
0x801127f0 <cpus+16>: 0x0010
So the new %esp should be 0x8dfbf000
, and the new %ss should be 0x0010
. I’ve checked it for you that this new stack segment is located at 0x0
.
These values are initiated in switchuvm()
(See vm.c
):
mycpu()->ts.ss0 = SEG_KDATA << 3; mycpu()->ts.esp0 = (uint)p->kstack + KSTACKSIZE;
Awesome! Now the CPU knows all the new register values it should know: %cs: 0x0008
, %eip: 0x80105f17
, %ss: 0x0010
and %esp: 0x8dfbf000
. It is ready to jump.
gdb-peda$ si
The target architecture is assumed to be i386
=> 0x80105f19 <vector64+2>: push 0x40
vector64 () at vectors.S:320
320 pushl $64
gdb-peda$ x/3i 0x80105f17
0x80105f17 <vector64>: push 0x0
=> 0x80105f19 <vector64+2>: push 0x40
0x80105f1b <vector64+4>: jmp 0x8010570f <alltraps>
My calculation was correct. I don’t know why GDB executed 0x80105f17
for me, but that actually scared the heck out of me. 🙂 The stack is still “under construction”, but in short all these push
instructions are there to build a struct trapframe
defined in x86.h
:
//PAGEBREAK: 36 // Layout of the trap frame built on the stack by the // hardware and by trapasm.S, and passed to trap(). struct trapframe { // registers as pushed by pusha uint edi; uint esi; uint ebp; uint oesp; // useless & ignored uint ebx; uint edx; uint ecx; uint eax; // rest of trap frame ushort gs; ushort padding1; ushort fs; ushort padding2; ushort es; ushort padding3; ushort ds; ushort padding4; uint trapno; // below here defined by x86 hardware uint err; uint eip; ushort cs; ushort padding5; uint eflags; // below here only when crossing rings, such as from user to kernel uint esp; ushort ss; ushort padding6; };
We will examine the stack later, but let’s first focus on the code.
xv6 uses a Perl script (3250) to generate the entry points that the IDT entries point to. Each entry pushes an error code if the processor didn’t, pushes the interrupt number, and then jumps to alltraps. (P42)
xv6 book
In this case the CPU does not push an error code (the err
field in trapframe
), so vector64
does push 0x0
for us instead, to keep the trapframe complete. Then it pushes 0x40
, the trap number (trapno
), otherwise the trap handler can’t tell which kind of trap / interruption / exception it is handling. Finally it jumps to alltrap
.
gdb-peda$ x/10i $pc
=> 0x8010570f <alltraps>: push ds
0x80105710 <alltraps+1>: push es
0x80105711 <alltraps+2>: push fs
0x80105713 <alltraps+4>: push gs
0x80105715 <alltraps+6>: pusha
0x80105716 <alltraps+7>: mov ax,0x10
0x8010571a <alltraps+11>: mov ds,eax
0x8010571c <alltraps+13>: mov es,eax
0x8010571e <alltraps+15>: push esp
0x8010571f <alltraps+16>: call 0x801057f0 <trap>
Basically this code just push a lot of other registers onto stack to complete the trapframe, and set %ds and %es to also 0x10
, same with %ss loaded from the struct Taskstate
. Maybe xv6 just hard-coded these values.
Anyways. Before 0x8010571e
, the kernel stack should look like:
+----------------+----------------+ -> ts.esp0 0x8dfbeffc | padding6 | ss: 0x0023 | -> %ss and %esp are pushed onto stack only because +----------------+----------------+ we are crossing rings from 3 to 0 0x8dfbeff8 | esp: 0x00002fac | +----------------+----------------+ 0x8dfbeff4 | eflags: 0x00000246 | -> other registers pushed by CPU +----------------+----------------+ 0x8dfbeff0 | padding5 | cs: 0x001b | +----------------+----------------+ 0x8dfbefec | eip: 0x000003f1 | +---------------------------------+ 0x8dfbefe8 | err: 0x00000000 | -> pushed by xv6 in vector64, since CPU does not +---------------------------------+ push error code for this case 0x8dfbefe4 | trapno: 0x00000040 | +----------------+----------------+ -> more registers pushed by alltrap 0x8dfbefe0 | padding4 | ds: 0x0023 | +----------------+----------------+ 0x8dfbefdc | padding3 | es: 0x0023 | +----------------+----------------+ 0x8dfbefd8 | padding2 | fs: 0x0000 | +----------------+----------------+ 0x8dfbefd4 | padding1 | gs: 0x0000 | +----------------+----------------+ -> even more general registers pushed by pusha 0x8dfbefd0 | eax: 0x00000017 | +---------------------------------+ 0x8dfbefcc | ecx: 0x0000000a | +---------------------------------+ 0x8dfbefc8 | edx: 0x00000818 | +---------------------------------+ 0x8dfbefc4 | ebx: 0x0000bfa8 | +---------------------------------+ 0x8dfbefc0 | oesp | -> ignored +---------------------------------+ 0x8dfbefbc | ebp: 0x00002fc8 | +---------------------------------+ 0x8dfbefb8 | esi: 0x00000000 | +---------------------------------+ 0x8dfbefb4 | edi: 0x00000000 | +---------------------------------+ -> esp, passed to trap() as argument
Beautiful. Now the last job of trapall
is to push esp
then call trap()
, passing the pointer to this struct trapframe
to trap()
. Well, trap()
then has all the information to handle this event. In our case, it looks at tf->trapno
and it should be T_SYSCALL
, so it knows that this must be a syscall request triggered by int 0x40
. It then checks tf->eax
to know that the user program actually wanted to call syscall number 0x17
, which is SYS_alarm
as defined in syscall.h
. Finally, it transfers control to sys_alarm()
.
Part 2: xv6 CPU alarm
Back to the homework itself, we are supposed to implement the alarm()
syscall for xv6, so when a test user program alarmtest
does things like:
#include "types.h" #include "stat.h" #include "user.h" void periodic(); int main(int argc, char *argv[]) { int i; printf(1, "alarmtest starting\n"); alarm(10, periodic); for(i = 0; i < 25*5000000; i++){ if((i % 250000) == 0) write(2, ".", 1); } printf(1, "\n"); exit(); } void periodic() { printf(1, "alarm!\n"); }
It registers a handler function periodic()
with alarm(10, periodic);
so that the kernel will call periodic()
and print out alarm!
every 10 ticks.
The function signature is defined in user.h
as follows:
int alarm(int ticks, void (*handler)());
A tick is a fairly arbitrary unit of time in xv6, determined by how often a hardware timer generates interrupts.
Homework: xv6 CPU alarm
This involves modifying the following files:
peilin@PWN:~/6.828/2018/xv6-public$ git status
…
modified: Makefile
modified: proc.c
modified: proc.h
modified: syscall.c
modified: syscall.h
modified: sysproc.c
modified: trap.c
modified: user.h
modified: usys.S
But lets only focus on sysproc.c
and trap.c
. In sysproc.c
, we add a sys_alarm()
:
// Homework: xv6 CPU alarm int sys_alarm(void) { int ticks; void (*handler)(); if(argint(0, &ticks) < 0) return -1; if(argptr(1, (char**)&handler, 1) < 0) return -1; myproc()->alarmticks = ticks; myproc()->alarmhandler = handler; myproc()->alarmticksleft = myproc()->alarmticks; return 0; }
Defined in proc.h
, the process descriptor struct proc
now contains three more fields:
// Home work: xv6 CPU alarm uint alarmticks; void (*alarmhandler)(); uint alarmticksleft;
sys_alarm()
reads two parameters, ticks
and handler
from user stack (recall all these information needed about the user process is stored in the trapframe), and set up these 3 fields of myproc()
to setup the clock.
Each time the timer chip generates a tick, control is transferred to trap()
, as we’ve seen in the first part. trap()
has a switch
statement for tf->trapno
. The corresponding trapno
for this timer tick is T_IRQ0 + IRQ_TIMER
, so we need to add some handling code under it. Here’s my solution:
switch(tf->trapno){ case T_IRQ0 + IRQ_TIMER: if(cpuid() == 0){ acquire(&tickslock); ticks++; wakeup(&ticks); release(&tickslock); } // Homework: xv6 CPU alarm if (myproc() != 0 && (tf->cs & 3) == 3) { // If 1) there's a process running, and // 2) the interrupt came from userspace if (myproc()->alarmticks) { // Make sure alarm is set myproc()->alarmticksleft--; if (myproc()->alarmticksleft == 0) { // decrement user esp tf->esp -= 4; // push user eip onto user stack *(uint*)(tf->esp) = tf->eip; // change user eip to handler tf->eip = (uint)(myproc()->alarmhandler); myproc()->alarmticksleft = myproc()->alarmticks; } } } lapiceoi(); break;
If the interrupted process is a user process, decrement myproc->alarmticksleft
by one. If it hits zero, push the user return %eip onto user stack, and set the %eip in trapframe to the address of myproc()->handler
. Now, when trap()
returns, the system will return to myproc()->handler
. When myproc()->handler
returns, it pops off a word from the stack as return address, which is the user return %eip we set up, so that the interrupted user program can resume executing transparently. We can’t just call myproc()->hander
in trap()
, since that will cause it to be executed with kernel privilege, which is a bad thing.
Let’s check it:
$ alarmtest
alarmtest starting
…………………………………………alarm!
………………………………………………alarm!
……………………………………………………alarm!
……………………………………………………………………………alarm!
………………………………………………alarm!
………………………………………………………………alarm!
…………………………………………alarm!
……………………………………………………………
Well, nice.
Notice though. We assumed that myproc()->handler
will ret
and pop off that return address we set up on the stack. What if it does not, or mess up the stack?…Well, in that case, the initially interrupted process won’t resume execution, but the user program itself should be responsible to this mess, since it registered a “bad” handler function itself.
What is more terrible though, is that we are dereferencing the user %esp. What if the %esp is invalid, or even malicious? Let’s do the third optional challenge and attack ourselves. 🙂
Part 3: Optional challenges
Consider this PoC:
#include "types.h" #include "stat.h" #include "user.h" void periodic(); void main() { int i; printf(1, "your computer is gonna crash, muahahah!\n"); alarm(1, periodic); asm ("mov 0xf0100004,%esp"); for(;;) i++; } void periodic() { printf(1, "alarm!\n"); }
As usual we register the handler function periodic()
, but this time we set our %esp to somewhere funny, jump into an infinite loop (no stack operations!) and wait for the countdown. Once the countdown reaches zero, our poor kernel will actually write a return address at 0xf0100000
, which is crazy! Let’s see:
$ alarmtest
your computer is gonna crash, muahahah!
lapicid 1: panic: remap
80106997 80105ab1 80105724 0 0 0 0 0 0 0
Well, nothing too bad, but we crashed the kernel. 🙂
So, that’s it! Now I’m painfully familiar with GDT and IDT entries. 🙂 See you next time!