Select Page
$ alarmtest
alarmtest starting
…………………………………………alarm!
………………………………………………alarm!
……………………………………………………alarm!
……………………………………………………………………………alarm!
………………………………………………alarm!
………………………………………………………………alarm!
…………………………………………alarm!
……………………………………………………………

alarm!

Well, I find this homework pretty interesting, and I think it’s also a great chance to walk through xv6 syscalls, so let me summarize what I’ve learnt from it in this post.

  1. Part 1: xv6 syscall overview
  2. Part 2: xv6 CPU alarm
  3. Part 3: Optional challenges


Part 1: xv6 syscall overview

Say, a user program, alarmtest.c called a system call, alarm():

  alarm(10, periodic);

What’s actually happening in assembly? Take a look at alarmtest.asm:

alarm(10, periodic);
  5b:	58                   	pop    %eax
  5c:	5a                   	pop    %edx
  5d:	68 20 00 00 00       	push   $0x20
  62:	6a 0a                	push   $0xa
  64:	e8 81 03 00 00       	call   3ea <alarm>
000003ea <alarm>:
SYSCALL(alarm)
 3ea:	b8 17 00 00 00       	mov    $0x17,%eax
 3ef:	cd 40                	int    $0x40
 3f1:	c3                   	ret    

It calls SYSCALL(alarm), mov 0x17 into %eax, and does int $0x40.

SYSCALL(alarm) is defined in usys.S:

#define SYSCALL(name) \
  .globl name; \
  name: \
    movl $SYS_ ## name, %eax; \
    int $T_SYSCALL; \
    ret

I don’t know all the compiler magic involved here, but for now let’s just don’t worry about it.

The int 0x40 instruction tells the CPU to find the 0x40th entry in the IDT (Interrupt Descriptor Table), based at IDTR (In my case, 0x80115000. Check it in QEMU monitor). Let’s see:

gdb-peda$ x/gx 0x80115000+(8*0x40)
0x80115200 <idt+512>: 0x8010ef0000085f17

Note that each entry is 8 bytes long. This data structure is defined in trap.c:

for(i = 0; i < 256; i++)
    SETGATE(idt[i], 0, SEG_KCODE<<3, vectors[i], 0);
  SETGATE(idt[T_SYSCALL], 1, SEG_KCODE<<3, vectors[T_SYSCALL], DPL_USER);

Where T_SYSCALL equals to precisely 0x40. It is defined separately here because it’s the only “trap” gate. Other gates are either interruption gates (Dear OS, I just pressed on a key) or exception gates (Dear OS, I just tried to divide by 0).

The SETGATE macro is further defined in mmu.h:

#define SETGATE(gate, istrap, sel, off, dpl)			\
{								\
	(gate).gd_off_15_0 = (uint32_t) (off) & 0xffff;		\
	(gate).gd_sel = (sel);					\
	(gate).gd_args = 0;					\
	(gate).gd_rsv1 = 0;					\
	(gate).gd_type = (istrap) ? STS_TG32 : STS_IG32;	\
	(gate).gd_s = 0;					\
	(gate).gd_dpl = (dpl);					\
	(gate).gd_p = 1;					\
	(gate).gd_off_31_16 = (uint32_t) (off) >> 16;		\
}

Why are these hardware related data structures always look so strange? It reminds me a GDT entry. Frankly, I don’t want to demystify this thing at all. 🙂 Well OK…Let’s take another look at the memory:

gdb-peda$ x/gx 0x80115000+(8*0x40)
0x80115200 <idt+512>: 0x8010ef0000085f17

Let’s first focus on gd_off_15_0, gd_off_31_16 and gd_sel:

gdb-peda$ x/hx 0x80115000+(8*0x40) + 6
0x80115206 <idt+518>: 0x8010
gdb-peda$ x/hx 0x80115000+(8*0x40) + 0
0x80115200 <idt+512>: 0x5f17
gdb-peda$ x/hx 0x80115000+(8*0x40) + 2
0x80115202 <idt+514>: 0x0008

Nice. Now we know where to jump: [0x0008:0x80105f17]. I’ve checked this segment for you: It’s based at 0x0. Other interesting fields include:

gdb-peda$ x/t 0x80115000+(8*0x40) + 5
0x80115205 <idt+517>: 11101111

These are gd_type, gd_s, gd_dpl and gd_p. gd_type 0b1110 is STS_IG32, 32-bit Interrupt Gate. gd_s is 0b0, means system. gd_dpl is 0b11, meaning that this gate is of user privilege (If a user program tries to trigger a system privilege gate, CPU generates another exception to prevent it). Finally, gd_p 0b1 means that this segment is present…

Another thing, though. We have to change from user stack to kernel stack. How does the CPU find the new %esp? How does it know whether there’s a privilege change or not, anyway?

CPU compares the PL (Privilege Level) of old %cs with the new %cs in IDT entry. It goes without saying we came from userland, but what about the new one, 0x0008?:

(qemu) info registers

GDT= 80112850 0000002f
gdb-peda$ x/gx 0x80112850+0x008
0x80112858 <cpus+120>: 0x00cf9a000000ffff
gdb-peda$ x/t 0x80112850+0x008 + 5
0x8011285d <cpus+125>: 10011010

GDT Bits

This means the new %cs is of supervisor privilege, which is pretty obvious since we want to jump into the kernel. Knowing that there’s a “privilege escalation” 🙂 , CPU does a few more things than when the privilege level is unchanged:

  • It loads new %esp and %ss from the task segment descriptor;
  • It pushes old %esp and %ss values onto (the new) stack;

Note that CPU doesn’t push anything onto user stack, since the user %esp may be corrupted and cannot be trusted anyway. We will see how things may go wrong if kernel simply trust a user %esp in later sections.

TR (Task Register) holds a segment selector for the TSS (Task State Segment):

(qemu) info registers

TR =0028 801127e8 00000067 00408900 DPL=0 TSS32-avl
GDT= 80112850 0000002f

OK I’m already getting used to this:

gdb-peda$ x/gx 0x80112850+0x28
0x80112878 <cpus+152>: 0x80408b1127e80067
gdb-peda$ x/hx 0x80112850+0x28 + 2
0x8011287a <cpus+154>: 0x27e8
gdb-peda$ x/bx 0x80112850+0x28 + 4
0x8011287c <cpus+156>: 0x11
gdb-peda$ x/bx 0x80112850+0x28 + 7
0x8011287f <cpus+159>: 0x80

The TSS is located at 0x801127e8. It’s a very long data structure defined in mmu.h:

// Task state segment format (as described by the Pentium architecture book)
struct Taskstate {
	uint32_t ts_link;	// Old ts selector
	uintptr_t ts_esp0;	// Stack pointers and segment selectors
	uint16_t ts_ss0;	//   after an increase in privilege level
...
}
gdb-peda$ x/wx 0x801127e8 + 4
0x801127ec <cpus+12>: 0x8dfbf000
gdb-peda$ x/hx 0x801127e8 + 8
0x801127f0 <cpus+16>: 0x0010

So the new %esp should be 0x8dfbf000, and the new %ss should be 0x0010. I’ve checked it for you that this new stack segment is located at 0x0.

These values are initiated in switchuvm() (See vm.c):

  mycpu()->ts.ss0 = SEG_KDATA << 3;
  mycpu()->ts.esp0 = (uint)p->kstack + KSTACKSIZE;

Awesome! Now the CPU knows all the new register values it should know: %cs: 0x0008, %eip: 0x80105f17, %ss: 0x0010 and %esp: 0x8dfbf000. It is ready to jump.

gdb-peda$ si
The target architecture is assumed to be i386
=> 0x80105f19 <vector64+2>: push 0x40
vector64 () at vectors.S:320
320       pushl $64
gdb-peda$ x/3i 0x80105f17
   0x80105f17 <vector64>:   push 0x0
=> 0x80105f19 <vector64+2>: push 0x40
   0x80105f1b <vector64+4>: jmp 0x8010570f <alltraps>

My calculation was correct. I don’t know why GDB executed 0x80105f17 for me, but that actually scared the heck out of me. 🙂 The stack is still “under construction”, but in short all these push instructions are there to build a struct trapframe defined in x86.h:

//PAGEBREAK: 36
// Layout of the trap frame built on the stack by the
// hardware and by trapasm.S, and passed to trap().
struct trapframe {
  // registers as pushed by pusha
  uint edi;
  uint esi;
  uint ebp;
  uint oesp;      // useless & ignored
  uint ebx;
  uint edx;
  uint ecx;
  uint eax;

  // rest of trap frame
  ushort gs;
  ushort padding1;
  ushort fs;
  ushort padding2;
  ushort es;
  ushort padding3;
  ushort ds;
  ushort padding4;
  uint trapno;

  // below here defined by x86 hardware
  uint err;
  uint eip;
  ushort cs;
  ushort padding5;
  uint eflags;

  // below here only when crossing rings, such as from user to kernel
  uint esp;
  ushort ss;
  ushort padding6;
};

We will examine the stack later, but let’s first focus on the code.

xv6 uses a Perl script (3250) to generate the entry points that the IDT entries point to. Each entry pushes an error code if the processor didn’t, pushes the interrupt number, and then jumps to alltraps. (P42)

xv6 book

In this case the CPU does not push an error code (the err field in trapframe), so vector64 does push 0x0 for us instead, to keep the trapframe complete. Then it pushes 0x40, the trap number (trapno), otherwise the trap handler can’t tell which kind of trap / interruption / exception it is handling. Finally it jumps to alltrap.

gdb-peda$ x/10i $pc
=> 0x8010570f <alltraps>:    push ds
   0x80105710 <alltraps+1>:  push es
   0x80105711 <alltraps+2>:  push fs
   0x80105713 <alltraps+4>:  push gs
   0x80105715 <alltraps+6>:  pusha
   0x80105716 <alltraps+7>:  mov ax,0x10
   0x8010571a <alltraps+11>: mov ds,eax
   0x8010571c <alltraps+13>: mov es,eax
   0x8010571e <alltraps+15>: push esp
   0x8010571f <alltraps+16>: call 0x801057f0 <trap>

Basically this code just push a lot of other registers onto stack to complete the trapframe, and set %ds and %es to also 0x10, same with %ss loaded from the struct Taskstate. Maybe xv6 just hard-coded these values.

Anyways. Before 0x8010571e, the kernel stack should look like:

                  
                  +----------------+----------------+	-> ts.esp0
0x8dfbeffc        |    padding6    |   ss: 0x0023   |	-> %ss and %esp are pushed onto stack only because
                  +----------------+----------------+	   we are crossing rings from 3 to 0
0x8dfbeff8        |         esp: 0x00002fac         |    
                  +----------------+----------------+   
0x8dfbeff4        |       eflags: 0x00000246        |	-> other registers pushed by CPU
                  +----------------+----------------+
0x8dfbeff0        |    padding5    |   cs: 0x001b   |
                  +----------------+----------------+
0x8dfbefec        |         eip: 0x000003f1         |
                  +---------------------------------+
0x8dfbefe8        |         err: 0x00000000         |	-> pushed by xv6 in vector64, since CPU does not 
                  +---------------------------------+	   push error code for this case
0x8dfbefe4        |       trapno: 0x00000040        |
                  +----------------+----------------+    -> more registers pushed by alltrap
0x8dfbefe0        |    padding4    |   ds: 0x0023   |
                  +----------------+----------------+
0x8dfbefdc        |    padding3    |   es: 0x0023   |
                  +----------------+----------------+
0x8dfbefd8        |    padding2    |   fs: 0x0000   |
                  +----------------+----------------+
0x8dfbefd4        |    padding1    |   gs: 0x0000   |
                  +----------------+----------------+    -> even more general registers pushed by pusha
0x8dfbefd0        |         eax: 0x00000017         |
                  +---------------------------------+
0x8dfbefcc        |         ecx: 0x0000000a         |
                  +---------------------------------+
0x8dfbefc8        |         edx: 0x00000818         |
                  +---------------------------------+
0x8dfbefc4        |         ebx: 0x0000bfa8         |
                  +---------------------------------+
0x8dfbefc0        |              oesp               |	-> ignored
                  +---------------------------------+
0x8dfbefbc        |         ebp: 0x00002fc8         |
                  +---------------------------------+
0x8dfbefb8        |         esi: 0x00000000         |
                  +---------------------------------+
0x8dfbefb4        |         edi: 0x00000000         |
                  +---------------------------------+	-> esp, passed to trap() as argument

Beautiful. Now the last job of trapall is to push esp then call trap(), passing the pointer to this struct trapframe to trap(). Well, trap() then has all the information to handle this event. In our case, it looks at tf->trapno and it should be T_SYSCALL, so it knows that this must be a syscall request triggered by int 0x40. It then checks tf->eax to know that the user program actually wanted to call syscall number 0x17, which is SYS_alarm as defined in syscall.h. Finally, it transfers control to sys_alarm().



Part 2: xv6 CPU alarm

Back to the homework itself, we are supposed to implement the alarm() syscall for xv6, so when a test user program alarmtest does things like:

#include "types.h"
#include "stat.h"
#include "user.h"

void periodic();

int
main(int argc, char *argv[])
{
  int i;
  printf(1, "alarmtest starting\n");
  alarm(10, periodic);
  for(i = 0; i < 25*5000000; i++){
    if((i % 250000) == 0)
      write(2, ".", 1);
  }
  printf(1, "\n");
  exit();
}

void
periodic()
{
  printf(1, "alarm!\n");
}

It registers a handler function periodic() with alarm(10, periodic); so that the kernel will call periodic() and print out alarm! every 10 ticks.

The function signature is defined in user.h as follows:

int alarm(int ticks, void (*handler)());

A tick is a fairly arbitrary unit of time in xv6, determined by how often a hardware timer generates interrupts.

Homework: xv6 CPU alarm

This involves modifying the following files:

peilin@PWN:~/6.828/2018/xv6-public$ git status

        modified:    Makefile
        modified:    proc.c
        modified:    proc.h
        modified:    syscall.c
        modified:    syscall.h
        modified:    sysproc.c
        modified:    trap.c
        modified:    user.h
        modified:    usys.S

But lets only focus on sysproc.c and trap.c. In sysproc.c, we add a sys_alarm():

// Homework: xv6 CPU alarm
int
sys_alarm(void)
{
  int ticks;
  void (*handler)();

  if(argint(0, &ticks) < 0)
    return -1;
  if(argptr(1, (char**)&handler, 1) < 0)
    return -1;

  myproc()->alarmticks = ticks;
  myproc()->alarmhandler = handler;
  myproc()->alarmticksleft = myproc()->alarmticks;
  return 0;
}

Defined in proc.h, the process descriptor struct proc now contains three more fields:

  // Home work: xv6 CPU alarm
  uint alarmticks;
  void (*alarmhandler)();
  uint alarmticksleft;

sys_alarm() reads two parameters, ticks and handler from user stack (recall all these information needed about the user process is stored in the trapframe), and set up these 3 fields of myproc() to setup the clock.

Each time the timer chip generates a tick, control is transferred to trap(), as we’ve seen in the first part. trap() has a switch statement for tf->trapno. The corresponding trapno for this timer tick is T_IRQ0 + IRQ_TIMER, so we need to add some handling code under it. Here’s my solution:

  switch(tf->trapno){
  case T_IRQ0 + IRQ_TIMER:
    if(cpuid() == 0){
      acquire(&tickslock);
      ticks++;
      wakeup(&ticks);
      release(&tickslock);
    }

    // Homework: xv6 CPU alarm
    if (myproc() != 0 && (tf->cs & 3) == 3) {
      // If 1) there's a process running, and
      //    2) the interrupt came from userspace
      if (myproc()->alarmticks) { // Make sure alarm is set
        myproc()->alarmticksleft--;
        if (myproc()->alarmticksleft == 0) {

          // decrement user esp
          tf->esp -= 4;
          // push user eip onto user stack
          *(uint*)(tf->esp) = tf->eip;
          // change user eip to handler
          tf->eip = (uint)(myproc()->alarmhandler);

          myproc()->alarmticksleft = myproc()->alarmticks;
        }
      }
    }

    lapiceoi();
    break;

If the interrupted process is a user process, decrement myproc->alarmticksleft by one. If it hits zero, push the user return %eip onto user stack, and set the %eip in trapframe to the address of myproc()->handler. Now, when trap() returns, the system will return to myproc()->handler. When myproc()->handler returns, it pops off a word from the stack as return address, which is the user return %eip we set up, so that the interrupted user program can resume executing transparently. We can’t just call myproc()->hander in trap(), since that will cause it to be executed with kernel privilege, which is a bad thing.

Let’s check it:

$ alarmtest
alarmtest starting
…………………………………………alarm!
………………………………………………alarm!
……………………………………………………alarm!
……………………………………………………………………………alarm!
………………………………………………alarm!
………………………………………………………………alarm!
…………………………………………alarm!
……………………………………………………………

Well, nice.

Notice though. We assumed that myproc()->handler will ret and pop off that return address we set up on the stack. What if it does not, or mess up the stack?…Well, in that case, the initially interrupted process won’t resume execution, but the user program itself should be responsible to this mess, since it registered a “bad” handler function itself.

What is more terrible though, is that we are dereferencing the user %esp. What if the %esp is invalid, or even malicious? Let’s do the third optional challenge and attack ourselves. 🙂



Part 3: Optional challenges

Consider this PoC:

#include "types.h"
#include "stat.h"
#include "user.h"

void periodic();

void
main()
{
  int i;
  printf(1, "your computer is gonna crash, muahahah!\n");
  alarm(1, periodic);
  
  asm ("mov    0xf0100004,%esp");

  for(;;)
    i++;
}

void
periodic()
{
  printf(1, "alarm!\n");
}

As usual we register the handler function periodic(), but this time we set our %esp to somewhere funny, jump into an infinite loop (no stack operations!) and wait for the countdown. Once the countdown reaches zero, our poor kernel will actually write a return address at 0xf0100000, which is crazy! Let’s see:

$ alarmtest
your computer is gonna crash, muahahah!
lapicid 1: panic: remap
80106997 80105ab1 80105724 0 0 0 0 0 0 0

Well, nothing too bad, but we crashed the kernel. 🙂

So, that’s it! Now I’m painfully familiar with GDT and IDT entries. 🙂 See you next time!