Select Page
dumbfork: OK (1.4s)
Part A score: 5/5

faultread: OK (0.9s)
faultwrite: OK (1.0s)
faultdie: OK (1.0s)
faultregs: OK (1.1s)
faultalloc: OK (0.9s)
faultallocbad: OK (1.0s)
faultnostack: OK (2.0s)
faultbadhandler: OK (1.9s)
faultevilhandler: OK (2.1s)
forktree: OK (2.1s)
Part B score: 50/50

spin: OK (0.9s)
stresssched: OK (1.8s)
sendpage: OK (1.7s)
pingpong: OK (1.4s)
primes: OK (2.7s)
Part C score: 25/25

Score: 80/80

Phew...MUCH tougher than I’ve imagined…

Here we go. Lab 4: Preemptive Multitasking with Part A, B and C

As usual we are going to implement a lot of functions, but while the Lab 2 functions were built mostly “from scratch”, most of the Lab 4 functions (if not all of them) were built on top of both Lab 3 and Lab 2 functions. This actually felt pretty good, but I also made some of the hardest bugs I’ve ever created… šŸ™‚

 kern/env.cĀ  Ā  Ā  Ā |Ā  Ā 8 +++++ā€”ā€”
kern/init.cĀ  Ā  Ā  |Ā  Ā 9 ++++ā€”ā€”ā€”
kern/pmap.cĀ  Ā  Ā  |Ā  32 ++++++++++++++++++ā€”ā€”ā€”ā€”ā€”ā€”ā€”
kern/sched.cĀ  Ā  Ā |Ā  14 ++++++++++ā€”
kern/syscall.cĀ  Ā | 173 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”
kern/trap.cĀ  Ā  Ā  | 172 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”ā€”
kern/trapentry.S |Ā  20 ++++++++++++ā€”ā€”ā€”ā€”
lib/fork.cĀ  Ā  Ā  Ā |Ā  92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ā€”ā€”ā€”ā€”ā€”ā€”ā€”
lib/ipc.cĀ  Ā  Ā  Ā  |Ā  35 ++++++++++++++++++++++++ā€”ā€”ā€”
lib/pfentry.SĀ  Ā  |Ā  14 +++++++++++
lib/pgfault.cĀ  Ā  |Ā  Ā 8 +++++ā€”ā€”
user/sendpage.cĀ  |Ā  Ā 2 +ā€”
12 files changed, 479 insertions(+), 100 deletions(-)

Without further ado:

  1. Part A: Multiprocessor Support and Cooperative Multitasking
  2. Part B: Copy-on-Write Fork
  3. Part C: Preemptive Multitasking and Inter-Process communication (IPC)


Part A: Multiprocessor Support and Cooperative Multitasking

First we would like to make our JOS kernel support “symmetric multiprocessing” (SMP). An SMP system has one boostrap processor (BSP) and one or more application processors (APs). APs are activated by BSP only after the system is up and running. In an SMP system, each processor has its own local APIC (LAPIC) unit. Frankly I don’t really know about them. From the perspective of this lab, we need these LAPIC units because:

First. Per-CPU information for all the CPUs is stored in an array struct CpuInfo cpus[NCPU], initialized in mpconfig.c. Whenever a CPU wants to access its CpuInfo, it first needs to figure out its index into this array, which is its LAPIC identifier (APIC ID). cpunum() in lapic.c does this for us.

Second. During system startup, the BSP sends the STARTUP interprocessor interrupt (IPI) to the APs through LAPICs.

Third. We need to program the build-in timer chip of LAPIC in order to trigger clock interrupts to support preemptive multitasking.

That’s the idea, and let’s just don’t worry too much for the h/w stuff and jump straight into the exercises:

Exercise 1. Implement mmio_map_region in kern/pmap.c.

void *
mmio_map_region(physaddr_t pa, size_t size)
{
	static uintptr_t base = MMIOBASE;

	size = ROUNDUP(size, PGSIZE);
	if (base + size > MMIOLIM)
		panic("mmio_map_region: cannot go higher than MMIOLIM!\n");

	boot_map_region(kern_pgdir, base, size, pa, (PTE_PCD|PTE_PWT|PTE_W));
	base += size;
	return (void *)(base - size);
}

So the CPUs interact with their LAPICs using MMIO. The function itself is nothing new. The base variable is declared as static so its value will be preserved between calls to mmio_map_region().

Exercise 2. Read boot_aps() and mp_main() in kern/init.c, and the assembly code in kern/mpentry.S. Make sure you understand the control flow transfer during the bootstrap of APs. Then modify your implementation of page_init() in kern/pmap.c to avoid adding the page at MPENTRY_PADDR to the free list, so that we can safely copy and run AP bootstrap code at that physical address.

void
page_init(void)
{	
	// 1) Real-mode IDT and BIOS structures.
	size_t i = 0;
	pages[i].pp_ref = 1;
	pages[i].pp_link = NULL;
	i++;
	// 2) The rest of base memory.
	for (; i < npages_basemem; i++) {
		if (i == MPENTRY_PADDR / PGSIZE) {
			pages[i].pp_ref = 1;
			pages[i].pp_link = NULL;
			continue;
		}
		pages[i].pp_ref = 0;
		pages[i].pp_link = page_free_list;
		page_free_list = &pages[i];
	}
	cprintf("page_init: npages_basemem: %x\n", npages_basemem);
	// 3) The IO hole.
	for (; i*PGSIZE < EXTPHYSMEM; i++) {
		pages[i].pp_ref = 1;
		pages[i].pp_link = NULL;
	}
	// 4) Kernel data structures allocated by boot_alloc().
	for (; i*PGSIZE < PADDR((char *)boot_alloc(0)); i++) {
		pages[i].pp_ref = 1;
		pages[i].pp_ref = NULL;
	}
	// 5) Everything else.
	for (; i < npages; i++) {
		pages[i].pp_ref = 0;
		pages[i].pp_link = page_free_list;
		page_free_list = &pages[i];
	}
}

We will map the AP entry code to MPENTRY_PADDR, so we definitely don’t want this physical page to be allocated / overwritten for other reasons.

Exercise 3. Modify mem_init_mp() (in kern/pmap.c) to map per-CPU stacks starting at KSTACKTOP, as shown in inc/memlayout.h. The size of each stack is KSTKSIZE bytes plus KSTKGAP bytes of unmapped guard pages.

mem_init_mp() is a function called inside mem_init().

static void
mem_init_mp(void)
{
	for (int i = 0; i < NCPU; i++) {
		uintptr_t kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP);
		boot_map_region(kern_pgdir, kstacktop_i - KSTKSIZE, KSTKSIZE, PADDR(percpu_kstacks[i]), PTE_W);
	}
}

Just map a kernel stack for each CPU. Kernel RW, user NONE.

Exercise 4. The code in trap_init_percpu() (kern/trap.c) initializes the TSS and TSS descriptor for the BSP. It worked in Lab 3, but is incorrect when running on other CPUs. Change the code so that it can work on all CPUs.

As mentioned, the BSP wakes APs up using IPI, and tells then to execute mpentry.S. mpentry.S calls mp_main(), which does the setup for each AP, including calling this trap_init_percpu().

void
trap_init_percpu(void)
{
    // Set per-CPU TSS
    thiscpu->cpu_ts.ts_esp0 = (uintptr_t) percpu_kstacks[thiscpu->cpu_id];
    thiscpu->cpu_ts.ts_ss0 = GD_KD;
    thiscpu->cpu_ts.ts_iomb = sizeof(struct Taskstate);

    // Set per-CPU TSS descriptor
    gdt[(GD_TSS0 >> 3) + thiscpu->cpu_id] = SEG16(STS_T32A, (uint32_t) (&thiscpu->cpu_ts), sizeof(struct Taskstate) - 1, 0);
    gdt[(GD_TSS0 >> 3) + thiscpu->cpu_id].sd_s = 0;

    // Load per-CPU TSS selector
    ltr(GD_TSS0 + (thiscpu->cpu_id * sizeof(struct Gatedesc)));

    // Load IDT
    lidt(&idt_pd);
}

As seen in Lab 3, we only use the ts_esp0 and ts_ss0 fields in TSS for context switching from user space to kernel space. So what is that tf_iomb?

OSDev calls this field “IOPB offset”. The Intel386 DX MICROPROCESSOR datasheet calls this field “BIT_MAP_OFFSET(15:0)” (See Figure 4-15a on page 49). It says, on page 58:

Due to the use of a pointer to the base of the I/O
Permission Bitmap
, the bitmap may be located anywhere within the TSS, or may be ignored completely
by pointing the Bit_Map_Offset(15:0) beyond the
limit of the TSS segment.

Whatever the IOPB is, if we set the IOPB offset “beyond the limit of the TSS segment”, it will be ignored and we can totally forget about it. Recall, the TSS limit is the length of TSS minus one (see that - 1 when we set the TSS descriptor at line 175?). So, if we set IOPB offset to the length of TSS (as we do in trap_init_percpu()), the whole IOPB things will always be ignored by the CPU.

That’s the reason behind this “recommended practice” (if one wants no IOPB, of course). I wonder if I will need this knowledge ever again. šŸ™‚

Anyways, back to C! Each CPU then sets its TSS descriptor, with the sd_s field sets to zero, stands for system. Then finally it loads %tr and %idtr registers.

Now, if we run QEMU with multiple processors:

peilin@PWN:~/6.828/2018/lab$ make qemu-nox CPUS=4

check_page_free_list() succeeded!
check_page_alloc() succeeded!
check_page() succeeded!
check_kern_pgdir() succeeded!
check_page_free_list() succeeded!
check_page_installed_pgdir() succeeded!
SMP: CPU 0 found 4 CPU(s)
enabled interrupts: 1 2
SMP: CPU 1 starting
SMP: CPU 2 starting
SMP: CPU 3 starting

Nice.

Exercise 5. Apply the big kernel lock as described above, by calling lock_kernel() and unlock_kernel() at the proper locations.

I’m gonna skip this part…The idea is: 1) Let the BSP first acquire the kernel lock; 2) Let each AP acquire the kernel lock before calling sched_yield() to execute user environments; 3) Make sure each CPU acquire the lock as soon as entering the kernel (trap()), and release the lock right before switching to user mode (env_run(), I actually chose env_pop_tf()).

Exercise 6. Implement round-robin scheduling in sched_yield() as described above. Don’t forget to modify syscall() to dispatch sys_yield(). Make sure to invoke sched_yield() in mp_main().

void
sched_yield(void)
{
	struct Env *idle;
	int i, j;

	j = (curenv) ? (ENVX(curenv->env_id) + 1) : 0;

	for (i = 0; i < NENV; i++) {
		idle = &envs[(j + i) % NENV];
		if (idle->env_status == ENV_RUNNABLE)
			env_run(idle);
	}

	if (i == NENV && curenv && curenv->env_status == ENV_RUNNING)
		env_run(curenv);

	// sched_halt() never returns
	sched_halt();
}

We are doing “round-robin” here. If this is the first time a CPU runs an environment, the for starts from zero; Otherwise, it starts from just after the environment this CPU was last running.

If sched_yield() finds a different RUNNABLE environment, it env_run()s it. If no other environments are RUNNABLE, but the one previously running on this CPU is still RUNNING, it is OK to env_run() the same environment again. If for some reason we fell through all these statements, this CPU sched_halt()s.

Now we have a simple scheduler. In the next exercise we will be implementing some syscalls: sys_exofork(), sys_env_set_status(), sys_page_alloc(), sys_page_map() and sys_page_unmap(). Nah, just five… šŸ™‚

Exercise 7. Implement the system calls described above in kern/syscall.c and make sure syscall() calls them.

sys_exofork() forks a new environment, without any mappings in the user portion of its address space.

static envid_t
sys_exofork(void)
{
	struct Env *e;

	if (env_alloc(&e, curenv->env_id)< 0)
		return -E_BAD_ENV;

	e->env_status = ENV_NOT_RUNNABLE;
	e->env_tf = curenv->env_tf;
	e->env_tf.tf_regs.reg_eax = 0;

	return e->env_id;
}

Obviously this child is not ready to be scheduled yet, so sys_exofork() marks it as NOT_RUNNABLE, giving the parent more time to set up the child environment. It also copies the trapframe from the parent. Later trap_dispatch() will set the return value of sys_exofork() to the parent as the child’s env_id, but nobody else’s gonna set the return value for the child (which is zero, as we all know), so sys_exofork() does it here.

sys_env_set_status()

static int
sys_env_set_status(envid_t envid, int status)
{
	if ((status != ENV_RUNNABLE) && (status != ENV_NOT_RUNNABLE))
		return -E_INVAL;
	
	struct Env *e;

	if (envid2env(envid, &e, 1) < 0)
		return -E_BAD_ENV;

	e->env_status = status;
	return 0;
}

envid2env performs a lot of checks on environment permissions, just set that checkperm argument flag. Same with sys_page_alloc():

static int
sys_page_alloc(envid_t envid, void *va, int perm)
{
	if (((uintptr_t)va >= UTOP) || ((uintptr_t)va % PGSIZE != 0))
		return -E_INVAL;
	if ((perm & PTE_U) != PTE_U)
		return -E_INVAL;
	if ((perm & ~PTE_SYSCALL) != 0)
		return -E_INVAL;

	struct Env *e;
	if (envid2env(envid, &e, 1) < 0)
		return -E_BAD_ENV;

	struct PageInfo *pp = page_alloc(ALLOC_ZERO);
	if (!pp)
		return -E_NO_MEM;

	if (page_insert(e->env_pgdir, pp, va, perm) < 0)
		return -E_NO_MEM;

	return 0;
}

You can’t really allocate a page without PTE_P set in its PTE, so I don’t bother to check it here…

Finally, sys_page_map() and sys_page_unmap():

static int
sys_page_map(envid_t srcenvid, void *srcva,
	     envid_t dstenvid, void *dstva, int perm)
{
	if (((uintptr_t)srcva >= UTOP) || ((uintptr_t)srcva % PGSIZE != 0))
		return -E_INVAL;
	if (((uintptr_t)dstva >= UTOP) || ((uintptr_t)dstva % PGSIZE != 0))
		return -E_INVAL;
	if ((perm & PTE_U) != PTE_U)
		return -E_INVAL;
	if ((perm & ~PTE_SYSCALL) != 0)
		return -E_INVAL;

	pte_t *pte;
	struct Env *srce, *dste;
	if (envid2env(srcenvid, &srce, 1) < 0)
		return -E_BAD_ENV;
	if (envid2env(dstenvid, &dste, 1) < 0)
		return -E_BAD_ENV;
	
	struct PageInfo *pp = page_lookup(srce->env_pgdir, srcva, &pte);
	if (!pp)	
		return -E_INVAL;	// srcva is not mapped in srcenvid's address space
	if ((perm & PTE_W) && ((*pte & PTE_W) == 0))
		return -E_INVAL;	// must not grant write access to a read-only page
	if (page_insert(dste->env_pgdir, pp, dstva, perm) < 0)
		return -E_NO_MEM;
	return 0;
}

As the kernel we basically can’t trust anything from the user land including the arguments. That’s why we have these crazy checks. I feel uneasy, since I believe I myself can’t think of such a checklist. Maybe the only way to make this list comprehensive is by trail and error.

static int
sys_page_unmap(envid_t envid, void *va)
{
	if (((uintptr_t)va >= UTOP) || ((intptr_t)va % PGSIZE != 0))
		return -E_INVAL;
	
	struct Env *e;
	if (envid2env(envid, &e, 1) < 0)
		return -E_BAD_ENV;

	page_remove(e->env_pgdir, va);
	return 0;
}

Now we have all the syscalls we need to implement a user library fork() function, which is exactly what we are going to do in Part B.



Part B: Copy-on-Write Fork

The plan is, whenever the parent or the child writes to a COW page, the page fault exception will be handled by our user-level page handler, pgfault(). But how does fork() tells the kernel which handler it wants to use? The answer is through another syscall, sys_env_set_pgfault_upcall():

Exercise 8. Implement the sys_env_set_pgfault_upcall() system call.

static int
sys_env_set_pgfault_upcall(envid_t envid, void *func)
{
	struct Env *e;

	if (envid2env(envid, &e, 1) < 0)
		return -E_BAD_ENV;

	e->env_pgfault_upcall = func;
	return 0;
}

Recall we already have a kernel page fault handler, page_fault_handler(), but it only handles kernel space page faults. We complete it to handle user page faults as well:

void
page_fault_handler(struct Trapframe *tf)
{
	uint32_t fault_va;

	if ((tf->tf_cs & 3) == 0) {
		panic("trap_dispatch: page fault happened in kernel mode!\n");
	}

	// We've already handled kernel-mode exceptions, so if we get here,
	// the page fault happened in user mode.

	if (!curenv->env_pgfault_upcall)
		goto bad;
	
	int recursive = (tf->tf_esp >= UXSTACKTOP - PGSIZE) && (tf->tf_esp < UXSTACKTOP);
	struct UTrapframe *utf;

	if (recursive) {
		utf = (struct UTrapframe*)(tf->tf_esp - sizeof(uint32_t) - sizeof(struct UTrapframe));
		user_mem_assert(curenv, (void *)utf, tf->tf_esp - (uintptr_t)utf, PTE_W);
	} else {
		utf = (struct UTrapframe*)(UXSTACKTOP - sizeof(struct UTrapframe));
		user_mem_assert(curenv, (void *)utf, UXSTACKTOP - (uintptr_t)utf, PTE_W);
	}

	utf->utf_esp = tf->tf_esp;	// Will be changed in _pgfault_upcall
	utf->utf_eflags = tf->tf_eflags;
	utf->utf_eip = tf->tf_eip;
	utf->utf_regs = tf->tf_regs;
	utf->utf_err = tf->tf_err;
	utf->utf_fault_va = fault_va;

	// Prepare to call env_run()
	tf->tf_eip = (uintptr_t)curenv->env_pgfault_upcall;
	tf->tf_esp = (uintptr_t)utf;
	env_run(curenv);

	bad:
		// Destroy the environment that caused the fault.
		cprintf("[%08x] user fault va %08x ip %08x\n",
			curenv->env_id, fault_va, tf->tf_eip);
		print_trapframe(tf);
		env_destroy(curenv);
}

If a user environment caused a page fault without registering a page fault handler, page_fault_handler() destroys that environment. Otherwise, it pushes a user trapframe (struct UTrapframe) onto the user exception stack, and returns to the user-level page fault handler running on the exception stack, in our case, _pgfault_upcall. _pgfault_upcall is an assembly routine similar to xv6‘s trapasm.S: It pushes the user trapframe pointer onto the exception stack, then call the “real” C handler, _pgfault_handler(). After _pgfault_handler() has done its job, _pgfault_upcall() is also responsible for cleaning up the stack arguments, then pop off the user trapframe to jump back to wherever that generated the page fault and continue executing.

We call the stack that was being used at the time the page fault was generated the “trap-time” stack. So there are totally 3 stacks: The exception brings us from the trap-time stack to the per-CPU kernel stack. The kernel page_fault_handler() switches us from the kernel stack to the user exception stack. Finally, _pgfault_upcall sends us from user exception stack back to the trap-time stack. It turns out the last step is pretty tricky, and _pgfault_upcall has to ret to the trap-time %eip after restoring trap-time %esp. You know what I mean? Someone has to pushes the trap-time %eip onto the trap-time stack. I will do this in _pgfault_upcall. The reason I mention this here is: It complicates the case if page_fault_handler() is handling a “recursive” page fault, a page fault generated when already running on the exception stack. In that case, page_fault_handler() can’t push the user trapframe immediately onto the trap-time stack (which is the exception stack). It has to reserve a blank word between the user trapframe and the trap-time stack top (whatever the function that caused the fault was working on). See line 404.

user_mem_assert() does an elegant check for us. It not only checks whether the user has ever allocated an exception stack already; It also covers the case when a recursive fault overflows the exception stack.

Exercise 10. Implement the _pgfault_upcall routine in lib/pfentry.S. The interesting part is returning to the original point in the user code that caused the page fault. You’ll return directly there, without going back through the kernel. The hard part is simultaneously switching stacks and re-loading the EIP.

#include <inc/mmu.h>
#include <inc/memlayout.h>

.text
.globl _pgfault_upcall
_pgfault_upcall:
    pushl %esp               // function argument: pointer to UTF
    movl _pgfault_handler, %eax
    call *%eax
    addl $4, %esp            // pop function argument
    
    movl 40(%esp), %ebx      // Load trap-time %eip into %ebx
    subl $4, 48(%esp)		// *Important*: Decrement trap-time %esp in the
    						 // trapframe first, since we want to push
    movl 48(%esp), %eax      // Load trap-time %esp into %eax
    movl %ebx, (%eax)        // "Push" trap-time %eip onto trap-time stack

    // Restore the trap-time registers.  After you do this, you
    // can no longer modify any general-purpose registers.
    addl $8, %esp            // Ignore fault_va and tf_err
    popal

    // Restore eflags from the stack.  After you do this, you can
    // no longer use arithmetic operations or anything else that
    // modifies eflags.
    addl $4, %esp            // Jump over trap-time %eip
    popf                     // There's no "popl %eflags"!

    // Switch back to the adjusted trap-time stack.
    popl %esp

    // Return to re-execute the instruction that faulted.
    ret

The highlighted two lines describe one of the worst bugs I encountered in this lab. In order to push trap-time %eip onto trap-time %esp, at first I simply did: 1) Load trap-time %esp from trapframe into temp register; 2) Decrement temp register by 4; 3) Dereference tmp register and write trap-time %eip to that adddress. THIS IS VERY WRONG! I wrote the return address to %esp - 4, but the trap-time program will still be running on %esp since I didn’t change the value in the trapframe…In stead, I should decrement it in the trapframe before loading it into a temp register, as I did here in line 13-14.

The weirdest thing is that my original code worked just fine for the non-recursive scenarios. Just made the debugging even harder.

That kind of feeling when you know your critical assembly code path failed.jpg

Now life is sane. Now we need a user library wrapper to set up the user page fault handler:

Exercise 11. Finish set_pgfault_handler() in lib/pgfault.c.

Makes sense. Imagine I’m a JOS developer, I definitely don’t want to manually allocate a user exception stack myself every time. And I certainly don’t want to type in man 2 sys_env_set_pgfault_upcall. šŸ™‚

void
set_pgfault_handler(void (*handler)(struct UTrapframe *utf))
{
	int rc;

	if (_pgfault_handler == 0) {
		// First time through!
		envid_t eid = sys_getenvid();
		if ((rc = sys_page_alloc(eid, (void *)(UXSTACKTOP-PGSIZE), (PTE_U|PTE_W))) < 0)
			panic("set_pgfault_handler: sys_page_alloc() failed!\n");
		if ((rc = sys_env_set_pgfault_upcall(eid, _pgfault_upcall)) < 0)
			panic("set_pgfault_handler: sys_env_set_pgfault_upcall() failed!\n");
	}

	// Save handler pointer for assembly to call.
	_pgfault_handler = handler;
}

First time a user environment calls it, it allocates a exception stack and register the _pgfault_upcall. Since _pgfault_handler is a global function pointer, we simply update it.

OK finally it’s time to implement that COW fork().

Exercise 12. Implement fork(), duppage() and pgfault() in lib/fork.c.

static void
pgfault(struct UTrapframe *utf)
{
	void *addr = (void *) utf->utf_fault_va;
	uint32_t err = utf->utf_err;
	extern volatile pte_t uvpt[];
	int rc;

	if ((err & FEC_WR) == 0)
		panic("pgfault: the faulting access was not a write!\n");
	if ((uvpt[(uintptr_t)addr >> PTXSHIFT] & PTE_COW) == 0)
		panic("pgfault: not a copy-on-write page!\n");

	envid_t eid = sys_getenvid();
	if ((rc = sys_page_alloc(eid, PFTEMP, (PTE_U|PTE_W))) < 0)
		panic("pgfault: sys_page_alloc() failed: %e\n", rc);

	memcpy((void *)PFTEMP, (void *)ROUNDDOWN(addr, PGSIZE), PGSIZE);
	if ((rc = sys_page_map(eid, (void *)PFTEMP, eid, (void *)ROUNDDOWN(addr, PGSIZE), (PTE_U|PTE_W))) < 0)
		panic("pgfault: sys_page_map() failed: %e\n", rc);
	if ((rc = sys_page_unmap(eid, (void *)PFTEMP)) < 0)
		panic("pgfault: sys_page_unmap() failed: %e\n", rc);
}

First let’s take a look at pgfault(). The CPU does push an error code when a page fault exception happens. Our handler checks it to tell whether the faulting access was a write. Then it consults uvpt[] to check whether the page is COW or not. If it actually was a write attempt to a COW page, the handler is happy to allocate a new page for the faulting environment, temporarily map it at PFTEMP, copy the content of the old page to it, install the new page, then un-map the temporary page.

Then fork() and duppage().

static int
duppage(envid_t ceid, unsigned pn)
{
	int rc;
	extern volatile pte_t uvpt[];
	envid_t peid = sys_getenvid();
	intptr_t va = (intptr_t)(pn * PGSIZE);

	if (uvpt[pn] & (PTE_COW|PTE_W)) {
		if ((rc = sys_page_map(peid, (void *)va, ceid, (void *)va, (PTE_COW|PTE_U))) < 0)
			return rc;
		if ((rc = sys_page_map(peid, (void *)va, peid, (void *)va, (PTE_COW|PTE_U))) < 0)
			return rc;
	} else {
		if ((rc = sys_page_map(peid, (void *)va, ceid, (void *)va, PTE_U)) < 0)
			return rc;
	}
	return 0;
}

envid_t
fork(void)
{
	int rc;

	set_pgfault_handler(pgfault);
	
	envid_t ceid = sys_exofork();	// child envid

	extern volatile pde_t uvpd[];
	extern volatile pte_t uvpt[];

	if (ceid < 0)
		return ceid;
	if (ceid == 0) {
		// we are the child.
		ceid = sys_getenvid();

		// update thisenv.
		thisenv = &envs[ENVX(ceid)];
		return 0;
	}

	for (uintptr_t va = 0; va < UTOP;) {
		if ((uvpd[va >> PDXSHIFT] & PTE_P) == 0) {	// page table page not found.
			va += NPTENTRIES * PGSIZE;
			continue;
		}
		if ((uvpt[va >> PTXSHIFT] & PTE_P) == 0) {	// page table entry not found.
			va += PGSIZE;
			continue;
		}
		if (va == UXSTACKTOP - PGSIZE) {	// UXSTACKTOP is not remmaped!
			va += PGSIZE;
			continue;
		}
		// this page should be duppage()d.
		if ((rc = duppage(ceid, (unsigned)(va/PGSIZE))) < 0)
			return rc;
		va += PGSIZE;
	}

	// The parent is responsible to allocate an exception stack for the child.
	// What if we let the child to do it for itself? When the child starts running,
	// it runs on its normal stack. However, its normal stack is COW now,
     // so this will cause a page fault immediately - even before the child
	// allocates its exception stack, so the kernel page_fault_handler will
	// simply crash.
	if ((rc = sys_page_alloc(ceid, (void *)(UXSTACKTOP-PGSIZE), (PTE_U|PTE_W))) < 0)
		return rc;

	// For the similar reason, the parent is also responsible to set up the 
	// page fault upcall for the child. However we can't call set_pgfault_handler
	// here since it only works for the current environment.
	extern void _pgfault_upcall(void);
	sys_env_set_pgfault_upcall(ceid, _pgfault_upcall);

	// Finally, mark the child RUNNABLE.
	sys_env_set_status(ceid, ENV_RUNNABLE);
	
	return ceid;
}

As mentioned in the comment, we can’t really let the child do anything “important” for itself (other than updating the thisenv variable). As soon as the child starts running, it will immediately cause a page fault for accessing its own COW stack, and the kernel handler page_fault_handler() will crash for sure since the child didn’t get any chance to allocate a exception stack or register a user handler. Therefore, the parent is responsible to do both things on the child’s behalf (and of course we don’t mark the parent’s exception stack as COW, since we don’t copy it to the child at all).

Second thing I want to mention is that for loop to walk through the caller’s virtual address space. I have to be cautious when consulting uvpt[] – If the corresponding page table page does not exist at all, this read attempt will cause a GP exception. Therefore, I check its uvpd[] first, and advance va for 1024 pages if its page table does not exist.

Finally I want to talk about these comment questions about duppage().

First: For writable or COW pages in the parent’s address space, we first map the page as COW into the child’s address space, then remap the same page as COW in the parent’s own address space again. Can we reverse this order?

The answer is NO. Consider: What happens, if we map the stack page as COW for the parent first, then map it as COW for the child? Since the parent is currently running on this stack, it will generates a page fault as soon as it marks its own stack as COW. Then pgfault() will take over, map a new stack for the parent. So, what happens next, is we are going to map this new stack as COW for the child! Later this page will be written to by the parent, while the same page is marked as COW in the child. This cannot be right.

The second question is similar: If a parent page is already COW, why do we have to remap it as COW again?

Because who knows if the parent will write to this COW page or not, before mapping it to the child? Line 70 and 71 are C statements – It may happen right between these statements (say, the function prolog’s push %ebp in sys_page_map()). Then that page, as in the previous question, will be marked as writable by the parent yet COW by the child, which is very wrong.

That was quite a lot to reason about. Thankfully, we now have a not-so-dumb fork()! This concludes the collaborative multitasking part of this lab. In Part C, we will finally enforce preemptive multitasking!



Part C: Preemptive Multitasking and Inter-Process communication (IPC)

Now, if an user environment just somehow feels like looping forever, there’s nothing we can do about it. That’s why we need preemptive multitasking, which is to say, forcing a environment to give up CPU and give other environments chances to be scheduled.

JOS does this by programming the built-in timer chip of LAPIC to generate timer interrupts, so that after some amount of time our kernel gets a chance to decide whether to switch to another environment or not.

Exercise 13. Modify kern/trapentry.S and kern/trap.c to initialize the appropriate entries in the IDT and provide handlers for IRQs 0 through 15. Then modify the code in env_alloc() in kern/env.c to ensure that user environments are always run with interrupts enabled. Also uncomment the sti instruction in sched_halt() so that idle CPUs unmask interrupts.

Just do it.

The processor never push an error code for a h/w interrupt, so it’s safe to use TRAPHANDLER_NOEC for all of them.

I do want to mention again about the SETGATE macro. Say, for the timer interrupt:

	void vector32();
	SETGATE(idt[32], 0, GD_KT, vector32, 0);	// IRQ_OFFSET + IRQ_TIMER

The first 0 is the istrap field. If it is turned off, the CPU clears the FL_IF bit in %eflags when switching to the kernel. This is how we enforce the rule that interrupt is always disabled while in the JOS kernel.

The second 0 is the dpl field. If it is 3, a user program can raise this exception using the int instruction. We definitely don’t want this to happen. Remember to set it to 3 for syscalls, though.

Exercise 14. Modify the kernel’s trap_dispatch() function so that it calls sched_yield() to find and run a different environment whenever a clock interrupt takes place.

static void
trap_dispatch(struct Trapframe *tf)
{
	switch (tf->tf_trapno) {
	......
	case IRQ_OFFSET + IRQ_TIMER:
		lapic_eoi();
		sched_yield();
	......

OK. Seems that in JOS an environment must give up the CPU whenever it receives a timer interrupt.

Exercise 15. Implement sys_ipc_recv() and sys_ipc_try_send() in kern/syscall.c. Read the comments on both before implementing them, since they have to work together.
Then implement the ipc_recv() and ipc_send() functions in lib/ipc.c.

Finally we implement a simple IPC mechanism supporting environments to send values (as well as pages if they want) to each other.

First the two syscalls:

static int
sys_ipc_try_send(envid_t envid, uint32_t value, void *srcva, unsigned perm)
{
	struct Env *e;
	if (envid2env(envid, &e, 0) < 0)
		return -E_BAD_ENV;

	if ((e->env_status != ENV_NOT_RUNNABLE) || (e->env_ipc_recving == 0))
		return -E_IPC_NOT_RECV;
	
	// If both side want to share a page:
	bool transferring_page = ((uintptr_t)srcva != -1) && ((uintptr_t)e->env_ipc_dstva != -1);
	if (transferring_page) {
		// Can't call sys_page_map() here, since it does not allow mapping
		// to arbitrary environments.
		if (((uintptr_t)srcva >= UTOP) || ((uintptr_t)srcva % PGSIZE != 0))
			return -E_INVAL;
		if (((perm & PTE_U) != PTE_U) || ((perm & ~PTE_SYSCALL) != 0))
			return -E_INVAL;

		pte_t *pte;
		struct PageInfo *pp = page_lookup(curenv->env_pgdir, srcva, &pte);
		if (!pp)	
			return -E_INVAL;	// srcva is not mapped in srcenvid's address space.
		if ((perm & PTE_W) && ((*pte & PTE_W) == 0))
			return -E_INVAL;	// Must not grant write access to a read-only page!
		if (page_insert(e->env_pgdir, pp, e->env_ipc_dstva, perm) < 0)
			return -E_NO_MEM;
	}

	e->env_ipc_recving = 0;
	e->env_ipc_from = curenv->env_id;
	e->env_ipc_value = value;
	e->env_ipc_perm = (transferring_page) ? perm : 0;
	
	e->env_status = ENV_RUNNABLE;
	return 0;
}

Don’t use envid2env() with the checkperm argument flag set. This also means we can’t simply call sys_map_page().

static int
sys_ipc_recv(void *dstva)
{
	if (dstva != (void *)-1) {
		if (((uintptr_t)dstva >= UTOP) || ((uintptr_t)dstva % PGSIZE != 0))
			return -E_INVAL;
	}

	curenv->env_ipc_recving = 1;
	curenv->env_ipc_dstva = dstva;	// -1 means not receiving a page

	curenv->env_status = ENV_NOT_RUNNABLE;
	
	// This is the real return 0.
	curenv->env_tf.tf_regs.reg_eax = 0;
	sys_yield();
	// This is the fake return 0.
	return 0;
}

sys_ipc_recv() is pretty tricky though. If it fails, trap_dispatch() takes care of the return value and returns -E_INVAL as usual. If it successfully yielded the CPU, however, this function will never actually return. Later when this environment is rescheduled, its register context will be restored from its trapframe, which means we have to save the return value zero into its trapframe now, in sys_pic_recv().

Then the two user library wrappers, ipc_recv() and ipc_send().

int32_t
ipc_recv(envid_t *from_env_store, void *pg, int *perm_store)
{
	void *dstva = (pg) ? pg : (void *)(-1);	// -1 if we don't want to receive a page
	int rc;
	
	if ((rc = sys_ipc_recv(dstva)) < 0) {
		if (from_env_store)
			*from_env_store = 0;
		if (perm_store)
			*perm_store = 0;
		return rc;
	}

	if (from_env_store)
		*from_env_store = thisenv->env_ipc_from;
	if (perm_store)
		*perm_store = thisenv->env_ipc_perm;
	return (thisenv->env_ipc_value);
}
void
ipc_send(envid_t to_env, uint32_t val, void *pg, int perm)
{
	void *srcva = (pg) ? pg : (void *)(-1);	// -1 if we don't want to send a page
	int rc;

	for (;;) {
		rc = sys_ipc_try_send(to_env, val, srcva, perm);
		if (rc == 0) {
			break;	// Successful!
		}
		if (rc == -E_IPC_NOT_RECV) {
			sys_yield();	// Keep on waiting.
			continue;
		}
		panic("ipc_send: received unexpected return code form sys_ipc_try_send: %e\n", rc);
	}
}

ipc_send() keeps on trying to send, until a sys_ipc_try_send() succeeds. I do have a question though, using the current API, how would an environment sends or receives a page at address zero? It would be interpreted as not sending / receiving a page.

OK, now we can do all the fun stuff!

peilin@PWN:~/6.828/2018/lab$ make run-primes-nox CPUS=4

SMP: CPU 0 found 4 CPU(s)
enabled interrupts: 1 2
SMP: CPU 1 starting
SMP: CPU 2 starting
SMP: CPU 3 starting
[00000000] new env 00001000
[00001000] new env 00001001
CPU 0: 2 [00001001] new env 00001002
CPU 0: 3 [00001002] new env 00001003
CPU 0: 5 [00001003] new env 00001004
CPU 0: 7 [00001004] new env 00001005
CPU 1: 11 [00001005] new env 00001006
CPU 0: 13 [00001006] new env 00001007
CPU 0: 17 [00001007] new env 00001008
CPU 0: 19 [00001008] new env 00001009
CPU 2: 23 [00001009] new env 0000100a
CPU 3: 29 [0000100a] new env 0000100b
CPU 2: 31 [0000100b] new env 0000100c
CPU 3: 37 [0000100c] new env 0000100d
CPU 0: 41 [0000100d] new env 0000100e
CPU 3: 43 [0000100e] new env 0000100f
CPU 1: 47 [0000100f] new env 00001010
CPU 2: 53 [00001010] new env 00001011
CPU 0: 59 [00001011] new env 00001012
CPU 2: 61 [00001012] new env 00001013
CPU 2: 67 [00001013] new env 00001014
CPU 0: 71 [00001014] new env 00001015
CPU 2: 73 [00001015] new env 00001016
CPU 0: 79 [00001016] new env 00001017
CPU 1: 83 [00001017] new env 00001018
CPU 2: 89 [00001018] new env 00001019
CPU 1: 97 [00001019] new env 0000101a

CPU 3: 8123 [000013fe] new env 000013ff

Aww, this is so beautiful.

So that’s it! That was quite a lab. Trying to do a 3-week lab within 4 days turned out to be very tough, but I have to make time for WPICTF 2020 this weekend. šŸ˜‰ Next we will be dealing with file systems! Time to take a rest and move on!