Select Page

This post will mainly focus on Part 2: The Boot Loader, and Part 3: The Kernel. While Part 1: PC Bootstrap, which covers the BIOS stuff, also sounds really interesting to me, I decide not to be distracted too much by all of these H/W details, for now.

  1. Part 1: PC Bootstrap
  2. Part 2: The Boot Loader
  3. Part 3: The Kernel


Part 1: PC Bootstrap

To quickly summarize: The 64 KB from 0x000F0000 to 0x00100000 is reserved for ROM BIOS. When our machine boots up, the CPU starts executing at [F000:FFF0], which is very close to the end of BIOS, so it immediately jumps back to an earlier address, [F000:E05B] and keeps on executing.

The BIOS is stored in non-volatile memory on the motherboard. It sets up an interrupt descriptor table and initializes various devices. After that, it searches for a bootable device. In our case, it loads the boot loader from the first sector, the boot sector, of the boot disk, to 0x7C00 and transfers control to it.



Part 2: The Boot Loader

Here goes our main stuff today. For this part, instead of following the structure of the lab page and answering all the exercise questions, I’ll simply walkthrough the boot loader itself, namely boot/boot.S and boot/main.c and comment on / explain the interesting things along the way.

When the boot loader starts executing, the CPU is simulating an Intel 8088. So, first comes some 16-bit code, since we are now running in real mode:

.globl start
start:
  .code16                     # Assemble for 16-bit mode
  cli                         # Disable interrupts
  cld                         # String operations increment

  # Set up the important data segment registers (DS, ES, SS).
  xorw    %ax,%ax             # Segment number zero
  movw    %ax,%ds             # -> Data Segment
  movw    %ax,%es             # -> Extra Segment
  movw    %ax,%ss             # -> Stack Segment

From the boot loader’s perspective, the BIOS may have set up its own interrupt handlers. However, the boot loader has no idea how these handlers work, so it’d better turn off interrupts when it’s running. The O/S will re-enable interrupts later anyway.

cld unsets the Direction Flag (DF) bit in the FLAGS register, making string processes start from lowest to highest addresses. Segment registers DS, ES and SS are set to 0.

  # Enable A20:
  #   For backwards compatibility with the earliest PCs, physical
  #   address line 20 is tied low, so that addresses higher than
  #   1MB wrap around to zero by default.  This code undoes this.
seta20.1:
  inb     $0x64,%al               # Wait for not busy
  testb   $0x2,%al
  jnz     seta20.1

  movb    $0xd1,%al               # 0xd1 -> port 0x64
  outb    %al,$0x64

seta20.2:
  inb     $0x64,%al               # Wait for not busy
  testb   $0x2,%al
  jnz     seta20.2

  movb    $0xdf,%al               # 0xdf -> port 0x60
  outb    %al,$0x60

Then we run into some hardware stuff. This piece of code opens the A20 gate. What is A20 gate, and why is this needed?

Both 8086 and 80286 are 16-bit processors, while 8086 has 20 address lines, and 80286 has 24. As we know, since 8086 cannot represent the entire 1M address space using one single 16-bit register, it uses the logical address representation [segment:offset] which can represent a maximum address of (0xFFFF * 0x10 + 0xFFFF) = 0x10FFEF, which is an address larger than 8086’s maximum address but less than 80286’s. This makes 8086 and 80826 to behave differently when trying to access these “High Addresses” (0x100000 ~ 0x10FFEF): The “overflown” 21th bit is ignored by 8086.

To solve the issue, IBM came up with a somehow weird idea: Ok, what about ignoring that 21th bit on 80286, too? Then we get the same behavior on both 8086 and 80286!… :-/

The plan is: We combine the 21th address line, A20 Line, with a bit called A20 Gate, which is the second bit of the 8042 chip (also known as the Keyboard Controller), using an AND gate, so that when the A20 gate is set to zero, the A20 Line will always be zero. Now, 8086 and 80286 get the same wraparound behavior when trying to access a “High Address”.

The 8042 chip was designed as a keyboard controller, but nevertheless it was modified to support a seemingly irrelevant feature. The author of the OSDev wiki page of A20 Line says:

This seems foolish now given their unrelated nature, but at the time computers weren’t quite so standardized.

A20 Line, OSDev.org

Anyways, soon we will change from real mode to 32-bit protected mode, so here we have to make sure that the A20 Gate is on – we don’t want a “hole” at the middle of our address space!

  # Switch from real to protected mode, using a bootstrap GDT
  # and segment translation that makes virtual addresses 
  # identical to their physical addresses, so that the 
  # effective memory map does not change during the switch.
  lgdt    gdtdesc
  movl    %cr0, %eax
  orl     $CR0_PE_ON, %eax
  movl    %eax, %cr0

Before switching to 32-bit protected mode, one last thing to do is to set a (temporary) GDT (Global Descriptor Table). Each entry in GDT, typically, is a 8-byte data structure that describes a memory segment, defined as below:

Global Descriptor Table – OSDev Wiki

As we can see this data structure looks pretty weird. Just look at how base address get cut into 3 pieces…Some H/W black magic optimization that I don’t understand must be going on.

The 2-bit DPL field defines the level of privilege required to access the described segment, which is our familiar ring 0, 1, 2, 3 stuff.

The Sz flag: 1 for 32-bit protected mode; 0 for 16-bit protected mode.

The G (Granularity) flag: When set, the Limit field is interpreted as 4K units; when unset, the Limit field is interpreted as bytes. Here the bit should be set, and we’ve got two huge 4G segments.

Here, the LGDT instruction loads info into the GDTR register and set up our initial GDT, as described in the code:

# Bootstrap GDT
.p2align 2                                # force 4 byte alignment
gdt:
  SEG_NULL				               # null seg
  SEG(STA_X|STA_R, 0x0, 0xffffffff)	  # code seg
  SEG(STA_W, 0x0, 0xffffffff)	        # data seg

gdtdesc:
  .word   0x17                            # sizeof(gdt) - 1
  .long   gdt                             # address gdt

Note the lower 2 bytes to be loaded into GDTR is the size of our GDT minus one:

LIMIT is 1 less than the length of the table, so if LIMIT has the value 15, then the GDT is 16 bytes long.

X86 Assembly / Global Descriptor Table

Macros are defined in inc/mmu.h. The first segment must be NULL, followed by a code segment and a data segment. Notably, The base address for both segments is zero, implying an identity mapping from virtual to physical addresses, which means the effective addresses will remain the same when we switch to protected mode.

Talking about the term “virtual address”, a quick sidenote:

For historical reasons we have used the term virtual address to refer to addresses manipulated by programs; an xv6 virtual address is the same as an x86 logical address, and is equal to the linear address to which the segmentation hardware maps it.

xv6 – a simple, Unix-like teaching operating system

Another quick note about the unused (NULL) first GDT entry:

Because the first entry of the GDT is not used by the processor, a selector that has an index of zero and a table indicator of zero (i.e., a selector that points to the first entry of the GDT), can be used as a null selector. The processor does not cause an exception when a segment register (other than CS or SS) is loaded with a null selector. It will, however, cause an exception when the segment register is used to access memory. This feature is useful for initializing unused segment registers so as to trap accidental references.

80386 Programmer’s Reference Manual – 5.1.3 Selectors

Then, we load CR0 into EAX, make sure its PE (Protected Mode Enabled) bit is set, then save it back. The CR0_PE_ON macro is defined as 0x1 at line 10.

Typing <C-a> + C enters QEMU’s monitor. info registers lists values of interesting registers one cannot easily inspect in GDB, including CR0.

Finally, we perform a ljmp, which allows the CS to be specified:

.set PROT_MODE_CSEG, 0x8         # kernel code segment selector
.set PROT_MODE_DSEG, 0x10        # kernel data segment selector
  # Jump to next instruction, but in 32-bit code segment.
  # Switches processor into 32-bit mode.
  ljmp    $PROT_MODE_CSEG, $protcseg

Under protected mode, Segment Registers are used in a different way:

Segment Registers in Protected Mode – CIS-77

So, PROT_MODE_CSEG here means the (8 >> 3 = 1)st entry in GDT, which is indeed our code segment. The segment descriptor describes a 32-bit segment (recall that Sz bit), and the following .code32 directive tells the assembler to generate instructions beginning with 0x66. Prefixes 0x66 / 0x67 toggle between 16-bit and 32-bit operands and addresses.

  .code32                     # Assemble for 32-bit mode
protcseg:
  # Set up the protected-mode data segment registers
  movw    $PROT_MODE_DSEG, %ax    # Our data segment selector
  movw    %ax, %ds                # -> DS: Data Segment
  movw    %ax, %es                # -> ES: Extra Segment
  movw    %ax, %fs                # -> FS
  movw    %ax, %gs                # -> GS
  movw    %ax, %ss                # -> SS: Stack Segment
  
  # Set up the stack pointer and call into C.
  movl    $start, %esp
  call bootmain

  # If bootmain returns (it shouldn't), loop.
spin:
  jmp spin

After setting other segment registers to our data segment, we set ESP to 0x7C00. The boot loader code execute “upwards” from this address, and our new stack will grow “downwards” from the same address, which I think is quite elegant. Later in bootmain(), the function prologue will assign the same address to EBP. In this way, 0x7C00 becomes our base address of the first stack frame.

bootmain typically should not return. If something crazy happens and it returns, CPU will be stuck in that infinite spin loop.

Now we are done with boot/boot.S! Take a break and look at boot/main.c:

#define SECTSIZE	512
#define ELFHDR		((struct Elf *) 0x10000) // scratch space

void readsect(void*, uint32_t);
void readseg(uint32_t, uint32_t, uint32_t);

void
bootmain(void)
{
	struct Proghdr *ph, *eph;

	// read 1st page off disk
	readseg((uint32_t) ELFHDR, SECTSIZE*8, 0);

	// is this a valid ELF?
	if (ELFHDR->e_magic != ELF_MAGIC)
		goto bad;

	// load each program segment (ignores ph flags)
	ph = (struct Proghdr *) ((uint8_t *) ELFHDR + ELFHDR->e_phoff);
	eph = ph + ELFHDR->e_phnum;
	for (; ph < eph; ph++)
		// p_pa is the load address of this segment (as well
		// as the physical address)
		readseg(ph->p_pa, ph->p_memsz, ph->p_offset);

	// call the entry point from the ELF header
	// note: does not return!
	((void (*)(void)) (ELFHDR->e_entry))();

bad:
	outw(0x8A00, 0x8A00);
	outw(0x8A00, 0x8E00);
	while (1)
		/* do nothing */;
}

What boot/main.c is fairly simple: It reads the first 4K page (starting from sector 1! Remember sector 0 contains the boot loader itself) off the hard disk, checks the ELF magic in the ELF header, load every segment described by program headers, then finally transfers control to kernel entry point. That’s it.

As we can see in the code, here the p_paddr field is used to determine the physical address where a segment should be loaded. By running readelf on obj/kern/kernel, we can actually see the numbers:

peilin@PWN:~/6.828/lab/obj/kern$ readelf -all kernel
ELF Header:

  Entry point address: 0x10000c

Program Headers:
  Type      Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  LOAD      0x001000 0xf0100000 0x00100000 0x0759d 0x0759d R E 0x1000
  LOAD      0x009000 0xf0108000 0x00108000 0x0b6a8 0x0b6a8 RW  0x1000
  GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RWE 0x10

It tells us the ((void (*)(void)) (ELFHDR->e_entry))(); statement actually jumps to 0x10000c (see 0x7d6b).

We can see the huge difference between the physical and virtual base address of the kernel. How this entire thing works? Let’s dive into the kernel:



Part 3: The Kernel

First, let’s look into kern/entry.S.

.globl entry
entry:
	movw	$0x1234,0x472			# warm boot

	# We haven't set up virtual memory yet, so we're running from
	# the physical address the boot loader loaded the kernel at: 1MB
	# (plus a few bytes).  However, the C code is linked to run at
	# KERNBASE+1MB.  Hence, we set up a trivial page directory that
	# translates virtual addresses [KERNBASE, KERNBASE+4MB) to
	# physical addresses [0, 4MB).  This 4MB region will be
	# sufficient until we set up our real page table in mem_init
	# in lab 2.

	# Load the physical address of entry_pgdir into cr3.  entry_pgdir
	# is defined in entrypgdir.c.
	movl	$(RELOC(entry_pgdir)), %eax
	movl	%eax, %cr3
	# Turn on paging.
	movl	%cr0, %eax
	orl	$(CR0_PE|CR0_PG|CR0_WP), %eax
	movl	%eax, %cr0

	# Now paging is enabled, but we're still running at a low EIP
	# (why is this okay?).  Jump up above KERNBASE before entering
	# C code.
	mov	$relocated, %eax
	jmp	*%eax

I don’t really understand the first movw $0x1234, 0x472 instruction. Seems like it has something to do with warm booting. Hmm… :-/

After that, the main goal of this piece of code is to set up a simple page table and enable paging, so that both [0, 4M) and [KERNBASE, KERNBASE+4M) virtual addresses will map to physical [0, 4M). Let’s walk through a concrete example to understand how it works.

(gdb) x/i $pc
=>    0x10002d:    jmp    *%eax
(gdb) i r eax
eax    0xf010002f    -267386833

Say now we turned on paging, set EIP to 0xf010002f. When the CPU then tries to execute an instruction at 0xf010002f, how does it translate this address into physical address?

The logical address can be represented as 0x8:0xf010002f. Remember 0x8 is the value of CS, an index into GDT. Here the GDT entry says that the base address of this segment is 0x0, so the linear address is 0x0 + 0xf010002f = 0xf010002f. Then:

Figure 5-12. 80386 Addressing Machanism

Now CR3 (also called the Page Directory Base Register, PDBR) is set to the physical address of entry_pgdir, 0x112000. The linear address can be divided into upper 10 bits 0b1111000000, middle 10 bits 0b0100000000 and lower 12 bits 0b000000101111. First we find the corresponding Page Directory Entry (PDE) using upper 10 bits and CR3: (Note that PDEs are 4 bytes long)

(gdb) x/wx 0x112000 + (0b1111000000*4)
0x112f00:    0x00110003

That 0x3 means the PTE_P flag (the table is present in the memory) and PTE_W flag (writable or not), and the remaining 0x110000 is the base address of a page table, in this case, entry_pgtable. Using this base address and the middle 10 bits we can find the corresponding Page Table Entry (PTE), whose length is also 4 bytes:

(gdb) x/wx 0x110000 + (0b0100000000*4)
0x110400:    0x00100023

This time we also have a PTE_A (0x20) bit set, indicating that this page has already been accessed before. Well, makes sense, since now we’ve already executed one instruction after turning on paging. Anyways:

The 0x100000 is the base physical address of our target page. Adding it with the offset, the lower 12 bits, the CPU successfully translate the linear address into physical address: 0x100000 + 0b000000101111 = 0x10002f. Exactly what we want! Let’s check it out:

(gdb) x/i 0xf010002f
0xf010002f    <relocated>:    mov $0x0, %ebp
(gdb) x/i 0x0010002f
0x10002f:    mov $0x0, %ebp

Yay.

Each PDE corresponds to a page table which has 1024 (defined as NPTENTRIES in kern/mmu.h) PTEs. Each PTE corresponds to 4K of physical memory, since the offset is 12 bits long. Therefore, each PDE represents a maximum range of memory of (1024 * 4K = ) 4M.

Because of design limitations JOS will use only the first 256MB of a PC’s physical memory anyway…

Lab 1: Booting a PC

In Lab 1 we only map the first 4M of the kernel, but in the next lab we are going to map the entire 256MB.

The remaining part of kern/entry.S is fairly easy to understand:

relocated:

	# Clear the frame pointer register (EBP)
	# so that once we get into debugging C code,
	# stack backtraces will be terminated properly.
	movl	$0x0,%ebp			# nuke frame pointer

	# Set the stack pointer
	movl	$(bootstacktop),%esp

	# now to C code
	call	i386_init

	# Should never get here, but in case we do, just spin.
spin:	jmp	spin

EBP is set to zero, so that our backtrace() can terminate gracefully. ESP is set to bootstacktop (0xf0110000). Then finally, we call i386_init, and we can run some C code.

OK, I want to stop the coding reading here. Quite a journey, and I definitely learned a lot. I tried to discover as much as I could, but I intentionally glossed over some H/W related stuff including multi-booting etc. That would be too much for this Lab, and I hope I will learn more about them in the future.

Finally, here’s my quick solution for the remaining code exercises:


Section “Formatted Printing to the Console”:

Exercise 8. We have omitted a small fragment of code – the code necessary to print octal numbers using patterns of the form “%o”. Find and fill in this code fragment.

Lab 1: Booting a PC

Before modifying lib/printfmt.c:

peilin@PWN:~/6.828/lab$ make qemu-nox

qemu-system-i386 -nographic -drive file=obj/kern/kernel.img,index=0,media=disk,format=raw -serial mon:stdio -gdb tcp::26000 -D qemu.log
6828 decimal is XXX octal!

After modifying this part of lib/printfmt.c:

		// (unsigned) octal
		case 'o':
			// Replaced this with mine code!
			num = getint(&ap, lflag);
			if ((long long) num < 0) {
				putch('-', putdat);
				num = -(long long) num;
			}
			base = 8;
			goto number;
peilin@PWN:~/6.828/lab$ make qemu-nox

qemu-system-i386 -nographic -drive file=obj/kern/kernel.img,index=0,media=disk,format=raw -serial mon:stdio -gdb tcp::26000 -D qemu.log
6828 decimal is 15254 octal!

Nice.

At kern/entry.S:80 we call i386_init() (kern/init.c:22), where we call cprintf() (defined in kern/printf.c)at line 36:

	cprintf("6828 decimal is %o octal!\n", 6828);

Later, cprintf() calls vprintfmt() defined at lib/printfmt.c:82, which finally calls cputchar() defined in kern/console.c. There are just so many interesting things going on in console.c, but they are too h/w to be included in this post.


Section “The Stack”:

kern/kdebug.c:

	// Mine code here:
	stab_binsearch(stabs, &lline, &rline, N_SLINE, addr);
	if (lline <= rline)
		info->eip_line = stabs[lline].n_desc;
	else
		return -1;

kern/monitor.c:

int
mon_backtrace(int argc, char **argv, struct Trapframe *tf)
{
	uint32_t *ebp;
	struct Eipdebuginfo info;

	ebp = (uint32_t *)read_ebp();

	cprintf("Stack backtrace:\n");
	while (1) {
		cprintf("  ebp %08x", ebp);
		cprintf("  eip %08x", *(ebp+1));
		cprintf("  args %08x %08x %08x %08x %08x\n", *(ebp+2), *(ebp+3), *(ebp+4), *(ebp+5), *(ebp+6));

		if (!debuginfo_eip((uintptr_t)*(ebp+1), &info)) {
			cprintf("         %s:%d:", info.eip_file, info.eip_line);
			cprintf(" %.*s", info.eip_fn_namelen, info.eip_fn_name);
			cprintf("+%d\n", (*(ebp+1) - info.eip_fn_addr));
		}

		ebp = (uint32_t *)(*ebp);
		if (ebp == 0)
			break;
	}
	return 0;
}

Nothing really fancy, we already know that the stack is a linked list. Just keep traversing the stack frames until we reach a NULL saved EBP, set by kern/entry.S.

The stab things are new to me, though. Roughly speaking:

	/* Include debugging information in kernel memory */
	.stab : {
		PROVIDE(__STAB_BEGIN__ = .);
		*(.stab);
		PROVIDE(__STAB_END__ = .);
		BYTE(0)		/* Force the linker to allocate space
				   for this section */
	}

	.stabstr : {
		PROVIDE(__STABSTR_BEGIN__ = .);
		*(.stabstr);
		PROVIDE(__STABSTR_END__ = .);
		BYTE(0)		/* Force the linker to allocate space
				   for this section */
	}

By looking at kern/kernel.ld, it seems that debugging information is added to .stab and .stabstr sections of the kernel ELF:

peilin@PWN:~/6.828/lab$ objdump -h obj/kern/kernel | grep .stab
2 .stab    00003d21 f010231c 0010231c 0000331c 2**2
3 .stabstr 0000197e f010603d 0010603d 0000703d 2**0
peilin@PWN:~/6.828/lab$ objdump -G obj/kern/kernel | grep SLINE
2 SLINE 0 44 f010000c 0
3 SLINE 0 57 f0100015 0
4 SLINE 0 58 f010001a 0
5 SLINE 0 60 f010001d 0

More information about stabs can be found here. Note that for N_SLINE (line number) symbols:

The desc field contains the line number and the value contains the code address for the start of that source line.

2.4 Line Numbers – STABS

I was getting wrong line numbers, since I mistakenly used the value field. 🙂 I should have just read the documentation in the first place. Anyways:

peilin@PWN:~/6.828/lab$ make grade

running JOS: (0.8s)
printf: OK
backtrace count: OK
backtrace arguments: OK
backtrace symbols: OK
backtrace lines: OK
Score: 50/50

Ta-dah.