K> showmappings 0xef7bc000
Virtual Address Physical Address Flags
0xef7bc000 0x003fd000 ——————U—P
K> chpgperm clear 0xef7bc000
[0xef7bc000 – 0xef7bd000): Supervisor | Read-only
K> showmappings 0xef7bc000
Virtual Address Physical Address Flags
0xef7bc000 0x003fd000 ————————P
K> chpgperm set 0xef7bc000 SW
[0xef7bc000 – 0xef7bd000): Supervisor | Read/write
K> showmappings 0xef7bc000
Virtual Address Physical Address Flags
0xef7bc000 0x003fd000 ———————WP
K> chpgperm change 0xef7bc000 +U
[0xef7bc000 – 0xef7bd000): User | Read/write
K> showmappings 0xef7bc000
Virtual Address Physical Address Flags
0xef7bc000 0x003fd000 ——————UWP
K> memdump -V 0xf0100000
Virtual Physical
[0xf0100000] [0x00100000] 1badb002 00000000 e4524ffe 7205c766
Oh my, this is just so satisfying.
Part 1: Overview
Last time, after walking through BIOS and the bootloader, we finally entered the kernel. In entry.S
, after:
- Setting up a very simple page directory,
entry_pgdir
; - Enabling paging;
- Initializing the stack;
The system is now ready to execute some C code, so it now calls i386_init
:
void i386_init(void) { extern char edata[], end[]; memset(edata, 0, end - edata); cons_init(); cprintf("6828 decimal is %o octal!\n", 6828); mem_init(); while (1) monitor(NULL); }
Later we will be allocating some memory managing data structures in the .bss
region, so now we want to memset()
it to zero first.
Then we call a function, cons_init()
to initialize the console. Frankly I don’t know what is going on inside it, but we can’t cprintf()
stuff until we call this function.
Finally, after calling mem_init()
(which will be our main course today), we jump into an infinite while
loop to run the monitor. The monitor is basically a simple shell-like user interface with that K>
prompt we’ve seen before.
This is already kind of satisfying: We’ve followed all along the way from booting until the system is now finally “up”. Now it’s time to look into the final piece (for now!) of the puzzle, mem_init()
:
void mem_init(void) { uint32_t cr0; size_t n; i386_detect_memory(); kern_pgdir = (pde_t *) boot_alloc(PGSIZE); memset(kern_pgdir, 0, PGSIZE); kern_pgdir[PDX(UVPT)] = PADDR(kern_pgdir) | PTE_U | PTE_P; pages = (struct PageInfo *) boot_alloc(npages * sizeof(struct PageInfo)); memset(pages, 0, npages * sizeof(struct PageInfo)); page_init(); // Physical page descriptors ("pages") boot_map_region(kern_pgdir, UPAGES, npages*sizeof(struct PageInfo), PADDR(pages), PTE_U); // Kernel stack ("bootstack") boot_map_region(kern_pgdir, KSTACKTOP-KSTKSIZE, KSTKSIZE, PADDR(bootstack), PTE_W); // Entire physical memory boot_map_region(kern_pgdir, KERNBASE, (1ULL << 32) - KERNBASE, 0, PTE_W); lcr3(PADDR(kern_pgdir)); cr0 = rcr0(); cr0 |= CR0_PE|CR0_PG|CR0_AM|CR0_WP|CR0_NE|CR0_MP; cr0 &= ~(CR0_TS|CR0_EM); lcr0(cr0); }
I’ve removed all the check_*()
functions as well as most of the comments so it’s much shorter than the original one. Hints and comments have been super useful when I wrote the code, but sometimes they prevent me from getting the whole picture. No worries, I’m gonna explain all of this in a top-down manner. Of course, this is only my version of mem_init()
, but the basic idea is the same.
The mere purpose of entry_pgdir
is to “keep the system alive” after enabling paging – otherwise CPU will crash when trying to execute at an EIP as high as 0xf010002f
. We need to set up a “real” page managing system, which allows us to:
- Keep track of (physical) page frames: Is this frame currently in use? If so, how many virtual pages are referencing this same frame;
- Dynamically allocate / free page frames;
- Set / Clear / Change mappings from virtual to physical pages as we want, including permissions.
This new and powerful system will use a page directory named kern_pgdir
.
Roughly speaking, the main purpose of mem_init()
is to install (set up then switch to) kern_pgdir
.
More specifically, it does the following things in sequence:
i386_detect_memory()
uses CMOS calls to detect how many physical memory is available on the machine, and save this information in global variable npages
and npages_basemem
. It also prints out the result to the console:
Physical memory: 131072K available, base = 640K, extended = 130432K
It then allocates the page directory kern_pgdir
using a temporary allocator, boot_alloc()
.
Page frame status info is stored in PageInfo
structs. It then allocates an array of npages
PageInfo
s named pages
, also by using boot_alloc()
. From now on we never use boot_alloc()
again. We use page_alloc()
instead. page_alloc()
itself uses PageInfo
structs, which is why we can’t use it for kern_pgdir
and pages
: That would be a chicken and egg problem!
Currently pages
is all zeros, indicating that all page frames are free, which is definitely not the case. page_init()
marks page frames that is already in use as, well, in-use, then link the rest of them together in a singular linked list, page_free_list
.
It then sets up mappings for a few virtual memory regions above UTOP
, including:
pages
, toUPAGES
;- The kernel stack
bootstack
we set up inentry.S[raw], to [raw]KSTACKTOP - KSTKSIZE
; - The entire physical memory to
KERNBASE
. We did the similar thing inentry_pgdir
, but only for the first 4M of the physical memory. Here we do the “full version” of it.
Finally, since all data structures are ready now, install kern_pgdir
by setting more flags in CR0 then load the physical address of kern_pgdir
into CR3. From now on, the CPU will consult kern_pgdir
when translating linear addresses.
So, basically that’s it. My version of it uses boot_map_region()
to do the mapping since I believe it’s easier. All we need to do now is to implement boot_map_region()
:-), as well as other useful page managing functions which will come in handy in the future.
Part 2: Function Implementations
Here’s the boot_alloc()
, which we only used twice:
static void * boot_alloc(uint32_t n) { static char *nextfree; // virtual address of next byte of free memory char *result; if (!nextfree) { extern char end[]; nextfree = ROUNDUP((char *) end, PGSIZE); } result = nextfree; if (n) { if ((uint32_t) nextfree + n > KERNBASE + PTSIZE - 1) { panic("boot_alloc: out of memory (entry_pgdir[0x3c0])\n"); } nextfree = ROUNDUP(nextfree+n, PGSIZE); } return (void *)result; }
Also, page_init()
:
void page_init(void) { // 1) Preserve the first frame including data structures set by BIOS, in case we ever need them. size_t i = 0; pages[i].pp_ref = 1; pages[i].pp_link = NULL; i++; // 2) The rest of base memory, [PGSIZE, npages_basemem * PGSIZE) is free. for (; i < npages_basemem; i++) { pages[i].pp_ref = 0; pages[i].pp_link = page_free_list; page_free_list = &pages[i]; } // 3) The IO hole [IOPHYSMEM, EXTPHYSMEM), which must never be allocated. Recall that BIOS is here. for (; i*PGSIZE < EXTPHYSMEM; i++) { pages[i].pp_ref = 1; } // 4) [EXTPHYSMEM, boot_alloc(0)) contains the kernel, as well as "kern_pgdir" and "pages" we just allocated. for (; i*PGSIZE < PADDR((char *)boot_alloc(0)); i++) { pages[i].pp_ref = 1; } // 5) Everything above should be free. for (; i < npages; i++) { pages[i].pp_ref = 0; pages[i].pp_link = page_free_list; page_free_list = &pages[i]; } }
I actually wondered how to get the end address of pages
. Fortunately I recalled that boot_alloc(0)
actually returns the address of the next free page.
Then comes the more interesting boot_map_region()
.
static void boot_map_region(pde_t *pgdir, uintptr_t va, size_t size, physaddr_t pa, int perm) { pte_t *pte = NULL; for (int i = 0; i < (size/PGSIZE); i++) { pte = pgdir_walk(pgdir, (void *)va, 1); if (!pte) { panic("boot_map_region: pgdir_walk failed to create!\n"); } *pte = pa | perm | PTE_P; pgdir[PDX(va)] |= perm; va += PGSIZE; pa += PGSIZE; } }
For each virtual page in the given range, boot_map_region()
tries to find the PTE of it (create a page table if one is not present at the PDE at all). Notice, we don’t touch pp_ref
here, since pp_ref
does not consider virtual regions higher than UTOP
. We only deal with UTOP
and above in mem_init()
.
Now I’m gonna post my implementations for other functions as well, in the sequence they appeared in the lab.
First comes page_alloc()
. Nothing really fancy here. This function does not increment pp_ref
. The caller is responsible to do it, either explicitly or by page_insert()
, which we will see later. This function replaces boot_alloc()
.
struct PageInfo * page_alloc(int alloc_flags) { struct PageInfo *result = page_free_list; if (!page_free_list) { return NULL; } page_free_list = result->pp_link; result->pp_link = NULL; if (alloc_flags & ALLOC_ZERO) { memset((struct PageInfo *) page2kva(result), 0, PGSIZE); } return result; }
Then, page_free()
. Still nothing fancy. Don’t forget to check double-free, though. 🙂
void page_free(struct PageInfo *pp) { if ((pp < pages) | (pp > pages + npages)) { panic("page_free: invalid free!\n"); } if ((pp->pp_ref != 0) | (pp->pp_link != NULL)) { panic("page_free: invalid free!\n"); } pp->pp_link = page_free_list; page_free_list = pp; }
Now take a look at pgdir_walk()
, used in my boot_map_region()
.
pte_t * pgdir_walk(pde_t *pgdir, const void *va, int create) { pde_t *pde = pgdir + PDX(va); if (((*pde) & PTE_P) == 0) { if (!create) { return NULL; } else { struct PageInfo *ptp = page_alloc(ALLOC_ZERO); // "page table page" if (!ptp) { // page_alloc failed return NULL; } ptp->pp_ref += 1; *pde = page2pa(ptp) | PTE_P; } } return (pte_t *) (KADDR(PTE_ADDR(*pde))) + PTX(va); }
It receives a create
flag indicating whether we want to create a page table if one is not present for given va
.
page_lookup()
, page_remove()
and page_insert()
are not actually used in mem_init()
, but they are also useful page managing functions. We will definitely need them in the future.
struct PageInfo * page_lookup(pde_t *pgdir, void *va, pte_t **pte_store) { pte_t *pte = pgdir_walk(pgdir, va, 0); if (!pte) return NULL; if (!(*pte & PTE_P)) return NULL; if (pte_store) *pte_store = pte; return pa2page(*pte & ~0xFFF); }
void page_remove(pde_t *pgdir, void *va) { pte_t *pte; struct PageInfo *pp = page_lookup(pgdir, va, &pte); if (!pp) return; page_decref(pp); memset(pte, 0, sizeof(pte_t)); tlb_invalidate(pgdir, va); }
That tlb_invalidate()
confused me a little bit. It seems that TLB (Translation lookaside buffer) is just a table containing linear-physical address pairs for recently translated addresses. When CPU translates a linear address, if it finds a hit in TLB, it will simply use the value in TLB instead of translate it again. But now since we removed the PTE for a linear address, we want to flush its TLB entry (if any) since it’s no longer valid. We do the same thing in page_insert()
:
// Corner-case hint: Make sure to consider what happens when the same // pp is re-inserted at the same virtual address in the same pgdir. // However, try not to distinguish this case in your code, as this // frequently leads to subtle bugs; there's an elegant way to handle // everything in one code path. int page_insert(pde_t *pgdir, struct PageInfo *pp, void *va, int perm) { pte_t *pte = pgdir_walk(pgdir, va, 1); if (!pte) return -E_NO_MEM; if ((*pte & PTE_P)) { if (PTE_ADDR(*pte) == page2pa(pp)) { tlb_invalidate(pgdir, va); pp->pp_ref--; } else page_remove(pgdir, va); } pp->pp_ref++; *pte = page2pa(pp) | perm | PTE_P; pgdir[PDX(va)] |= perm; return 0; }
Notice that corner case as described in the comment. If we try to insert the same physical page to the same virtual address, its pp_ref
should not be incremented. The comment suggests not to “distinguish this case”, but I couldn’t figure out how.
Part 3: Challenges!
For the challenge part, I chose the second challenge, and basically I wrote 3 more little tools, showmappings
, chpgperm
and memdump
:
Challenge! Extend the JOS kernel monitor with commands to:
Lab 2: Memory Management
・Display in a useful and easy-to-read format all of the physical page mappings (or lack thereof) that apply to a particular range of virtual/linear addresses in the currently active address space. For example, you might enter ‘showmappings 0x3000 0x5000’ to display the physical page mappings and corresponding permission bits that apply to the pages at virtual addresses 0x3000, 0x4000, and 0x5000.
First comes my showmappings
:
K> showmappings
Usage: showmappings LOWERBOUND [UPPERBOUND]
K> showmappings 0xef7bc000
Virtual Address Physical Address Flags
0xef7bc000 0x003fd000 ——————U—P
K> showmappings 0xef7bf000
Virtual Address Physical Address Flags
0xef7bf000 0x003fe000 ———————WP
K> showmappings 0xf0001000
Virtual Address Physical Address Flags
0xf0001000 0x00001000 ——DA———WP
K> showmappings 0x0
Virtual Address Physical Address Flags
0x00000000 (UNMAPPED)
int mon_showmappings(int argc, char **argv, struct Trapframe *tf) { if ((argc < 2) | (argc > 3)) { cprintf("Usage: showmappings LOWERBOUND [UPPERBOUND]\n"); return 0; } // Args are not sufficiently sanitized. Use with caution! uintptr_t lo = ROUNDDOWN(strtol(argv[1], NULL, 16), PGSIZE); uintptr_t up = (argc == 2) ? lo : ROUNDDOWN(strtol(argv[2], NULL, 16), PGSIZE); if (lo > up) { cprintf("showmappings: UPPERBOUND cannot be lower than LOWERBOUND\n"); return 0; } pte_t *pte; extern pde_t *kern_pgdir; char flags[] = "GSDACTUWP"; cprintf("Virtual Address Physical Address Flags\n"); for (uintptr_t p = lo; p <= up; p += PGSIZE) { cprintf("0x%08x ", p); pte = pgdir_walk(kern_pgdir, (void *)p, 0); if (pte) { cprintf("0x%08x ", PTE_ADDR(*pte)); int flag = (*pte & 0xFFF); for (int i = 0; i < 9; i++) { if ((flag >> (8 - i)) & 1) { cprintf("%c", flags[i]); } else { cprintf("-"); } } } else { cprintf("(UNMAPPED)"); } cprintf("\n"); if (p >= 0xfffff000) break; } return 0; }
Challenge! Extend the JOS kernel monitor with commands to:
Lab 2: Memory Management
・Explicitly set, clear, or change the permissions of any mapping in the current address space.
I wrote a chpgperm
(CHange PaGe PERMissions) for this:
K> chpgperm
Usage: chpgperm ACTION VADDR [MODE]
ACTION is one of “set”, “clear” or “change”.
K> chpgperm set 0xf0000000 what?
chpgperm set: Each MODE is of the form ‘([[Ss]|[Uu]])([[Rr]|[Ww]])’.
K> chpgperm set 0xf010000c UW
[0xf0100000 – 0xf0101000): User | Read/write
Don’t do chpgperm set 0xf010000c UW
though :-). I just can’t think of a better example off the top of my head…
Also chpgperm
only affects the Supervisor/User bit and the Read/Write bit.
int mon_chpgperm(int argc, char **argv, struct Trapframe *tf) { int action = 0; #define F_SET 1 #define F_CLEAR 2 #define F_CHANGE 3 if (argc < 3) { cprintf("Usage: chpgperm ACTION VADDR [MODE]\n"); cprintf("ACTION is one of \"set\", \"clear\" or \"change\".\n"); return 0; } if (!(strcmp(argv[1], "set"))) action = F_SET; else if (!(strcmp(argv[1], "clear"))) action = F_CLEAR; else if (!(strcmp(argv[1], "change"))) action = F_CHANGE; else { cprintf("chpgperm: Invalid ACTION!\n"); return 0; } // Args are not sufficiently sanitized. Use with caution! extern pde_t *kern_pgdir; uintptr_t va = ROUNDDOWN(strtol(argv[2], NULL, 16), PGSIZE); pte_t *pte = pgdir_walk(kern_pgdir, (void *)va, 0); if (!pte) { cprintf("chpgperm: Cannot change permission for page [0x%08x - 0x%08x): Corresponding page table page does not exist!\n", va, (va+PGSIZE)); return 0; } if (!(*pte & PTE_P)) { cprintf("chpgperm: Cannot change permission for page [0x%08x - 0x%08x): Corresponding page table entry does not exist!\n", va, (va+PGSIZE)); return 0; } if (action == F_SET) { if (argc != 4) { cprintf("Usage: chpgperm set VADDR MODE\n"); cprintf("Each MODE is of the form '([[Ss]|[Uu]])([[Rr]|[Ww]])'.\n"); return 0; } if (strlen(argv[3]) != 2) { cprintf("chpgperm set: Each MODE is of the form '([[Ss]|[Uu]])([[Rr]|[Ww]])'.\n"); return 0; } int perm = 0; if ((argv[3][0] == 'U') | (argv[3][0] == 'u')) { perm |= PTE_U; } else if ((argv[3][0] != 'S') & (argv[3][0] != 's')) { cprintf("chpgperm set: '%c' is not a valid flag for the U/S bit.\n", argv[3][0]); return 0; } if ((argv[3][1] == 'W') | (argv[3][1] == 'w')) { perm |= PTE_W; } else if ((argv[3][1] != 'R') & (argv[3][1] != 'r')) { cprintf("chpgperm set: '%c' is not a valid flag for the R/W bit.\n", argv[3][1]); return 0; } *pte = (*pte & ~(PTE_U | PTE_W)) | perm; cprintf("[0x%08x - 0x%08x): ", va, (va+PGSIZE)); cprintf((*pte & PTE_U) ? "User" : "Supervisor"); cprintf(" | "); cprintf((*pte & PTE_W) ? "Read/write" : "Read-only"); cprintf("\n"); } if (action == F_CLEAR) { if (argc != 3) { cprintf("chpgperm clear: Too many arguments!\n"); cprintf("Usage: chpgperm clear VADDR\n"); return 0; } *pte = *pte & ~(PTE_U | PTE_W); cprintf("[0x%08x - 0x%08x): Supervisor | Read-only\n", va, (va+PGSIZE)); } if (action == F_CHANGE) { if (argc != 4) { cprintf("Usage: chpgperm change VADDR MODE\n"); cprintf("Each MODE is of the form '[+-]([Uu]|[Ww])'.\n"); return 0; } // Changing two flags at the same time is not allowed. if (strlen(argv[3]) != 2) { cprintf("chpgperm change: Each MODE is of the form '[+-]([Uu]|[Ww])'.\n"); return 0; } int add = 0, perm = 0; if (argv[3][0] == '+') { add = 1; } else if (argv[3][0] == '-') { add = 0; } else { cprintf("chpgperm change: Each MODE is of the form '[+-]([Uu]|[Ww])'.\n"); return 0; } if ((argv[3][1] == 'U') | (argv[3][1] == 'u')) { perm = PTE_U; } else if ((argv[3][1] == 'W') | (argv[3][1] == 'w')) { perm = PTE_W; } else { cprintf("chpgperm change: '%c' is not a valid flag either for the U/S bit or the R/W bit.\n", argv[3][1]); return 0; } if (add) { *pte |= perm; } else { *pte &= (~perm); } cprintf("[0x%08x - 0x%08x): ", va, (va+PGSIZE)); cprintf((*pte & PTE_U) ? "User" : "Supervisor"); cprintf(" | "); cprintf((*pte & PTE_W) ? "Read/write" : "Read-only"); cprintf("\n"); } return 0; }
Challenge! Extend the JOS kernel monitor with commands to:
Lab 2: Memory Management
・Dump the contents of a range of memory given either a virtual or physical address range. Be sure the dump code behaves correctly when the range extends across page boundaries!
Finally, memdump
:
K> memdump -V 0xf010000c 0xf0100020
Virtual Physical
[0xf0100000] [0x00100000] 1badb002 00000000 e4524ffe 7205c766
[0xf0100010] [0x00100010] 34000004 a000b812 220f0011 c0200fd8
[0xf0100020] [0x00100020] 0100010d c0220f80 10002fb8 bde0fff0
K> memdump -P 0x0010000c 0x00100020
Physical
[0x00100000] 1badb002 00000000 e4524ffe 7205c766
[0x00100010] 34000004 a000b812 220f0011 c0200fd8
[0x00100020] 0100010d c0220f80 10002fb8 bde0fff0
int mon_memdump(int argc, char **argv, struct Trapframe *tf) { if ((argc < 3) | (argc > 4)) { cprintf("Usage: memdump OPTION LOWERBOUND [UPPERBOUND]\n"); return 0; } uintptr_t lo = ROUNDDOWN(strtol(argv[2], NULL, 16), 0x10); uintptr_t up = (argc == 3) ? lo : ROUNDDOWN(strtol(argv[3], NULL, 16), 0x10); extern pde_t *kern_pgdir; if (lo > up) { cprintf("memdump: UPPERBOUND cannot be lower than LOWERBOUND\n"); return 0; } int virtual = 0; if ((!(strcmp(argv[1], "-V"))) | (!(strcmp(argv[1], "-v")))) { virtual = 1; } else if ((!(strcmp(argv[1], "-P"))) | (!(strcmp(argv[1], "-p")))) { virtual = 0; } else { cprintf("memdump: OPTION is of the form '(-[Vv])|(-[Pp])'.\n"); return 0; } if (!virtual) { cprintf("Physical\n"); for (uintptr_t p = lo; p <= up; p += 0x10) { cprintf("[0x%08x] ", p); for (int j = 0; j < 0x10; j += 4) { cprintf("%08lx ", *(long *)KADDR(p + j)); } cprintf("\n"); } } else { cprintf("Virtual Physical\n"); for (uintptr_t v = lo; v <= up; v += 0x10) { cprintf("[0x%08x] ", v); struct PageInfo *pp = page_lookup(kern_pgdir, (void *)v, NULL); if (!pp) { cprintf("(UNMAPPED)\n"); continue; } else { cprintf("[0x%08x] ", page2pa(pp) + PGOFF(v)); for (int j = 0; j < 0x10; j += 4) { cprintf("%08lx ", *(long *)(v + j)); } cprintf("\n"); continue; } } } return 0; }
You know what? Finally let’s use these tools to simulate a page translation process. Let’s take kern_pgdir
itself, 0xf011c000
.
It’s PDX is 0xf011c000 >> 22 = 0x3c0
, so we need to consult PDE kern_pgdir[0x3c0]
, at 0xf011c000 + (0x3c0 * sizeof(pde_t)) = 0xf011cf00
:
K> memdump -V 0xf011cf00
Virtual Physical
[0xf011cf00] [0x0011cf00] 003ff023 003fc023 003fb023 003fa023
So our page table is located at 0x003ff000
. The PTX of our linear address is (0xf011c000 >> 12) & 0x3ff = 0x11c
, which means we should consult PTE at physical address 0x003ff000 + (0x11c * sizeof(pte_t)) = 0x003ff470
:
K> memdump -P 0x003ff470
Physical
[0x003ff470] 0011c063 0011d023 0011e063 0011f023
Well…
K> showmappings 0xf011c000
Virtual Address Physical Address Flags
0xf011c000 0x0011c000 ——DA———WP
kern_pgdir
’s physical address is exactly 0x0011c000
! Yay!
Side notes: As always, when writing these little tools, I spent most of the time writing the command parser, instead of the “real meat”. There must be some best practice writing parsers in C or other languages, but that’s a completely separate topic.
This completes this lab. Chapter 5 and 6 of the Intel 80386 Reference Programmer’s Manual helped me a lot on understanding the page translation process. It has the clearest figures I can find on the Internet illustrating these stuff. See you next time!