x86 Memory Protection and Translation

x86 Memory Protection
and Translation
Don Porter
CSE 506
Lecture Goal
ò  Understand the hardware tools available on a modern
x86 processor for manipulating and protecting memory
ò  Lab 2: You will program this hardware
ò  Apologies: Material can be a bit dry, but important
ò  Plus, slides will be good reference
ò  But, cool tech tricks:
ò  How does thread-local storage (TLS) work?
ò  An actual (and tough) Microsoft interview question
Undergrad Review
ò  What is:
ò  Virtual memory?
ò  Segmentation?
ò  Paging?
Two System Goals
1) Provide an abstraction of contiguous, isolated virtual
memory to a program
2) Prevent illegal operations
ò  Prevent access to other application or OS memory
ò  Detect failures early (e.g., segfault on address 0)
ò  More recently, prevent exploits that try to execute
program data
ò  x86 processor modes
ò  x86 segmentation
ò  x86 page tables
ò  Software vs. Hardware mechanisms
ò  Advanced Features
ò  Interesting applications/problems
x86 Processor Modes
ò  Real mode – walks and talks like a really old x86 chip
ò  State at boot
ò  20-bit address space, direct physical memory access
ò  Segmentation available (no paging)
ò  Protected mode – Standard 32-bit x86 mode
ò  Segmentation and paging
ò  Privilege levels (separate user and kernel)
x86 Processor Modes
ò  Long mode – 64-bit mode (aka amd64, x86_64, etc.)
ò  Very similar to 32-bit mode (protected mode), but bigger
ò  Restrict segmentation use
ò  Garbage collect deprecated instructions
ò  Chips can still run in protected mode with old instructions
Translation Overview
Virtual Address
Linear Address
Physical Address
Protected/Long mode only
ò  Segmentation cannot be disabled!
ò  But can be a no-op (aka flat mode)
x86 Segmentation
ò  A segment has:
ò  Base address (linear address)
ò  Length
ò  Type (code, data, etc).
Programming model
ò  Segments for: code, data, stack, “extra”
ò  A program can have up to 6 total segments
ò  Segments identified by registers: cs, ds, ss, es, fs, gs
ò  Prefix all memory accesses with desired segment:
ò  mov eax, ds:0x80 (load offset 0x80 from data into eax)
ò  jmp cs:0xab8
(jump execution to code offset 0xab8)
ò  mov ss:0x40, ecx
(move ecx to stack offset 0x40)
Programming, cont.
ò  This is cumbersome, so infer code, data and stack
segments by instruction type:
ò  Control-flow instructions use code segment (jump, call)
ò  Stack management (push/pop) uses stack
ò  Most loads/stores use data segment
ò  Note x86 has separate icache and dcache
ò  Extra segments (es, fs, gs) must be used explicitly
Segment management
ò  For safety (without paging), only the OS should define
segments. Why?
ò  Two segment tables the OS creates in memory:
ò  Global – any process can use these segments
ò  Local – segment definitions for a specific process
ò  How does the hardware know where they are?
ò  Dedicated registers: gdtr and ldtr
ò  Privileged instructions: lgdt, lldt
Segment registers
Table Index (13 bits)
Global or Local
Ring (2 bits)
Table? (1 bit)
ò  Set by the OS on fork, context switch, etc.
JOS example 1
ò  Bootloader puts the kernel at phys. address 0x00100000
ò  Kernel is compiled to run at virt. address 0xf0100000
ò  Segmentation to the rescue (kern/entry.S):
ò  What is this code doing?
# null seg!
SEG(STA_X|STA_R, -KERNBASE, 0xffffffff) # code seg!
SEG(STA_W, -KERNBASE, 0xffffffff)
# data seg!
JOS ex 1, cont.
SEG(STA_X|STA_R, -KERNBASE, 0xffffffff) # code seg!
Execute and
jmp 0xf01000db8
Length (4 GB)
# virtual addr. (implicit cs seg)!
jmp (0xf01000db8 + -0xf0000000)
jmp 0x001000db8
# linear addr.!
Flat segmentation
ò  The above trick is used for booting. We eventually want
to use paging.
ò  How can we make segmentation a no-op?
ò  From kern/pmap.c:
// 0x8 - kernel code segment!
[GD_KT >> 3] = SEG(STA_X | STA_R, 0x0, 0xffffffff, 0),!
Execute and
Length (4 GB)
Ring 0
ò  x86 processor modes
ò  x86 segmentation
ò  x86 page tables
ò  Software vs. Hardware mechanisms
ò  Advanced Features
ò  Interesting applications/problems
Paging Model
ò  32 (or 64) bit address space.
ò  Arbitrary mapping of linear to physical pages
ò  Pages are most commonly 4 KB
ò  Newer processors also support page sizes of 2 and 4 MB
and 1 GB
How it works
ò  OS creates a page table
ò  Any old page with entries formatted properly
ò  Hardware interprets entries
ò  cr3 register points to the current page table
ò  Only ring0 can change cr3
Translation Overview
From Intel 80386 Reference Programmer’s Manual
Page Dir Offset
(Top 10 addr bits:
0xf10 >> 2)
Page Table Offset
(Next 10 addr bits)
Physical Page Offset
(Low 12 addr bits)
Entry at cr3+0x3b4 *
Entry at 0x84 *
Data we want at
offset 0x150
Page Table Entries
ò  Top 20 bits are the physical address of the mapped page
ò  Why 20 bits?
ò  4k page size == 12 bits of offset
ò  Lower 12 bits for flags
Page flags
ò  3 for OS to use however it likes
ò  4 reserved by Intel, just in case
ò  3 for OS to CPU metadata
ò  User/vs kernel page,
ò  Write permission,
ò  Present bit (so we can swap out pages)
ò  2 for CPU to OS metadata
ò  Dirty (page was written), Accessed (page was read)
Back of the envelope
ò  If a page is 4K and an entry is 4 bytes, how many entries
per page?
ò  1k
ò  How large of an address space can 1 page represent?
ò  1k entries * 1page/entry * 4K/page = 4MB
ò  How large can we get with a second level of translation?
ò  1k tables/dir * 1k entries/table * 4k/page = 4 GB
ò  Nice that it works out that way!
Challenge questions
ò  What is the space overhead of paging?
ò  I.e., how much memory goes to page tables for a 4 GB
address space?
ò  What is the optimal number of levels for a 64 bit page
ò  When would you use a 2 MB or 1 GB page size?
TLB Entries
ò  The CPU caches address translations in the TLB
ò  Translation Lookaside Buffer
ò  The TLB is not coherent with memory, meaning:
ò  If you change a PTE, you need to manually invalidate
cached values
ò  See the tlb_invalidate() function in JOS
ò  x86 processor modes
ò  x86 segmentation
ò  x86 page tables
ò  Software vs. Hardware mechanisms
ò  Advanced Features
ò  Interesting applications/problems
SW vs. HW
ò  We already saw that TLB shootdown is done by software
ò  Let’s think about other paging features…
Copy-on-write paging
ò  HW: Traps to the OS on a write to read-only page
ò  OS: Allocates a new copy of the page, updates page
ò  Note: can use one of the “avail” bits for COW status
Async. mmap writeback
ò  Suppose the OS maps a writeable file into a process’s
address space.
ò  When the process exits, which pages to write back to the
ò  Could write them all, but that is wasteful
ò  Check the dirty bit in the PTE!
ò  OS clears the present bit for an entry that is swapped out
ò  What happens if you access a stale mapping?
ò  OS gets a page fault the next time it is accessed
ò  OS can replace the page, suspend process until reloaded
ò  x86 processor modes
ò  x86 segmentation
ò  x86 page tables
ò  Software vs. Hardware mechanisms
ò  Advanced Features
ò  Interesting applications/problems
Physical Address Extension (PAE)
ò  Period with 32-bit machines + >4GB RAM (2000’s)
ò  Essentially, an early deployment of a 64-bit page table
ò  Any given process can only address 4GB
ò  Including OS!
ò  Page tables themselves can address >4GB of physical
No execute (NX) bit
ò  Many security holes arise from bad input
ò  Tricks program to jump to unintended address
ò  That happens to be on heap or stack
ò  And contains bits that form malware
ò  Idea: execute protection can catch these
ò  Feels a bit like code segment, no?
ò  Bit 63 in 64-bit page tables (or 32 bit + PAE)
Nested page tables
ò  Paging tough for early Virtual Machine implementations
ò  Can’t trust a guest OS to correctly modify pages
ò  So, add another layer of paging between host-physical
and guest-physical
And now the fun stuff…
Thread-local storage (TLS)
ò  Convenient abstraction for per-thread variables
ò  Code just refers to a variable name, accesses private
ò  Example: Windows stores the thread ID (and other info)
in a thread environment block (TEB)
ò  Same code in any thread to access
ò  No notion of a thread offset or id
ò  How to do this?
TLS implementation
ò  Map a few pages per thread into a segment
ò  Use an “extra” segmentation register
ò  Usually gs
ò  Windows TEB in fs
ò  Any thread accesses first byte of TLS like this:
mov eax, gs:(0x0)
Viva segmentation!
ò  My undergrad OS course treated segmentation as a
historical artifact
ò  Yet still widely (ab)used
ò  Also used for sandboxing in vx32, Native Client
ò  Counterpoint: TLS hack is just compensating for lack of
general-purpose registers
ò  Either way, all but fs and gs are deprecated in x64
Microsoft interview
ò  Suppose I am on a low-memory x86 system (<4MB). I
don’t care about swapping or addressing more than 4MB.
ò  How can I keep paging space overhead at one page?
ò  Recall that the CPU requires 2 levels of addr. translation
Solution sketch
ò  A 4MB address space will only use the low 22 bits of the
address space.
ò  So the first level translation will always hit entry 0
ò  Map the page table’s physical address at entry 0
ò  First translation will “loop” back to the page table
ò  Then use page table normally for 4MB space
ò  Assumes correct programs will not read address 0
ò  Getting null pointers early is nice
ò  Challenge: Refine the solution to still get null pointer exceptions
ò  Lab 2 will be fun
ò  Please do not show up unannounced
ò  I love to chat with you, but I cannot complete my other
work at the current frequency of interruptions
ò  Send email. I will schedule an appointment if needed, or
come during office hours
ò  Reminder: sign up for course mailing list
ò  Read the whole thing before posting
ò  If you have an issue, please post if resolved (and how!)
Housekeeping 2
ò  Checkpoint your VM before changing things
ò  Instructions to follow soon
ò  You break it, you buy it
ò  I’ll update enrollment tomorrow
Download PDF