Jan 3, 2010 - N b t dd i ( t f). â No byte addressing (sort-of). ⢠Per-core 1K-word D-cache. I-cache. D-cache. Per c
Beehive, a Multi Beehive Multi-core core Platform for Low-level Systems Research
Andrew d Birrell, ll Tom Rodeheffer, d h ff Chuck Thacker Microsoft Research, Silicon Valley
Ingredients • Xilinx field-programmable gate array (FPGA) • XUPV5 development board ($750 academic) – Virtex-5 XC5VLX110T: • 17280 x 4 flip-flops • 17280 1 280 x 4 L LUTs T • 148 x 4KByte BRAMs – Gbit Ethernet – 2 GB DRAM
• Or O BEE3 board b d with ith more and d bigger bi FPGA FPGAs January 2010
Beehive, a Multi-core Platform ...
2
Results so far • Multi-core RISC (entirely home-grown) • 100 MHz, one cycle per instruction (if cached) • 13 cores per FPGA (using only half the FPGA) • C compiler (and most of an MSIL one) • Completely C l t l malleable ll bl architecture hit t
January 2010
Beehive, a Multi-core Platform ...
3
Goals • Have a mutable computer architecture
– M Make k changes h in i h hours or days, d nott months th – More realistically than any simulation
• Design blank-sheet system software – Without an existing OS – In high-level h h l l languages l (like (l k C#)
• Do both: systems research architecture • Usable by y universities January 2010
Beehive, a Multi-core Platform ...
4
Research Examples (Hypothetical) • Forget shared memory, use message passing • Transactional memory …
– Apples Apples-to-apples to apples comparison with monitors/CVs
• Do we really y need … – – – – –
Coherent shared memory? Interrupts? Virtual memory? Kernel mode? An operating system?
January 2010
Beehive, a Multi-core Platform ...
5
Per-core CPU rw i t [31 27] instx[31:27]
ra
instx[14:10]
rb
pcx
w
wa
+1 pcInc
Registers 32 X 32
doJump pcMux
ax a
pcd1
bx
Instruction Cache 1K X 32
{inst[31:4], 4'b0} fwdA
lli
fwdB
instx
a
b
link
pc
inst
Read Queue
inst
Data from memory
Constant
ra overrides
Const
Address Queue
amux Function left, shift, arith
bmux Address to memory
Add, Sub, Logic Shift outx
Write Queue
out
Data to memory out
January 2010
Beehive, a Multi-core Platform ...
6
Instructions • 32-bit word (I and D) • 32 x 32 32-bit bi registers i Ra 31
Rw 27 26
Count 22 21
Rb 17 16
??
Function 10
9
8
6
5
4 3
Op 0
• Rw = (Ra Function Rb) Op Count
– Function = add, subtract, and, or, xor, etc – Op p = shift left, shift right, g cyclic y shift, etc.
• Variations for jumps and memory access January 2010
Beehive, a Multi-core Platform ...
7
Memory • 32-bit word, word addressed – No N byte b t addressing dd i ((sort-of) t f)
I-cache
• Per Per-core core 1K 1K-word word D D-cache cache • Per-core 1K-word I-cache
– 8-word lines, direct mapped pp
D-cache
CPU core
• Half of address space is I/O space
January 2010
Beehive, a Multi-core Platform ...
8
Memory Instructions • Separate addressing and data access – Three Th per-core queues: AQ, AQ RQ, RQ WQ
• Read:
AQR r1 r2 0 … LD r4 4 RQ 0
• Write:
AQW r1 r2 0 LD WQ r4 0
January 2010
// Writes r1 but also appends to AQR // Then some time later … // Moves M memory data d from f RQ to r4 4
// Writes r1 but also appends to AQW // Appends data from r4 to WQ Beehive, a Multi-core Platform ...
9
Architectural Curiosities • No memory coherence • No interrupts • No memory protection • No virtual memory • No N kernel k l mode d • No memory bus or I/O bus … January 2010
Beehive, a Multi-core Platform ...
10
Interconnect: the Ring • Minimize wires • Passes through:
– Each core – Ethernet controller – DRAM controller
• “train”: token + contents
– Node can modify or append – Buffered in DRAM controller node
• Used for real memory and inter-core messages January 2010
Beehive, a Multi-core Platform ...
11
Inter-core Messages • Sent by core #x, destination core #y • Up to 60 words • Doesn’t use the memory system • No wake-ups: each core polls its message queue • Accessed A d by b placing l i an I/O address dd on AQ
January 2010
Beehive, a Multi-core Platform ...
12
Inter-core Binary Semaphores • Abstractly, each is a value in [0..1] s.t.
– “P” = atomic( t i ( if > 0 then th d decrementt else l fail) f il) – “V” = atomic( if == 0 then increment else nothing)
• Implemented by 64-bit register per core
– Semaphore’s “value” is sum of its value in registers
• Uses ring messages, not the memory system • Polling, not wake-ups
– E.g. g “acquire q mutex” spins p until “P” succeeds.
January 2010
Beehive, a Multi-core Platform ...
13
Using the Ring: Ethernet • Ethernet is just a specialized core, #14
– With an Eth Ethernett attached, tt h d and d no reall memory
• Uses DMA (via ring) for packet contents • Controlled by y messages:
– Core #n -> core #14: use buffers from address X – Core #n -> core #14: send packet at address Y – Core #14 -> core #n: packet arrived, arrived at address Z
• One MAC address p per core
– Broadcast/multicast not fully supported
January 2010
Beehive, a Multi-core Platform ...
14
Software Tools • GCC
– C tto assembler bl
• Basm
– Custom assembly language to relocatable binary
• Bld
– Link binaries into final bootfile image
• Hard-wired bootstrap code
– Use TFTP to load bootfile into DRAM
January 2010
Beehive, a Multi-core Platform ...
15
C Support Libraries • Parts of libc: malloc, free, printf
– “malloc” “ ll ” uses separate t cache h lines li f for each h item it – “int i CACHELINE” for global in separate cache line
• Single-core libraries:
– thread/mutex/CV – IP, P ARP, RP UDP, DP DHCP, DHCP DNS, DN TCP
• Basic multi multi-core core support in “libmc”: libmc :
– Library implements “main”; application provides: • “mcinit” called first, “mcmain” called per-core – Library provides inter-core putchar, malloc, free
January 2010
Beehive, a Multi-core Platform ...
16
Programming in C# and its friends • Strategy: just compile MSIL (bytecodes)
– P Parse via i reflection fl ti – Covers numerous languages with no extra effort
• Tactic: MSIL to C • Status
– First 90% works – Working on the rest (particularly exceptions)
• By-product: y p portable p MSIL-to-C compiler p January 2010
Beehive, a Multi-core Platform ...
17
Debugging (work-in-progress) • Remote debug with GDB via TCP over Ethernet • Dedicate core #1 for teledebug nub
– Dedicated code, code with no bugs – High-level debugging down to the bare metal
• “debug unit” at each core #n: – – – – –
Breakpoint stops core #n, sends message to core #1 Or “kiss Or, kiss of death death” message from #1 stops core #n Message from #1 to core #n resumes execution Details: cache flushing, RQ, WQ, condition codes Debug unit is working today (used by libmc)
January 2010
Beehive, a Multi-core Platform ...
18
Creeping Kernelism • Special hardware state after breakpoints • Time-slice breakpoints triggered by core #1? • Breakpoints look a lot like slow interrupts? • Base-limit registers for multi-image programs? • Etc. • But still: adding features only if we need them January 2010
Beehive, a Multi-core Platform ...
19
What’s Next • Barrelfish port (MSR Cambridge and ETH) • Try it out on some students (MIT) • Implement transactional memory
– Not “software transactional memory” y
• Use monitors instead of lock to aid coherence? • Give it away to more universities January 2010
Beehive, a Multi-core Platform ...
20