4. UNIVERSITY OF MASSACHUSETTS, AMHERST ⢠Department of Computer Science. Memory Management 101. ⫠Memory allocators: ⫠Accept requests for ...
High-Performance Applications
Memory Management for High-Performance Applications
Emery Berger Univ. of Massachusetts, Amherst
Web servers, search engines, scientific codes C or C++ Run on one or a cluster of server boxes
software compiler
Needs support at every level
runtime system operating system hardware
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
New Applications, Old Memory Managers
Memory Management 101
Applications and hardware have changed
Multiprocessors now commonplace Object-oriented, multithreaded Increased pressure on memory manager (malloc, free)
Memory allocators:
The Heap:
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
Multiprocessors now commonplace Object-oriented, multithreaded Increased pressure on memory manager (malloc, free)
But memory managers have not kept up
4
Current Memory Managers Limit Scalability
Applications and hardware have changed
The pool of unused memory
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
3
New Applications, Old Memory Managers (cont)
Accept requests for memory and return virtual address to a block of memory of the requested size Accept requests to free memory, which allows virtual memory to be reused
As we add processors, program slows down Caused by heap contention
Runtime Performance
Speedup
2
14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Ideal Actual
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Number of Processors
Inadequate support for modern applications
Larson server benchmark on 14-processor Sun UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
5
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
6
1
The Problem
Implementing Memory Managers
Current memory managers inadequate for high-performance applications on modern architectures
Memory managers must be
Limit scalability & application performance
Heavily-optimized C code
⇒
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
Real Code: DLmalloc 2.7.2
Allocator-induced false sharing
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
Solution: provably scalable memory manager
Extended memory manager for servers
8
Heap Layers framework Contention, false sharing, space
Hoard Reap 10
Multiple Heap Allocator: Pure Private Heaps
Reduce contention but cause other problems:
Problems with current memory managers
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
9
One heap per processor:
11
Key:
malloc
free
puts memory on its local heap
STL, Cilk, ad hoc
= in use, processor 0 = free, on heap 1
gets memory from its local heap
Multiple heaps [Larson 98, Gloger 99]
P-fold or even unbounded increase in space
Impractical
Building memory managers
Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92]
Previous work for multiprocessors
Hard to write, reuse, or extend
This Talk
Problems with General-Purpose Memory Managers
Hand-unrolled loops Macros Monolithic functions
UNIVERSITY OF M ASSACHUSETTS, A MHERST • Department of Computer Science
7
#define chunksize(p) ((p)->size & ~(SIZE_BITS)) #define next_chunk(p) ((mchunkptr)( ((char*)(p)) + ((p)->size & ~PREV_INUSE) )) #define prev_chunk(p) ((mchunkptr)( ((char*)(p)) - ((p)->prev_size) )) #define chunk_at_offset(p, s) ((mchunkptr)(((char*)(p)) + (s))) #define inuse(p)\ ((((mchunkptr)(((char*)(p))+((p)->size & ~PREV_INUSE)))->size) & PREV_INUSE) #define set_inuse(p)\ ((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size |= PREV_INUSE #define clear_inuse(p)\ ((mchunkptr)(((char*)(p)) + ((p)->size & ~PREV_INUSE)))->size &= ~(PREV_INUSE) #define inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size & PREV_INUSE) #define set_inuse_bit_at_offset(p, s)\ (((mchunkptr)(((char*)(p)) + (s)))->size |= PREV_INUSE) #define MALLOC_ZERO(charp, nbytes) \ do { \ INTERNAL_SIZE_T* mzp = (INTERNAL_SIZE_T*)(charp); \ CHUNK_SIZE_T mctmp = (nbytes)/sizeof(INTERNAL_SIZE_T); \ long mcn; \ if (mctmp < 8) mcn = 0; else { mcn = (mctmp-1)/8; mctmp %= 8; } \ switch (mctmp) { \ case 0: for(;;) { *mzp++ = 0; \ case 7: *mzp++ = 0; \ case 6: *mzp++ = 0; \ case 5: *mzp++ = 0; \ case 4: *mzp++ = 0; \ case 3: *mzp++ = 0; \ case 2: *mzp++ = 0; \ case 1: *mzp++ = 0; if(mcn