Refactored and optimized implementation

commit: 75e19cc1dd283a078d6c0dca7db2aec1a26301b9 [log] [tgz]
author: Mattias Jansson <mattias@rampantpixels.com> Thu Aug 08 20:59:16 2019 +0200
committer: GitHub <noreply@github.com> Thu Aug 08 20:59:16 2019 +0200
tree: 123f0eb73afc15074d530b0b8f7ef441644f3529
parent: 5ffaa237989ac2c74dcd77776e4a3983a387a477 [diff]
parent: 9ebe0ce0c578a747d6fd48a4308b66cfa5ea59c7 [diff]
diff --git a/CHANGELOG b/CHANGELOG
index d7cbf84..94c74f2 100644
--- a/CHANGELOG
+++ b/CHANGELOG

@@ -1,3 +1,31 @@
+1.4.0
+
+Improved cross thread deallocations by using per-span atomic free list to minimize thread
+contention and localize free list processing to actual span
+
+Change span free list to a linked list, conditionally initialized one memory page at a time
+
+Reduce number of conditionals in the fast path allocation and avoid touching heap structure
+at all in best case
+
+Avoid realigning block in deallocation unless span marked as used by alignment > 32 bytes
+
+Revert block granularity and natural alignment to 16 bytes to reduce memory waste
+
+Bugfix for preserving data when reallocating a previously aligned (>32 bytes) block
+
+Use compile time span size by default for improved performance, added build time RPMALLOC_CONFIGURABLE
+preprocessor directive to reenable configurability of span and page size
+
+More detailed statistics
+
+Disabled adaptive thread cache by default
+
+Fixed an issue where reallocations of large blocks could read outsize of memory page boundaries
+
+Tag mmap requests on macOS with tag 240 for identification with vmmap tool
+
+
 1.3.2
 
 Support for alignment equal or larger than memory page size, up to span size

diff --git a/README.md b/README.md
index e8806a8..c8149b8 100644
--- a/README.md
+++ b/README.md

@@ -1,5 +1,5 @@
 # rpmalloc - Rampant Pixels Memory Allocator
-This library provides a public domain cross platform lock free thread caching 32-byte aligned memory allocator implemented in C. The latest source code is always available at https://github.com/mjansson/rpmalloc
+This library provides a public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C. The latest source code is always available at https://github.com/mjansson/rpmalloc
 
 Platforms currently supported:
 
@@ -8,6 +8,7 @@
 - iOS
 - Linux
 - Android
+- Haiku
 
 The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API. The API used to map/unmap memory pages can be configured in runtime to a custom implementation and mapping granularity/size.
 
@@ -16,9 +17,9 @@
 Created by Mattias Jansson ([@maniccoder](https://twitter.com/maniccoder))
 
 # Performance
-We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches compared to these allocators. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~2000 lines of C code. All allocations have a natural 32-byte alignment.
+We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches compared to these allocators. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~2500 lines of C code. All allocations have a natural 16-byte alignment.
 
-Contained in a parallel repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
+Contained in a parallel repository is a benchmark utility that performs interleaved unaligned allocations and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
 
 https://github.com/mjansson/rpmalloc-benchmark
 
@@ -31,7 +32,7 @@
 Configuration of the thread and global caches can be important depending on your use pattern. See [CACHE](CACHE.md) for a case study and some comments/guidelines.
 
 # Using
-The easiest way to use the library is simply adding rpmalloc.[h|c] to your project and compile them along with your sources. This contains only the rpmalloc specific entry points and does not provide internal hooks to process and/or thread creation at the moment. You are required to call these functions from your own code in order to initialize and finalize the allocator in your process and threads:
+The easiest way to use the library is simply adding __rpmalloc.[h|c]__ to your project and compile them along with your sources. This contains only the rpmalloc specific entry points and does not provide internal hooks to process and/or thread creation at the moment. You are required to call these functions from your own code in order to initialize and finalize the allocator in your process and threads:
 
 __rpmalloc_initialize__ : Call at process start to initialize the allocator
 
@@ -47,12 +48,12 @@
 
 Then simply use the __rpmalloc__/__rpfree__ and the other malloc style replacement functions. Remember all allocations are 16-byte aligned, so no need to call the explicit rpmemalign/rpaligned_alloc/rpposix_memalign functions unless you need greater alignment, they are simply wrappers to make it easier to replace in existing code.
 
-If you wish to override the standard library malloc family of functions and have automatic initialization/finalization of process and threads, also include the `malloc.c` file in your project. The automatic init/fini is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use libc replacement only as a convenience for testing the library on an existing code base, not a final solution.
+If you wish to override the standard library malloc family of functions and have automatic initialization/finalization of process and threads, define __ENABLE_OVERRIDE__ to non-zero which will include the `malloc.c` file in compilation of __rpmalloc.c__. The list of libc entry points replaced may not be complete, use libc replacement only as a convenience for testing the library on an existing code base, not a final solution.
 
 # Building
 To compile as a static library run the configure python script which generates a Ninja build script, then build using ninja. The ninja build produces two static libraries, one named `rpmalloc` and one named `rpmallocwrap`, where the latter includes the libc entry point overrides.
 
-The configure + ninja build also produces two shared object/dynamic libraries. The `rpmallocwrap` shared library can be used with LD_PRELOAD/DYLD_INSERT_LIBRARIES to inject in a preexisting binary, replacing any malloc/free family of function calls. This is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use preloading as a convenience for testing the library on an existing binary, not a final solution.
+The configure + ninja build also produces two shared object/dynamic libraries. The `rpmallocwrap` shared library can be used with LD_PRELOAD/DYLD_INSERT_LIBRARIES to inject in a preexisting binary, replacing any malloc/free family of function calls. This is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use preloading as a convenience for testing the library on an existing binary, not a final solution. The dynamic library also provides automatic init/fini of process and threads for all platforms.
 
 The latest stable release is available in the master branch. For latest development code, use the develop branch.
 
@@ -69,6 +70,8 @@
 
 __ENABLE_THREAD_CACHE__: By default defined to 1, enables the per-thread cache. Set to 0 to disable the thread cache and directly unmap pages no longer in use (also disables the global cache).
 
+__ENABLE_ADAPTIVE_THREAD_CACHE__: Introduces a simple heuristics in the thread cache size, keeping 25% of the high water mark for each span count class.
+
 # Other configuration options
 Detailed statistics are available if __ENABLE_STATISTICS__ is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in `rpmalloc.c`. This will cause a slight overhead in runtime to collect statistics for each memory operation, and will also add 4 bytes overhead per allocation to track sizes.
 
@@ -78,6 +81,10 @@
 
 Overwrite and underwrite guards are enabled if __ENABLE_GUARDS__ is defined to 1 (default is 0, or disabled), either on compile command line or by settings the value in `rpmalloc.c`. This will introduce up to 64 byte overhead on each allocation to store magic numbers, which will be verified when freeing the memory block. The actual overhead is dependent on the requested size compared to size class limits.
 
+To include __malloc.c__ in compilation and provide overrides of standard library malloc entry points define __ENABLE_OVERRIDE__ to 1. To enable automatic initialization of finalization of process and threads in order to preload the library into executables using standard library malloc, define __ENABLE_PRELOAD__ to 1.
+
+To enable the runtime configurable memory page and span sizes, define __ENABLE_CONFIGURABLE__ to 1. By default, memory page size is determined by system APIs and memory span size is set to 64KiB.
+
 # Huge pages
 The allocator has support for huge/large pages on Windows, Linux and MacOS. To enable it, pass a non-zero value in the config value `enable_huge_pages` when initializing the allocator with `rpmalloc_initialize_config`. If the system does not support huge pages it will be automatically disabled. You can query the status by looking at `enable_huge_pages` in the config returned from a call to `rpmalloc_config` after initialization is done.
 
@@ -85,15 +92,15 @@
 The allocator is similar in spirit to tcmalloc from the [Google Performance Toolkit](https://github.com/gperftools/gperftools). It uses separate heaps for each thread and partitions memory blocks according to a preconfigured set of size classes, up to 2MiB. Larger blocks are mapped and unmapped directly. Allocations for different size classes will be served from different set of memory pages, each "span" of pages is dedicated to one size class. Spans of pages can flow between threads when the thread cache overflows and are released to a global cache, or when the thread ends. Unlike tcmalloc, single blocks do not flow between threads, only entire spans of pages.
 
 # Implementation details
-The allocator is based on a fixed but configurable page alignment (defaults to 64KiB) and 32 byte block alignment, where all runs of memory pages (spans) are mapped to this alignment boundary. On Windows this is automatically guaranteed up to 64KiB by the VirtualAlloc granularity, and on mmap systems it is achieved by oversizing the mapping and aligning the returned virtual memory address to the required boundaries. By aligning to a fixed size the free operation can locate the header of the memory span without having to do a table lookup (as tcmalloc does) by simply masking out the low bits of the address (for 64KiB this would be the low 16 bits).
+The allocator is based on a fixed but configurable page alignment (defaults to 64KiB) and 16 byte block alignment, where all runs of memory pages (spans) are mapped to this alignment boundary. On Windows this is automatically guaranteed up to 64KiB by the VirtualAlloc granularity, and on mmap systems it is achieved by oversizing the mapping and aligning the returned virtual memory address to the required boundaries. By aligning to a fixed size the free operation can locate the header of the memory span without having to do a table lookup (as tcmalloc does) by simply masking out the low bits of the address (for 64KiB this would be the low 16 bits).
 
-Memory blocks are divided into three categories. For 64KiB span size/alignment the small blocks are [32, 2016] bytes, medium blocks (2016, 32720] bytes, and large blocks (32720, 2097120] bytes. The three categories are further divided in size classes. If the span size is changed, the small block classes remain but medium blocks go from (2016, span size] bytes.
+Memory blocks are divided into three categories. For 64KiB span size/alignment the small blocks are [16, 1024] bytes, medium blocks (1024, 32256] bytes, and large blocks (32256, 2097120] bytes. The three categories are further divided in size classes. If the span size is changed, the small block classes remain but medium blocks go from (1024, span size] bytes.
 
-Small blocks have a size class granularity of 32 bytes each in 63 buckets. Medium blocks have a granularity of 512 bytes, 60 buckets (default). Large blocks have a the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 42 bytes will allocate a block of 64 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
+Small blocks have a size class granularity of 16 bytes each in 64 buckets. Medium blocks have a granularity of 512 bytes, 61 buckets (default). Large blocks have a the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 36 bytes will allocate a block of 48 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
 
 Spans for small and medium blocks are cached in four levels to avoid calls to map/unmap memory pages. The first level is a per thread single active span for each size class. The second level is a per thread list of partially free spans for each size class. The third level is a per thread list of free spans. The fourth level is a global list of free spans.
 
-Each span for a small and medium size class keeps track of how many blocks are allocated/free, as well as a list of which blocks that are free for allocation. To avoid locks, each span is completely owned by the allocating thread, and all cross-thread deallocations will be deferred to the owner thread.
+Each span for a small and medium size class keeps track of how many blocks are allocated/free, as well as a list of which blocks that are free for allocation. To avoid locks, each span is completely owned by the allocating thread, and all cross-thread deallocations will be deferred to the owner thread through a separate free list per span.
 
 Large blocks, or super spans, are cached in two levels. The first level is a per thread list of free super spans. The second level is a global list of free super spans.
 
@@ -104,7 +111,9 @@
 
 Memory mapping requests are always done in multiples of the memory page size. You can specify a custom page size when initializing rpmalloc with __rpmalloc_initialize_config__, or pass 0 to let rpmalloc determine the system memory page size using OS APIs. The page size MUST be a power of two.
 
-To reduce system call overhead, memory spans are mapped in batches controlled by the `span_map_count` configuration variable (which defaults to the `DEFAULT_SPAN_MAP_COUNT` value if 0, which in turn is sized according to the cache configuration define, defaulting to 32). If the memory page size is larger than the span size, the number of spans to map in a single call will be adjusted to guarantee a multiple of the page size, and the spans will be kept mapped until the entire span range can be unmapped in one call (to avoid trying to unmap partial pages).
+To reduce system call overhead, memory spans are mapped in batches controlled by the `span_map_count` configuration variable (which defaults to the `DEFAULT_SPAN_MAP_COUNT` value if 0, which in turn is sized according to the cache configuration define, defaulting to 64). If the memory page size is larger than the span size, the number of spans to map in a single call will be adjusted to guarantee a multiple of the page size, and the spans will be kept mapped until the entire span range can be unmapped in one call (to avoid trying to unmap partial pages).
+
+On macOS and iOS mmap requests are tagged with tag 240 for easy identification with the vmmap tool.
 
 # Span breaking
 Super spans (spans a multiple > 1 of the span size) can be subdivided into smaller spans to fulfull a need to map a new span of memory. By default the allocator will greedily grab and break any larger span from the available caches before mapping new virtual memory. However, spans can currently not be glued together to form larger super spans again. Subspans can traverse the cache and be used by different threads individually.
@@ -123,10 +132,10 @@
 
 However, there is memory fragmentation in the meaning that a request for x bytes followed by a request of y bytes where x and y are at least one size class different in size will return blocks that are at least one memory page apart in virtual address space. Only blocks of the same size will potentially be within the same memory page span.
 
-Unlike the similar tcmalloc where the linked list of individual blocks leads to back-to-back allocations of the same block size will spread across a different span of memory pages each time (depending on free order), rpmalloc keeps an "active span" for each size class. This leads to back-to-back allocations will most likely be served from within the same span of memory pages (unless the span runs out of free blocks). The rpmalloc implementation will also use any "holes" in memory pages in semi-filled spans before using a completely free span.
+rpmalloc keeps an "active span" and free list for each size class. This leads to back-to-back allocations will most likely be served from within the same span of memory pages (unless the span runs out of free blocks). The rpmalloc implementation will also use any "holes" in memory pages in semi-filled spans before using a completely free span.
 
 # Producer-consumer scenario
-Compared to the tcmalloc implementation, rpmalloc does not suffer as much from a producer-consumer thread scenario where one thread allocates memory blocks and another thread frees the blocks. In tcmalloc the free blocks need to traverse both the thread cache of the thread doing the free operations as well as the global cache before being reused in the allocating thread. In rpmalloc the freed blocks will be reused as soon as the allocating thread needs to get new spans from the thread cache.
+Compared to the some other allocators, rpmalloc does not suffer as much from a producer-consumer thread scenario where one thread allocates memory blocks and another thread frees the blocks. In some allocators the free blocks need to traverse both the thread cache of the thread doing the free operations as well as the global cache before being reused in the allocating thread. In rpmalloc the freed blocks will be reused as soon as the allocating thread needs to get new spans from the thread cache. This enables faster release of completely freed memory pages as blocks in a memory page will not be aliased between different owning threads.
 
 # Best case scenarios
 Threads that keep ownership of allocated memory blocks within the thread and free the blocks from the same thread will have optimal performance.
@@ -134,20 +143,16 @@
 Threads that have allocation patterns where the difference in memory usage high and low water marks fit within the thread cache thresholds in the allocator will never touch the global cache except during thread init/fini and have optimal performance. Tweaking the cache limits can be done on a per-size-class basis.
 
 # Worst case scenarios
-Since each thread cache maps spans of memory pages per size class, a thread that allocates just a few blocks of each size class (32, 64, ...) for many size classes will never fill each bucket, and thus map a lot of memory pages while only using a small fraction of the mapped memory. However, the wasted memory will always be less than 64KiB (or the configured span size) per size class. The cache for free spans will be reused by all size classes.
-
-An application that has a producer-consumer scheme between threads where one thread performs all allocations and another frees all memory will have a sub-optimal performance due to blocks crossing thread boundaries will be freed in a two step process - first deferred to the allocating thread, then freed when that thread has need for more memory pages for the requested size. However, depending on the use case the performance overhead might be small.
+Since each thread cache maps spans of memory pages per size class, a thread that allocates just a few blocks of each size class (16, 32, ...) for many size classes will never fill each bucket, and thus map a lot of memory pages while only using a small fraction of the mapped memory. However, the wasted memory will always be less than 4KiB (or the configured memory page size) per size class as each span is initialized one memory page at a time. The cache for free spans will be reused by all size classes.
 
 Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-acquired from the global cache. This can be mitigated by changing the __MAX_SPAN_CACHE_DIVISOR__ define in the source code (at the cost of higher average memory overhead).
 
 # Caveats
-Cross-thread deallocations are more costly than in-thread deallocations, since the spans are completely owned by the allocating thread. The free operation will be deferred using an atomic list operation and the actual free operation will be performed when the owner thread requires a new block of the corresponding size class.
+Cross-thread deallocations could leave dangling spans in the owning thread heap partially used list if the deallocation is the last used block in the span and the span is previously marked as partial (at least one block deallocated by the owning thread). However, an optimization for GC like use cases is that if all the blocks in the span are freed by other threads, the span can immediately be inserted in the owning thread span cache.
 
 VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead oversizes the memory mapping with configured span size to be able to always return a memory area with the required alignment. Since the extra memory pages are never touched this will not result in extra committed physical memory pages, but rather only increase virtual memory address space.
 
-The free, realloc and usable size functions all require the passed pointer to be within the first 64KiB (or whatever you set the span size to) of the start of the memory block. You cannot pass in any pointer from the memory block address range. 
-
-All entry points assume the passed values are valid, for example passing an invalid pointer to free would most likely result in a segmentation fault. The library does not try to guard against errors.
+All entry points assume the passed values are valid, for example passing an invalid pointer to free would most likely result in a segmentation fault. __The library does not try to guard against errors!__.
 
 # License
 

diff --git a/build/ninja/clang.py b/build/ninja/clang.py
index 7cea8fd..e9ddc50 100644
--- a/build/ninja/clang.py
+++ b/build/ninja/clang.py

@@ -40,15 +40,15 @@
 
     #Base flags
     self.cflags = ['-D' + project.upper() + '_COMPILE=1',
-                   '-funit-at-a-time', '-fstrict-aliasing',
-                   '-fno-math-errno','-ffinite-math-only', '-funsafe-math-optimizations',
+                   '-funit-at-a-time', '-fstrict-aliasing', '-fvisibility=hidden', '-fno-stack-protector',
+                   '-fomit-frame-pointer', '-fno-math-errno','-ffinite-math-only', '-funsafe-math-optimizations',
                    '-fno-trapping-math', '-ffast-math']
     self.cwarnflags = ['-W', '-Werror', '-pedantic', '-Wall', '-Weverything',
-                       '-Wno-padded', '-Wno-documentation-unknown-command']
+                       '-Wno-padded', '-Wno-documentation-unknown-command', '-Wno-static-in-inline']
     self.cmoreflags = []
     self.mflags = []
     self.arflags = []
-    self.linkflags = []
+    self.linkflags = ['-fomit-frame-pointer']
     self.oslibs = []
     self.frameworks = []
 
@@ -65,7 +65,6 @@
     if self.target.is_linux() or self.target.is_bsd() or self.target.is_raspberrypi():
       self.cflags += ['-D_GNU_SOURCE=1']
       self.linkflags += ['-pthread']
-      self.oslibs += ['m']
     if self.target.is_linux() or self.target.is_raspberrypi():
       self.oslibs += ['dl']
     if self.target.is_bsd():
@@ -85,7 +84,7 @@
       self.cflags += ['-w']
     self.cxxflags = list(self.cflags)
 
-    self.cflags += ['-std=c11']
+    self.cflags += ['-std=gnu11']
     if self.target.is_macos() or self.target.is_ios():
       self.cxxflags += ['-std=c++14', '-stdlib=libc++']
     else:
@@ -311,7 +310,7 @@
     flags = []
     if targettype == 'sharedlib':
       flags += ['-DBUILD_DYNAMIC_LINK=1']
-      if self.target.is_linux():
+      if self.target.is_linux() or self.target.is_bsd():
        flags += ['-fPIC']
     flags += self.make_targetarchflags(arch, targettype)
     return flags
@@ -321,11 +320,11 @@
     if config == 'debug':
       flags += ['-DBUILD_DEBUG=1', '-g']
     elif config == 'release':
-      flags += ['-DBUILD_RELEASE=1', '-O3', '-g', '-funroll-loops']
+      flags += ['-DBUILD_RELEASE=1', '-DNDEBUG', '-O3', '-g', '-funroll-loops']
     elif config == 'profile':
-      flags += ['-DBUILD_PROFILE=1', '-O3', '-g', '-funroll-loops']
+      flags += ['-DBUILD_PROFILE=1', '-DNDEBUG', '-O3', '-g', '-funroll-loops']
     elif config == 'deploy':
-      flags += ['-DBUILD_DEPLOY=1', '-O3', '-g', '-funroll-loops']
+      flags += ['-DBUILD_DEPLOY=1', '-DNDEBUG', '-O3', '-g', '-funroll-loops']
     return flags
 
   def make_ararchflags(self, arch, targettype):
@@ -363,7 +362,9 @@
         flags += ['-dynamiclib']
     else:
       if targettype == 'sharedlib':
-        flags += ['-shared']
+        flags += ['-shared', '-fPIC']
+    if config == 'release':
+      flags += ['-DNDEBUG', '-O3']
     return flags
 
   def make_linkarchlibs(self, arch, targettype):

diff --git a/build/ninja/gcc.py b/build/ninja/gcc.py
index da08e95..21fc2e4 100644
--- a/build/ninja/gcc.py
+++ b/build/ninja/gcc.py

@@ -52,7 +52,6 @@
     if self.target.is_linux() or self.target.is_bsd() or self.target.is_raspberrypi():
       self.cflags += ['-D_GNU_SOURCE=1']
       self.linkflags += ['-pthread']
-      self.oslibs += ['m']
     if self.target.is_linux() or self.target.is_raspberrypi():
       self.oslibs += ['dl']
     if self.target.is_bsd():
@@ -183,7 +182,7 @@
     flags = []
     if targettype == 'sharedlib':
       flags += ['-DBUILD_DYNAMIC_LINK=1']
-      if self.target.is_linux():
+      if self.target.is_linux() or self.target.is_bsd():
         flags += ['-fPIC']
     flags += self.make_targetarchflags(arch, targettype)
     return flags

diff --git a/build/ninja/msvc.py b/build/ninja/msvc.py
index 3de3858..8288d94 100644
--- a/build/ninja/msvc.py
+++ b/build/ninja/msvc.py

@@ -22,7 +22,7 @@
     self.linker = 'link'
     self.dller = 'dll'
 
-    #Command definitions
+    #Command definitions (to generate assembly, add "/FAs /Fa$out.asm")
     self.cccmd = '$toolchain$cc /showIncludes /I. $includepaths $moreincludepaths $cflags $carchflags $cconfigflags $cmoreflags /c $in /Fo$out /Fd$pdbpath /FS /nologo'
     self.cxxcmd = '$toolchain$cxx /showIncludes /I. $includepaths $moreincludepaths $cxxflags $carchflags $cconfigflags $cmoreflags /c $in /Fo$out /Fd$pdbpath /FS /nologo'
     self.ccdepfile = None

diff --git a/configure.py b/configure.py
index 985c4a9..b3e77db 100755
--- a/configure.py
+++ b/configure.py

@@ -10,21 +10,14 @@
 import generator
 
 generator = generator.Generator(project = 'rpmalloc', variables = [('bundleidentifier', 'com.rampantpixels.rpmalloc.$(binname)')])
-target = generator.target
-writer = generator.writer
-toolchain = generator.toolchain
 
 rpmalloc_lib = generator.lib(module = 'rpmalloc', libname = 'rpmalloc', sources = ['rpmalloc.c'])
 
-if not target.is_android() and not target.is_ios():
+if not generator.target.is_android() and not generator.target.is_ios():
 	rpmalloc_so = generator.sharedlib(module = 'rpmalloc', libname = 'rpmalloc', sources = ['rpmalloc.c'])
 
-if not target.is_windows():
-	if not target.is_android() and not target.is_ios():
-		rpmallocwrap_lib = generator.lib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'], variables = {'defines': ['ENABLE_PRELOAD=1']})
+	rpmallocwrap_so = generator.sharedlib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c'], variables = {'defines': ['ENABLE_PRELOAD=1', 'ENABLE_OVERRIDE=1']})
+	rpmallocwrap_lib = generator.lib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c'], variables = {'defines': ['ENABLE_PRELOAD=1', 'ENABLE_OVERRIDE=1']})
 
-	if not target.is_windows() and not target.is_android() and not target.is_ios():
-		rpmallocwrap_so = generator.sharedlib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'], variables = {'runtime': 'c++', 'defines': ['ENABLE_PRELOAD=1']})
-
-if not target.is_ios() and not target.is_android():
 	generator.bin(module = 'test', sources = ['thread.c', 'main.c'], binname = 'rpmalloc-test', implicit_deps = [rpmalloc_lib], libs = ['rpmalloc'], includepaths = ['rpmalloc', 'test'], variables = {'defines': ['ENABLE_ASSERTS=1', 'ENABLE_STATISTICS=1']})
+	generator.bin(module = 'test', sources = ['thread.c', 'main-override.cc'], binname = 'rpmallocwrap-test', implicit_deps = [rpmallocwrap_lib], libs = ['rpmallocwrap'], includepaths = ['rpmalloc', 'test'], variables = {'runtime': 'c++', 'defines': ['ENABLE_ASSERTS=1', 'ENABLE_STATISTICS=1']})

diff --git a/rpmalloc/malloc.c b/rpmalloc/malloc.c
index 47702bd..426a14a 100644
--- a/rpmalloc/malloc.c
+++ b/rpmalloc/malloc.c

@@ -9,16 +9,43 @@
  *
  */
 
-#include "rpmalloc.h"
+//
+// This file provides overrides for the standard library malloc entry points for C and new/delete operators for C++
+// It also provides automatic initialization/finalization of process and threads
+//
 
-#ifndef ENABLE_VALIDATE_ARGS
-//! Enable validation of args to public entry points
-#define ENABLE_VALIDATE_ARGS      0
+#ifndef ARCH_64BIT
+#  if defined(__LLP64__) || defined(__LP64__) || defined(_WIN64)
+#    define ARCH_64BIT 1
+_Static_assert(sizeof(size_t) == 8, "Data type size mismatch");
+_Static_assert(sizeof(void*) == 8, "Data type size mismatch");
+#  else
+#    define ARCH_64BIT 0
+_Static_assert(sizeof(size_t) == 4, "Data type size mismatch");
+_Static_assert(sizeof(void*) == 4, "Data type size mismatch");
+#  endif
 #endif
 
-#if ENABLE_VALIDATE_ARGS
-//! Maximum allocation size to avoid integer overflow
-#define MAX_ALLOC_SIZE            (((size_t)-1) - 4096)
+#if (defined(__GNUC__) || defined(__clang__)) && !defined(__MACH__)
+#pragma GCC visibility push(default)
+#endif
+
+#if ENABLE_OVERRIDE
+
+#define USE_IMPLEMENT 1
+#define USE_INTERPOSE 0
+#define USE_ALIAS 0
+
+#if defined(__APPLE__) && ENABLE_PRELOAD
+#undef USE_INTERPOSE
+#define USE_INTERPOSE 1
+#endif
+
+#if !defined(_WIN32) && !USE_INTERPOSE
+#undef USE_IMPLEMENT
+#undef USE_ALIAS
+#define USE_IMPLEMENT 0
+#define USE_ALIAS 1
 #endif
 
 #ifdef _MSC_VER
@@ -28,84 +55,193 @@
 #undef calloc
 #endif
 
-//This file provides overrides for the standard library malloc style entry points
+#if USE_IMPLEMENT
 
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-malloc(size_t size);
+extern inline void* RPMALLOC_CDECL malloc(size_t size) { return rpmalloc(size); }
+extern inline void* RPMALLOC_CDECL calloc(size_t count, size_t size) { return rpcalloc(count, size); }
+extern inline void* RPMALLOC_CDECL realloc(void* ptr, size_t size) { return rprealloc(ptr, size); }
+extern inline void* RPMALLOC_CDECL reallocf(void* ptr, size_t size) { return rprealloc(ptr, size); }
+extern inline void* RPMALLOC_CDECL aligned_alloc(size_t alignment, size_t size) { return rpaligned_alloc(alignment, size); }
+extern inline void* RPMALLOC_CDECL memalign(size_t alignment, size_t size) { return rpmemalign(alignment, size); }
+extern inline int RPMALLOC_CDECL posix_memalign(void** memptr, size_t alignment, size_t size) { return rpposix_memalign(memptr, alignment, size); }
+extern inline void RPMALLOC_CDECL free(void* ptr) { rpfree(ptr); }
+extern inline void RPMALLOC_CDECL cfree(void* ptr) { rpfree(ptr); }
+extern inline size_t RPMALLOC_CDECL malloc_usable_size(void* ptr) { return rpmalloc_usable_size(ptr); }
+extern inline size_t RPMALLOC_CDECL malloc_size(void* ptr) { return rpmalloc_usable_size(ptr); }
 
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-calloc(size_t count, size_t size);
-
-extern void* RPMALLOC_CDECL
-realloc(void* ptr, size_t size);
-
-extern void* RPMALLOC_CDECL
-reallocf(void* ptr, size_t size);
-
-extern void*
-reallocarray(void* ptr, size_t count, size_t size);
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-valloc(size_t size);
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-pvalloc(size_t size);
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-aligned_alloc(size_t alignment, size_t size);
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-memalign(size_t alignment, size_t size);
-
-extern int RPMALLOC_CDECL
-posix_memalign(void** memptr, size_t alignment, size_t size);
-
-extern void RPMALLOC_CDECL
-free(void* ptr);
-
-extern void RPMALLOC_CDECL
-cfree(void* ptr);
-
-extern size_t RPMALLOC_CDECL
-malloc_usable_size(
-#if defined(__ANDROID__)
-	const void* ptr
+// Overload the C++ operators using the mangled names (https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling)
+// operators delete and delete[]
+extern void _ZdlPv(void* p); void _ZdlPv(void* p) { rpfree(p); }
+extern void _ZdaPv(void* p); void _ZdaPv(void* p) { rpfree(p); }
+#if ARCH_64BIT
+// 64-bit operators new and new[], normal and aligned
+extern void* _Znwm(uint64_t size); void* _Znwm(uint64_t size) { return rpmalloc(size); }
+extern void* _Znam(uint64_t size); void* _Znam(uint64_t size) { return rpmalloc(size); }
+extern void* _Znwmm(uint64_t size, uint64_t align); void* _Znwmm(uint64_t size, uint64_t align) { return rpaligned_alloc(align, size); }
+extern void* _Znamm(uint64_t size, uint64_t align); void* _Znamm(uint64_t size, uint64_t align) { return rpaligned_alloc(align, size); }
 #else
-	void* ptr
+// 32-bit operators new and new[], normal and aligned
+extern void* _Znwj(uint32_t size); void* _Znwj(uint32_t size) { return rpmalloc(size); }
+extern void* _Znaj(uint32_t size); void* _Znaj(uint32_t size) { return rpmalloc(size); }
+extern void* _Znwjj(uint64_t size, uint64_t align); void* _Znwjj(uint64_t size, uint64_t align) { return rpaligned_alloc(align, size); }
+extern void* _Znajj(uint64_t size, uint64_t align); void* _Znajj(uint64_t size, uint64_t align) { return rpaligned_alloc(align, size); }
 #endif
-	);
 
-extern size_t RPMALLOC_CDECL
-malloc_size(void* ptr);
+#endif
+
+#if USE_INTERPOSE
+
+typedef struct interpose_t {
+	void* new_func;
+	void* orig_func;
+} interpose_t;
+
+#define MAC_INTERPOSE_PAIR(newf, oldf) 	{ (void*)newf, (void*)oldf }
+#define MAC_INTERPOSE_SINGLE(newf, oldf) \
+__attribute__((used)) static const interpose_t macinterpose##newf##oldf \
+__attribute__ ((section("__DATA, __interpose"))) = MAC_INTERPOSE_PAIR(newf, oldf)
+
+__attribute__((used)) static const interpose_t macinterpose_malloc[]
+__attribute__ ((section("__DATA, __interpose"))) = {
+	//new and new[]
+	MAC_INTERPOSE_PAIR(rpmalloc, _Znwm),
+	MAC_INTERPOSE_PAIR(rpmalloc, _Znam),
+	//delete and delete[]
+	MAC_INTERPOSE_PAIR(rpfree, _ZdlPv),
+	MAC_INTERPOSE_PAIR(rpfree, _ZdaPv),
+	MAC_INTERPOSE_PAIR(rpmalloc, malloc),
+	MAC_INTERPOSE_PAIR(rpmalloc, calloc),
+	MAC_INTERPOSE_PAIR(rprealloc, realloc),
+	MAC_INTERPOSE_PAIR(rprealloc, reallocf),
+	MAC_INTERPOSE_PAIR(rpaligned_alloc, aligned_alloc),
+	MAC_INTERPOSE_PAIR(rpmemalign, memalign),
+	MAC_INTERPOSE_PAIR(rpposix_memalign, posix_memalign),
+	MAC_INTERPOSE_PAIR(rpfree, free),
+	MAC_INTERPOSE_PAIR(rpfree, cfree),
+	MAC_INTERPOSE_PAIR(rpmalloc_usable_size, malloc_usable_size),
+	MAC_INTERPOSE_PAIR(rpmalloc_usable_size, malloc_size)
+};
+
+#endif
+
+#if USE_ALIAS
+
+#define RPALIAS(fn) __attribute__((alias(#fn), used, visibility("default")));
+
+// Alias the C++ operators using the mangled names (https://itanium-cxx-abi.github.io/cxx-abi/abi.html#mangling)
+
+// operators delete and delete[]
+void _ZdlPv(void* p) RPALIAS(rpfree)
+void _ZdaPv(void* p) RPALIAS(rpfree)
+
+#if ARCH_64BIT
+// 64-bit operators new and new[], normal and aligned
+void* _Znwm(uint64_t size) RPALIAS(rpmalloc)
+void* _Znam(uint64_t size) RPALIAS(rpmalloc)
+extern inline void* _Znwmm(uint64_t size, uint64_t align) { return rpaligned_alloc(align, size); }
+extern inline void* _Znamm(uint64_t size, uint64_t align) { return rpaligned_alloc(align, size); }
+#else
+// 32-bit operators new and new[], normal and aligned
+void* _Znwj(uint32_t size) RPALIAS(rpmalloc)
+void* _Znaj(uint32_t size) RPALIAS(rpmalloc)
+extern inline void* _Znwjj(uint32_t size, uint32_t align) { return rpaligned_alloc(align, size); }
+extern inline void* _Znajj(uint32_t size, uint32_t align) { return rpaligned_alloc(align, size); }
+#endif
+
+void* malloc(size_t size) RPALIAS(rpmalloc)
+void* calloc(size_t count, size_t size) RPALIAS(rpcalloc)
+void* realloc(void* ptr, size_t size) RPALIAS(rprealloc)
+void* reallocf(void* ptr, size_t size) RPALIAS(rprealloc)
+void* aligned_alloc(size_t alignment, size_t size) RPALIAS(rpaligned_alloc)
+void* memalign(size_t alignment, size_t size) RPALIAS(rpmemalign)
+int posix_memalign(void** memptr, size_t alignment, size_t size) RPALIAS(rpposix_memalign)
+void free(void* ptr) RPALIAS(rpfree)
+void cfree(void* ptr) RPALIAS(rpfree)
+size_t malloc_usable_size(void* ptr) RPALIAS(rpmalloc_usable_size)
+size_t malloc_size(void* ptr) RPALIAS(rpmalloc_usable_size)
+
+#endif
+
+extern inline void* RPMALLOC_CDECL
+reallocarray(void* ptr, size_t count, size_t size) {
+	size_t total;
+#if ENABLE_VALIDATE_ARGS
+#ifdef _MSC_VER
+	int err = SizeTMult(count, size, &total);
+	if ((err != S_OK) || (total >= MAX_ALLOC_SIZE)) {
+		errno = EINVAL;
+		return 0;
+	}
+#else
+	int err = __builtin_umull_overflow(count, size, &total);
+	if (err || (total >= MAX_ALLOC_SIZE)) {
+		errno = EINVAL;
+		return 0;
+	}
+#endif
+#else
+	total = count * size;
+#endif
+	return realloc(ptr, total);
+}
+
+extern inline void* RPMALLOC_CDECL
+valloc(size_t size) {
+	get_thread_heap();
+	if (!size)
+		size = _memory_page_size;
+	size_t total_size = size + _memory_page_size;
+#if ENABLE_VALIDATE_ARGS
+	if (total_size < size) {
+		errno = EINVAL;
+		return 0;
+	}
+#endif
+	void* buffer = rpmalloc(total_size);
+	if ((uintptr_t)buffer & (_memory_page_size - 1))
+		return (void*)(((uintptr_t)buffer & ~(_memory_page_size - 1)) + _memory_page_size);
+	return buffer;
+}
+
+extern inline void* RPMALLOC_CDECL
+pvalloc(size_t size) {
+	get_thread_heap();
+	size_t aligned_size = size;
+	if (aligned_size % _memory_page_size)
+		aligned_size = (1 + (aligned_size / _memory_page_size)) * _memory_page_size;
+#if ENABLE_VALIDATE_ARGS
+	if (aligned_size < size) {
+		errno = EINVAL;
+		return 0;
+	}
+#endif
+	return valloc(size);
+}
+
+#endif // ENABLE_OVERRIDE
+
+#if ENABLE_PRELOAD
 
 #ifdef _WIN32
 
-#include <Windows.h>
+#if defined(BUILD_DYNAMIC_LINK) && BUILD_DYNAMIC_LINK
 
-static size_t page_size;
-static int is_initialized;
-
-static void
-initializer(void) {
-	if (!is_initialized) {
-		is_initialized = 1;
-		SYSTEM_INFO system_info;
-		memset(&system_info, 0, sizeof(system_info));
-		GetSystemInfo(&system_info);
-		page_size = system_info.dwPageSize;
+__declspec(dllexport) BOOL WINAPI
+DllMain(HINSTANCE instance, DWORD reason, LPVOID reserved) {
+	(void)sizeof(reserved);
+	(void)sizeof(instance);
+	if (reason == DLL_PROCESS_ATTACH)
 		rpmalloc_initialize();
-	}
-	rpmalloc_thread_initialize();
+	else if (reason == DLL_PROCESS_DETACH)
+		rpmalloc_finalize();
+	else if (reason == DLL_THREAD_ATTACH)
+		rpmalloc_thread_initialize();
+	else if (reason == DLL_THREAD_DETACH)
+		rpmalloc_thread_finalize();
+	return TRUE;
 }
 
-static void
-finalizer(void) {
-	rpmalloc_thread_finalize();
-	if (is_initialized) {
-		is_initialized = 0;
-		rpmalloc_finalize();
-	}
-}
+#endif
 
 #else
 
@@ -114,32 +250,20 @@
 #include <stdint.h>
 #include <unistd.h>
 
-static size_t page_size;
 static pthread_key_t destructor_key;
-static int is_initialized;
 
 static void
 thread_destructor(void*);
 
 static void __attribute__((constructor))
 initializer(void) {
-	if (!is_initialized) {
-		is_initialized = 1;
-		page_size = (size_t)sysconf(_SC_PAGESIZE);
-		pthread_key_create(&destructor_key, thread_destructor);
-		if (rpmalloc_initialize())
-			abort();
-	}
-	rpmalloc_thread_initialize();
+	rpmalloc_initialize();
+	pthread_key_create(&destructor_key, thread_destructor);
 }
 
 static void __attribute__((destructor))
 finalizer(void) {
-	rpmalloc_thread_finalize();
-	if (is_initialized) {
-		is_initialized = 0;
-		rpmalloc_finalize();
-	}
+	rpmalloc_finalize();
 }
 
 typedef struct {
@@ -171,24 +295,14 @@
                      const pthread_attr_t* attr,
                      void* (*start_routine)(void*),
                      void* arg) {
-	rpmalloc_thread_initialize();
+	rpmalloc_initialize();
 	thread_starter_arg* starter_arg = rpmalloc(sizeof(thread_starter_arg));
 	starter_arg->real_start = start_routine;
 	starter_arg->real_arg = arg;
 	return pthread_create(thread, attr, thread_starter, starter_arg);
 }
 
-typedef struct interpose_s {
-	void* new_func;
-	void* orig_func;
-} interpose_t;
-
-#define MAC_INTERPOSE(newf, oldf) __attribute__((used)) \
-static const interpose_t macinterpose##newf##oldf \
-__attribute__ ((section("__DATA, __interpose"))) = \
-	{ (void*)newf, (void*)oldf }
-
-MAC_INTERPOSE(pthread_create_proxy, pthread_create);
+MAC_INTERPOSE_SINGLE(pthread_create_proxy, pthread_create);
 
 #else
 
@@ -199,7 +313,7 @@
                const pthread_attr_t* attr,
                void* (*start_routine)(void*),
                void* arg) {
-#if defined(__linux__) || defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__APPLE__)
+#if defined(__linux__) || defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__APPLE__) || defined(__HAIKU__)
 	char fname[] = "pthread_create";
 #else
 	char fname[] = "_pthread_create";
@@ -216,306 +330,37 @@
 
 #endif
 
-RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-malloc(size_t size) {
-	initializer();
-	return rpmalloc(size);
-}
-
-void* RPMALLOC_CDECL
-realloc(void* ptr, size_t size) {
-	initializer();
-	return rprealloc(ptr, size);
-}
-
-void* RPMALLOC_CDECL
-reallocf(void* ptr, size_t size) {
-	initializer();
-	return rprealloc(ptr, size);
-}
-
-void* RPMALLOC_CDECL
-reallocarray(void* ptr, size_t count, size_t size) {
-	size_t total;
-#if ENABLE_VALIDATE_ARGS
-#ifdef _MSC_VER
-	int err = SizeTMult(count, size, &total);
-	if ((err != S_OK) || (total >= MAX_ALLOC_SIZE)) {
-		errno = EINVAL;
-		return 0;
-	}
-#else
-	int err = __builtin_umull_overflow(count, size, &total);
-	if (err || (total >= MAX_ALLOC_SIZE)) {
-		errno = EINVAL;
-		return 0;
-	}
 #endif
-#else
-	total = count * size;
-#endif
-	return realloc(ptr, total);
-}
 
-RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-calloc(size_t count, size_t size) {
-	initializer();
-	return rpcalloc(count, size);
-}
+#if ENABLE_OVERRIDE
 
-RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-valloc(size_t size) {
-	initializer();
-	if (!size)
-		size = page_size;
-	size_t total_size = size + page_size;
-#if ENABLE_VALIDATE_ARGS
-	if (total_size < size) {
-		errno = EINVAL;
-		return 0;
-	}
-#endif
-	void* buffer = rpmalloc(total_size);
-	if ((uintptr_t)buffer & (page_size - 1))
-		return (void*)(((uintptr_t)buffer & ~(page_size - 1)) + page_size);
-	return buffer;
-}
+#if defined(__GLIBC__) && defined(__linux__)
 
-RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-pvalloc(size_t size) {
-	size_t aligned_size = size;
-	if (aligned_size % page_size)
-		aligned_size = (1 + (aligned_size / page_size)) * page_size;
-#if ENABLE_VALIDATE_ARGS
-	if (aligned_size < size) {
-		errno = EINVAL;
-		return 0;
-	}
-#endif
+void* __libc_malloc(size_t size) RPALIAS(rpmalloc)
+void* __libc_calloc(size_t count, size_t size) RPALIAS(rpcalloc)
+void* __libc_realloc(void* p, size_t size) RPALIAS(rprealloc)
+void __libc_free(void* p) RPALIAS(rpfree)
+void __libc_cfree(void* p) RPALIAS(rpfree)
+void* __libc_memalign(size_t align, size_t size) RPALIAS(rpmemalign)
+int __posix_memalign(void** p, size_t align, size_t size) RPALIAS(rpposix_memalign)
+
+extern void* __libc_valloc(size_t size);
+extern void* __libc_pvalloc(size_t size);
+
+void*
+__libc_valloc(size_t size) {
 	return valloc(size);
 }
 
-RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-aligned_alloc(size_t alignment, size_t size) {
-	initializer();
-	return rpaligned_alloc(alignment, size);
+void*
+__libc_pvalloc(size_t size) {
+	return pvalloc(size);
 }
 
-RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-memalign(size_t alignment, size_t size) {
-	initializer();
-	return rpmemalign(alignment, size);
-}
-
-int RPMALLOC_CDECL
-posix_memalign(void** memptr, size_t alignment, size_t size) {
-	initializer();
-	return rpposix_memalign(memptr, alignment, size);
-}
-
-void RPMALLOC_CDECL
-free(void* ptr) {
-	if (!is_initialized || !rpmalloc_is_thread_initialized())
-		return;
-	rpfree(ptr);
-}
-
-void RPMALLOC_CDECL
-cfree(void* ptr) {
-	free(ptr);
-}
-
-size_t RPMALLOC_CDECL
-malloc_usable_size(
-#if defined(__ANDROID__)
-	const void* ptr
-#else
-	void* ptr
 #endif
-	) {
-	if (!rpmalloc_is_thread_initialized())
-		return 0;
-	return rpmalloc_usable_size((void*)(uintptr_t)ptr);
-}
 
-size_t RPMALLOC_CDECL
-malloc_size(void* ptr) {
-	return malloc_usable_size(ptr);
-}
+#endif
 
-#ifdef _MSC_VER
-
-extern void* RPMALLOC_CDECL
-_expand(void* block, size_t size) {
-	return realloc(block, size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_recalloc(void* block, size_t count, size_t size) {
-	initializer();
-	if (!block)
-		return rpcalloc(count, size);
-	size_t newsize = count * size;
-	size_t oldsize = rpmalloc_usable_size(block);
-	void* newblock = rprealloc(block, newsize);
-	if (newsize > oldsize)
-		memset((char*)newblock + oldsize, 0, newsize - oldsize);
-	return newblock;
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_aligned_malloc(size_t size, size_t alignment) {
-	return aligned_alloc(alignment, size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_aligned_realloc(void* block, size_t size, size_t alignment) {
-	initializer();
-	size_t oldsize = rpmalloc_usable_size(block);
-	return rpaligned_realloc(block, alignment, size, oldsize, 0);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_aligned_recalloc(void* block, size_t count, size_t size, size_t alignment) {
-	initializer();
-	size_t newsize = count * size;
-	if (!block) {
-		block = rpaligned_alloc(count, newsize);
-		memset(block, 0, newsize);
-		return block;
-	}
-	size_t oldsize = rpmalloc_usable_size(block);
-	void* newblock = rpaligned_realloc(block, alignment, newsize, oldsize, 0);
-	if (newsize > oldsize)
-		memset((char*)newblock + oldsize, 0, newsize - oldsize);
-	return newblock;
-}
-
-void RPMALLOC_CDECL
-_aligned_free(void* block) {
-	free(block);
-}
-
-extern size_t RPMALLOC_CDECL
-_msize(void* ptr) {
-	return malloc_usable_size(ptr);
-}
-
-extern size_t RPMALLOC_CDECL
-_aligned_msize(void* block, size_t alignment, size_t offset) {
-	return malloc_usable_size(block);
-}
-
-extern intptr_t RPMALLOC_CDECL
-_get_heap_handle(void) {
-	return 0;
-}
-
-extern int RPMALLOC_CDECL
-_heap_init(void) {
-	initializer();
-	return 1;
-}
-
-extern void RPMALLOC_CDECL
-_heap_term() {
-}
-
-extern int RPMALLOC_CDECL
-_set_new_mode(int flag) {
-	(void)sizeof(flag);
-	return 0;
-}
-
-#ifndef NDEBUG
-
-extern int RPMALLOC_CDECL
-_CrtDbgReport(int reportType, char const* fileName, int linenumber, char const* moduleName, char const* format, ...) {
-	return 0;
-}
-
-extern int RPMALLOC_CDECL
-_CrtDbgReportW(int reportType, wchar_t const* fileName, int lineNumber, wchar_t const* moduleName, wchar_t const* format, ...) {
-	return 0;
-}
-
-extern int RPMALLOC_CDECL
-_VCrtDbgReport(int reportType, char const* fileName, int linenumber, char const* moduleName, char const* format, va_list arglist) {
-	return 0;
-}
-
-extern int RPMALLOC_CDECL
-_VCrtDbgReportW(int reportType, wchar_t const* fileName, int lineNumber, wchar_t const* moduleName, wchar_t const* format, va_list arglist) {
-	return 0;
-}
-
-extern int RPMALLOC_CDECL
-_CrtSetReportMode(int reportType, int reportMode) {
-	return 0;
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_malloc_dbg(size_t size, int blockUse, char const* fileName, int lineNumber) {
-	return malloc(size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_expand_dbg(void* block, size_t size, int blockUse, char const* fileName, int lineNumber) {
-	return _expand(block, size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_calloc_dbg(size_t count, size_t size, int blockUse, char const* fileName, int lineNumber) {
-	return calloc(count, size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_realloc_dbg(void* block, size_t size, int blockUse, char const* fileName, int lineNumber) {
-	return realloc(block, size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_recalloc_dbg(void* block, size_t count, size_t size, int blockUse, char const* fileName, int lineNumber) {
-	return _recalloc(block, count, size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_aligned_malloc_dbg(size_t size, size_t alignment, char const* fileName, int lineNumber) {
-	return aligned_alloc(alignment, size);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_aligned_realloc_dbg(void* block, size_t size, size_t alignment, char const* fileName, int lineNumber) {
-	return _aligned_realloc(block, size, alignment);
-}
-
-extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
-_aligned_recalloc_dbg(void* block, size_t count, size_t size, size_t alignment, char const* fileName, int lineNumber) {
-	return _aligned_recalloc(block, count, size, alignment);
-}
-
-extern void RPMALLOC_CDECL
-_free_dbg(void* block, int blockUse) {
-	free(block);
-}
-
-extern void RPMALLOC_CDECL
-_aligned_free_dbg(void* block) {
-	free(block);
-}
-
-extern size_t RPMALLOC_CDECL
-_msize_dbg(void* ptr) {
-	return malloc_usable_size(ptr);
-}
-
-extern size_t RPMALLOC_CDECL
-_aligned_msize_dbg(void* block, size_t alignment, size_t offset) {
-	return malloc_usable_size(block);
-}
-
-#endif  // NDEBUG
-
-extern void* _crtheap = (void*)1;
-
+#if (defined(__GNUC__) || defined(__clang__)) && !defined(__MACH__)
+#pragma GCC visibility pop
 #endif

diff --git a/rpmalloc/new.cc b/rpmalloc/new.cc
deleted file mode 100644
index 92ae2fb..0000000
--- a/rpmalloc/new.cc
+++ /dev/null

@@ -1,123 +0,0 @@
-/* new.cc  -  Memory allocator  -  Public Domain  -  2017 Mattias Jansson
- *
- * This library provides a cross-platform lock free thread caching malloc implementation in C11.
- * The latest source code is always available at
- *
- * https://github.com/mjansson/rpmalloc
- *
- * This library is put in the public domain; you can redistribute it and/or modify it without any restrictions.
- *
- */
-
-#include <new>
-#include <cstdint>
-#include <cstdlib>
-
-#include "rpmalloc.h"
-
-using namespace std;
-
-#ifdef __clang__
-#pragma clang diagnostic ignored "-Wc++98-compat"
-#endif
-
-extern void*
-operator new(size_t size);
-
-extern void*
-operator new[](size_t size);
-
-extern void
-operator delete(void* ptr) noexcept;
-
-extern void
-operator delete[](void* ptr) noexcept;
-
-extern void*
-operator new(size_t size, const std::nothrow_t&) noexcept;
-
-extern void*
-operator new[](size_t size, const std::nothrow_t&) noexcept;
-
-extern void
-operator delete(void* ptr, const std::nothrow_t&) noexcept;
-
-extern void
-operator delete[](void* ptr, const std::nothrow_t&) noexcept;
-
-extern void
-operator delete(void* ptr, size_t) noexcept;
-
-extern void
-operator delete[](void* ptr, size_t) noexcept;
-
-static int is_initialized;
-
-static void
-initializer(void) {
-	if (!is_initialized) {
-		is_initialized = 1;
-		rpmalloc_initialize();
-	}
-	rpmalloc_thread_initialize();
-}
-
-void*
-operator new(size_t size) {
-	initializer();
-	return rpmalloc(size);
-}
-
-void
-operator delete(void* ptr) noexcept {
-	if (rpmalloc_is_thread_initialized())
-		rpfree(ptr);
-}
-
-void*
-operator new[](size_t size) {
-	initializer();
-	return rpmalloc(size);
-}
-
-void
-operator delete[](void* ptr) noexcept {
-	if (rpmalloc_is_thread_initialized())
-		rpfree(ptr);
-}
-
-void*
-operator new(size_t size, const std::nothrow_t&) noexcept {
-	initializer();
-	return rpmalloc(size);
-}
-
-void*
-operator new[](size_t size, const std::nothrow_t&) noexcept {
-	initializer();
-	return rpmalloc(size);
-}
-
-void
-operator delete(void* ptr, const std::nothrow_t&) noexcept {
-	if (rpmalloc_is_thread_initialized())
-		rpfree(ptr);
-}
-
-void
-operator delete[](void* ptr, const std::nothrow_t&) noexcept {
-	if (rpmalloc_is_thread_initialized())
-		rpfree(ptr);
-}
-
-void
-operator delete(void* ptr, size_t) noexcept {
-	if (rpmalloc_is_thread_initialized())
-		rpfree(ptr);
-}
-
-void
-operator delete[](void* ptr, size_t) noexcept {
-	if (rpmalloc_is_thread_initialized())
-		rpfree(ptr);
-}

diff --git a/rpmalloc/rpmalloc.c b/rpmalloc/rpmalloc.c
index 4eae30c..451d03d 100644
--- a/rpmalloc/rpmalloc.c
+++ b/rpmalloc/rpmalloc.c

@@ -20,10 +20,6 @@
 //! Enable per-thread cache
 #define ENABLE_THREAD_CACHE       1
 #endif
-#ifndef ENABLE_ADAPTIVE_THREAD_CACHE
-//! Enable adaptive size of per-thread cache (still bounded by THREAD_CACHE_MULTIPLIER hard limit)
-#define ENABLE_ADAPTIVE_THREAD_CACHE  1
-#endif
 #ifndef ENABLE_GLOBAL_CACHE
 //! Enable global cache shared between all threads, requires thread cache
 #define ENABLE_GLOBAL_CACHE       1
@@ -40,6 +36,10 @@
 //! Enable asserts
 #define ENABLE_ASSERTS            0
 #endif
+#ifndef ENABLE_OVERRIDE
+//! Override standard library malloc/free and new/delete entry points
+#define ENABLE_OVERRIDE           0
+#endif
 #ifndef ENABLE_PRELOAD
 //! Support preloading
 #define ENABLE_PRELOAD            0
@@ -49,8 +49,8 @@
 #define DISABLE_UNMAP             0
 #endif
 #ifndef DEFAULT_SPAN_MAP_COUNT
-//! Default number of spans to map in call to map more virtual memory
-#define DEFAULT_SPAN_MAP_COUNT    32
+//! Default number of spans to map in call to map more virtual memory (default values yield 4MiB here)
+#define DEFAULT_SPAN_MAP_COUNT    64
 #endif
 
 #if ENABLE_THREAD_CACHE
@@ -63,9 +63,15 @@
 #define ENABLE_UNLIMITED_THREAD_CACHE ENABLE_UNLIMITED_CACHE
 #endif
 #if !ENABLE_UNLIMITED_THREAD_CACHE
+#ifndef THREAD_CACHE_MULTIPLIER
 //! Multiplier for thread cache (cache limit will be span release count multiplied by this value)
 #define THREAD_CACHE_MULTIPLIER 16
 #endif
+#ifndef ENABLE_ADAPTIVE_THREAD_CACHE
+//! Enable adaptive size of per-thread cache (still bounded by THREAD_CACHE_MULTIPLIER hard limit)
+#define ENABLE_ADAPTIVE_THREAD_CACHE  0
+#endif
+#endif
 #endif
 
 #if ENABLE_GLOBAL_CACHE && ENABLE_THREAD_CACHE
@@ -75,7 +81,7 @@
 #endif
 #if !ENABLE_UNLIMITED_GLOBAL_CACHE
 //! Multiplier for global cache (cache limit will be span release count multiplied by this value)
-#define GLOBAL_CACHE_MULTIPLIER 64
+#define GLOBAL_CACHE_MULTIPLIER (THREAD_CACHE_MULTIPLIER * 6)
 #endif
 #else
 #  undef ENABLE_GLOBAL_CACHE
@@ -101,12 +107,14 @@
 
 /// Platform and arch specifics
 #if defined(_MSC_VER) && !defined(__clang__)
-#  define FORCEINLINE __forceinline
+#  define FORCEINLINE inline __forceinline
 #  define _Static_assert static_assert
 #else
 #  define FORCEINLINE inline __attribute__((__always_inline__))
 #endif
 #if PLATFORM_WINDOWS
+#  define WIN32_LEAN_AND_MEAN
+#  include <windows.h>
 #  if ENABLE_VALIDATE_ARGS
 #    include <Intsafe.h>
 #  endif
@@ -116,6 +124,7 @@
 #  include <stdlib.h>
 #  if defined(__APPLE__)
 #    include <mach/mach_vm.h>
+#    include <mach/vm_statistics.h>
 #    include <pthread.h>
 #  endif
 #  if defined(__HAIKU__)
@@ -124,14 +133,6 @@
 #  endif
 #endif
 
-#ifndef ARCH_64BIT
-#  if defined(__LLP64__) || defined(__LP64__) || defined(_WIN64)
-#    define ARCH_64BIT 1
-#  else
-#    define ARCH_64BIT 0
-#  endif
-#endif
-
 #include <stdint.h>
 #include <string.h>
 
@@ -145,6 +146,9 @@
 #  undef  assert
 #  define assert(x) do {} while(0)
 #endif
+#if ENABLE_STATISTICS
+#  include <stdio.h>
+#endif
 
 /// Atomic access abstraction
 #if defined(_MSC_VER) && !defined(__clang__)
@@ -159,15 +163,21 @@
 static FORCEINLINE int32_t atomic_load32(atomic32_t* src) { return *src; }
 static FORCEINLINE void    atomic_store32(atomic32_t* dst, int32_t val) { *dst = val; }
 static FORCEINLINE int32_t atomic_incr32(atomic32_t* val) { return (int32_t)_InterlockedExchangeAdd(val, 1) + 1; }
+#if ENABLE_STATISTICS || ENABLE_ADAPTIVE_THREAD_CACHE
+static FORCEINLINE int32_t atomic_decr32(atomic32_t* val) { return (int32_t)_InterlockedExchangeAdd(val, -1) - 1; }
+#endif
 static FORCEINLINE int32_t atomic_add32(atomic32_t* val, int32_t add) { return (int32_t)_InterlockedExchangeAdd(val, add) + add; }
 static FORCEINLINE void*   atomic_load_ptr(atomicptr_t* src) { return (void*)*src; }
 static FORCEINLINE void    atomic_store_ptr(atomicptr_t* dst, void* val) { *dst = val; }
-#if ARCH_64BIT
+#  if defined(__LLP64__) || defined(__LP64__) || defined(_WIN64)
 static FORCEINLINE int     atomic_cas_ptr(atomicptr_t* dst, void* val, void* ref) { return (_InterlockedCompareExchange64((volatile long long*)dst, (long long)val, (long long)ref) == (long long)ref) ? 1 : 0; }
 #else
 static FORCEINLINE int     atomic_cas_ptr(atomicptr_t* dst, void* val, void* ref) { return (_InterlockedCompareExchange((volatile long*)dst, (long)val, (long)ref) == (long)ref) ? 1 : 0; }
 #endif
 
+#define EXPECTED(x) (x)
+#define UNEXPECTED(x) (x)
+
 #else
 
 #include <stdatomic.h>
@@ -182,28 +192,34 @@
 static FORCEINLINE int32_t atomic_load32(atomic32_t* src) { return atomic_load_explicit(src, memory_order_relaxed); }
 static FORCEINLINE void    atomic_store32(atomic32_t* dst, int32_t val) { atomic_store_explicit(dst, val, memory_order_relaxed); }
 static FORCEINLINE int32_t atomic_incr32(atomic32_t* val) { return atomic_fetch_add_explicit(val, 1, memory_order_relaxed) + 1; }
+#if ENABLE_STATISTICS || ENABLE_ADAPTIVE_THREAD_CACHE
+static FORCEINLINE int32_t atomic_decr32(atomic32_t* val) { return atomic_fetch_add_explicit(val, -1, memory_order_relaxed) - 1; }
+#endif
 static FORCEINLINE int32_t atomic_add32(atomic32_t* val, int32_t add) { return atomic_fetch_add_explicit(val, add, memory_order_relaxed) + add; }
 static FORCEINLINE void*   atomic_load_ptr(atomicptr_t* src) { return atomic_load_explicit(src, memory_order_relaxed); }
 static FORCEINLINE void    atomic_store_ptr(atomicptr_t* dst, void* val) { atomic_store_explicit(dst, val, memory_order_relaxed); }
 static FORCEINLINE int     atomic_cas_ptr(atomicptr_t* dst, void* val, void* ref) { return atomic_compare_exchange_weak_explicit(dst, &ref, val, memory_order_release, memory_order_acquire); }
 
+#define EXPECTED(x) __builtin_expect((x), 1)
+#define UNEXPECTED(x) __builtin_expect((x), 0)
+    
 #endif
 
 /// Preconfigured limits and sizes
 //! Granularity of a small allocation block
-#define SMALL_GRANULARITY         32
+#define SMALL_GRANULARITY         16
 //! Small granularity shift count
-#define SMALL_GRANULARITY_SHIFT   5
+#define SMALL_GRANULARITY_SHIFT   4
 //! Number of small block size classes
-#define SMALL_CLASS_COUNT         63
+#define SMALL_CLASS_COUNT         65
 //! Maximum size of a small block
-#define SMALL_SIZE_LIMIT          (SMALL_GRANULARITY * SMALL_CLASS_COUNT)
+#define SMALL_SIZE_LIMIT          (SMALL_GRANULARITY * (SMALL_CLASS_COUNT - 1))
 //! Granularity of a medium allocation block
 #define MEDIUM_GRANULARITY        512
 //! Medium granularity shift count
 #define MEDIUM_GRANULARITY_SHIFT  9
 //! Number of medium block size classes
-#define MEDIUM_CLASS_COUNT        63
+#define MEDIUM_CLASS_COUNT        61
 //! Total number of small + medium size classes
 #define SIZE_CLASS_COUNT          (SMALL_CLASS_COUNT + MEDIUM_CLASS_COUNT)
 //! Number of large block size classes
@@ -212,18 +228,8 @@
 #define MEDIUM_SIZE_LIMIT         (SMALL_SIZE_LIMIT + (MEDIUM_GRANULARITY * MEDIUM_CLASS_COUNT))
 //! Maximum size of a large block
 #define LARGE_SIZE_LIMIT          ((LARGE_CLASS_COUNT * _memory_span_size) - SPAN_HEADER_SIZE)
-//! Size of a span header
-#define SPAN_HEADER_SIZE          64
-
-#define pointer_offset(ptr, ofs) (void*)((char*)(ptr) + (ptrdiff_t)(ofs))
-#define pointer_diff(first, second) (ptrdiff_t)((const char*)(first) - (const char*)(second))
-
-#if ARCH_64BIT
-typedef int64_t offset_t;
-#else
-typedef int32_t offset_t;
-#endif
-typedef uint32_t count_t;
+//! Size of a span header (must be a multiple of SMALL_GRANULARITY)
+#define SPAN_HEADER_SIZE          96
 
 #if ENABLE_VALIDATE_ARGS
 //! Maximum allocation size to avoid integer overflow
@@ -231,60 +237,92 @@
 #define MAX_ALLOC_SIZE            (((size_t)-1) - _memory_span_size)
 #endif
 
+#define pointer_offset(ptr, ofs) (void*)((char*)(ptr) + (ptrdiff_t)(ofs))
+#define pointer_diff(first, second) (ptrdiff_t)((const char*)(first) - (const char*)(second))
+
+#define INVALID_POINTER ((void*)((uintptr_t)-1))
+
 /// Data types
 //! A memory heap, per thread
 typedef struct heap_t heap_t;
+//! Heap spans per size class
+typedef struct heap_class_t heap_class_t;
 //! Span of memory pages
 typedef struct span_t span_t;
+//! Span list
+typedef struct span_list_t span_list_t;
+//! Span active data
+typedef struct span_active_t span_active_t;
 //! Size class definition
 typedef struct size_class_t size_class_t;
-//! Span block bookkeeping
-typedef struct span_block_t span_block_t;
-//! Span list bookkeeping
-typedef struct span_list_t span_list_t;
-//! Span data union, usage depending on span state
-typedef union span_data_t span_data_t;
 //! Global cache
 typedef struct global_cache_t global_cache_t;
 
 //! Flag indicating span is the first (master) span of a split superspan
-#define SPAN_FLAG_MASTER 1
+#define SPAN_FLAG_MASTER 1U
 //! Flag indicating span is a secondary (sub) span of a split superspan
-#define SPAN_FLAG_SUBSPAN 2
+#define SPAN_FLAG_SUBSPAN 2U
+//! Flag indicating span has blocks with increased alignment
+#define SPAN_FLAG_ALIGNED_BLOCKS 4U
 
-struct span_block_t {
-	//! Free list
-	uint16_t    free_list;
-	//! First autolinked block
-	uint16_t    first_autolink;
-	//! Free count
-	uint16_t    free_count;
-};
-
-struct span_list_t {
-	//! List size
-	uint32_t    size;
-};
-
-union span_data_t {
-	//! Span data when used as blocks
-	span_block_t block;
-	//! Span data when used in lists
-	span_list_t list;
-	//! Dummy
-	uint64_t compound;
-};
-
-#if ENABLE_ADAPTIVE_THREAD_CACHE
+#if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
 struct span_use_t {
 	//! Current number of spans used (actually used, not in cache)
-	unsigned int current;
+	atomic32_t current;
 	//! High water mark of spans used
-	unsigned int high;
+	uint32_t high;
+#if ENABLE_STATISTICS
+	//! Number of spans transitioned to global cache
+	uint32_t spans_to_global;
+	//! Number of spans transitioned from global cache
+	uint32_t spans_from_global;
+	//! Number of spans transitioned to thread cache
+	uint32_t spans_to_cache;
+	//! Number of spans transitioned from thread cache
+	uint32_t spans_from_cache;
+	//! Number of spans transitioned to reserved state
+	uint32_t spans_to_reserved;
+	//! Number of spans transitioned from reserved state
+	uint32_t spans_from_reserved;
+	//! Number of raw memory map calls
+	uint32_t spans_map_calls;
+#endif
 };
 typedef struct span_use_t span_use_t;
 #endif
 
+#if ENABLE_STATISTICS
+struct size_class_use_t {
+	//! Current number of allocations
+	atomic32_t alloc_current;
+	//! Peak number of allocations
+	int32_t alloc_peak;
+	//! Total number of allocations
+	int32_t alloc_total;
+	//! Total number of frees
+	atomic32_t free_total;
+	//! Number of spans in use
+	uint32_t spans_current;
+	//! Number of spans transitioned to cache
+	uint32_t spans_peak;
+	//! Number of spans transitioned to cache
+	uint32_t spans_to_cache;
+	//! Number of spans transitioned from cache
+	uint32_t spans_from_cache;
+	//! Number of spans transitioned from reserved state
+	uint32_t spans_from_reserved;
+	//! Number of spans mapped
+	uint32_t spans_map_calls;
+};
+typedef struct size_class_use_t size_class_use_t;
+#endif
+
+typedef enum span_state_t {
+	SPAN_STATE_ACTIVE = 0,
+	SPAN_STATE_PARTIAL,
+	SPAN_STATE_FULL
+} span_state_t;
+
 //A span can either represent a single span of memory pages with size declared by span_map_count configuration variable,
 //or a set of spans in a continuous region, a super span. Any reference to the term "span" usually refers to both a single
 //span or a super span. A super span can further be divided into multiple spans (or this, super spans), where the first
@@ -294,43 +332,61 @@
 //in the same call to release the virtual memory range, but individual subranges can be decommitted individually
 //to reduce physical memory use).
 struct span_t {
-	//!	Heap ID
-	atomic32_t  heap_id;
+	//! Free list
+	void*       free_list;
+	//! State
+	uint32_t    state;
+	//! Used count when not active (not including deferred free list)
+	uint32_t    used_count;
+	//! Block count
+	uint32_t    block_count;
 	//! Size class
-	uint16_t    size_class;
+	uint32_t    size_class;
+	//! Index of last block initialized in free list
+	uint32_t    free_list_limit;
+	//! Span list size when part of a cache list, or size of deferred free list when partial/full
+	uint32_t    list_size;
+	//! Deferred free list
+	atomicptr_t free_list_deferred;
+	//! Size of a block
+	uint32_t    block_size;
 	//! Flags and counters
-	uint16_t    flags;
-	//! Span data depending on use
-	span_data_t data;
-	//! Total span counter for master spans, distance for subspans
-	uint32_t    total_spans_or_distance;
+	uint32_t    flags;
 	//! Number of spans
 	uint32_t    span_count;
+	//! Total span counter for master spans, distance for subspans
+	uint32_t    total_spans_or_distance;
 	//! Remaining span counter, for master spans
 	atomic32_t  remaining_spans;
 	//! Alignment offset
 	uint32_t    align_offset;
+	//! Owning heap
+	heap_t*     heap;
 	//! Next span
-	span_t*     next_span;
+	span_t*     next;
 	//! Previous span
-	span_t*     prev_span;
+	span_t*     prev;
 };
 _Static_assert(sizeof(span_t) <= SPAN_HEADER_SIZE, "span size mismatch");
 
+struct heap_class_t {
+	//! Free list of active span
+	void*        free_list;
+	//! Double linked list of partially used spans with free blocks for each size class.
+	//  Current active span is at head of list. Previous span pointer in head points to tail span of list.
+	span_t*      partial_span;
+};
+
 struct heap_t {
-	//! Heap ID
-	int32_t      id;
-	//! Free count for each size class active span
-	span_block_t active_block[SIZE_CLASS_COUNT];
-	//! Active span for each size class
-	span_t*      active_span[SIZE_CLASS_COUNT];
-	//! List of semi-used spans with free blocks for each size class (double linked list)
-	span_t*      size_cache[SIZE_CLASS_COUNT];
+	//! Active and semi-used span data per size class
+	heap_class_t span_class[SIZE_CLASS_COUNT];
 #if ENABLE_THREAD_CACHE
 	//! List of free spans (single linked list)
 	span_t*      span_cache[LARGE_CLASS_COUNT];
+	//! List of deferred free spans of class 0 (single linked list)
+	atomicptr_t  span_cache_deferred;
 #endif
-#if ENABLE_ADAPTIVE_THREAD_CACHE
+#if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
 	//! Current and high water mark of spans used per span count
 	span_use_t   span_use[LARGE_CLASS_COUNT];
 #endif
@@ -340,25 +396,27 @@
 	span_t*      span_reserve_master;
 	//! Number of mapped but unused spans
 	size_t       spans_reserved;
-	//! Deferred deallocation
-	atomicptr_t  defer_deallocate;
 	//! Next heap in id list
 	heap_t*      next_heap;
 	//! Next heap in orphan list
 	heap_t*      next_orphan;
 	//! Memory pages alignment offset
 	size_t       align_offset;
+	//! Heap ID
+	int32_t      id;
 #if ENABLE_STATISTICS
 	//! Number of bytes transitioned thread -> global
 	size_t       thread_to_global;
 	//! Number of bytes transitioned global -> thread
 	size_t       global_to_thread;
+	//! Allocation stats per size class
+	size_class_use_t size_class_use[SIZE_CLASS_COUNT + 1];
 #endif
 };
 
 struct size_class_t {
 	//! Size of blocks in this class
-	uint32_t size;
+	uint32_t block_size;
 	//! Number of blocks in each chunk
 	uint16_t block_count;
 	//! Class index this class is merged with
@@ -376,6 +434,8 @@
 };
 
 /// Global data
+//! Initialized flag
+static int _rpmalloc_initialized;
 //! Configuration
 static rpmalloc_config_t _memory_config;
 //! Memory page size
@@ -384,12 +444,19 @@
 static size_t _memory_page_size_shift;
 //! Granularity at which memory pages are mapped by OS
 static size_t _memory_map_granularity;
+#if RPMALLOC_CONFIGURABLE
 //! Size of a span of memory pages
 static size_t _memory_span_size;
 //! Shift to divide by span size
 static size_t _memory_span_size_shift;
 //! Mask to get to start of a memory span
 static uintptr_t _memory_span_mask;
+#else
+//! Hardwired span size (64KiB)
+#define _memory_span_size (64 * 1024)
+#define _memory_span_size_shift 16
+#define _memory_span_mask (~((uintptr_t)(_memory_span_size - 1)))
+#endif
 //! Number of spans to map in each map call
 static size_t _memory_span_map_count;
 //! Number of spans to release from thread cache to global cache (single spans)
@@ -417,16 +484,22 @@
 #if ENABLE_STATISTICS
 //! Active heap count
 static atomic32_t _memory_active_heaps;
-//! Total number of currently mapped memory pages
+//! Number of currently mapped memory pages
 static atomic32_t _mapped_pages;
-//! Total number of currently lost spans
+//! Peak number of concurrently mapped memory pages
+static int32_t _mapped_pages_peak;
+//! Number of currently unused spans
 static atomic32_t _reserved_spans;
 //! Running counter of total number of mapped memory pages since start
 static atomic32_t _mapped_total;
 //! Running counter of total number of unmapped memory pages since start
 static atomic32_t _unmapped_total;
-//! Total number of currently mapped memory pages in OS calls
+//! Number of currently mapped memory pages in OS calls
 static atomic32_t _mapped_pages_os;
+//! Number of currently allocated pages in huge allocations
+static atomic32_t _huge_pages_current;
+//! Peak number of currently allocated pages in huge allocations
+static int32_t _huge_pages_peak;
 #endif
 
 //! Current thread heap
@@ -445,9 +518,8 @@
 static _Thread_local heap_t* _memory_thread_heap TLS_MODEL;
 #endif
 
-//! Get the current thread heap
-static FORCEINLINE heap_t*
-get_thread_heap(void) {
+static inline heap_t*
+get_thread_heap_raw(void) {
 #if (defined(__APPLE__) || defined(__HAIKU__)) && ENABLE_PRELOAD
 	return pthread_getspecific(_memory_thread_heap);
 #else
@@ -455,6 +527,20 @@
 #endif
 }
 
+//! Get the current thread heap
+static inline heap_t*
+get_thread_heap(void) {
+	heap_t* heap = get_thread_heap_raw();
+#if ENABLE_PRELOAD
+	if (EXPECTED(heap != 0))
+		return heap;
+	rpmalloc_initialize();
+	return get_thread_heap_raw();
+#else
+	return heap;
+#endif
+}
+
 //! Set the current thread heap
 static void
 set_thread_heap(heap_t* heap) {
@@ -473,10 +559,6 @@
 static void
 _memory_unmap_os(void* address, size_t size, size_t offset, size_t release);
 
-//! Deallocate any deferred blocks and check for the given size class
-static void
-_memory_deallocate_deferred(heap_t* heap);
-
 //! Lookup a memory heap from heap ID
 static heap_t*
 _memory_heap_lookup(int32_t id) {
@@ -488,11 +570,29 @@
 }
 
 #if ENABLE_STATISTICS
+#  define _memory_statistics_inc(counter, value) counter += value
+#  define _memory_statistics_dec(counter, value) counter -= value
 #  define _memory_statistics_add(atomic_counter, value) atomic_add32(atomic_counter, (int32_t)(value))
+#  define _memory_statistics_add_peak(atomic_counter, value, peak) do { int32_t _cur_count = atomic_add32(atomic_counter, (int32_t)(value)); if (_cur_count > (peak)) peak = _cur_count; } while (0)
 #  define _memory_statistics_sub(atomic_counter, value) atomic_add32(atomic_counter, -(int32_t)(value))
+#  define _memory_statistics_inc_alloc(heap, class_idx) do { \
+	int32_t alloc_current = atomic_incr32(&heap->size_class_use[class_idx].alloc_current); \
+	if (alloc_current > heap->size_class_use[class_idx].alloc_peak) \
+		heap->size_class_use[class_idx].alloc_peak = alloc_current; \
+	heap->size_class_use[class_idx].alloc_total++; \
+} while(0)
+#  define _memory_statistics_inc_free(heap, class_idx) do { \
+	atomic_decr32(&heap->size_class_use[class_idx].alloc_current); \
+	atomic_incr32(&heap->size_class_use[class_idx].free_total); \
+} while(0)
 #else
+#  define _memory_statistics_inc(counter, value) do {} while(0)
+#  define _memory_statistics_dec(counter, value) do {} while(0)
 #  define _memory_statistics_add(atomic_counter, value) do {} while(0)
+#  define _memory_statistics_add_peak(atomic_counter, value, peak) do {} while (0)
 #  define _memory_statistics_sub(atomic_counter, value) do {} while(0)
+#  define _memory_statistics_inc_alloc(heap, class_idx) do {} while(0)
+#  define _memory_statistics_inc_free(heap, class_idx) do {} while(0)
 #endif
 
 static void
@@ -503,7 +603,7 @@
 _memory_map(size_t size, size_t* offset) {
 	assert(!(size % _memory_page_size));
 	assert(size >= _memory_page_size);
-	_memory_statistics_add(&_mapped_pages, (size >> _memory_page_size_shift));
+	_memory_statistics_add_peak(&_mapped_pages, (size >> _memory_page_size_shift), _mapped_pages_peak);
 	_memory_statistics_add(&_mapped_total, (size >> _memory_page_size_shift));
 	return _memory_config.memory_map(size, offset);
 }
@@ -521,78 +621,104 @@
 	_memory_config.memory_unmap(address, size, offset, release);
 }
 
+//! Declare the span to be a subspan and store distance from master span and span count
+static void
+_memory_span_mark_as_subspan_unless_master(span_t* master, span_t* subspan, size_t span_count) {
+	assert((subspan != master) || (subspan->flags & SPAN_FLAG_MASTER));
+	if (subspan != master) {
+		subspan->flags = SPAN_FLAG_SUBSPAN;
+		subspan->total_spans_or_distance = (uint32_t)((uintptr_t)pointer_diff(subspan, master) >> _memory_span_size_shift);
+		subspan->align_offset = 0;
+	}
+	subspan->span_count = (uint32_t)span_count;
+}
+
+//! Use reserved spans to fulfill a memory map request (reserve size must be checked by caller)
+static span_t*
+_memory_map_from_reserve(heap_t* heap, size_t span_count) {
+	//Update the heap span reserve
+	span_t* span = heap->span_reserve;
+	heap->span_reserve = pointer_offset(span, span_count * _memory_span_size);
+	heap->spans_reserved -= span_count;
+
+	_memory_span_mark_as_subspan_unless_master(heap->span_reserve_master, span, span_count);
+	if (span_count <= LARGE_CLASS_COUNT)
+		_memory_statistics_inc(heap->span_use[span_count - 1].spans_from_reserved, 1);
+
+	return span;
+}
+
+//! Get the aligned number of spans to map in based on wanted count, configured mapping granularity and the page size
+static size_t
+_memory_map_align_span_count(size_t span_count) {
+	size_t request_count = (span_count > _memory_span_map_count) ? span_count : _memory_span_map_count;
+	if ((_memory_page_size > _memory_span_size) && ((request_count * _memory_span_size) % _memory_page_size))
+		request_count += _memory_span_map_count - (request_count % _memory_span_map_count);	
+	return request_count;
+}
+
+//! Store the given spans as reserve in the given heap
+static void
+_memory_heap_set_reserved_spans(heap_t* heap, span_t* master, span_t* reserve, size_t reserve_span_count) {
+	heap->span_reserve_master = master;
+	heap->span_reserve = reserve;
+	heap->spans_reserved = reserve_span_count;
+}
+
+//! Setup a newly mapped span
+static void
+_memory_span_initialize(span_t* span, size_t total_span_count, size_t span_count, size_t align_offset) {
+	span->total_spans_or_distance = (uint32_t)total_span_count;
+	span->span_count = (uint32_t)span_count;
+	span->align_offset = (uint32_t)align_offset;
+	span->flags = SPAN_FLAG_MASTER;
+	atomic_store32(&span->remaining_spans, (int32_t)total_span_count);	
+}
+
+//! Map a akigned set of spans, taking configured mapping granularity and the page size into account
+static span_t*
+_memory_map_aligned_span_count(heap_t* heap, size_t span_count) {
+	//If we already have some, but not enough, reserved spans, release those to heap cache and map a new
+	//full set of spans. Otherwise we would waste memory if page size > span size (huge pages)
+	size_t aligned_span_count = _memory_map_align_span_count(span_count);
+	size_t align_offset = 0;
+	span_t* span = _memory_map(aligned_span_count * _memory_span_size, &align_offset);
+	if (!span)
+		return 0;
+	_memory_span_initialize(span, aligned_span_count, span_count, align_offset);
+	_memory_statistics_add(&_reserved_spans, aligned_span_count);
+	if (span_count <= LARGE_CLASS_COUNT)
+		_memory_statistics_inc(heap->span_use[span_count - 1].spans_map_calls, 1);
+	if (aligned_span_count > span_count) {
+		if (heap->spans_reserved) {
+			_memory_span_mark_as_subspan_unless_master(heap->span_reserve_master, heap->span_reserve, heap->spans_reserved);
+			_memory_heap_cache_insert(heap, heap->span_reserve);
+		}
+		_memory_heap_set_reserved_spans(heap, span, pointer_offset(span, span_count * _memory_span_size), aligned_span_count - span_count);
+	}
+	return span;
+}
+
 //! Map in memory pages for the given number of spans (or use previously reserved pages)
 static span_t*
 _memory_map_spans(heap_t* heap, size_t span_count) {
-	if (span_count <= heap->spans_reserved) {
-		span_t* span = heap->span_reserve;
-		heap->span_reserve = pointer_offset(span, span_count * _memory_span_size);
-		heap->spans_reserved -= span_count;
-		if (span == heap->span_reserve_master) {
-			assert(span->flags & SPAN_FLAG_MASTER);
-		}
-		else {
-			//Declare the span to be a subspan with given distance from master span
-			uint32_t distance = (uint32_t)((uintptr_t)pointer_diff(span, heap->span_reserve_master) >> _memory_span_size_shift);
-			span->flags = SPAN_FLAG_SUBSPAN;
-			span->total_spans_or_distance = distance;
-			span->align_offset = 0;
-		}
-		span->span_count = (uint32_t)span_count;
-		return span;
-	}
-
-	//If we already have some, but not enough, reserved spans, release those to heap cache and map a new
-	//full set of spans. Otherwise we would waste memory if page size > span size (huge pages)
-	size_t request_spans = (span_count > _memory_span_map_count) ? span_count : _memory_span_map_count;
-	if ((_memory_page_size > _memory_span_size) && ((request_spans * _memory_span_size) % _memory_page_size))
-		request_spans += _memory_span_map_count - (request_spans % _memory_span_map_count);
-	size_t align_offset = 0;
-	span_t* span = _memory_map(request_spans * _memory_span_size, &align_offset);
-	if (!span)
-		return span;
-	span->align_offset = (uint32_t)align_offset;
-	span->total_spans_or_distance = (uint32_t)request_spans;
-	span->span_count = (uint32_t)span_count;
-	span->flags = SPAN_FLAG_MASTER;
-	atomic_store32(&span->remaining_spans, (int32_t)request_spans);
-	_memory_statistics_add(&_reserved_spans, request_spans);
-	if (request_spans > span_count) {
-		if (heap->spans_reserved) {
-			span_t* prev_span = heap->span_reserve;
-			if (prev_span == heap->span_reserve_master) {
-				assert(prev_span->flags & SPAN_FLAG_MASTER);
-			}
-			else {
-				uint32_t distance = (uint32_t)((uintptr_t)pointer_diff(prev_span, heap->span_reserve_master) >> _memory_span_size_shift);
-				prev_span->flags = SPAN_FLAG_SUBSPAN;
-				prev_span->total_spans_or_distance = distance;
-				prev_span->align_offset = 0;
-			}
-			prev_span->span_count = (uint32_t)heap->spans_reserved;
-			atomic_store32(&prev_span->heap_id, heap->id);
-			_memory_heap_cache_insert(heap, prev_span);
-		}
-		heap->span_reserve_master = span;
-		heap->span_reserve = pointer_offset(span, span_count * _memory_span_size);
-		heap->spans_reserved = request_spans - span_count;
-	}
-	return span;
+	if (span_count <= heap->spans_reserved)
+		return _memory_map_from_reserve(heap, span_count);
+	return _memory_map_aligned_span_count(heap, span_count);
 }
 
 //! Unmap memory pages for the given number of spans (or mark as unused if no partial unmappings)
 static void
 _memory_unmap_span(span_t* span) {
-	size_t span_count = span->span_count;
 	assert((span->flags & SPAN_FLAG_MASTER) || (span->flags & SPAN_FLAG_SUBSPAN));
 	assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN));
 
 	int is_master = !!(span->flags & SPAN_FLAG_MASTER);
 	span_t* master = is_master ? span : (pointer_offset(span, -(int32_t)(span->total_spans_or_distance * _memory_span_size)));
-
 	assert(is_master || (span->flags & SPAN_FLAG_SUBSPAN));
 	assert(master->flags & SPAN_FLAG_MASTER);
 
+	size_t span_count = span->span_count;
 	if (!is_master) {
 		//Directly unmap subspans (unless huge pages, in which case we defer and unmap entire page range with master)
 		assert(span->align_offset == 0);
@@ -600,8 +726,7 @@
 			_memory_unmap(span, span_count * _memory_span_size, 0, 0);
 			_memory_statistics_sub(&_reserved_spans, span_count);
 		}
-	}
-	else {
+	} else {
 		//Special double flag to denote an unmapped master
 		//It must be kept in memory since span header must be used
 		span->flags |= SPAN_FLAG_MASTER | SPAN_FLAG_SUBSPAN;
@@ -623,47 +748,25 @@
 //! Unmap a single linked list of spans
 static void
 _memory_unmap_span_list(span_t* span) {
-	size_t list_size = span->data.list.size;
+	size_t list_size = span->list_size;
 	for (size_t ispan = 0; ispan < list_size; ++ispan) {
-		span_t* next_span = span->next_span;
+		span_t* next_span = span->next;
 		_memory_unmap_span(span);
 		span = next_span;
 	}
 	assert(!span);
 }
 
-//! Split a super span in two
-static span_t*
-_memory_span_split(span_t* span, size_t use_count) {
-	size_t current_count = span->span_count;
-	uint32_t distance = 0;
-	assert(current_count > use_count);
-	assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN));
-	assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN));
-
-	span->span_count = (uint32_t)use_count;
-	if (span->flags & SPAN_FLAG_SUBSPAN)
-		distance = span->total_spans_or_distance;
-
-	//Setup remainder as a subspan
-	span_t* subspan = pointer_offset(span, use_count * _memory_span_size);
-	subspan->flags = SPAN_FLAG_SUBSPAN;
-	subspan->total_spans_or_distance = (uint32_t)(distance + use_count);
-	subspan->span_count = (uint32_t)(current_count - use_count);
-	subspan->align_offset = 0;
-	return subspan;
-}
-
 //! Add span to head of single linked span list
 static size_t
 _memory_span_list_push(span_t** head, span_t* span) {
-	span->next_span = *head;
+	span->next = *head;
 	if (*head)
-		span->data.list.size = (*head)->data.list.size + 1;
+		span->list_size = (*head)->list_size + 1;
 	else
-		span->data.list.size = 1;
+		span->list_size = 1;
 	*head = span;
-	return span->data.list.size;
+	return span->list_size;
 }
 
 //! Remove span from head of single linked span list, returns the new list head
@@ -671,69 +774,99 @@
 _memory_span_list_pop(span_t** head) {
 	span_t* span = *head;
 	span_t* next_span = 0;
-	if (span->data.list.size > 1) {
-		next_span = span->next_span;
+	if (span->list_size > 1) {
+		assert(span->next);
+		next_span = span->next;
 		assert(next_span);
-		next_span->data.list.size = span->data.list.size - 1;
+		next_span->list_size = span->list_size - 1;
 	}
 	*head = next_span;
 	return span;
 }
 
-#endif
-#if ENABLE_THREAD_CACHE
-
 //! Split a single linked span list
 static span_t*
 _memory_span_list_split(span_t* span, size_t limit) {
 	span_t* next = 0;
 	if (limit < 2)
 		limit = 2;
-	if (span->data.list.size > limit) {
-		count_t list_size = 1;
+	if (span->list_size > limit) {
+		uint32_t list_size = 1;
 		span_t* last = span;
-		next = span->next_span;
+		next = span->next;
 		while (list_size < limit) {
 			last = next;
-			next = next->next_span;
+			next = next->next;
 			++list_size;
 		}
-		last->next_span = 0;
+		last->next = 0;
 		assert(next);
-		next->data.list.size = span->data.list.size - list_size;
-		span->data.list.size = list_size;
-		span->prev_span = 0;
+		next->list_size = span->list_size - list_size;
+		span->list_size = list_size;
+		span->prev = 0;
 	}
 	return next;
 }
 
 #endif
 
-//! Add a span to a double linked list
+//! Add a span to partial span double linked list at the head
 static void
-_memory_span_list_doublelink_add(span_t** head, span_t* span) {
+_memory_span_partial_list_add(span_t** head, span_t* span) {
 	if (*head) {
-		(*head)->prev_span = span;
-		span->next_span = *head;
-	}
-	else {
-		span->next_span = 0;
+		span->next = *head;
+		//Maintain pointer to tail span
+		span->prev = (*head)->prev;
+		(*head)->prev = span;
+	} else {
+		span->next = 0;
+		span->prev = span;
 	}
 	*head = span;
 }
 
-//! Remove a span from a double linked list
+//! Add a span to partial span double linked list at the tail
 static void
-_memory_span_list_doublelink_remove(span_t** head, span_t* span) {
-	if (*head == span) {
-		*head = span->next_span;
+_memory_span_partial_list_add_tail(span_t** head, span_t* span) {
+	span->next = 0;
+	if (*head) {
+		span_t* tail = (*head)->prev;
+		tail->next = span;
+		span->prev = tail;
+		//Maintain pointer to tail span
+		(*head)->prev = span;
+	} else {
+		span->prev = span;
+		*head = span;
 	}
-	else {
-		span_t* next_span = span->next_span;
-		span_t* prev_span = span->prev_span;
-		if (next_span)
-			next_span->prev_span = prev_span;
-		prev_span->next_span = next_span;
+}
+
+//! Pop head span from partial span double linked list
+static void
+_memory_span_partial_list_pop_head(span_t** head) {
+	span_t* span = *head;
+	*head = span->next;
+	if (*head) {
+		//Maintain pointer to tail span
+		(*head)->prev = span->prev;
+	}
+}
+
+//! Remove a span from partial span double linked list
+static void
+_memory_span_partial_list_remove(span_t** head, span_t* span) {
+	if (UNEXPECTED(*head == span)) {
+		_memory_span_partial_list_pop_head(head);
+	} else {
+		span_t* next_span = span->next;
+		span_t* prev_span = span->prev;
+		prev_span->next = next_span;
+		if (EXPECTED(next_span != 0)) {
+			next_span->prev = prev_span;
+		} else {
+			//Update pointer to tail span
+			(*head)->prev = prev_span;
+		}
 	}
 }
 
@@ -742,8 +875,8 @@
 //! Insert the given list of memory page spans in the global cache
 static void
 _memory_cache_insert(global_cache_t* cache, span_t* span, size_t cache_limit) {
-	assert((span->data.list.size == 1) || (span->next_span != 0));
-	int32_t list_size = (int32_t)span->data.list.size;
+	assert((span->list_size == 1) || (span->next != 0));
+	int32_t list_size = (int32_t)span->list_size;
 	//Unmap if cache has reached the limit
 	if (atomic_add32(&cache->size, list_size) > (int32_t)cache_limit) {
 #if !ENABLE_UNLIMITED_GLOBAL_CACHE
@@ -755,7 +888,7 @@
 	void* current_cache, *new_cache;
 	do {
 		current_cache = atomic_load_ptr(&cache->cache);
-		span->prev_span = (void*)((uintptr_t)current_cache & _memory_span_mask);
+		span->prev = (void*)((uintptr_t)current_cache & _memory_span_mask);
 		new_cache = (void*)((uintptr_t)span | ((uintptr_t)atomic_incr32(&cache->counter) & ~_memory_span_mask));
 	} while (!atomic_cas_ptr(&cache->cache, new_cache, current_cache));
 }
@@ -771,9 +904,9 @@
 			span_t* span = (void*)span_ptr;
 			//By accessing the span ptr before it is swapped out of list we assume that a contending thread
 			//does not manage to traverse the span to being unmapped before we access it
-			void* new_cache = (void*)((uintptr_t)span->prev_span | ((uintptr_t)atomic_incr32(&cache->counter) & ~_memory_span_mask));
+			void* new_cache = (void*)((uintptr_t)span->prev | ((uintptr_t)atomic_incr32(&cache->counter) & ~_memory_span_mask));
 			if (atomic_cas_ptr(&cache->cache, new_cache, global_span)) {
-				atomic_add32(&cache->size, -(int32_t)span->data.list.size);
+				atomic_add32(&cache->size, -(int32_t)span->list_size);
 				return span;
 			}
 		}
@@ -787,8 +920,8 @@
 	void* current_cache = atomic_load_ptr(&cache->cache);
 	span_t* span = (void*)((uintptr_t)current_cache & _memory_span_mask);
 	while (span) {
-		span_t* skip_span = (void*)((uintptr_t)span->prev_span & _memory_span_mask);
-		atomic_add32(&cache->size, -(int32_t)span->data.list.size);
+		span_t* skip_span = (void*)((uintptr_t)span->prev & _memory_span_mask);
+		atomic_add32(&cache->size, -(int32_t)span->list_size);
 		_memory_unmap_span_list(span);
 		span = skip_span;
 	}
@@ -819,12 +952,39 @@
 
 #endif
 
+#if ENABLE_THREAD_CACHE
+//! Adopt the deferred span cache list
+static void
+_memory_heap_cache_adopt_deferred(heap_t* heap) {
+	atomic_thread_fence_acquire();
+	span_t* span = atomic_load_ptr(&heap->span_cache_deferred);
+	if (!span)
+		return;
+	do {
+		span = atomic_load_ptr(&heap->span_cache_deferred);
+	} while (!atomic_cas_ptr(&heap->span_cache_deferred, 0, span));
+	while (span) {
+		span_t* next_span = span->next;
+		_memory_span_list_push(&heap->span_cache[0], span);
+#if ENABLE_STATISTICS
+		atomic_decr32(&heap->span_use[span->span_count - 1].current);
+		++heap->size_class_use[span->size_class].spans_to_cache;
+		--heap->size_class_use[span->size_class].spans_current;
+#endif
+		span = next_span;
+	}
+}
+#endif
+
 //! Insert a single span into thread heap cache, releasing to global cache if overflow
 static void
 _memory_heap_cache_insert(heap_t* heap, span_t* span) {
 #if ENABLE_THREAD_CACHE
 	size_t span_count = span->span_count;
 	size_t idx = span_count - 1;
+	_memory_statistics_inc(heap->span_use[idx].spans_to_cache, 1);
+	if (!idx)
+		_memory_heap_cache_adopt_deferred(heap);
 #if ENABLE_UNLIMITED_THREAD_CACHE
 	_memory_span_list_push(&heap->span_cache[idx], span);
 #else
@@ -836,7 +996,7 @@
 	if (current_cache_size <= hard_limit) {
 #if ENABLE_ADAPTIVE_THREAD_CACHE
 		//Require 25% of high water mark to remain in cache (and at least 1, if use is 0)
-		size_t high_mark = heap->span_use[idx].high;
+		const size_t high_mark = heap->span_use[idx].high;
 		const size_t min_limit = (high_mark >> 2) + release_count + 1;
 		if (current_cache_size < min_limit)
 			return;
@@ -845,9 +1005,10 @@
 #endif
 	}
 	heap->span_cache[idx] = _memory_span_list_split(span, release_count);
-	assert(span->data.list.size == release_count);
+	assert(span->list_size == release_count);
 #if ENABLE_STATISTICS
-	heap->thread_to_global += (size_t)span->data.list.size * span_count * _memory_span_size;
+	heap->thread_to_global += (size_t)span->list_size * span_count * _memory_span_size;
+	heap->span_use[idx].spans_to_global += span->list_size;
 #endif
 #if ENABLE_GLOBAL_CACHE
 	_memory_global_cache_insert(span);
@@ -863,166 +1024,286 @@
 
 //! Extract the given number of spans from the different cache levels
 static span_t*
-_memory_heap_cache_extract(heap_t* heap, size_t span_count) {
+_memory_heap_thread_cache_extract(heap_t* heap, size_t span_count) {
 #if ENABLE_THREAD_CACHE
 	size_t idx = span_count - 1;
-	//Step 1: check thread cache
-	if (heap->span_cache[idx])
-		return _memory_span_list_pop(&heap->span_cache[idx]);
-#endif
-	//Step 2: Check reserved spans
-	if (heap->spans_reserved >= span_count)
-		return _memory_map_spans(heap, span_count);
-#if ENABLE_THREAD_CACHE
-	//Step 3: Check larger super spans and split if we find one
-	span_t* span = 0;
-	for (++idx; idx < LARGE_CLASS_COUNT; ++idx) {
-		if (heap->span_cache[idx]) {
-			span = _memory_span_list_pop(&heap->span_cache[idx]);
-			break;
-		}
-	}
-	if (span) {
-		//Mark the span as owned by this heap before splitting
-		size_t got_count = span->span_count;
-		assert(got_count > span_count);
-		atomic_store32(&span->heap_id, heap->id);
-		atomic_thread_fence_release();
-
-		//Split the span and store as reserved if no previously reserved spans, or in thread cache otherwise
-		span_t* subspan = _memory_span_split(span, span_count);
-		assert((span->span_count + subspan->span_count) == got_count);
-		assert(span->span_count == span_count);
-		if (!heap->spans_reserved) {
-			heap->spans_reserved = got_count - span_count;
-			heap->span_reserve = subspan;
-			heap->span_reserve_master = pointer_offset(subspan, -(int32_t)(subspan->total_spans_or_distance * _memory_span_size));
-		}
-		else {
-			_memory_heap_cache_insert(heap, subspan);
-		}
-		return span;
-	}
-#if ENABLE_GLOBAL_CACHE
-	//Step 4: Extract from global cache
-	idx = span_count - 1;
-	heap->span_cache[idx] = _memory_global_cache_extract(span_count);
+	if (!idx)
+		_memory_heap_cache_adopt_deferred(heap);
 	if (heap->span_cache[idx]) {
 #if ENABLE_STATISTICS
-		heap->global_to_thread += (size_t)heap->span_cache[idx]->data.list.size * span_count * _memory_span_size;
+		heap->span_use[idx].spans_from_cache++;
 #endif
 		return _memory_span_list_pop(&heap->span_cache[idx]);
 	}
 #endif
-#endif
 	return 0;
 }
 
-//! Allocate a small/medium sized memory block from the given heap
-static void*
-_memory_allocate_from_heap(heap_t* heap, size_t size) {
-	//Calculate the size class index and do a dependent lookup of the final class index (in case of merged classes)
-	const size_t base_idx = (size <= SMALL_SIZE_LIMIT) ?
-	                        ((size + (SMALL_GRANULARITY - 1)) >> SMALL_GRANULARITY_SHIFT) :
-	                        SMALL_CLASS_COUNT + ((size - SMALL_SIZE_LIMIT + (MEDIUM_GRANULARITY - 1)) >> MEDIUM_GRANULARITY_SHIFT);
-	assert(!base_idx || ((base_idx - 1) < SIZE_CLASS_COUNT));
-	const size_t class_idx = _memory_size_class[base_idx ? (base_idx - 1) : 0].class_idx;
+static span_t*
+_memory_heap_reserved_extract(heap_t* heap, size_t span_count) {
+	if (heap->spans_reserved >= span_count)
+		return _memory_map_spans(heap, span_count);
+	return 0;
+}
 
-	span_block_t* active_block = heap->active_block + class_idx;
-	size_class_t* size_class = _memory_size_class + class_idx;
-	const count_t class_size = size_class->size;
-
-	//Step 1: Try to get a block from the currently active span. The span block bookkeeping
-	//        data for the active span is stored in the heap for faster access
-use_active:
-	if (active_block->free_count) {
-		//Happy path, we have a span with at least one free block
-		span_t* span = heap->active_span[class_idx];
-		count_t offset = class_size * active_block->free_list;
-		uint32_t* block = pointer_offset(span, SPAN_HEADER_SIZE + offset);
-		assert(span && (atomic_load32(&span->heap_id) == heap->id));
-
-		if (active_block->free_count == 1) {
-			//Span is now completely allocated, set the bookkeeping data in the
-			//span itself and reset the active span pointer in the heap
-			span->data.block.free_count = active_block->free_count = 0;
-			span->data.block.first_autolink = 0xFFFF;
-			heap->active_span[class_idx] = 0;
-		}
-		else {
-			//Get the next free block, either from linked list or from auto link
-			++active_block->free_list;
-			if (active_block->free_list <= active_block->first_autolink)
-				active_block->free_list = (uint16_t)(*block);
-			assert(active_block->free_list < size_class->block_count);
-			--active_block->free_count;
-		}
-		return block;
-	}
-
-	//Step 2: No active span, try executing deferred deallocations and try again if there
-	//        was at least one of the requested size class
-	_memory_deallocate_deferred(heap);
-
-	//Step 3: Check if there is a semi-used span of the requested size class available
-	if (heap->size_cache[class_idx]) {
-		//Promote a pending semi-used span to be active, storing bookkeeping data in
-		//the heap structure for faster access
-		span_t* span = heap->size_cache[class_idx];
-		//Mark span as owned by this heap
-		atomic_store32(&span->heap_id, heap->id);
-		atomic_thread_fence_release();
-
-		*active_block = span->data.block;
-		assert(active_block->free_count > 0);
-		heap->size_cache[class_idx] = span->next_span;
-		heap->active_span[class_idx] = span;
-
-		goto use_active;
-	}
-
-	//Step 4: Find a span in one of the cache levels
-	span_t* span = _memory_heap_cache_extract(heap, 1);
-	if (!span) {
-		//Step 5: Map in more virtual memory
-		span = _memory_map_spans(heap, 1);
-		if (!span)
-			return span;
-	}
-
-#if ENABLE_ADAPTIVE_THREAD_CACHE
-	++heap->span_use[0].current;
-	if (heap->span_use[0].current > heap->span_use[0].high)
-		heap->span_use[0].high = heap->span_use[0].current;
+//! Extract a span from the global cache
+static span_t*
+_memory_heap_global_cache_extract(heap_t* heap, size_t span_count) {
+#if ENABLE_GLOBAL_CACHE
+	size_t idx = span_count - 1;
+	heap->span_cache[idx] = _memory_global_cache_extract(span_count);
+	if (heap->span_cache[idx]) {
+#if ENABLE_STATISTICS
+		heap->global_to_thread += (size_t)heap->span_cache[idx]->list_size * span_count * _memory_span_size;
+		heap->span_use[idx].spans_from_global += heap->span_cache[idx]->list_size;
 #endif
+		return _memory_span_list_pop(&heap->span_cache[idx]);
+	}
+#endif
+	return 0;
+}
 
-	//Mark span as owned by this heap and set base data
+//! Get a span from one of the cache levels (thread cache, reserved, global cache) or fallback to mapping more memory
+static span_t*
+_memory_heap_extract_new_span(heap_t* heap, size_t span_count, uint32_t class_idx) {
+	(void)sizeof(class_idx);
+#if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
+	uint32_t idx = (uint32_t)span_count - 1;
+	uint32_t current_count = (uint32_t)atomic_incr32(&heap->span_use[idx].current);
+	if (current_count > heap->span_use[idx].high)
+		heap->span_use[idx].high = current_count;
+#if ENABLE_STATISTICS
+	uint32_t spans_current = ++heap->size_class_use[class_idx].spans_current;
+	if (spans_current > heap->size_class_use[class_idx].spans_peak)
+		heap->size_class_use[class_idx].spans_peak = spans_current;
+#endif
+#endif	
+	span_t* span = _memory_heap_thread_cache_extract(heap, span_count);
+	if (EXPECTED(span != 0)) {
+		_memory_statistics_inc(heap->size_class_use[class_idx].spans_from_cache, 1);
+		return span;
+	}
+	span = _memory_heap_reserved_extract(heap, span_count);
+	if (EXPECTED(span != 0)) {
+		_memory_statistics_inc(heap->size_class_use[class_idx].spans_from_reserved, 1);
+		return span;
+	}
+	span = _memory_heap_global_cache_extract(heap, span_count);
+	if (EXPECTED(span != 0)) {
+		_memory_statistics_inc(heap->size_class_use[class_idx].spans_from_cache, 1);
+		return span;
+	}
+	//Final fallback, map in more virtual memory
+	span = _memory_map_spans(heap, span_count);
+	_memory_statistics_inc(heap->size_class_use[class_idx].spans_map_calls, 1);
+	return span;
+}
+
+//! Move the span (used for small or medium allocations) to the heap thread cache
+static void
+_memory_span_release_to_cache(heap_t* heap, span_t* span) {
+	heap_class_t* heap_class = heap->span_class + span->size_class;
+	assert(heap_class->partial_span != span);
+	if (span->state == SPAN_STATE_PARTIAL)
+		_memory_span_partial_list_remove(&heap_class->partial_span, span);
+#if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
+	atomic_decr32(&heap->span_use[0].current);
+#endif
+	_memory_statistics_inc(heap->span_use[0].spans_to_cache, 1);
+	_memory_statistics_inc(heap->size_class_use[span->size_class].spans_to_cache, 1);
+	_memory_statistics_dec(heap->size_class_use[span->size_class].spans_current, 1);
+	_memory_heap_cache_insert(heap, span);
+}
+
+//! Initialize a (partial) free list up to next system memory page, while reserving the first block
+//! as allocated, returning number of blocks in list
+static uint32_t
+free_list_partial_init(void** list, void** first_block, void* page_start, void* block_start,
+                       uint32_t block_count, uint32_t block_size) {
+	assert(block_count);
+	*first_block = block_start;
+	if (block_count > 1) {
+		void* free_block = pointer_offset(block_start, block_size);
+		void* block_end = pointer_offset(block_start, block_size * block_count);
+		//If block size is less than half a memory page, bound init to next memory page boundary
+		if (block_size < (_memory_page_size >> 1)) {
+			void* page_end = pointer_offset(page_start, _memory_page_size);
+			if (page_end < block_end)
+				block_end = page_end;
+		}
+		*list = free_block;
+		block_count = 2;
+		void* next_block = pointer_offset(free_block, block_size);
+		while (next_block < block_end) {
+			*((void**)free_block) = next_block;
+			free_block = next_block;
+			++block_count;
+			next_block = pointer_offset(next_block, block_size);
+		}
+		*((void**)free_block) = 0;
+	} else {
+		*list = 0;
+	}
+	return block_count;
+}
+
+//! Initialize an unused span (from cache or mapped) to be new active span
+static void*
+_memory_span_set_new_active(heap_t* heap, heap_class_t* heap_class, span_t* span, uint32_t class_idx) {
 	assert(span->span_count == 1);
-	span->size_class = (uint16_t)class_idx;
-	atomic_store32(&span->heap_id, heap->id);
+	size_class_t* size_class = _memory_size_class + class_idx;
+	span->size_class = class_idx;
+	span->heap = heap;
+	span->flags &= ~SPAN_FLAG_ALIGNED_BLOCKS;
+	span->block_count = size_class->block_count;
+	span->block_size = size_class->block_size;
+	span->state = SPAN_STATE_ACTIVE;
+	span->free_list = 0;
+
+	//Setup free list. Only initialize one system page worth of free blocks in list
+	void* block;
+	span->free_list_limit = free_list_partial_init(&heap_class->free_list, &block, 
+		span, pointer_offset(span, SPAN_HEADER_SIZE), size_class->block_count, size_class->block_size);
+	atomic_store_ptr(&span->free_list_deferred, 0);
+	span->list_size = 0;
 	atomic_thread_fence_release();
 
-	//If we only have one block we will grab it, otherwise
-	//set span as new span to use for next allocation
-	if (size_class->block_count > 1) {
-		//Reset block order to sequential auto linked order
-		active_block->free_count = (uint16_t)(size_class->block_count - 1);
-		active_block->free_list = 1;
-		active_block->first_autolink = 1;
-		heap->active_span[class_idx] = span;
-	}
-	else {
-		span->data.block.free_count = 0;
-		span->data.block.first_autolink = 0xFFFF;
-	}
+	_memory_span_partial_list_add(&heap_class->partial_span, span);
+	return block;
+}
 
-	//Return first block if memory page span
-	return pointer_offset(span, SPAN_HEADER_SIZE);
+//! Promote a partially used span (from heap used list) to be new active span
+static void
+_memory_span_set_partial_active(heap_class_t* heap_class, span_t* span) {
+	assert(span->state == SPAN_STATE_PARTIAL);
+	assert(span->block_count == _memory_size_class[span->size_class].block_count);
+	//Move data to heap size class and set span as active
+	heap_class->free_list = span->free_list;
+	span->state = SPAN_STATE_ACTIVE;
+	span->free_list = 0;
+	assert(heap_class->free_list);
+}
+
+//! Mark span as full (from active)
+static void
+_memory_span_set_active_full(heap_class_t* heap_class, span_t* span) {
+	assert(span->state == SPAN_STATE_ACTIVE);
+	assert(span == heap_class->partial_span);
+	_memory_span_partial_list_pop_head(&heap_class->partial_span);
+	span->used_count = span->block_count;
+	span->state = SPAN_STATE_FULL;
+	span->free_list = 0;
+}
+
+//! Move span from full to partial state
+static void
+_memory_span_set_full_partial(heap_t* heap, span_t* span) {
+	assert(span->state == SPAN_STATE_FULL);
+	heap_class_t* heap_class = &heap->span_class[span->size_class];
+	span->state = SPAN_STATE_PARTIAL;
+	_memory_span_partial_list_add_tail(&heap_class->partial_span, span);
+}
+
+static void*
+_memory_span_extract_deferred(span_t* span) {
+	void* free_list;
+	do {
+		free_list = atomic_load_ptr(&span->free_list_deferred);
+	} while ((free_list == INVALID_POINTER) || !atomic_cas_ptr(&span->free_list_deferred, INVALID_POINTER, free_list));
+	span->list_size = 0;
+	atomic_store_ptr(&span->free_list_deferred, 0);
+	atomic_thread_fence_release();
+	return free_list;
+}
+
+//! Pop first block from a free list
+static void*
+free_list_pop(void** list) {
+	void* block = *list;
+	*list = *((void**)block);
+	return block;
+}
+
+//! Allocate a small/medium sized memory block from the given heap
+static void*
+_memory_allocate_from_heap_fallback(heap_t* heap, uint32_t class_idx) {
+	heap_class_t* heap_class = &heap->span_class[class_idx];
+	void* block;
+
+	span_t* active_span = heap_class->partial_span;
+	if (EXPECTED(active_span != 0)) {
+		assert(active_span->state == SPAN_STATE_ACTIVE);
+		assert(active_span->block_count == _memory_size_class[active_span->size_class].block_count);
+		//Swap in free list if not empty
+		if (active_span->free_list) {
+			heap_class->free_list = active_span->free_list;
+			active_span->free_list = 0;
+			return free_list_pop(&heap_class->free_list);
+		}
+		//If the span did not fully initialize free list, link up another page worth of blocks
+		if (active_span->free_list_limit < active_span->block_count) {
+			void* block_start = pointer_offset(active_span, SPAN_HEADER_SIZE + (active_span->free_list_limit * active_span->block_size));
+			active_span->free_list_limit += free_list_partial_init(&heap_class->free_list, &block,
+				(void*)((uintptr_t)block_start & ~(_memory_page_size - 1)), block_start,
+				active_span->block_count - active_span->free_list_limit, active_span->block_size);
+			return block;
+		}
+		//Swap in deferred free list
+		atomic_thread_fence_acquire();
+		if (atomic_load_ptr(&active_span->free_list_deferred)) {
+			heap_class->free_list = _memory_span_extract_deferred(active_span);
+			return free_list_pop(&heap_class->free_list);
+		}
+
+		//If the active span is fully allocated, mark span as free floating (fully allocated and not part of any list)
+		assert(!heap_class->free_list);
+		assert(active_span->free_list_limit >= active_span->block_count);
+		_memory_span_set_active_full(heap_class, active_span);
+	}
+	assert(!heap_class->free_list);
+
+	//Try promoting a semi-used span to active
+	active_span = heap_class->partial_span;
+	if (EXPECTED(active_span != 0)) {
+		_memory_span_set_partial_active(heap_class, active_span);
+		return free_list_pop(&heap_class->free_list);
+	}
+	assert(!heap_class->free_list);
+	assert(!heap_class->partial_span);
+
+	//Find a span in one of the cache levels
+	active_span = _memory_heap_extract_new_span(heap, 1, class_idx);
+
+	//Mark span as owned by this heap and set base data, return first block
+	return _memory_span_set_new_active(heap, heap_class, active_span, class_idx);
+}
+
+//! Allocate a small sized memory block from the given heap
+static void*
+_memory_allocate_small(heap_t* heap, size_t size) {
+	//Small sizes have unique size classes
+	const uint32_t class_idx = (uint32_t)((size + (SMALL_GRANULARITY - 1)) >> SMALL_GRANULARITY_SHIFT);
+	_memory_statistics_inc_alloc(heap, class_idx);
+	if (EXPECTED(heap->span_class[class_idx].free_list != 0))
+		return free_list_pop(&heap->span_class[class_idx].free_list);
+	return _memory_allocate_from_heap_fallback(heap, class_idx);
+}
+
+//! Allocate a medium sized memory block from the given heap
+static void*
+_memory_allocate_medium(heap_t* heap, size_t size) {
+	//Calculate the size class index and do a dependent lookup of the final class index (in case of merged classes)
+	const uint32_t base_idx = (uint32_t)(SMALL_CLASS_COUNT + ((size - (SMALL_SIZE_LIMIT + 1)) >> MEDIUM_GRANULARITY_SHIFT));
+	const uint32_t class_idx = _memory_size_class[base_idx].class_idx;
+	_memory_statistics_inc_alloc(heap, class_idx);
+	if (EXPECTED(heap->span_class[class_idx].free_list != 0))
+		return free_list_pop(&heap->span_class[class_idx].free_list);
+	return _memory_allocate_from_heap_fallback(heap, class_idx);
 }
 
 //! Allocate a large sized memory block from the given heap
 static void*
-_memory_allocate_large_from_heap(heap_t* heap, size_t size) {
+_memory_allocate_large(heap_t* heap, size_t size) {
 	//Calculate number of needed max sized spans (including header)
 	//Since this function is never called if size > LARGE_SIZE_LIMIT
 	//the span_count is guaranteed to be <= LARGE_CLASS_COUNT
@@ -1031,30 +1312,57 @@
 	if (size & (_memory_span_size - 1))
 		++span_count;
 	size_t idx = span_count - 1;
-#if ENABLE_ADAPTIVE_THREAD_CACHE
-	++heap->span_use[idx].current;
-	if (heap->span_use[idx].current > heap->span_use[idx].high)
-		heap->span_use[idx].high = heap->span_use[idx].current;
-#endif
 
-	//Step 1: Find span in one of the cache levels
-	span_t* span = _memory_heap_cache_extract(heap, span_count);
-	if (!span) {
-		//Step 2: Map in more virtual memory
-		span = _memory_map_spans(heap, span_count);
-		if (!span)
-			return span;
-	}
+	//Find a span in one of the cache levels
+	span_t* span = _memory_heap_extract_new_span(heap, span_count, SIZE_CLASS_COUNT);
 
 	//Mark span as owned by this heap and set base data
 	assert(span->span_count == span_count);
-	span->size_class = (uint16_t)(SIZE_CLASS_COUNT + idx);
-	atomic_store32(&span->heap_id, heap->id);
+	span->size_class = (uint32_t)(SIZE_CLASS_COUNT + idx);
+	span->heap = heap;
 	atomic_thread_fence_release();
 
 	return pointer_offset(span, SPAN_HEADER_SIZE);
 }
 
+//! Allocate a huge block by mapping memory pages directly
+static void*
+_memory_allocate_huge(size_t size) {
+	size += SPAN_HEADER_SIZE;
+	size_t num_pages = size >> _memory_page_size_shift;
+	if (size & (_memory_page_size - 1))
+		++num_pages;
+	size_t align_offset = 0;
+	span_t* span = _memory_map(num_pages * _memory_page_size, &align_offset);
+	if (!span)
+		return span;
+	//Store page count in span_count
+	span->size_class = (uint32_t)-1;
+	span->span_count = (uint32_t)num_pages;
+	span->align_offset = (uint32_t)align_offset;
+	_memory_statistics_add_peak(&_huge_pages_current, num_pages, _huge_pages_peak);
+
+	return pointer_offset(span, SPAN_HEADER_SIZE);
+}
+
+//! Allocate a block larger than medium size
+static void*
+_memory_allocate_oversized(heap_t* heap, size_t size) {
+	if (size <= LARGE_SIZE_LIMIT)
+		return _memory_allocate_large(heap, size);
+	return _memory_allocate_huge(size);
+}
+
+//! Allocate a block of the given size
+static void*
+_memory_allocate(heap_t* heap, size_t size) {
+	if (EXPECTED(size <= SMALL_SIZE_LIMIT))
+		return _memory_allocate_small(heap, size);
+	else if (size <= _memory_medium_size_limit)
+		return _memory_allocate_medium(heap, size);
+	return _memory_allocate_oversized(heap, size);
+}
+
 //! Allocate a new heap
 static heap_t*
 _memory_allocate_heap(void) {
@@ -1067,14 +1375,13 @@
 	atomic_thread_fence_acquire();
 	do {
 		raw_heap = atomic_load_ptr(&_memory_orphan_heaps);
-		heap = (void*)((uintptr_t)raw_heap & ~(uintptr_t)0xFF);
+		heap = (void*)((uintptr_t)raw_heap & ~(uintptr_t)0x1FF);
 		if (!heap)
 			break;
 		next_heap = heap->next_orphan;
 		orphan_counter = (uintptr_t)atomic_incr32(&_memory_orphan_counter);
-		next_raw_heap = (void*)((uintptr_t)next_heap | (orphan_counter & (uintptr_t)0xFF));
-	}
-	while (!atomic_cas_ptr(&_memory_orphan_heaps, next_raw_heap, raw_heap));
+		next_raw_heap = (void*)((uintptr_t)next_heap | (orphan_counter & (uintptr_t)0x1FF));
+	} while (!atomic_cas_ptr(&_memory_orphan_heaps, next_raw_heap, raw_heap));
 
 	if (!heap) {
 		//Map in pages for a new heap
@@ -1100,181 +1407,129 @@
 		} while (!atomic_cas_ptr(&_memory_heaps[list_idx], heap, next_heap));
 	}
 
-	//Clean up any deferred operations
-	_memory_deallocate_deferred(heap);
-
 	return heap;
 }
 
-//! Deallocate the given small/medium memory block from the given heap
+//! Deallocate the given small/medium memory block in the current thread local heap
 static void
-_memory_deallocate_to_heap(heap_t* heap, span_t* span, void* p) {
-	//Check if span is the currently active span in order to operate
-	//on the correct bookkeeping data
-	assert(span->span_count == 1);
-	const count_t class_idx = span->size_class;
-	size_class_t* size_class = _memory_size_class + class_idx;
-	int is_active = (heap->active_span[class_idx] == span);
-	span_block_t* block_data = is_active ?
-	                           heap->active_block + class_idx :
-	                           &span->data.block;
-
-	//Check if the span will become completely free
-	if (block_data->free_count == ((count_t)size_class->block_count - 1)) {
-		if (is_active) {
-			//If it was active, reset free block list
-			++block_data->free_count;
-			block_data->first_autolink = 0;
-			block_data->free_list = 0;
-		} else {
-			//If not active, remove from partial free list if we had a previous free
-			//block (guard for classes with only 1 block) and add to heap cache
-			if (block_data->free_count > 0)
-				_memory_span_list_doublelink_remove(&heap->size_cache[class_idx], span);
-#if ENABLE_ADAPTIVE_THREAD_CACHE
-			if (heap->span_use[0].current)
-				--heap->span_use[0].current;
-#endif
-			_memory_heap_cache_insert(heap, span);
-		}
+_memory_deallocate_direct(span_t* span, void* block) {
+	assert(span->heap == get_thread_heap_raw());
+	uint32_t state = span->state;
+	//Add block to free list
+	*((void**)block) = span->free_list;
+	span->free_list = block;
+	if (UNEXPECTED(state == SPAN_STATE_ACTIVE))
 		return;
-	}
-
-	//Check if first free block for this span (previously fully allocated)
-	if (block_data->free_count == 0) {
-		//add to free list and disable autolink
-		_memory_span_list_doublelink_add(&heap->size_cache[class_idx], span);
-		block_data->first_autolink = 0xFFFF;
-	}
-	++block_data->free_count;
-	//Span is not yet completely free, so add block to the linked list of free blocks
-	void* blocks_start = pointer_offset(span, SPAN_HEADER_SIZE);
-	count_t block_offset = (count_t)pointer_diff(p, blocks_start);
-	count_t block_idx = block_offset / (count_t)size_class->size;
-	uint32_t* block = pointer_offset(blocks_start, block_idx * size_class->size);
-	*block = block_data->free_list;
-	if (block_data->free_list > block_data->first_autolink)
-		block_data->first_autolink = block_data->free_list;
-	block_data->free_list = (uint16_t)block_idx;
+	uint32_t used = --span->used_count;
+	uint32_t free = span->list_size;
+	if (UNEXPECTED(used == free))
+		_memory_span_release_to_cache(span->heap, span);
+	else if (UNEXPECTED(state == SPAN_STATE_FULL))
+		_memory_span_set_full_partial(span->heap, span);
 }
 
-//! Deallocate the given large memory block to the given heap
+//! Put the block in the deferred free list of the owning span
 static void
-_memory_deallocate_large_to_heap(heap_t* heap, span_t* span) {
+_memory_deallocate_defer(span_t* span, void* block) {
+	atomic_thread_fence_acquire();
+	if (span->state == SPAN_STATE_FULL) {
+		if ((span->list_size + 1) == span->block_count) {
+			//Span will be completely freed by deferred deallocations, no other thread can
+			//currently touch it. Safe to move to owner heap deferred cache
+			span_t* last_head;
+			heap_t* heap = span->heap;
+			do {
+				last_head = atomic_load_ptr(&heap->span_cache_deferred);
+				span->next = last_head;
+			} while (!atomic_cas_ptr(&heap->span_cache_deferred, span, last_head));
+			return;
+		}
+	}
+
+	void* free_list;
+	do {
+		atomic_thread_fence_acquire();
+		free_list = atomic_load_ptr(&span->free_list_deferred);
+		*((void**)block) = free_list;
+	} while ((free_list == INVALID_POINTER) || !atomic_cas_ptr(&span->free_list_deferred, INVALID_POINTER, free_list));
+	++span->list_size;
+	atomic_store_ptr(&span->free_list_deferred, block);
+}
+
+static void
+_memory_deallocate_small_or_medium(span_t* span, void* p) {
+	_memory_statistics_inc_free(span->heap, span->size_class);
+	if (span->flags & SPAN_FLAG_ALIGNED_BLOCKS) {
+		//Realign pointer to block start
+		void* blocks_start = pointer_offset(span, SPAN_HEADER_SIZE);
+		uint32_t block_offset = (uint32_t)pointer_diff(p, blocks_start);
+		p = pointer_offset(p, -(int32_t)(block_offset % span->block_size));
+	}
+	//Check if block belongs to this heap or if deallocation should be deferred
+	if (span->heap == get_thread_heap_raw())
+		_memory_deallocate_direct(span, p);
+	else
+		_memory_deallocate_defer(span, p);
+}
+
+//! Deallocate the given large memory block to the current heap
+static void
+_memory_deallocate_large(span_t* span) {
 	//Decrease counter
 	assert(span->span_count == ((size_t)span->size_class - SIZE_CLASS_COUNT + 1));
 	assert(span->size_class >= SIZE_CLASS_COUNT);
 	assert(span->size_class - SIZE_CLASS_COUNT < LARGE_CLASS_COUNT);
 	assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN));
 	assert((span->flags & SPAN_FLAG_MASTER) || (span->flags & SPAN_FLAG_SUBSPAN));
-#if ENABLE_ADAPTIVE_THREAD_CACHE
+	//Large blocks can always be deallocated and transferred between heaps
+	//Investigate if it is better to defer large spans as well through span_cache_deferred,
+	//possibly with some heuristics to pick either scheme at runtime per deallocation
+	heap_t* heap = get_thread_heap();
+#if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
 	size_t idx = span->span_count - 1;
-	if (heap->span_use[idx].current)
-		--heap->span_use[idx].current;
+	atomic_decr32(&span->heap->span_use[idx].current);
 #endif
 	if ((span->span_count > 1) && !heap->spans_reserved) {
 		heap->span_reserve = span;
 		heap->spans_reserved = span->span_count;
 		if (span->flags & SPAN_FLAG_MASTER) {
 			heap->span_reserve_master = span;
-		}
-		else { //SPAN_FLAG_SUBSPAN
+		} else { //SPAN_FLAG_SUBSPAN
 			uint32_t distance = span->total_spans_or_distance;
 			span_t* master = pointer_offset(span, -(int32_t)(distance * _memory_span_size));
 			heap->span_reserve_master = master;
 			assert(master->flags & SPAN_FLAG_MASTER);
 			assert(atomic_load32(&master->remaining_spans) >= (int32_t)span->span_count);
 		}
-	}
-	else {
+		_memory_statistics_inc(heap->span_use[idx].spans_to_reserved, 1);
+	} else {
 		//Insert into cache list
 		_memory_heap_cache_insert(heap, span);
 	}
 }
 
-//! Process pending deferred cross-thread deallocations
+//! Deallocate the given huge span
 static void
-_memory_deallocate_deferred(heap_t* heap) {
-	//Grab the current list of deferred deallocations
-	atomic_thread_fence_acquire();
-	void* p = atomic_load_ptr(&heap->defer_deallocate);
-	if (!p || !atomic_cas_ptr(&heap->defer_deallocate, 0, p))
-		return;
-	do {
-		void* next = *(void**)p;
-		span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
-		_memory_deallocate_to_heap(heap, span, p);
-		p = next;
-	} while (p);
-}
-
-//! Defer deallocation of the given block to the given heap
-static void
-_memory_deallocate_defer(int32_t heap_id, void* p) {
-	//Get the heap and link in pointer in list of deferred operations
-	heap_t* heap = _memory_heap_lookup(heap_id);
-	if (!heap)
-		return;
-	void* last_ptr;
-	do {
-		last_ptr = atomic_load_ptr(&heap->defer_deallocate);
-		*(void**)p = last_ptr; //Safe to use block, it's being deallocated
-	} while (!atomic_cas_ptr(&heap->defer_deallocate, p, last_ptr));
-}
-
-//! Allocate a block of the given size
-static void*
-_memory_allocate(size_t size) {
-	if (size <= _memory_medium_size_limit)
-		return _memory_allocate_from_heap(get_thread_heap(), size);
-	else if (size <= LARGE_SIZE_LIMIT)
-		return _memory_allocate_large_from_heap(get_thread_heap(), size);
-
-	//Oversized, allocate pages directly
-	size += SPAN_HEADER_SIZE;
-	size_t num_pages = size >> _memory_page_size_shift;
-	if (size & (_memory_page_size - 1))
-		++num_pages;
-	size_t align_offset = 0;
-	span_t* span = _memory_map(num_pages * _memory_page_size, &align_offset);
-	if (!span)
-		return span;
-	atomic_store32(&span->heap_id, 0);
-	//Store page count in span_count
-	span->span_count = (uint32_t)num_pages;
-	span->align_offset = (uint32_t)align_offset;
-
-	return pointer_offset(span, SPAN_HEADER_SIZE);
+_memory_deallocate_huge(span_t* span) {
+	//Oversized allocation, page count is stored in span_count
+	size_t num_pages = span->span_count;
+	_memory_unmap(span, num_pages * _memory_page_size, span->align_offset, num_pages * _memory_page_size);
+	_memory_statistics_sub(&_huge_pages_current, num_pages);
 }
 
 //! Deallocate the given block
 static void
 _memory_deallocate(void* p) {
-	if (!p)
-		return;
-
 	//Grab the span (always at start of span, using span alignment)
 	span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
-	int32_t heap_id = atomic_load32(&span->heap_id);
-	if (heap_id) {
-		heap_t* heap = get_thread_heap();
-		if (span->size_class < SIZE_CLASS_COUNT) {
-			//Check if block belongs to this heap or if deallocation should be deferred
-			if (heap->id == heap_id)
-				_memory_deallocate_to_heap(heap, span, p);
-			else
-				_memory_deallocate_defer(heap_id, p);
-		}
-		else {
-			//Large blocks can always be deallocated and transferred between heaps
-			_memory_deallocate_large_to_heap(heap, span);
-		}
-	}
-	else {
-		//Oversized allocation, page count is stored in span_count
-		size_t num_pages = span->span_count;
-		_memory_unmap(span, num_pages * _memory_page_size, span->align_offset, num_pages * _memory_page_size);
-	}
+	if (UNEXPECTED(!span))
+		return;
+	if (EXPECTED(span->size_class < SIZE_CLASS_COUNT))
+		_memory_deallocate_small_or_medium(span, p);
+	else if (span->size_class != (uint32_t)-1)
+		_memory_deallocate_large(span);
+	else
+		_memory_deallocate_huge(span);
 }
 
 //! Reallocate the given block to the given size
@@ -1283,37 +1538,41 @@
 	if (p) {
 		//Grab the span using guaranteed span alignment
 		span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
-		int32_t heap_id = atomic_load32(&span->heap_id);
-		if (heap_id) {
+		if (span->heap) {
 			if (span->size_class < SIZE_CLASS_COUNT) {
 				//Small/medium sized block
 				assert(span->span_count == 1);
-				size_class_t* size_class = _memory_size_class + span->size_class;
 				void* blocks_start = pointer_offset(span, SPAN_HEADER_SIZE);
-				count_t block_offset = (count_t)pointer_diff(p, blocks_start);
-				count_t block_idx = block_offset / (count_t)size_class->size;
-				void* block = pointer_offset(blocks_start, block_idx * size_class->size);
-				if ((size_t)size_class->size >= size)
-					return block; //Still fits in block, never mind trying to save memory
+				uint32_t block_offset = (uint32_t)pointer_diff(p, blocks_start);
+				uint32_t block_idx = block_offset / span->block_size;
+				void* block = pointer_offset(blocks_start, block_idx * span->block_size);
 				if (!oldsize)
-					oldsize = size_class->size - (uint32_t)pointer_diff(p, block);
-			}
-			else {
+					oldsize = span->block_size - (uint32_t)pointer_diff(p, block);
+				if ((size_t)span->block_size >= size) {
+					//Still fits in block, never mind trying to save memory, but preserve data if alignment changed
+					if ((p != block) && !(flags & RPMALLOC_NO_PRESERVE))
+						memmove(block, p, oldsize);
+					return block;
+				}
+			} else {
 				//Large block
 				size_t total_size = size + SPAN_HEADER_SIZE;
 				size_t num_spans = total_size >> _memory_span_size_shift;
 				if (total_size & (_memory_span_mask - 1))
 					++num_spans;
-				size_t current_spans = (span->size_class - SIZE_CLASS_COUNT) + 1;
-				assert(current_spans == span->span_count);
+				size_t current_spans = span->span_count;
+				assert(current_spans == ((span->size_class - SIZE_CLASS_COUNT) + 1));
 				void* block = pointer_offset(span, SPAN_HEADER_SIZE);
-				if ((current_spans >= num_spans) && (num_spans >= (current_spans / 2)))
-					return block; //Still fits and less than half of memory would be freed
 				if (!oldsize)
-					oldsize = (current_spans * _memory_span_size) - (size_t)pointer_diff(p, span);
+					oldsize = (current_spans * _memory_span_size) - (size_t)pointer_diff(p, block) - SPAN_HEADER_SIZE;
+				if ((current_spans >= num_spans) && (num_spans >= (current_spans / 2))) {
+					//Still fits in block, never mind trying to save memory, but preserve data if alignment changed
+					if ((p != block) && !(flags & RPMALLOC_NO_PRESERVE))
+						memmove(block, p, oldsize);
+					return block;
+				}
 			}
-		}
-		else {
+		} else {
 			//Oversized block
 			size_t total_size = size + SPAN_HEADER_SIZE;
 			size_t num_pages = total_size >> _memory_page_size_shift;
@@ -1322,20 +1581,28 @@
 			//Page count is stored in span_count
 			size_t current_pages = span->span_count;
 			void* block = pointer_offset(span, SPAN_HEADER_SIZE);
-			if ((current_pages >= num_pages) && (num_pages >= (current_pages / 2)))
-				return block; //Still fits and less than half of memory would be freed
 			if (!oldsize)
-				oldsize = (current_pages * _memory_page_size) - (size_t)pointer_diff(p, span);
+				oldsize = (current_pages * _memory_page_size) - (size_t)pointer_diff(p, block) - SPAN_HEADER_SIZE;
+			if ((current_pages >= num_pages) && (num_pages >= (current_pages / 2))) {
+				//Still fits in block, never mind trying to save memory, but preserve data if alignment changed
+				if ((p != block) && !(flags & RPMALLOC_NO_PRESERVE))
+					memmove(block, p, oldsize);
+				return block;
+			}
 		}
+	} else {
+		oldsize = 0;
 	}
 
 	//Size is greater than block size, need to allocate a new block and deallocate the old
+	heap_t* heap = get_thread_heap();
 	//Avoid hysteresis by overallocating if increase is small (below 37%)
 	size_t lower_bound = oldsize + (oldsize >> 2) + (oldsize >> 3);
-	void* block = _memory_allocate((size > lower_bound) ? size : ((size > oldsize) ? lower_bound : size));
-	if (p) {
+	size_t new_size = (size > lower_bound) ? size : ((size > oldsize) ? lower_bound : size);
+	void* block = _memory_allocate(heap, new_size);
+	if (p && block) {
 		if (!(flags & RPMALLOC_NO_PRESERVE))
-			memcpy(block, p, oldsize < size ? oldsize : size);
+			memcpy(block, p, oldsize < new_size ? oldsize : new_size);
 		_memory_deallocate(p);
 	}
 
@@ -1347,13 +1614,11 @@
 _memory_usable_size(void* p) {
 	//Grab the span using guaranteed span alignment
 	span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
-	int32_t heap_id = atomic_load32(&span->heap_id);
-	if (heap_id) {
+	if (span->heap) {
 		//Small/medium block
 		if (span->size_class < SIZE_CLASS_COUNT) {
-			size_class_t* size_class = _memory_size_class + span->size_class;
 			void* blocks_start = pointer_offset(span, SPAN_HEADER_SIZE);
-			return size_class->size - ((size_t)pointer_diff(p, blocks_start) % size_class->size);
+			return span->block_size - ((size_t)pointer_diff(p, blocks_start) % span->block_size);
 		}
 
 		//Large block
@@ -1369,7 +1634,7 @@
 //! Adjust and optimize the size class properties for the given class
 static void
 _memory_adjust_size_class(size_t iclass) {
-	size_t block_size = _memory_size_class[iclass].size;
+	size_t block_size = _memory_size_class[iclass].block_size;
 	size_t block_count = (_memory_span_size - SPAN_HEADER_SIZE) / block_size;
 
 	_memory_size_class[iclass].block_count = (uint16_t)block_count;
@@ -1380,18 +1645,73 @@
 	while (prevclass > 0) {
 		--prevclass;
 		//A class can be merged if number of pages and number of blocks are equal
-		if (_memory_size_class[prevclass].block_count == _memory_size_class[iclass].block_count) {
+		if (_memory_size_class[prevclass].block_count == _memory_size_class[iclass].block_count)
 			memcpy(_memory_size_class + prevclass, _memory_size_class + iclass, sizeof(_memory_size_class[iclass]));
-		}
-		else {
+		else
 			break;
-		}
 	}
 }
 
-#if defined(_WIN32) || defined(__WIN32__) || defined(_WIN64)
-#  include <Windows.h>
+static void
+_memory_heap_finalize(void* heapptr) {
+	heap_t* heap = heapptr;
+	if (!heap)
+		return;
+	//Release thread cache spans back to global cache
+#if ENABLE_THREAD_CACHE
+	_memory_heap_cache_adopt_deferred(heap);
+	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
+		span_t* span = heap->span_cache[iclass];
+#if ENABLE_GLOBAL_CACHE
+		while (span) {
+			assert(span->span_count == (iclass + 1));
+			size_t release_count = (!iclass ? _memory_span_release_count : _memory_span_release_count_large);
+			span_t* next = _memory_span_list_split(span, (uint32_t)release_count);
+#if ENABLE_STATISTICS
+			heap->thread_to_global += (size_t)span->list_size * span->span_count * _memory_span_size;
+			heap->span_use[iclass].spans_to_global += span->list_size;
+#endif
+			_memory_global_cache_insert(span);
+			span = next;
+		}
 #else
+		if (span)
+			_memory_unmap_span_list(span);
+#endif
+		heap->span_cache[iclass] = 0;
+	}
+#endif
+
+	//Orphan the heap
+	void* raw_heap;
+	uintptr_t orphan_counter;
+	heap_t* last_heap;
+	do {
+		last_heap = atomic_load_ptr(&_memory_orphan_heaps);
+		heap->next_orphan = (void*)((uintptr_t)last_heap & ~(uintptr_t)0x1FF);
+		orphan_counter = (uintptr_t)atomic_incr32(&_memory_orphan_counter);
+		raw_heap = (void*)((uintptr_t)heap | (orphan_counter & (uintptr_t)0x1FF));
+	} while (!atomic_cas_ptr(&_memory_orphan_heaps, raw_heap, last_heap));
+
+	set_thread_heap(0);
+
+#if ENABLE_STATISTICS
+	atomic_decr32(&_memory_active_heaps);
+	assert(atomic_load32(&_memory_active_heaps) >= 0);
+#endif
+}
+
+#if defined(_MSC_VER) && !defined(__clang__) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
+#include <fibersapi.h>
+static DWORD fls_key;
+static void NTAPI
+rp_thread_destructor(void* value) {
+	if (value)
+		rpmalloc_thread_finalize();
+}
+#endif
+
+#if PLATFORM_POSIX
 #  include <sys/mman.h>
 #  include <sched.h>
 #  ifdef __FreeBSD__
@@ -1405,14 +1725,24 @@
 #include <errno.h>
 
 //! Initialize the allocator and setup global data
-int
+extern inline int
 rpmalloc_initialize(void) {
+	if (_rpmalloc_initialized) {
+		rpmalloc_thread_initialize();
+		return 0;
+	}
 	memset(&_memory_config, 0, sizeof(rpmalloc_config_t));
 	return rpmalloc_initialize_config(0);
 }
 
 int
 rpmalloc_initialize_config(const rpmalloc_config_t* config) {
+	if (_rpmalloc_initialized) {
+		rpmalloc_thread_initialize();
+		return 0;
+	}
+	_rpmalloc_initialized = 1;
+
 	if (config)
 		memcpy(&_memory_config, config, sizeof(rpmalloc_config_t));
 
@@ -1421,8 +1751,12 @@
 		_memory_config.memory_unmap = _memory_unmap_os;
 	}
 
-	_memory_huge_pages = 0;
+#if RPMALLOC_CONFIGURABLE
 	_memory_page_size = _memory_config.page_size;
+#else
+	_memory_page_size = 0;
+#endif
+	_memory_huge_pages = 0;
 	_memory_map_granularity = _memory_page_size;
 	if (!_memory_page_size) {
 #if PLATFORM_WINDOWS
@@ -1493,12 +1827,12 @@
 #endif
 		}
 #endif
-	}
-	else {
+	} else {
 		if (config && config->enable_huge_pages)
 			_memory_huge_pages = 1;
 	}
 
+	//The ABA counter in heap orphan list is tied to using 512 (bitmask 0x1FF)
 	if (_memory_page_size < 512)
 		_memory_page_size = 512;
 	if (_memory_page_size > (64 * 1024 * 1024))
@@ -1511,19 +1845,20 @@
 	}
 	_memory_page_size = ((size_t)1 << _memory_page_size_shift);
 
+#if RPMALLOC_CONFIGURABLE
 	size_t span_size = _memory_config.span_size;
 	if (!span_size)
 		span_size = (64 * 1024);
 	if (span_size > (256 * 1024))
 		span_size = (256 * 1024);
 	_memory_span_size = 4096;
-	_memory_span_size = 4096;
 	_memory_span_size_shift = 12;
 	while (_memory_span_size < span_size) {
 		_memory_span_size <<= 1;
 		++_memory_span_size_shift;
 	}
 	_memory_span_mask = ~(uintptr_t)(_memory_span_size - 1);
+#endif
 
 	_memory_span_map_count = ( _memory_config.span_map_count ? _memory_config.span_map_count : DEFAULT_SPAN_MAP_COUNT);
 	if ((_memory_span_size * _memory_span_map_count) < _memory_page_size)
@@ -1540,9 +1875,12 @@
 	_memory_span_release_count_large = (_memory_span_release_count > 8 ? (_memory_span_release_count / 4) : 2);
 
 #if (defined(__APPLE__) || defined(__HAIKU__)) && ENABLE_PRELOAD
-	if (pthread_key_create(&_memory_thread_heap, 0))
+	if (pthread_key_create(&_memory_thread_heap, _memory_heap_finalize))
 		return -1;
 #endif
+#if defined(_MSC_VER) && !defined(__clang__) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
+    fls_key = FlsAlloc(&rp_thread_destructor);
+#endif
 
 	atomic_store32(&_memory_heap_id, 0);
 	atomic_store32(&_memory_orphan_counter, 0);
@@ -1550,27 +1888,32 @@
 	atomic_store32(&_memory_active_heaps, 0);
 	atomic_store32(&_reserved_spans, 0);
 	atomic_store32(&_mapped_pages, 0);
+	_mapped_pages_peak = 0;
 	atomic_store32(&_mapped_total, 0);
 	atomic_store32(&_unmapped_total, 0);
 	atomic_store32(&_mapped_pages_os, 0);
+	atomic_store32(&_huge_pages_current, 0);
+	_huge_pages_peak = 0;
 #endif
 
 	//Setup all small and medium size classes
-	size_t iclass;
-	for (iclass = 0; iclass < SMALL_CLASS_COUNT; ++iclass) {
-		size_t size = (iclass + 1) * SMALL_GRANULARITY;
-		_memory_size_class[iclass].size = (uint16_t)size;
+	size_t iclass = 0;
+	_memory_size_class[iclass].block_size = SMALL_GRANULARITY;
+	_memory_adjust_size_class(iclass);
+	for (iclass = 1; iclass < SMALL_CLASS_COUNT; ++iclass) {
+		size_t size = iclass * SMALL_GRANULARITY;
+		_memory_size_class[iclass].block_size = (uint32_t)size;
 		_memory_adjust_size_class(iclass);
 	}
-
-	_memory_medium_size_limit = _memory_span_size - SPAN_HEADER_SIZE;
+	//At least two blocks per span, then fall back to large allocations
+	_memory_medium_size_limit = (_memory_span_size - SPAN_HEADER_SIZE) >> 1;
 	if (_memory_medium_size_limit > MEDIUM_SIZE_LIMIT)
 		_memory_medium_size_limit = MEDIUM_SIZE_LIMIT;
 	for (iclass = 0; iclass < MEDIUM_CLASS_COUNT; ++iclass) {
 		size_t size = SMALL_SIZE_LIMIT + ((iclass + 1) * MEDIUM_GRANULARITY);
 		if (size > _memory_medium_size_limit)
-			size = _memory_medium_size_limit;
-		_memory_size_class[SMALL_CLASS_COUNT + iclass].size = (uint16_t)size;
+			break;
+		_memory_size_class[SMALL_CLASS_COUNT + iclass].block_size = (uint32_t)size;
 		_memory_adjust_size_class(SMALL_CLASS_COUNT + iclass);
 	}
 
@@ -1588,34 +1931,50 @@
 	atomic_thread_fence_acquire();
 
 	rpmalloc_thread_finalize();
-
-#if ENABLE_STATISTICS
-	//If you hit this assert, you still have active threads or forgot to finalize some thread(s)
-	assert(atomic_load32(&_memory_active_heaps) == 0);
-#endif
+	//rpmalloc_dump_statistics(stderr);
 
 	//Free all thread caches
 	for (size_t list_idx = 0; list_idx < HEAP_ARRAY_SIZE; ++list_idx) {
 		heap_t* heap = atomic_load_ptr(&_memory_heaps[list_idx]);
 		while (heap) {
-			_memory_deallocate_deferred(heap);
-
 			if (heap->spans_reserved) {
 				span_t* span = _memory_map_spans(heap, heap->spans_reserved);
 				_memory_unmap_span(span);
 			}
 
 			for (size_t iclass = 0; iclass < SIZE_CLASS_COUNT; ++iclass) {
-				span_t* span = heap->active_span[iclass];
-				if (span && (heap->active_block[iclass].free_count == _memory_size_class[iclass].block_count)) {
-					heap->active_span[iclass] = 0;
-					heap->active_block[iclass].free_count = 0;
-					_memory_heap_cache_insert(heap, span);
+				heap_class_t* heap_class = heap->span_class + iclass;
+				span_t* span = heap_class->partial_span;
+				while (span) {
+					span_t* next = span->next;
+					if (span->state == SPAN_STATE_ACTIVE) {
+						uint32_t used_blocks = span->block_count;
+						if (span->free_list_limit < span->block_count)
+							used_blocks = span->free_list_limit;
+						uint32_t free_blocks = 0;
+						void* block = heap_class->free_list;
+						while (block) {
+							++free_blocks;
+							block = *((void**)block);
+						}
+						block = span->free_list;
+						while (block) {
+							++free_blocks;
+							block = *((void**)block);
+						}
+						if (used_blocks == (free_blocks + span->list_size))
+							_memory_heap_cache_insert(heap, span);
+					} else {
+						if (span->used_count == span->list_size)
+							_memory_heap_cache_insert(heap, span);
+					}
+					span = next;
 				}
 			}
 
-			//Free span caches (other thread might have deferred after the thread using this heap finalized)
 #if ENABLE_THREAD_CACHE
+			//Free span caches (other thread might have deferred after the thread using this heap finalized)
+			_memory_heap_cache_adopt_deferred(heap);
 			for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
 				if (heap->span_cache[iclass])
 					_memory_unmap_span_list(heap->span_cache[iclass]);
@@ -1637,6 +1996,13 @@
 	atomic_store_ptr(&_memory_orphan_heaps, 0);
 	atomic_thread_fence_release();
 
+#if (defined(__APPLE__) || defined(__HAIKU__)) && ENABLE_PRELOAD
+	pthread_key_delete(_memory_thread_heap);
+#endif
+#if defined(_MSC_VER) && !defined(__clang__) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
+    FlsFree(fls_key);
+#endif
+
 #if ENABLE_STATISTICS
 	//If you hit these asserts you probably have memory leaks or double frees in your code
 	assert(!atomic_load32(&_mapped_pages));
@@ -1644,23 +2010,23 @@
 	assert(!atomic_load32(&_mapped_pages_os));
 #endif
 
-#if (defined(__APPLE__) || defined(__HAIKU__)) && ENABLE_PRELOAD
-	pthread_key_delete(_memory_thread_heap);
-#endif
+	_rpmalloc_initialized = 0;
 }
 
 //! Initialize thread, assign heap
-void
+extern inline void
 rpmalloc_thread_initialize(void) {
-	if (!get_thread_heap()) {
+	if (!get_thread_heap_raw()) {
 		heap_t* heap = _memory_allocate_heap();
 		if (heap) {
+			atomic_thread_fence_acquire();
 #if ENABLE_STATISTICS
 			atomic_incr32(&_memory_active_heaps);
-			heap->thread_to_global = 0;
-			heap->global_to_thread = 0;
 #endif
 			set_thread_heap(heap);
+#if defined(_MSC_VER) && !defined(__clang__) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
+			FlsSetValue(fls_key, heap);
+#endif
 		}
 	}
 }
@@ -1668,63 +2034,14 @@
 //! Finalize thread, orphan heap
 void
 rpmalloc_thread_finalize(void) {
-	heap_t* heap = get_thread_heap();
-	if (!heap)
-		return;
-
-	_memory_deallocate_deferred(heap);
-
-	for (size_t iclass = 0; iclass < SIZE_CLASS_COUNT; ++iclass) {
-		span_t* span = heap->active_span[iclass];
-		if (span && (heap->active_block[iclass].free_count == _memory_size_class[iclass].block_count)) {
-			heap->active_span[iclass] = 0;
-			heap->active_block[iclass].free_count = 0;
-			_memory_heap_cache_insert(heap, span);
-		}
-	}
-
-	//Release thread cache spans back to global cache
-#if ENABLE_THREAD_CACHE
-	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
-		span_t* span = heap->span_cache[iclass];
-#if ENABLE_GLOBAL_CACHE
-		while (span) {
-			assert(span->span_count == (iclass + 1));
-			size_t release_count = (!iclass ? _memory_span_release_count : _memory_span_release_count_large);
-			span_t* next = _memory_span_list_split(span, (uint32_t)release_count);
-			_memory_global_cache_insert(span);
-			span = next;
-		}
-#else
-		if (span)
-			_memory_unmap_span_list(span);
-#endif
-		heap->span_cache[iclass] = 0;
-	}
-#endif
-
-	//Orphan the heap
-	void* raw_heap;
-	uintptr_t orphan_counter;
-	heap_t* last_heap;
-	do {
-		last_heap = atomic_load_ptr(&_memory_orphan_heaps);
-		heap->next_orphan = (void*)((uintptr_t)last_heap & ~(uintptr_t)0xFF);
-		orphan_counter = (uintptr_t)atomic_incr32(&_memory_orphan_counter);
-		raw_heap = (void*)((uintptr_t)heap | (orphan_counter & (uintptr_t)0xFF));
-	}
-	while (!atomic_cas_ptr(&_memory_orphan_heaps, raw_heap, last_heap));
-
-	set_thread_heap(0);
-
-#if ENABLE_STATISTICS
-	atomic_add32(&_memory_active_heaps, -1);
-#endif
+	heap_t* heap = get_thread_heap_raw();
+	if (heap)
+		_memory_heap_finalize(heap);
 }
 
 int
 rpmalloc_is_thread_initialized(void) {
-	return (get_thread_heap() != 0) ? 1 : 0;
+	return (get_thread_heap_raw() != 0) ? 1 : 0;
 }
 
 const rpmalloc_config_t*
@@ -1746,12 +2063,16 @@
 		return 0;
 	}
 #else
+	int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_UNINITIALIZED;
 #  if defined(__APPLE__)
-	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_UNINITIALIZED, (_memory_huge_pages ? VM_FLAGS_SUPERPAGE_SIZE_2MB : -1), 0);
+	int fd = (int)VM_MAKE_TAG(240U);
+	if (_memory_huge_pages)
+		fd |= VM_FLAGS_SUPERPAGE_SIZE_2MB;
+	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, flags, fd, 0);
 #  elif defined(MAP_HUGETLB)
-	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, (_memory_huge_pages ? MAP_HUGETLB : 0) | MAP_PRIVATE | MAP_ANONYMOUS | MAP_UNINITIALIZED, -1, 0);
+	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, (_memory_huge_pages ? MAP_HUGETLB : 0) | flags, -1, 0);
 #  else
-	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_UNINITIALIZED, -1, 0);
+	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, flags, -1, 0);
 #  endif
 	if ((ptr == MAP_FAILED) || !ptr) {
 		assert("Failed to map virtual memory block" == 0);
@@ -1816,7 +2137,7 @@
 
 // Extern interface
 
-RPMALLOC_RESTRICT void*
+extern inline RPMALLOC_ALLOCATOR void*
 rpmalloc(size_t size) {
 #if ENABLE_VALIDATE_ARGS
 	if (size >= MAX_ALLOC_SIZE) {
@@ -1824,15 +2145,16 @@
 		return 0;
 	}
 #endif
-	return _memory_allocate(size);
+	heap_t* heap = get_thread_heap();
+	return _memory_allocate(heap, size);
 }
 
-void
+extern inline void
 rpfree(void* ptr) {
 	_memory_deallocate(ptr);
 }
 
-RPMALLOC_RESTRICT void*
+extern inline RPMALLOC_ALLOCATOR void*
 rpcalloc(size_t num, size_t size) {
 	size_t total;
 #if ENABLE_VALIDATE_ARGS
@@ -1852,12 +2174,13 @@
 #else
 	total = num * size;
 #endif
-	void* block = _memory_allocate(total);
+	heap_t* heap = get_thread_heap();
+	void* block = _memory_allocate(heap, total);
 	memset(block, 0, total);
 	return block;
 }
 
-void*
+extern inline RPMALLOC_ALLOCATOR void*
 rprealloc(void* ptr, size_t size) {
 #if ENABLE_VALIDATE_ARGS
 	if (size >= MAX_ALLOC_SIZE) {
@@ -1868,7 +2191,7 @@
 	return _memory_reallocate(ptr, size, 0, 0);
 }
 
-void*
+extern RPMALLOC_ALLOCATOR void*
 rpaligned_realloc(void* ptr, size_t alignment, size_t size, size_t oldsize,
                   unsigned int flags) {
 #if ENABLE_VALIDATE_ARGS
@@ -1891,16 +2214,18 @@
 				memcpy(block, ptr, oldsize < size ? oldsize : size);
 			rpfree(ptr);
 		}
-	}
-	else {
+		//Mark as having aligned blocks
+		span_t* span = (span_t*)((uintptr_t)block & _memory_span_mask);
+		span->flags |= SPAN_FLAG_ALIGNED_BLOCKS;
+	} else {
 		block = _memory_reallocate(ptr, size, oldsize, flags);
 	}
 	return block;
 }
 
-RPMALLOC_RESTRICT void*
+extern RPMALLOC_ALLOCATOR void*
 rpaligned_alloc(size_t alignment, size_t size) {
-	if (alignment <= 32)
+	if (alignment <= 16)
 		return rpmalloc(size);
 
 #if ENABLE_VALIDATE_ARGS
@@ -1920,6 +2245,9 @@
 		ptr = rpmalloc(size + alignment);
 		if ((uintptr_t)ptr & align_mask)
 			ptr = (void*)(((uintptr_t)ptr & ~(uintptr_t)align_mask) + alignment);
+		//Mark as having aligned blocks
+		span_t* span = (span_t*)((uintptr_t)ptr & _memory_span_mask);
+		span->flags |= SPAN_FLAG_ALIGNED_BLOCKS;
 		return ptr;
 	}
 
@@ -1985,20 +2313,21 @@
 		goto retry;
 	}
 
-	atomic_store32(&span->heap_id, 0);
 	//Store page count in span_count
+	span->size_class = (uint32_t)-1;
 	span->span_count = (uint32_t)num_pages;
 	span->align_offset = (uint32_t)align_offset;
+	_memory_statistics_add_peak(&_huge_pages_current, num_pages, _huge_pages_peak);
 
 	return ptr;
 }
 
-RPMALLOC_RESTRICT void*
+extern inline RPMALLOC_ALLOCATOR void*
 rpmemalign(size_t alignment, size_t size) {
 	return rpaligned_alloc(alignment, size);
 }
 
-int
+extern inline int
 rpposix_memalign(void **memptr, size_t alignment, size_t size) {
 	if (memptr)
 		*memptr = rpaligned_alloc(alignment, size);
@@ -2007,45 +2336,70 @@
 	return *memptr ? 0 : ENOMEM;
 }
 
-size_t
+extern inline size_t
 rpmalloc_usable_size(void* ptr) {
 	return (ptr ? _memory_usable_size(ptr) : 0);
 }
 
-void
+extern inline void
 rpmalloc_thread_collect(void) {
-	heap_t* heap = get_thread_heap();
-	if (heap)
-		_memory_deallocate_deferred(heap);
 }
 
 void
 rpmalloc_thread_statistics(rpmalloc_thread_statistics_t* stats) {
 	memset(stats, 0, sizeof(rpmalloc_thread_statistics_t));
-	heap_t* heap = get_thread_heap();
-	void* p = atomic_load_ptr(&heap->defer_deallocate);
-	while (p) {
-		void* next = *(void**)p;
-		span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
-		stats->deferred += _memory_size_class[span->size_class].size;
-		p = next;
-	}
+	heap_t* heap = get_thread_heap_raw();
+	if (!heap)
+		return;
 
-	for (size_t isize = 0; isize < SIZE_CLASS_COUNT; ++isize) {
-		if (heap->active_block[isize].free_count)
-			stats->active += heap->active_block[isize].free_count * _memory_size_class[heap->active_span[isize]->size_class].size;
-
-		span_t* cache = heap->size_cache[isize];
-		while (cache) {
-			stats->sizecache = cache->data.block.free_count * _memory_size_class[cache->size_class].size;
-			cache = cache->next_span;
+	for (size_t iclass = 0; iclass < SIZE_CLASS_COUNT; ++iclass) {
+		size_class_t* size_class = _memory_size_class + iclass;
+		heap_class_t* heap_class = heap->span_class + iclass;
+		span_t* span = heap_class->partial_span;
+		while (span) {
+			atomic_thread_fence_acquire();
+			size_t free_count = span->list_size;
+			if (span->state == SPAN_STATE_PARTIAL)
+				free_count += (size_class->block_count - span->used_count);
+			stats->sizecache = free_count * size_class->block_size;
+			span = span->next;
 		}
 	}
 
 #if ENABLE_THREAD_CACHE
 	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
 		if (heap->span_cache[iclass])
-			stats->spancache = (size_t)heap->span_cache[iclass]->data.list.size * (iclass + 1) * _memory_span_size;
+			stats->spancache = (size_t)heap->span_cache[iclass]->list_size * (iclass + 1) * _memory_span_size;
+		span_t* deferred_list = !iclass ? atomic_load_ptr(&heap->span_cache_deferred) : 0;
+		//TODO: Incorrect, for deferred lists the size is NOT stored in list_size
+		if (deferred_list)
+			stats->spancache = (size_t)deferred_list->list_size * (iclass + 1) * _memory_span_size;
+	}
+#endif
+#if ENABLE_STATISTICS
+	stats->thread_to_global = heap->thread_to_global;
+	stats->global_to_thread = heap->global_to_thread;
+
+	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
+		stats->span_use[iclass].current = (size_t)atomic_load32(&heap->span_use[iclass].current);
+		stats->span_use[iclass].peak = (size_t)heap->span_use[iclass].high;
+		stats->span_use[iclass].to_global = (size_t)heap->span_use[iclass].spans_to_global;
+		stats->span_use[iclass].from_global = (size_t)heap->span_use[iclass].spans_from_global;
+		stats->span_use[iclass].to_cache = (size_t)heap->span_use[iclass].spans_to_cache;
+		stats->span_use[iclass].from_cache = (size_t)heap->span_use[iclass].spans_from_cache;
+		stats->span_use[iclass].to_reserved = (size_t)heap->span_use[iclass].spans_to_reserved;
+		stats->span_use[iclass].from_reserved = (size_t)heap->span_use[iclass].spans_from_reserved;
+		stats->span_use[iclass].map_calls = (size_t)heap->span_use[iclass].spans_map_calls;
+	}
+	for (size_t iclass = 0; iclass < SIZE_CLASS_COUNT; ++iclass) {
+		stats->size_use[iclass].alloc_current = (size_t)atomic_load32(&heap->size_class_use[iclass].alloc_current);
+		stats->size_use[iclass].alloc_peak = (size_t)heap->size_class_use[iclass].alloc_peak;
+		stats->size_use[iclass].alloc_total = (size_t)heap->size_class_use[iclass].alloc_total;
+		stats->size_use[iclass].free_total = (size_t)atomic_load32(&heap->size_class_use[iclass].free_total);
+		stats->size_use[iclass].spans_to_cache = (size_t)heap->size_class_use[iclass].spans_to_cache;
+		stats->size_use[iclass].spans_from_cache = (size_t)heap->size_class_use[iclass].spans_from_cache;
+		stats->size_use[iclass].spans_from_reserved = (size_t)heap->size_class_use[iclass].spans_from_reserved;
+		stats->size_use[iclass].map_calls = (size_t)heap->size_class_use[iclass].spans_map_calls;
 	}
 #endif
 }
@@ -2055,8 +2409,11 @@
 	memset(stats, 0, sizeof(rpmalloc_global_statistics_t));
 #if ENABLE_STATISTICS
 	stats->mapped = (size_t)atomic_load32(&_mapped_pages) * _memory_page_size;
+	stats->mapped_peak = (size_t)_mapped_pages_peak * _memory_page_size;
 	stats->mapped_total = (size_t)atomic_load32(&_mapped_total) * _memory_page_size;
 	stats->unmapped_total = (size_t)atomic_load32(&_unmapped_total) * _memory_page_size;
+	stats->huge_alloc = (size_t)atomic_load32(&_huge_pages_current) * _memory_page_size;
+	stats->huge_alloc_peak = (size_t)_huge_pages_peak * _memory_page_size;
 #endif
 #if ENABLE_GLOBAL_CACHE
 	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
@@ -2064,3 +2421,91 @@
 	}
 #endif
 }
+
+void
+rpmalloc_dump_statistics(void* file) {
+#if ENABLE_STATISTICS
+	//If you hit this assert, you still have active threads or forgot to finalize some thread(s)
+	assert(atomic_load32(&_memory_active_heaps) == 0);
+
+	for (size_t list_idx = 0; list_idx < HEAP_ARRAY_SIZE; ++list_idx) {
+		heap_t* heap = atomic_load_ptr(&_memory_heaps[list_idx]);
+		while (heap) {
+			fprintf(file, "Heap %d stats:\n", heap->id);
+			fprintf(file, "Class   CurAlloc  PeakAlloc   TotAlloc    TotFree  BlkSize BlkCount SpansCur SpansPeak  PeakAllocMiB  ToCacheMiB FromCacheMiB FromReserveMiB MmapCalls\n");
+			for (size_t iclass = 0; iclass < SIZE_CLASS_COUNT; ++iclass) {
+				if (!heap->size_class_use[iclass].alloc_total) {
+					assert(!atomic_load32(&heap->size_class_use[iclass].free_total));
+					assert(!heap->size_class_use[iclass].spans_map_calls);
+					continue;
+				}
+				fprintf(file, "%3u:  %10u %10u %10u %10u %8u %8u %8d %9d %13zu %11zu %12zu %14zu %9u\n", (uint32_t)iclass,
+					atomic_load32(&heap->size_class_use[iclass].alloc_current),
+					heap->size_class_use[iclass].alloc_peak,
+					heap->size_class_use[iclass].alloc_total,
+					atomic_load32(&heap->size_class_use[iclass].free_total),
+					_memory_size_class[iclass].block_size,
+					_memory_size_class[iclass].block_count,
+					heap->size_class_use[iclass].spans_current,
+					heap->size_class_use[iclass].spans_peak,
+					((size_t)heap->size_class_use[iclass].alloc_peak * (size_t)_memory_size_class[iclass].block_size) / (size_t)(1024 * 1024),
+					((size_t)heap->size_class_use[iclass].spans_to_cache * _memory_span_size) / (size_t)(1024 * 1024),
+					((size_t)heap->size_class_use[iclass].spans_from_cache * _memory_span_size) / (size_t)(1024 * 1024),
+					((size_t)heap->size_class_use[iclass].spans_from_reserved * _memory_span_size) / (size_t)(1024 * 1024),
+					heap->size_class_use[iclass].spans_map_calls);
+			}
+			fprintf(file, "Spans  Current     Peak  PeakMiB  Cached  ToCacheMiB FromCacheMiB ToReserveMiB FromReserveMiB ToGlobalMiB FromGlobalMiB  MmapCalls\n");
+			for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
+				if (!heap->span_use[iclass].high && !heap->span_use[iclass].spans_map_calls)
+					continue;
+				fprintf(file, "%4u: %8d %8u %8zu %7u %11zu %12zu %12zu %14zu %11zu %13zu %10u\n", (uint32_t)(iclass + 1),
+					atomic_load32(&heap->span_use[iclass].current),
+					heap->span_use[iclass].high,
+					((size_t)heap->span_use[iclass].high * (size_t)_memory_span_size * (iclass + 1)) / (size_t)(1024 * 1024),
+					heap->span_cache[iclass] ? heap->span_cache[iclass]->list_size : 0,
+					((size_t)heap->span_use[iclass].spans_to_cache * (iclass + 1) * _memory_span_size) / (size_t)(1024 * 1024),
+					((size_t)heap->span_use[iclass].spans_from_cache * (iclass + 1) * _memory_span_size) / (size_t)(1024 * 1024),
+					((size_t)heap->span_use[iclass].spans_to_reserved * (iclass + 1) * _memory_span_size) / (size_t)(1024 * 1024),
+					((size_t)heap->span_use[iclass].spans_from_reserved * (iclass + 1) * _memory_span_size) / (size_t)(1024 * 1024),
+					((size_t)heap->span_use[iclass].spans_to_global * (size_t)_memory_span_size * (iclass + 1)) / (size_t)(1024 * 1024),
+					((size_t)heap->span_use[iclass].spans_from_global * (size_t)_memory_span_size * (iclass + 1)) / (size_t)(1024 * 1024),
+					heap->span_use[iclass].spans_map_calls);
+			}
+			fprintf(file, "ThreadToGlobalMiB GlobalToThreadMiB\n");
+			fprintf(file, "%17zu %17zu\n", (size_t)heap->thread_to_global / (size_t)(1024 * 1024), (size_t)heap->global_to_thread / (size_t)(1024 * 1024));
+			heap = heap->next_heap;
+		}
+	}
+
+	fprintf(file, "Global stats:\n");
+	size_t huge_current = (size_t)atomic_load32(&_huge_pages_current) * _memory_page_size;
+	size_t huge_peak = (size_t)_huge_pages_peak * _memory_page_size;
+	fprintf(file, "HugeCurrentMiB HugePeakMiB\n");
+	fprintf(file, "%14zu %11zu\n", huge_current / (size_t)(1024 * 1024), huge_peak / (size_t)(1024 * 1024));
+
+	size_t mapped = (size_t)atomic_load32(&_mapped_pages) * _memory_page_size;
+	size_t mapped_os = (size_t)atomic_load32(&_mapped_pages_os) * _memory_page_size;
+	size_t mapped_peak = (size_t)_mapped_pages_peak * _memory_page_size;
+	size_t mapped_total = (size_t)atomic_load32(&_mapped_total) * _memory_page_size;
+	size_t unmapped_total = (size_t)atomic_load32(&_unmapped_total) * _memory_page_size;
+	size_t reserved_total = (size_t)atomic_load32(&_reserved_spans) * _memory_span_size;
+	fprintf(file, "MappedMiB MappedOSMiB MappedPeakMiB MappedTotalMiB UnmappedTotalMiB ReservedTotalMiB\n");
+	fprintf(file, "%9zu %11zu %13zu %14zu %16zu %16zu\n",
+		mapped / (size_t)(1024 * 1024),
+		mapped_os / (size_t)(1024 * 1024),
+		mapped_peak / (size_t)(1024 * 1024),
+		mapped_total / (size_t)(1024 * 1024),
+		unmapped_total / (size_t)(1024 * 1024),
+		reserved_total / (size_t)(1024 * 1024));
+
+	fprintf(file, "\n");
+#else
+	(void)sizeof(file);
+#endif
+}
+
+#if ENABLE_PRELOAD || ENABLE_OVERRIDE
+
+#include "malloc.c"
+
+#endif

diff --git a/rpmalloc/rpmalloc.h b/rpmalloc/rpmalloc.h
index 6a89ff9..2f48bc9 100644
--- a/rpmalloc/rpmalloc.h
+++ b/rpmalloc/rpmalloc.h

@@ -18,46 +18,107 @@
 #endif
 
 #if defined(__clang__) || defined(__GNUC__)
-# define RPMALLOC_ATTRIBUTE __attribute__((__malloc__))
-# define RPMALLOC_RESTRICT
+# define RPMALLOC_EXPORT __attribute__((visibility("default")))
+# define RPMALLOC_ALLOCATOR 
+# define RPMALLOC_ATTRIB_MALLOC __attribute__((__malloc__))
+# if defined(__clang_major__) && (__clang_major__ < 4)
+# define RPMALLOC_ATTRIB_ALLOC_SIZE(size)
+# define RPMALLOC_ATTRIB_ALLOC_SIZE2(count, size)
+# else
+# define RPMALLOC_ATTRIB_ALLOC_SIZE(size) __attribute__((alloc_size(size)))
+# define RPMALLOC_ATTRIB_ALLOC_SIZE2(count, size)  __attribute__((alloc_size(count, size)))
+# endif
 # define RPMALLOC_CDECL
 #elif defined(_MSC_VER)
-# define RPMALLOC_ATTRIBUTE
-# define RPMALLOC_RESTRICT __declspec(restrict)
+# define RPMALLOC_EXPORT
+# define RPMALLOC_ALLOCATOR __declspec(allocator) __declspec(restrict)
+# define RPMALLOC_ATTRIB_MALLOC
+# define RPMALLOC_ATTRIB_ALLOC_SIZE(size)
+# define RPMALLOC_ATTRIB_ALLOC_SIZE2(count,size)
 # define RPMALLOC_CDECL __cdecl
 #else
-# define RPMALLOC_ATTRIBUTE
-# define RPMALLOC_RESTRICT
+# define RPMALLOC_EXPORT
+# define RPMALLOC_ALLOCATOR
+# define RPMALLOC_ATTRIB_MALLOC
+# define RPMALLOC_ATTRIB_ALLOC_SIZE(size)
+# define RPMALLOC_ATTRIB_ALLOC_SIZE2(count,size)
 # define RPMALLOC_CDECL
 #endif
 
+//! Define RPMALLOC_CONFIGURABLE to enable configuring sizes
+#ifndef RPMALLOC_CONFIGURABLE
+#define RPMALLOC_CONFIGURABLE 0
+#endif
+
 //! Flag to rpaligned_realloc to not preserve content in reallocation
 #define RPMALLOC_NO_PRESERVE    1
 
 typedef struct rpmalloc_global_statistics_t {
-	//! Current amount of virtual memory mapped (only if ENABLE_STATISTICS=1)
+	//! Current amount of virtual memory mapped, all of which might not have been committed (only if ENABLE_STATISTICS=1)
 	size_t mapped;
-	//! Current amount of memory in global caches for small and medium sizes (<64KiB)
+	//! Peak amount of virtual memory mapped, all of which might not have been committed (only if ENABLE_STATISTICS=1)
+	size_t mapped_peak;
+	//! Current amount of memory in global caches for small and medium sizes (<32KiB)
 	size_t cached;
-	//! Total amount of memory mapped (only if ENABLE_STATISTICS=1)
+	//! Current amount of memory allocated in huge allocations, i.e larger than LARGE_SIZE_LIMIT which is 2MiB by default (only if ENABLE_STATISTICS=1)
+	size_t huge_alloc;
+	//! Peak amount of memory allocated in huge allocations, i.e larger than LARGE_SIZE_LIMIT which is 2MiB by default (only if ENABLE_STATISTICS=1)
+	size_t huge_alloc_peak;
+	//! Total amount of memory mapped since initialization (only if ENABLE_STATISTICS=1)
 	size_t mapped_total;
-	//! Total amount of memory unmapped (only if ENABLE_STATISTICS=1)
+	//! Total amount of memory unmapped since initialization  (only if ENABLE_STATISTICS=1)
 	size_t unmapped_total;
 } rpmalloc_global_statistics_t;
 
 typedef struct rpmalloc_thread_statistics_t {
-	//! Current number of bytes available for allocation from active spans
-	size_t active;
-	//! Current number of bytes available in thread size class caches
+	//! Current number of bytes available in thread size class caches for small and medium sizes (<32KiB)
 	size_t sizecache;
-	//! Current number of bytes available in thread span caches
+	//! Current number of bytes available in thread span caches for small and medium sizes (<32KiB)
 	size_t spancache;
-	//! Current number of bytes in pending deferred deallocations
-	size_t deferred;
-	//! Total number of bytes transitioned from thread cache to global cache
+	//! Total number of bytes transitioned from thread cache to global cache (only if ENABLE_STATISTICS=1)
 	size_t thread_to_global;
-	//! Total number of bytes transitioned from global cache to thread cache
+	//! Total number of bytes transitioned from global cache to thread cache (only if ENABLE_STATISTICS=1)
 	size_t global_to_thread;
+	//! Per span count statistics (only if ENABLE_STATISTICS=1)
+	struct {
+		//! Currently used number of spans
+		size_t current;
+		//! High water mark of spans used
+		size_t peak;
+		//! Number of spans transitioned to global cache
+		size_t to_global;
+		//! Number of spans transitioned from global cache
+		size_t from_global;
+		//! Number of spans transitioned to thread cache
+		size_t to_cache;
+		//! Number of spans transitioned from thread cache
+		size_t from_cache;
+		//! Number of spans transitioned to reserved state
+		size_t to_reserved;
+		//! Number of spans transitioned from reserved state
+		size_t from_reserved;
+		//! Number of raw memory map calls (not hitting the reserve spans but resulting in actual OS mmap calls)
+		size_t map_calls;
+	} span_use[32];
+	//! Per size class statistics (only if ENABLE_STATISTICS=1)
+	struct {
+		//! Current number of allocations
+		size_t alloc_current;
+		//! Peak number of allocations
+		size_t alloc_peak;
+		//! Total number of allocations
+		size_t alloc_total;
+		//! Total number of frees
+		size_t free_total;
+		//! Number of spans transitioned to cache
+		size_t spans_to_cache;
+		//! Number of spans transitioned from cache
+		size_t spans_from_cache;
+		//! Number of spans transitioned from reserved state
+		size_t spans_from_reserved;
+		//! Number of raw memory map calls (not hitting the reserve spans but resulting in actual OS mmap calls)
+		size_t map_calls;
+	} size_use[128];
 } rpmalloc_thread_statistics_t;
 
 typedef struct rpmalloc_config_t {
@@ -82,9 +143,11 @@
 	void (*memory_unmap)(void* address, size_t size, size_t offset, size_t release);
 	//! Size of memory pages. The page size MUST be a power of two. All memory mapping
 	//  requests to memory_map will be made with size set to a multiple of the page size.
+	//  Used if RPMALLOC_CONFIGURABLE is defined to 1, otherwise system page size is used.
 	size_t page_size;
 	//! Size of a span of memory blocks. MUST be a power of two, and in [4096,262144]
-	//  range (unless 0 - set to 0 to use the default span size).
+	//  range (unless 0 - set to 0 to use the default span size). Used if RPMALLOC_CONFIGURABLE
+	//  is defined to 1.
 	size_t span_size;
 	//! Number of spans to map at each request to map new virtual memory blocks. This can
 	//  be used to minimize the system call overhead at the cost of virtual memory address
@@ -103,92 +166,96 @@
 } rpmalloc_config_t;
 
 //! Initialize allocator with default configuration
-extern int
+RPMALLOC_EXPORT int
 rpmalloc_initialize(void);
 
 //! Initialize allocator with given configuration
-extern int
+RPMALLOC_EXPORT int
 rpmalloc_initialize_config(const rpmalloc_config_t* config);
 
 //! Get allocator configuration
-extern const rpmalloc_config_t*
+RPMALLOC_EXPORT const rpmalloc_config_t*
 rpmalloc_config(void);
 
 //! Finalize allocator
-extern void
+RPMALLOC_EXPORT void
 rpmalloc_finalize(void);
 
 //! Initialize allocator for calling thread
-extern void
+RPMALLOC_EXPORT void
 rpmalloc_thread_initialize(void);
 
 //! Finalize allocator for calling thread
-extern void
+RPMALLOC_EXPORT void
 rpmalloc_thread_finalize(void);
 
 //! Perform deferred deallocations pending for the calling thread heap
-extern void
+RPMALLOC_EXPORT void
 rpmalloc_thread_collect(void);
 
 //! Query if allocator is initialized for calling thread
-extern int
+RPMALLOC_EXPORT int
 rpmalloc_is_thread_initialized(void);
 
 //! Get per-thread statistics
-extern void
+RPMALLOC_EXPORT void
 rpmalloc_thread_statistics(rpmalloc_thread_statistics_t* stats);
 
 //! Get global statistics
-extern void
+RPMALLOC_EXPORT void
 rpmalloc_global_statistics(rpmalloc_global_statistics_t* stats);
 
+//! Dump all statistics in human readable format to file (should be a FILE*)
+RPMALLOC_EXPORT void
+rpmalloc_dump_statistics(void* file);
+
 //! Allocate a memory block of at least the given size
-extern RPMALLOC_RESTRICT void*
-rpmalloc(size_t size) RPMALLOC_ATTRIBUTE;
+RPMALLOC_EXPORT RPMALLOC_ALLOCATOR void*
+rpmalloc(size_t size) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE(1);
 
 //! Free the given memory block
-extern void
+RPMALLOC_EXPORT void
 rpfree(void* ptr);
 
 //! Allocate a memory block of at least the given size and zero initialize it
-extern RPMALLOC_RESTRICT void*
-rpcalloc(size_t num, size_t size) RPMALLOC_ATTRIBUTE;
+RPMALLOC_EXPORT RPMALLOC_ALLOCATOR void*
+rpcalloc(size_t num, size_t size) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE2(1, 2);
 
 //! Reallocate the given block to at least the given size
-extern void*
-rprealloc(void* ptr, size_t size);
+RPMALLOC_EXPORT RPMALLOC_ALLOCATOR void*
+rprealloc(void* ptr, size_t size) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE(2);
 
 //! Reallocate the given block to at least the given size and alignment,
 //  with optional control flags (see RPMALLOC_NO_PRESERVE).
 //  Alignment must be a power of two and a multiple of sizeof(void*),
 //  and should ideally be less than memory page size. A caveat of rpmalloc
 //  internals is that this must also be strictly less than the span size (default 64KiB)
-extern void*
-rpaligned_realloc(void* ptr, size_t alignment, size_t size, size_t oldsize, unsigned int flags);
+RPMALLOC_EXPORT RPMALLOC_ALLOCATOR void*
+rpaligned_realloc(void* ptr, size_t alignment, size_t size, size_t oldsize, unsigned int flags) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE(3);
 
 //! Allocate a memory block of at least the given size and alignment.
 //  Alignment must be a power of two and a multiple of sizeof(void*),
 //  and should ideally be less than memory page size. A caveat of rpmalloc
 //  internals is that this must also be strictly less than the span size (default 64KiB)
-extern RPMALLOC_RESTRICT void*
-rpaligned_alloc(size_t alignment, size_t size) RPMALLOC_ATTRIBUTE;
+RPMALLOC_EXPORT RPMALLOC_ALLOCATOR void*
+rpaligned_alloc(size_t alignment, size_t size) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE(2);
 
 //! Allocate a memory block of at least the given size and alignment.
 //  Alignment must be a power of two and a multiple of sizeof(void*),
 //  and should ideally be less than memory page size. A caveat of rpmalloc
 //  internals is that this must also be strictly less than the span size (default 64KiB)
-extern RPMALLOC_RESTRICT void*
-rpmemalign(size_t alignment, size_t size) RPMALLOC_ATTRIBUTE;
+RPMALLOC_EXPORT RPMALLOC_ALLOCATOR void*
+rpmemalign(size_t alignment, size_t size) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE(2);
 
 //! Allocate a memory block of at least the given size and alignment.
 //  Alignment must be a power of two and a multiple of sizeof(void*),
 //  and should ideally be less than memory page size. A caveat of rpmalloc
 //  internals is that this must also be strictly less than the span size (default 64KiB)
-extern int
+RPMALLOC_EXPORT int
 rpposix_memalign(void **memptr, size_t alignment, size_t size);
 
 //! Query the usable size of the given memory block (from given pointer to the end of block)
-extern size_t
+RPMALLOC_EXPORT size_t
 rpmalloc_usable_size(void* ptr);
 
 #ifdef __cplusplus

diff --git a/test/main-override.cc b/test/main-override.cc
new file mode 100644
index 0000000..1134d37
--- /dev/null
+++ b/test/main-override.cc

@@ -0,0 +1,165 @@
+
+#if defined(_WIN32) && !defined(_CRT_SECURE_NO_WARNINGS)
+#  define _CRT_SECURE_NO_WARNINGS
+#endif
+
+#include <rpmalloc.h>
+#include <thread.h>
+#include <test.h>
+
+#include <stdint.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <math.h>
+
+static size_t _hardware_threads;
+
+static void
+test_initialize(void);
+
+static int
+test_fail(const char* reason) {
+	fprintf(stderr, "FAIL: %s\n", reason);
+	return -1;
+}
+
+static int
+test_alloc(void) {
+	void* p = malloc(371);
+	if (!p)
+		return test_fail("malloc failed");
+	if ((rpmalloc_usable_size(p) < 371) || (rpmalloc_usable_size(p) > (371 + 16)))
+		return test_fail("usable size invalid (1)");
+	rpfree(p);
+
+	p = new int;
+	if (!p)
+		return test_fail("new failed");
+	if (rpmalloc_usable_size(p) != 16)
+		return test_fail("usable size invalid (2)");
+	delete static_cast<int*>(p);
+
+	p = new int[16];
+	if (!p)
+		return test_fail("new[] failed");
+	if (rpmalloc_usable_size(p) != 16*sizeof(int))
+		return test_fail("usable size invalid (3)");
+	delete[] static_cast<int*>(p);
+
+	printf("Allocation tests passed\n");
+	return 0;
+}
+
+static int
+test_free(void) {
+	free(rpmalloc(371));
+	free(new int);
+	free(new int[16]);
+	printf("Free tests passed\n");
+	return 0;	
+}
+
+static void
+basic_thread(void* argp) {
+	(void)sizeof(argp);
+	int res = test_alloc();
+	if (res) {
+		thread_exit(static_cast<uintptr_t>(res));
+		return;
+	}
+	res = test_free();
+	if (res) {
+		thread_exit(static_cast<uintptr_t>(res));
+		return;
+	}
+	thread_exit(0);
+}
+
+static int
+test_thread(void) {
+	uintptr_t thread[2];
+	uintptr_t threadres[2];
+
+	thread_arg targ;
+	memset(&targ, 0, sizeof(targ));
+	targ.fn = basic_thread;
+	for (int i = 0; i < 2; ++i)
+		thread[i] = thread_run(&targ);
+
+	for (int i = 0; i < 2; ++i) {
+		threadres[i] = thread_join(thread[i]);
+		if (threadres[i])
+			return -1;
+	}
+
+	printf("Thread tests passed\n");
+	return 0;
+}
+
+int
+test_run(int argc, char** argv) {
+	(void)sizeof(argc);
+	(void)sizeof(argv);
+	test_initialize();
+	if (test_alloc())
+		return -1;
+	if (test_free())
+		return -1;
+	if (test_thread())
+		return -1;
+	printf("All tests passed\n");
+	return 0;
+}
+
+#if (defined(__APPLE__) && __APPLE__)
+#  include <TargetConditionals.h>
+#  if defined(__IPHONE__) || (defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE) || (defined(TARGET_IPHONE_SIMULATOR) && TARGET_IPHONE_SIMULATOR)
+#    define NO_MAIN 1
+#  endif
+#elif (defined(__linux__) || defined(__linux))
+#  include <sched.h>
+#endif
+
+#if !defined(NO_MAIN)
+
+int
+main(int argc, char** argv) {
+	return test_run(argc, argv);
+}
+
+#endif
+
+#ifdef _WIN32
+#include <Windows.h>
+
+static void
+test_initialize(void) {
+	SYSTEM_INFO system_info;
+	GetSystemInfo(&system_info);
+	_hardware_threads = static_cast<size_t>(system_info.dwNumberOfProcessors);
+}
+
+#elif (defined(__linux__) || defined(__linux))
+
+static void
+test_initialize(void) {
+	cpu_set_t prevmask, testmask;
+	CPU_ZERO(&prevmask);
+	CPU_ZERO(&testmask);
+	sched_getaffinity(0, sizeof(prevmask), &prevmask);     //Get current mask
+	sched_setaffinity(0, sizeof(testmask), &testmask);     //Set zero mask
+	sched_getaffinity(0, sizeof(testmask), &testmask);     //Get mask for all CPUs
+	sched_setaffinity(0, sizeof(prevmask), &prevmask);     //Reset current mask
+	int num = CPU_COUNT(&testmask);
+	_hardware_threads = static_cast<size_t>(num > 1 ? num : 1);
+}
+
+#else
+
+static void
+test_initialize(void) {
+	_hardware_threads = 1;
+}
+
+#endif

diff --git a/test/main.c b/test/main.c
index 8630954..679287f 100644
--- a/test/main.c
+++ b/test/main.c

@@ -12,6 +12,7 @@
 #include <stdio.h>
 #include <string.h>
 #include <math.h>
+#include <time.h>
 
 #define pointer_offset(ptr, ofs) (void*)((char*)(ptr) + (ptrdiff_t)(ofs))
 #define pointer_diff(first, second) (ptrdiff_t)((const char*)(first) - (const char*)(second))
@@ -22,6 +23,12 @@
 test_initialize(void);
 
 static int
+test_fail(const char* reason) {
+	fprintf(stderr, "FAIL: %s\n", reason);
+	return -1;
+}
+
+static int
 test_alloc(void) {
 	unsigned int iloop = 0;
 	unsigned int ipass = 0;
@@ -36,61 +43,88 @@
 	for (id = 0; id < 20000; ++id)
 		data[id] = (char)(id % 139 + id % 17);
 
-	void* testptr = rpmalloc(253000);
-	testptr = rprealloc(testptr, 154);
-	//Verify that blocks are 32 byte size aligned
+	//Verify that blocks are 16 byte size aligned
+	void* testptr = rpmalloc(16);
+	if (rpmalloc_usable_size(testptr) != 16)
+		return test_fail("Bad base alloc usable size");
+	rpfree(testptr);
+	testptr = rpmalloc(32);
+	if (rpmalloc_usable_size(testptr) != 32)
+		return test_fail("Bad base alloc usable size");
+	rpfree(testptr);
+	testptr = rpmalloc(128);
+	if (rpmalloc_usable_size(testptr) != 128)
+		return test_fail("Bad base alloc usable size");
+	rpfree(testptr);
+	for (iloop = 0; iloop <= 1024; ++iloop) {
+		testptr = rpmalloc(iloop);
+		size_t wanted_usable_size = 16 * ((iloop / 16) + ((!iloop || (iloop % 16)) ? 1 : 0));
+		if (rpmalloc_usable_size(testptr) != wanted_usable_size)
+			return test_fail("Bad base alloc usable size");
+		rpfree(testptr);
+	}
+
+	//Verify medium block sizes (until class merging kicks in)
+	for (iloop = 1025; iloop <= 6000; ++iloop) {
+		testptr = rpmalloc(iloop);
+		size_t wanted_usable_size = 512 * ((iloop / 512) + ((iloop % 512) ? 1 : 0));
+		if (rpmalloc_usable_size(testptr) != wanted_usable_size)
+			return test_fail("Bad medium alloc usable size");
+		rpfree(testptr);
+	}
+
+	//Large reallocation test
+	testptr = rpmalloc(253000);
+	testptr = rprealloc(testptr, 151);
 	if (rpmalloc_usable_size(testptr) != 160)
-		return -1;
+		return test_fail("Bad usable size");
 	if (rpmalloc_usable_size(pointer_offset(testptr, 16)) != 144)
-		return -1;
+		return test_fail("Bad offset usable size");
 	rpfree(testptr);
 
 	//Reallocation tests
-	for (iloop = 1; iloop < 32; ++iloop) {
+	for (iloop = 1; iloop < 24; ++iloop) {
 		size_t size = 37 * iloop;
 		testptr = rpmalloc(size);
 		*((uintptr_t*)testptr) = 0x12345678;
-		if (rpmalloc_usable_size(testptr) < size)
-			return -1;
-		if (rpmalloc_usable_size(testptr) >= (size + 32))
-			return -1;
-		testptr = rprealloc(testptr, size + 32);
-		if (rpmalloc_usable_size(testptr) < (size + 32))
-			return -1;
-		if (rpmalloc_usable_size(testptr) >= ((size + 32) * 2))
-			return -1;
+		size_t wanted_usable_size = 16 * ((size / 16) + ((size % 16) ? 1 : 0));
+		if (rpmalloc_usable_size(testptr) != wanted_usable_size)
+			return test_fail("Bad usable size (alloc)");
+		testptr = rprealloc(testptr, size + 16);
+		if (rpmalloc_usable_size(testptr) < (wanted_usable_size + 16))
+			return test_fail("Bad usable size (realloc)");
 		if (*((uintptr_t*)testptr) != 0x12345678)
-			return -1;
+			return test_fail("Data not preserved on realloc");
 		rpfree(testptr);
 
 		testptr = rpaligned_alloc(128, size);
 		*((uintptr_t*)testptr) = 0x12345678;
-		if (rpmalloc_usable_size(testptr) < size)
-			return -1;
-		if (rpmalloc_usable_size(testptr) >= (size + 128 + 32))
-			return -1;
+		wanted_usable_size = 16 * ((size / 16) + ((size % 16) ? 1 : 0));
+		if (rpmalloc_usable_size(testptr) < wanted_usable_size)
+			return test_fail("Bad usable size (aligned alloc)");
+		if (rpmalloc_usable_size(testptr) > (wanted_usable_size + 128))
+			return test_fail("Bad usable size (aligned alloc)");
 		testptr = rpaligned_realloc(testptr, 128, size + 32, 0, 0);
-		if (rpmalloc_usable_size(testptr) < (size + 32))
-			return -1;
-		if (rpmalloc_usable_size(testptr) >= (((size + 32) * 2) + 128))
-			return -1;
+		if (rpmalloc_usable_size(testptr) < (wanted_usable_size + 32))
+			return test_fail("Bad usable size (aligned realloc)");
 		if (*((uintptr_t*)testptr) != 0x12345678)
-			return -1;
+			return test_fail("Data not preserved on realloc");
 		void* unaligned = rprealloc(testptr, size);
 		if (unaligned != testptr) {
 			ptrdiff_t diff = pointer_diff(testptr, unaligned);
 			if (diff < 0)
-				return -1;
+				return test_fail("Bad realloc behaviour");
 			if (diff >= 128)
-				return -1;
+				return test_fail("Bad realloc behaviour");
 		}
 		rpfree(testptr);
 	}
 
+	static size_t alignment[3] = { 0, 64, 256 };
 	for (iloop = 0; iloop < 64; ++iloop) {
 		for (ipass = 0; ipass < 8142; ++ipass) {
 			size_t size = iloop + ipass + datasize[(iloop + ipass) % 7];
-			char* baseptr = rpmalloc(size);
+			char* baseptr = rpaligned_alloc(alignment[ipass % 3], size);
 			for (size_t ibyte = 0; ibyte < size; ++ibyte)
 				baseptr[ibyte] = (char)(ibyte & 0xFF);
 
@@ -99,7 +133,7 @@
 			baseptr = rprealloc(baseptr, resize);
 			for (size_t ibyte = 0; ibyte < capsize; ++ibyte) {
 				if (baseptr[ibyte] != (char)(ibyte & 0xFF))
-					return -1;
+					return test_fail("Data not preserved on realloc");
 			}
 
 			size_t alignsize = (iloop * ipass + datasize[(iloop + ipass * 3) % 7]) & 0x2FF;
@@ -107,7 +141,7 @@
 			baseptr = rpaligned_realloc(baseptr, 128, alignsize, resize, 0);
 			for (size_t ibyte = 0; ibyte < capsize; ++ibyte) {
 				if (baseptr[ibyte] != (char)(ibyte & 0xFF))
-					return -1;
+					return test_fail("Data not preserved on realloc");
 			}
 
 			rpfree(baseptr);
@@ -118,27 +152,27 @@
 		for (ipass = 0; ipass < 8142; ++ipass) {
 			addr[ipass] = rpmalloc(500);
 			if (addr[ipass] == 0)
-				return -1;
+				return test_fail("Allocation failed");
 
 			memcpy(addr[ipass], data + ipass, 500);
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
 				if (addr[icheck] == addr[ipass])
-					return -1;
+					return test_fail("Bad allocation result");
 				if (addr[icheck] < addr[ipass]) {
 					if (pointer_offset(addr[icheck], 500) > addr[ipass])
-						return -1;
+						return test_fail("Bad allocation result");
 				}
 				else if (addr[icheck] > addr[ipass]) {
 					if (pointer_offset(addr[ipass], 500) > addr[icheck])
-						return -1;
+						return test_fail("Bad allocation result");
 				}
 			}
 		}
 
 		for (ipass = 0; ipass < 8142; ++ipass) {
 			if (memcmp(addr[ipass], data + ipass, 500))
-				return -1;
+				return test_fail("Data corruption");
 		}
 
 		for (ipass = 0; ipass < 8142; ++ipass)
@@ -151,20 +185,20 @@
 
 			addr[ipass] = rpmalloc(cursize);
 			if (addr[ipass] == 0)
-				return -1;
+				return test_fail("Allocation failed");
 
 			memcpy(addr[ipass], data + ipass, cursize);
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
 				if (addr[icheck] == addr[ipass])
-					return -1;
+					return test_fail("Identical pointer returned from allocation");
 				if (addr[icheck] < addr[ipass]) {
 					if (pointer_offset(addr[icheck], rpmalloc_usable_size(addr[icheck])) > addr[ipass])
-						return -1;
+						return test_fail("Invalid pointer inside another block returned from allocation");
 				}
 				else if (addr[icheck] > addr[ipass]) {
 					if (pointer_offset(addr[ipass], rpmalloc_usable_size(addr[ipass])) > addr[icheck])
-						return -1;
+						return test_fail("Invalid pointer inside another block returned from allocation");
 				}
 			}
 		}
@@ -172,7 +206,7 @@
 		for (ipass = 0; ipass < 1024; ++ipass) {
 			unsigned int cursize = datasize[ipass%7] + ipass;
 			if (memcmp(addr[ipass], data + ipass, cursize))
-				return -1;
+				return test_fail("Data corruption");
 		}
 
 		for (ipass = 0; ipass < 1024; ++ipass)
@@ -183,27 +217,27 @@
 		for (ipass = 0; ipass < 1024; ++ipass) {
 			addr[ipass] = rpmalloc(500);
 			if (addr[ipass] == 0)
-				return -1;
+				return test_fail("Allocation failed");
 
 			memcpy(addr[ipass], data + ipass, 500);
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
 				if (addr[icheck] == addr[ipass])
-					return -1;
+					return test_fail("Identical pointer returned from allocation");
 				if (addr[icheck] < addr[ipass]) {
 					if (pointer_offset(addr[icheck], 500) > addr[ipass])
-						return -1;
+						return test_fail("Invalid pointer inside another block returned from allocation");
 				}
 				else if (addr[icheck] > addr[ipass]) {
 					if (pointer_offset(addr[ipass], 500) > addr[icheck])
-						return -1;
+						return test_fail("Invalid pointer inside another block returned from allocation");
 				}
 			}
 		}
 
 		for (ipass = 0; ipass < 1024; ++ipass) {
 			if (memcmp(addr[ipass], data + ipass, 500))
-				return -1;
+				return test_fail("Data corruption");
 		}
 
 		for (ipass = 0; ipass < 1024; ++ipass)
@@ -216,7 +250,7 @@
 		rpmalloc_initialize();
 		addr[0] = rpmalloc(iloop);
 		if (!addr[0])
-			return -1;
+			return test_fail("Allocation failed");
 		rpfree(addr[0]);
 		rpmalloc_finalize();
 	}
@@ -225,7 +259,7 @@
 		rpmalloc_initialize();
 		addr[0] = rpmalloc(iloop);
 		if (!addr[0])
-			return -1;
+			return test_fail("Allocation failed");
 		rpfree(addr[0]);
 		rpmalloc_finalize();
 	}
@@ -234,7 +268,7 @@
 		rpmalloc_initialize();
 		addr[0] = rpmalloc(iloop);
 		if (!addr[0])
-			return -1;
+			return test_fail("Allocation failed");
 		rpfree(addr[0]);
 		rpmalloc_finalize();
 	}
@@ -243,7 +277,7 @@
 	for (iloop = 0; iloop < (2 * 1024 * 1024); iloop += 16) {
 		addr[0] = rpmalloc(iloop);
 		if (!addr[0])
-			return -1;
+			return test_fail("Allocation failed");
 		rpfree(addr[0]);
 	}
 	rpmalloc_finalize();
@@ -254,6 +288,45 @@
 }
 
 static int
+test_realloc(void) {
+	srand((unsigned int)time(0));
+
+	rpmalloc_initialize();
+
+	size_t pointer_count = 4096;
+	void** pointers = rpmalloc(sizeof(void*) * pointer_count);
+	memset(pointers, 0, sizeof(void*) * pointer_count);
+
+	size_t alignments[5] = {0, 16, 32, 64, 128};
+
+	for (size_t iloop = 0; iloop < 8000; ++iloop) {
+		for (size_t iptr = 0; iptr < pointer_count; ++iptr) {
+			if (iloop)
+				rpfree(rprealloc(pointers[iptr], rand() % 4096));
+			pointers[iptr] = rpaligned_alloc(alignments[(iptr + iloop) % 5], iloop + iptr);
+		}
+	}
+
+	for (size_t iptr = 0; iptr < pointer_count; ++iptr)
+		rpfree(pointers[iptr]);
+	rpfree(pointers);
+
+	size_t bigsize = 1024 * 1024;
+	void* bigptr = rpmalloc(bigsize);
+	while (bigsize < 3 * 1024 * 1024) {
+		++bigsize;
+		bigptr = rprealloc(bigptr, bigsize);
+	}
+	rpfree(bigptr);
+
+	rpmalloc_finalize();
+
+	printf("Memory reallocation tests passed\n");
+
+	return 0;
+}
+
+static int
 test_superalign(void) {
 
 	rpmalloc_initialize();
@@ -268,7 +341,7 @@
 					size_t alloc_size = sizes[isize] + iloop + ipass;
 					uint8_t* ptr = rpaligned_alloc(alignment[ialign], alloc_size);
 					if (!ptr || ((uintptr_t)ptr & (alignment[ialign] - 1)))
-						return -1;
+						return test_fail("Super alignment allocation failed");
 					ptr[0] = 1;
 					ptr[alloc_size - 1] = 1;
 					rpfree(ptr);
@@ -290,6 +363,7 @@
 	unsigned int        datasize[32];
 	unsigned int        num_datasize; //max 32
 	void**              pointers;
+	void**              crossthread_pointers;
 } allocator_thread_arg_t;
 
 static void
@@ -320,7 +394,7 @@
 
 			addr[ipass] = rpmalloc(4 + cursize);
 			if (addr[ipass] == 0) {
-				ret = -1;
+				ret = test_fail("Allocation failed");
 				goto end;
 			}
 
@@ -329,23 +403,19 @@
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
 				if (addr[icheck] == addr[ipass]) {
-					ret = -1;
+					ret = test_fail("Identical pointer returned from allocation");
 					goto end;
 				}
 				if (addr[icheck] < addr[ipass]) {
-					if (pointer_offset(addr[icheck], *(uint32_t*)addr[icheck]) > addr[ipass]) {
-						if (pointer_offset(addr[icheck], *(uint32_t*)addr[icheck]) > addr[ipass]) {
-							ret = -1;
-							goto end;
-						}
+					if (pointer_offset(addr[icheck], *(uint32_t*)addr[icheck] + 4) > addr[ipass]) {
+						ret = test_fail("Invalid pointer inside another block returned from allocation");
+						goto end;
 					}
 				}
 				else if (addr[icheck] > addr[ipass]) {
-					if (pointer_offset(addr[ipass], *(uint32_t*)addr[ipass]) > addr[ipass]) {
-						if (pointer_offset(addr[ipass], *(uint32_t*)addr[ipass]) > addr[icheck]) {
-							ret = -1;
-							goto end;
-						}
+					if (pointer_offset(addr[ipass], *(uint32_t*)addr[ipass] + 4) > addr[icheck]) {
+						ret = test_fail("Invalid pointer inside another block returned from allocation");
+						goto end;
 					}
 				}
 			}
@@ -355,7 +425,7 @@
 			cursize = *(uint32_t*)addr[ipass];
 
 			if (memcmp(pointer_offset(addr[ipass], 4), data, cursize)) {
-				ret = -1;
+				ret = test_fail("Data corrupted");
 				goto end;
 			}
 
@@ -378,24 +448,72 @@
 	unsigned int iloop = 0;
 	unsigned int ipass = 0;
 	unsigned int cursize;
-	unsigned int iwait = 0;
+	unsigned int iextra = 0;
 	int ret = 0;
 
 	rpmalloc_thread_initialize();
 
 	thread_sleep(10);
 
+	size_t next_crossthread = 0;
+	size_t end_crossthread = arg.loops * arg.passes;
+
+	void** extra_pointers = rpmalloc(sizeof(void*) * arg.loops * arg.passes);
+
 	for (iloop = 0; iloop < arg.loops; ++iloop) {
 		for (ipass = 0; ipass < arg.passes; ++ipass) {
-			cursize = arg.datasize[(iloop + ipass + iwait) % arg.num_datasize ] + ((iloop + ipass) % 1024);
-
-			void* addr = rpmalloc(cursize);
-			if (addr == 0) {
-				ret = -1;
+			size_t iarg = (iloop + ipass + iextra++) % arg.num_datasize;
+			cursize = arg.datasize[iarg] + ((iloop + ipass) % 21);
+			void* first_addr = rpmalloc(cursize);
+			if (first_addr == 0) {
+				ret = test_fail("Allocation failed");
 				goto end;
 			}
 
-			arg.pointers[iloop * arg.passes + ipass] = addr;
+			iarg = (iloop + ipass + iextra++) % arg.num_datasize;
+			cursize = arg.datasize[iarg] + ((iloop + ipass) % 71);
+			void* second_addr = rpmalloc(cursize);
+			if (second_addr == 0) {
+				ret = test_fail("Allocation failed");
+				goto end;
+			}
+
+			iarg = (iloop + ipass + iextra++) % arg.num_datasize;
+			cursize = arg.datasize[iarg] + ((iloop + ipass) % 17);
+			void* third_addr = rpmalloc(cursize);
+			if (third_addr == 0) {
+				ret = test_fail("Allocation failed");
+				goto end;
+			}
+
+			rpfree(first_addr);
+			arg.pointers[iloop * arg.passes + ipass] = second_addr;
+			extra_pointers[iloop * arg.passes + ipass] = third_addr;
+
+			while ((next_crossthread < end_crossthread) &&
+			        arg.crossthread_pointers[next_crossthread]) {
+				rpfree(arg.crossthread_pointers[next_crossthread]);
+				arg.crossthread_pointers[next_crossthread] = 0;
+				++next_crossthread;
+			}
+		}
+	}
+
+	for (iloop = 0; iloop < arg.loops; ++iloop) {
+		for (ipass = 0; ipass < arg.passes; ++ipass) {
+			rpfree(extra_pointers[(iloop * arg.passes) + ipass]);
+		}
+	}
+
+	rpfree(extra_pointers);
+
+	while (next_crossthread < end_crossthread) {
+		if (arg.crossthread_pointers[next_crossthread]) {
+			rpfree(arg.crossthread_pointers[next_crossthread]);
+			arg.crossthread_pointers[next_crossthread] = 0;
+			++next_crossthread;
+		} else {
+			thread_yield();
 		}
 	}
 
@@ -426,12 +544,15 @@
 	for (iloop = 0; iloop < arg.loops; ++iloop) {
 		rpmalloc_thread_initialize();
 
+		unsigned int max_datasize = 0;
 		for (ipass = 0; ipass < arg.passes; ++ipass) {
-			cursize = 4 + arg.datasize[(iloop + ipass + iwait) % arg.num_datasize] + ((iloop + ipass) % 1024);
+			cursize = arg.datasize[(iloop + ipass + iwait) % arg.num_datasize] + ((iloop + ipass) % 1024);
+			if (cursize > max_datasize)
+				max_datasize = cursize;
 
 			addr[ipass] = rpmalloc(4 + cursize);
 			if (addr[ipass] == 0) {
-				ret = -1;
+				ret = test_fail("Allocation failed");
 				goto end;
 			}
 
@@ -439,24 +560,30 @@
 			memcpy(pointer_offset(addr[ipass], 4), data, cursize);
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
+				size_t this_size = *(uint32_t*)addr[ipass];
+				size_t check_size = *(uint32_t*)addr[icheck];
+				if (this_size != cursize) {
+					ret = test_fail("Data corrupted in this block (size)");
+					goto end;
+				}
+				if (check_size > max_datasize) {
+					ret = test_fail("Data corrupted in previous block (size)");
+					goto end;
+				}
 				if (addr[icheck] == addr[ipass]) {
-					ret = -1;
+					ret = test_fail("Identical pointer returned from allocation");
 					goto end;
 				}
 				if (addr[icheck] < addr[ipass]) {
-					if (pointer_offset(addr[icheck], *(uint32_t*)addr[icheck]) > addr[ipass]) {
-						if (pointer_offset(addr[icheck], *(uint32_t*)addr[icheck]) > addr[ipass]) {
-							ret = -1;
-							goto end;
-						}
+					if (pointer_offset(addr[icheck], check_size + 4) > addr[ipass]) {
+						ret = test_fail("Invalid pointer inside another block returned from allocation");
+						goto end;
 					}
 				}
 				else if (addr[icheck] > addr[ipass]) {
-					if (pointer_offset(addr[ipass], *(uint32_t*)addr[ipass]) > addr[ipass]) {
-						if (pointer_offset(addr[ipass], *(uint32_t*)addr[ipass]) > addr[icheck]) {
-							ret = -1;
-							goto end;
-						}
+					if (pointer_offset(addr[ipass], cursize + 4) > addr[icheck]) {
+						ret = test_fail("Invalid pointer inside another block returned from allocation");
+						goto end;
 					}
 				}
 			}
@@ -464,9 +591,13 @@
 
 		for (ipass = 0; ipass < arg.passes; ++ipass) {
 			cursize = *(uint32_t*)addr[ipass];
+			if (cursize > max_datasize) {
+				ret = test_fail("Data corrupted (size)");
+				goto end;
+			}
 
 			if (memcmp(pointer_offset(addr[ipass], 4), data, cursize)) {
-				ret = -1;
+				ret = test_fail("Data corrupted");
 				goto end;
 			}
 			
@@ -474,6 +605,7 @@
 		}
 
 		rpmalloc_thread_finalize();
+		thread_yield();
 	}
 
 end:
@@ -555,10 +687,11 @@
 		num_alloc_threads = 4;
 
 	for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread) {
-		unsigned int iadd = ithread * (16 + ithread);
+		unsigned int iadd = (ithread * (16 + ithread) + ithread) % 128;
 		arg[ithread].loops = 50;
 		arg[ithread].passes = 1024;
 		arg[ithread].pointers = rpmalloc(sizeof(void*) * arg[ithread].loops * arg[ithread].passes);
+		memset(arg[ithread].pointers, 0, sizeof(void*) * arg[ithread].loops * arg[ithread].passes);
 		arg[ithread].datasize[0] = 19 + iadd;
 		arg[ithread].datasize[1] = 249 + iadd;
 		arg[ithread].datasize[2] = 797 + iadd;
@@ -567,13 +700,13 @@
 		arg[ithread].datasize[5] = 344 + iadd;
 		arg[ithread].datasize[6] = 3892 + iadd;
 		arg[ithread].datasize[7] = 19 + iadd;
-		arg[ithread].datasize[8] = 14954 + iadd;
+		arg[ithread].datasize[8] = 154 + iadd;
 		arg[ithread].datasize[9] = 39723 + iadd;
 		arg[ithread].datasize[10] = 15 + iadd;
 		arg[ithread].datasize[11] = 493 + iadd;
 		arg[ithread].datasize[12] = 34 + iadd;
 		arg[ithread].datasize[13] = 894 + iadd;
-		arg[ithread].datasize[14] = 6893 + iadd;
+		arg[ithread].datasize[14] = 193 + iadd;
 		arg[ithread].datasize[15] = 2893 + iadd;
 		arg[ithread].num_datasize = 16;
 
@@ -581,6 +714,10 @@
 		targ[ithread].arg = &arg[ithread];
 	}
 
+	for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread) {
+		arg[ithread].crossthread_pointers = arg[(ithread + 1) % num_alloc_threads].pointers;
+	}
+
 	for (int iloop = 0; iloop < 32; ++iloop) {
 		for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread)
 			thread[ithread] = thread_run(&targ[ithread]);
@@ -590,10 +727,6 @@
 		for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread) {
 			if (thread_join(thread[ithread]) != 0)
 				return -1;
-
-			//Off-thread deallocation
-			for (size_t iptr = 0; iptr < arg[ithread].loops * arg[ithread].passes; ++iptr)
-				rpfree(arg[ithread].pointers[iptr]);
 		}
 	}
 
@@ -677,6 +810,8 @@
 	test_initialize();
 	if (test_alloc())
 		return -1;
+	if (test_realloc())
+		return -1;
 	if (test_superalign())
 		return -1;
 	if (test_crossthread())
@@ -685,6 +820,7 @@
 		return -1;
 	if (test_threaded())
 		return -1;
+	printf("All tests passed\n");
 	return 0;
 }
 

diff --git a/test/thread.c b/test/thread.c
index b9af573..15b8d1d 100644
--- a/test/thread.c
+++ b/test/thread.c

@@ -82,7 +82,7 @@
 void
 thread_sleep(int milliseconds) {
 #ifdef _WIN32
-	SleepEx((DWORD)milliseconds, 1);
+	SleepEx((DWORD)milliseconds, 0);
 #else
 	struct timespec ts;
 	ts.tv_sec  = milliseconds / 1000;

diff --git a/test/thread.h b/test/thread.h
index 49ce1a5..5c0d873 100644
--- a/test/thread.h
+++ b/test/thread.h

@@ -1,6 +1,9 @@
 
 #include <stdint.h>
 
+#ifdef __cplusplus
+extern "C" {
+#endif
 
 struct thread_arg {
 	void (*fn)(void*);
@@ -25,3 +28,7 @@
 
 extern void
 thread_fence(void);
+
+#ifdef __cplusplus
+}
+#endif
commit	75e19cc1dd283a078d6c0dca7db2aec1a26301b9	[log] [tgz]
author	Mattias Jansson <mattias@rampantpixels.com>	Thu Aug 08 20:59:16 2019 +0200
committer	GitHub <noreply@github.com>	Thu Aug 08 20:59:16 2019 +0200
tree	123f0eb73afc15074d530b0b8f7ef441644f3529
parent	5ffaa237989ac2c74dcd77776e4a3983a387a477 [diff]
parent	9ebe0ce0c578a747d6fd48a4308b66cfa5ea59c7 [diff]