Merge branch 'release/1.3.0'
diff --git a/CACHE.md b/CACHE.md
index b29907f..6450930 100644
--- a/CACHE.md
+++ b/CACHE.md
@@ -1,5 +1,5 @@
 # Thread caches
-rpmalloc has a thread cache of free memory blocks which can be used in allocations without interfering with other threads or going to system to map more memory, as well as a global cache shared by all threads to let pages flow between threads. Configuring the size of these caches can be crucial to obtaining good performance while minimizing memory overhead blowup. Below is a simple case study using the benchmark tool to compare different thread cache configurations for rpmalloc.
+rpmalloc has a thread cache of free memory blocks which can be used in allocations without interfering with other threads or going to system to map more memory, as well as a global cache shared by all threads to let spans of memory pages flow between threads. Configuring the size of these caches can be crucial to obtaining good performance while minimizing memory overhead blowup. Below is a simple case study using the benchmark tool to compare different thread cache configurations for rpmalloc.
 
 The rpmalloc thread cache is configured to be unlimited, performance oriented as meaning default values, size oriented where both thread cache and global cache is reduced significantly, or disabled where both thread and global caches are disabled and completely free pages are directly unmapped.
 
diff --git a/CHANGELOG b/CHANGELOG
index 0ca80de..f2f25a7 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,3 +1,24 @@
+1.3.0
+
+Make span size configurable and all spans equal in size, removing span size classes and streamlining the thread cache.
+
+Allow super spans to be reserved in advance and split up in multiple used spans to reduce number of system calls. This will not increase committed physical pages, only reserved virtual memory space.
+
+Allow super spans to be reused for allocations of lower size, breaking up the super span and storing remainder in thread cache in order to reduce load on global cache and reduce cache overhead.
+
+Fixed an issue where an allocation of zero bytes would cause a segmentation fault from indexing size class array with index -1.
+
+Fixed an issue where an allocation of maximum large block size (2097120 bytes) would index the heap cache array out of bounds and potentially cause a segmentation fault depending on earlier allocation patterns.
+
+Fixed an issue where memory pages at start of aligned span run was not completely unmapped on POSIX systems.
+
+Fixed an issue where spans were not correctly marked as owned by the heap after traversing the global span cache.
+
+Added function to access the allocator configuration after initialization to find default values.
+
+Removed allocated and reserved statistics to reduce code complexity.
+
+
 1.2.2
 
 Add configurable memory mapper providing map/unmap of memory pages. Default to VirtualAlloc/mmap if none provided. This allows rpmalloc to be used in contexts where memory is provided by internal means.
diff --git a/README.md b/README.md
index 617c131..c93fc6c 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
 # rpmalloc - Rampant Pixels Memory Allocator
-This library provides a public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C. The latest source code is always available at https://github.com/rampantpixels/rpmalloc
+This library provides a public domain cross platform lock free thread caching 32-byte aligned memory allocator implemented in C. The latest source code is always available at https://github.com/rampantpixels/rpmalloc
 
 Platforms currently supported:
 
@@ -9,7 +9,7 @@
 - Linux
 - Android
 
-The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API. The API used to map/unmap memory pages can be configured in runtime to a custom implementation.
+The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API. The API used to map/unmap memory pages can be configured in runtime to a custom implementation and mapping granularity/size.
 
 This library is put in the public domain; you can redistribute it and/or modify it without any restrictions. Or, if you choose, you can use it under the MIT license.
 
@@ -18,7 +18,7 @@
 Created by Mattias Jansson ([@maniccoder](https://twitter.com/maniccoder)) / Rampant Pixels - http://www.rampantpixels.com
 
 # Performance
-We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~1800 lines of C code.
+We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches compared to these allocators. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~2100 lines of C code. All allocations have a natural 32-byte alignment.
 
 Contained in a parallel repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
 
@@ -37,7 +37,7 @@
 
 __rpmalloc_initialize__ : Call at process start to initialize the allocator
 
-__rpmalloc_initialize_config__ : Optional entry point to call at process start to initialize the allocator with a custom memory mapping backend and/or memory page size
+__rpmalloc_initialize_config__ : Optional entry point to call at process start to initialize the allocator with a custom memory mapping backend, memory page size and mapping granularity.
 
 __rpmalloc_finalize__: Call at process exit to finalize the allocator
 
@@ -45,6 +45,8 @@
 
 __rpmalloc_thread_finalize__: Call at each thread exit to finalize and release thread cache back to global cache
 
+__rpmalloc_config__: Get the current runtime configuration of the allocator
+
 Then simply use the __rpmalloc__/__rpfree__ and the other malloc style replacement functions. Remember all allocations are 16-byte aligned, so no need to call the explicit rpmemalign/rpaligned_alloc/rpposix_memalign functions unless you need greater alignment, they are simply wrappers to make it easier to replace in existing code.
 
 If you wish to override the standard library malloc family of functions and have automatic initialization/finalization of process and threads, also include the `malloc.c` file in your project. The automatic init/fini is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use libc replacement only as a convenience for testing the library on an existing code base, not a final solution.
@@ -57,13 +59,13 @@
 The latest stable release is available in the master branch. For latest development code, use the develop branch.
 
 # Cache configuration options
-Free memory pages are cached both per thread and in a global cache for all threads. The size of the thread caches is determined by an adaptive scheme where each cache is limited by a percentage of the maximum allocation count of the corresponding size class. The size of the global caches is determined by a multiple of the maximum of all thread caches. The factors controlling the cache sizes can be set by either defining one of four presets, or by editing the individual defines in the `rpmalloc.c` source file for fine tuned control. If you do not define any of the following three directives, the default preset will be used which is to increase caches and prioritize performance over memory overhead (but not making caches unlimited).
+Free memory pages are cached both per thread and in a global cache for all threads. The size of the thread caches is determined by an adaptive scheme where each cache is limited by a percentage of the maximum allocation count of the corresponding size class. The size of the global caches is determined by a multiple of the maximum of all thread caches. The factors controlling the cache sizes can be set by editing the individual defines in the `rpmalloc.c` source file for fine tuned control.
 
-__ENABLE_UNLIMITED_CACHE__: This will make all caches infinite, i.e never release spans to global cache unless thread finishes, and never unmap memory pages back to the OS. Highest performance but largest memory overhead.
+__ENABLE_UNLIMITED_CACHE__: By default defined to 0, set to 1 to make all caches infinite, i.e never release spans to global cache unless thread finishes and never unmap memory pages back to the OS. Highest performance but largest memory overhead.
 
-__ENABLE_SPACE_PRIORITY_CACHE__: This will reduce caches to minimize memory overhead while still maintaining decent performance.
+__ENABLE_GLOBAL_CACHE__: By default defined to 1, enables the global cache shared between all threads. Set to 0 to disable the global cache and directly unmap pages evicted from the thread cache.
 
-__DISABLE_CACHE__: This will completely disable caches for free pages and instead immediately unmap memory pages back to the OS when no longer in use. Minimizes memory overhead but heavily reduces performance.
+__ENABLE_THREAD_CACHE__: By default defined to 1, enables the per-thread cache. Set to 0 to disable the thread cache and directly unmap pages no longer in use (also disables the global cache).
 
 # Other configuration options
 Detailed statistics are available if __ENABLE_STATISTICS__ is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in `rpmalloc.c`. This will cause a slight overhead in runtime to collect statistics for each memory operation, and will also add 4 bytes overhead per allocation to track sizes.
@@ -72,19 +74,19 @@
 
 Asserts are enabled if __ENABLE_ASSERTS__ is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in `rpmalloc.c`.
 
-Overwrite and underwrite guards are enabled if __ENABLE_GUARDS__ is defined to 1 (default is 0, or disabled), either on compile command line or by settings the value in `rpmalloc.c`. This will introduce up to 32 byte overhead on each allocation to store magic numbers, which will be verified when freeing the memory block. The actual overhead is dependent on the requested size compared to size class limits.
+Overwrite and underwrite guards are enabled if __ENABLE_GUARDS__ is defined to 1 (default is 0, or disabled), either on compile command line or by settings the value in `rpmalloc.c`. This will introduce up to 64 byte overhead on each allocation to store magic numbers, which will be verified when freeing the memory block. The actual overhead is dependent on the requested size compared to size class limits.
 
 # Quick overview
 The allocator is similar in spirit to tcmalloc from the [Google Performance Toolkit](https://github.com/gperftools/gperftools). It uses separate heaps for each thread and partitions memory blocks according to a preconfigured set of size classes, up to 2MiB. Larger blocks are mapped and unmapped directly. Allocations for different size classes will be served from different set of memory pages, each "span" of pages is dedicated to one size class. Spans of pages can flow between threads when the thread cache overflows and are released to a global cache, or when the thread ends. Unlike tcmalloc, single blocks do not flow between threads, only entire spans of pages.
 
 # Implementation details
-The allocator is based on 64KiB page alignment and 16 byte block alignment, where all runs of memory pages are mapped to 64KiB boundaries. On Windows this is automatically guaranteed by the VirtualAlloc granularity, and on mmap systems it is achieved by atomically incrementing the address where pages are mapped to. By aligning to 64KiB boundaries the free operation can locate the header of the memory block without having to do a table lookup (as tcmalloc does) by simply masking out the low 16 bits of the address.
+The allocator is based on a fixed but configurable page alignment (defaults to 64KiB) and 32 byte block alignment, where all runs of memory pages (spans) are mapped to this alignment boundary. On Windows this is automatically guaranteed up to 64KiB by the VirtualAlloc granularity, and on mmap systems it is achieved by oversizing the mapping and aligning the returned virtual memory address to the required boundaries. By aligning to a fixed size the free operation can locate the header of the memory span without having to do a table lookup (as tcmalloc does) by simply masking out the low bits of the address (for 64KiB this would be the low 16 bits).
 
-Memory blocks are divided into three categories. Small blocks are [16, 2032] bytes, medium blocks (2032, 32720] bytes, and large blocks (32720, 2097120] bytes. The three categories are further divided in size classes.
+Memory blocks are divided into three categories. For 64KiB span size/alignment the small blocks are [32, 2016] bytes, medium blocks (2016, 32720] bytes, and large blocks (32720, 2097120] bytes. The three categories are further divided in size classes. If the span size is changed, the small block classes remain but medium blocks go from (2016, span size] bytes.
 
-Small blocks have a size class granularity of 16 bytes each in 127 buckets. Medium blocks have a granularity of 512 bytes, 60 buckets. Large blocks have a 64KiB granularity, 32 buckets. All allocations are fitted to these size class boundaries (an allocation of 34 bytes will allocate a block of 48 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
+Small blocks have a size class granularity of 32 bytes each in 63 buckets. Medium blocks have a granularity of 512 bytes, 60 buckets (default). Large blocks have a the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 42 bytes will allocate a block of 64 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
 
-Spans for small and medium blocks are cached in four levels to avoid calls to map/unmap memory pages. The first level is a per thread single active span for each size class. The second level is a per thread list of partially free spans for each size class. The third level is a per thread list of free spans for each number of pages in the span configuration. The fourth level is a global list of free spans for each number of pages in the span configuration.
+Spans for small and medium blocks are cached in four levels to avoid calls to map/unmap memory pages. The first level is a per thread single active span for each size class. The second level is a per thread list of partially free spans for each size class. The third level is a per thread list of free spans. The fourth level is a global list of free spans.
 
 Each span for a small and medium size class keeps track of how many blocks are allocated/free, as well as a list of which blocks that are free for allocation. To avoid locks, each span is completely owned by the allocating thread, and all cross-thread deallocations will be deferred to the owner thread.
 
@@ -93,9 +95,18 @@
 # Memory mapping
 By default the allocator uses OS APIs to map virtual memory pages as needed, either `VirtualAlloc` on Windows or `mmap` on POSIX systems. If you want to use your own custom memory mapping provider you can use __rpmalloc_initialize_config__ and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes.
 
-The functions do not need to deal with alignment, this is done by rpmalloc internally. However, ideally the map function should return pages aligned to 64KiB boundaries in order to avoid extra mapping requests (see caveats section below). What will happen is that if the first map call returns an address that is not 64KiB aligned, rpmalloc will immediately unmap that block and call a new mapping request with the size increased by 64KiB, then perform the alignment internally. To avoid this double mapping, always return blocks aligned to 64KiB.
+The functions must guarantee alignment to the configured span size. Either provide the span size during initialization using __rpmalloc_initialize_config__, or use __rpmalloc_config__ to find the required alignment which is equal to the span size. The span size MUST be a power of two in [4096, 262144] range, and be a multiple (or divisor) of the memory page size.
 
-Memory mapping requests are always done in multiples of the memory page size. You can specify a custom page size when initializing rpmalloc with __rpmalloc_initialize_config__, or pass 0 to let rpmalloc determine the system memory page size using OS APIs. The page size MUST be a power of two in [512, 16384] range.
+Memory mapping requests are always done in multiples of the memory page size, whichever is larger. You can specify a custom page size when initializing rpmalloc with __rpmalloc_initialize_config__, or pass 0 to let rpmalloc determine the system memory page size using OS APIs. The page size MUST be a power of two in [512, 16384] range.
+
+To reduce system call overhead, memory spans are mapped in batches controlled by the `span_map_count` configuration variable (which defaults to the `DEFAULT_SPAN_MAP_COUNT` value if 0, which in turn is sized according to the cache configuration define, defaulting to 8). If the platform can handle partial unmaps (unmapping one or more spans of memory pages mapped in a larger batch) the `unmap_partial` configuration variable should be set to non-zero. If not, spans will be kept until the entire batch can be unmapped.
+
+# Span breaking
+Super spans (spans a multiple > 1 of the span size) can be subdivided into smaller spans to fulfull a need to map a new span of memory. By default the allocator will greedily grab and break any larger span from the available caches before mapping new virtual memory. However, spans can currently not be glued together to form larger super spans again. Subspans can traverse the cache and be used by different threads individually.
+
+A span that is a subspan of a larger super span can be individually decommitted to reduce physical memory pressure when the span is evicted from caches and scheduled to be unmapped. The entire original super span will keep track of the subspans it is broken up into, and when the entire range is decommitted tha super span will be unmapped. This allows platforms like Windows that require the entire virtual memory range that was mapped in a call to VirtualAlloc to be unmapped in one call to VirtualFree, while still decommitting individual pages in subspans.
+
+If you use a custom memory map/unmap function you need to take this into account by looking at the `release` parameter given to the `memory_unmap` function. It is set to 0 for decommitting invididual pages and 1 for releasing the entire super span memory range.
 
 # Memory guards
 If you define the __ENABLE_GUARDS__ to 1, all memory allocations will be padded with extra guard areas before and after the memory block (while still honoring the requested alignment). These dead zones will be filled with a pattern and checked when the block is freed. If the patterns are not intact the callback set in initialization config is called, or if not set an assert is fired.
@@ -118,17 +129,17 @@
 Threads that have allocation patterns where the difference in memory usage high and low water marks fit within the thread cache thresholds in the allocator will never touch the global cache except during thread init/fini and have optimal performance. Tweaking the cache limits can be done on a per-size-class basis.
 
 # Worst case scenarios
-Since each thread cache maps spans of memory pages per size class, a thread that allocates just a few blocks of each size class (16, 32, 48, ...) for many size classes will never fill each bucket, and thus map a lot of memory pages while only using a small fraction of the mapped memory. However, the wasted memory will always be less than 64KiB per size class.
+Since each thread cache maps spans of memory pages per size class, a thread that allocates just a few blocks of each size class (32, 64, ...) for many size classes will never fill each bucket, and thus map a lot of memory pages while only using a small fraction of the mapped memory. However, the wasted memory will always be less than 64KiB (or the configured span size) per size class. The cache for free spans will be reused by all size classes.
 
 An application that has a producer-consumer scheme between threads where one thread performs all allocations and another frees all memory will have a sub-optimal performance due to blocks crossing thread boundaries will be freed in a two step process - first deferred to the allocating thread, then freed when that thread has need for more memory pages for the requested size. However, depending on the use case the performance overhead might be small.
 
-Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-acquired from the global cache.
+Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-acquired from the global cache. This can be mitigated by changing the __MAX_SPAN_CACHE_DIVISOR__ define in the source code (at the cost of higher average memory overhead).
 
 # Caveats
 Cross-thread deallocations are more costly than in-thread deallocations, since the spans are completely owned by the allocating thread. The free operation will be deferred using an atomic list operation and the actual free operation will be performed when the owner thread requires a new block of the corresponding size class.
 
-VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead oversizes the memory mapping with 64KiB to be able to always return a memory area with this alignment. Since the extra memory pages are never touched this will not result in extra committed physical memory pages, but rather only increase virtual memory address space.
+VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead oversizes the memory mapping with configured span size to be able to always return a memory area with the required alignment. Since the extra memory pages are never touched this will not result in extra committed physical memory pages, but rather only increase virtual memory address space.
 
-The free, realloc and usable size functions all require the passed pointer to be within the first 64KiB page block of the start of the memory block. You cannot pass in any pointer from the memory block address range. 
+The free, realloc and usable size functions all require the passed pointer to be within the first 64KiB (or whatever you set the span size to) of the start of the memory block. You cannot pass in any pointer from the memory block address range. 
 
 All entry points assume the passed values are valid, for example passing an invalid pointer to free would most likely result in a segmentation fault. The library does not try to guard against errors.
diff --git a/build/msvs/rpmalloc.sln b/build/msvs/rpmalloc.sln
index 32d1b29..b806eda 100644
--- a/build/msvs/rpmalloc.sln
+++ b/build/msvs/rpmalloc.sln
@@ -1,10 +1,12 @@
 
 Microsoft Visual Studio Solution File, Format Version 12.00
 # Visual Studio 15
-VisualStudioVersion = 15.0.26228.4
+VisualStudioVersion = 15.0.27130.2010
 MinimumVisualStudioVersion = 10.0.40219.1
 Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "rpmalloc", "rpmalloc.vcxproj", "{65DC4291-954E-4B91-8889-4F3ADCC9D2D5}"
 EndProject
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "test", "test.vcxproj", "{C31980DD-1241-4EF8-A351-69DAF982A7B9}"
+EndProject
 Global
 	GlobalSection(SolutionConfigurationPlatforms) = preSolution
 		Debug|x64 = Debug|x64
@@ -21,8 +23,19 @@
 		{65DC4291-954E-4B91-8889-4F3ADCC9D2D5}.Release|x64.Build.0 = Release|x64
 		{65DC4291-954E-4B91-8889-4F3ADCC9D2D5}.Release|x86.ActiveCfg = Release|Win32
 		{65DC4291-954E-4B91-8889-4F3ADCC9D2D5}.Release|x86.Build.0 = Release|Win32
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Debug|x64.ActiveCfg = Debug|x64
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Debug|x64.Build.0 = Debug|x64
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Debug|x86.ActiveCfg = Debug|Win32
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Debug|x86.Build.0 = Debug|Win32
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Release|x64.ActiveCfg = Release|x64
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Release|x64.Build.0 = Release|x64
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Release|x86.ActiveCfg = Release|Win32
+		{C31980DD-1241-4EF8-A351-69DAF982A7B9}.Release|x86.Build.0 = Release|Win32
 	EndGlobalSection
 	GlobalSection(SolutionProperties) = preSolution
 		HideSolutionNode = FALSE
 	EndGlobalSection
+	GlobalSection(ExtensibilityGlobals) = postSolution
+		SolutionGuid = {50C54715-12C7-4F8C-B7B6-B65A30D91DFF}
+	EndGlobalSection
 EndGlobal
diff --git a/build/msvs/test.vcxproj b/build/msvs/test.vcxproj
new file mode 100644
index 0000000..0a7b33f
--- /dev/null
+++ b/build/msvs/test.vcxproj
@@ -0,0 +1,239 @@
+<?xml version="1.0" encoding="utf-8"?>
+<Project DefaultTargets="Build" ToolsVersion="15.0" xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
+  <ItemGroup Label="ProjectConfigurations">
+    <ProjectConfiguration Include="Debug|Win32">
+      <Configuration>Debug</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|Win32">
+      <Configuration>Release</Configuration>
+      <Platform>Win32</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Debug|x64">
+      <Configuration>Debug</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+    <ProjectConfiguration Include="Release|x64">
+      <Configuration>Release</Configuration>
+      <Platform>x64</Platform>
+    </ProjectConfiguration>
+  </ItemGroup>
+  <ItemGroup>
+    <ClCompile Include="..\..\test\main.c" />
+    <ClCompile Include="..\..\test\thread.c" />
+  </ItemGroup>
+  <ItemGroup>
+    <ClInclude Include="..\..\test\test.h" />
+    <ClInclude Include="..\..\test\thread.h" />
+  </ItemGroup>
+  <ItemGroup>
+    <ProjectReference Include="rpmalloc.vcxproj">
+      <Project>{65dc4291-954e-4b91-8889-4f3adcc9d2d5}</Project>
+    </ProjectReference>
+  </ItemGroup>
+  <PropertyGroup Label="Globals">
+    <VCProjectVersion>15.0</VCProjectVersion>
+    <ProjectGuid>{C31980DD-1241-4EF8-A351-69DAF982A7B9}</ProjectGuid>
+    <Keyword>Win32Proj</Keyword>
+    <RootNamespace>test</RootNamespace>
+    <WindowsTargetPlatformVersion>10.0.16299.0</WindowsTargetPlatformVersion>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.Default.props" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <CharacterSet>Unicode</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>Unicode</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>true</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <CharacterSet>Unicode</CharacterSet>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'" Label="Configuration">
+    <ConfigurationType>Application</ConfigurationType>
+    <UseDebugLibraries>false</UseDebugLibraries>
+    <PlatformToolset>v141</PlatformToolset>
+    <WholeProgramOptimization>true</WholeProgramOptimization>
+    <CharacterSet>Unicode</CharacterSet>
+  </PropertyGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.props" />
+  <ImportGroup Label="ExtensionSettings">
+  </ImportGroup>
+  <ImportGroup Label="Shared">
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <ImportGroup Label="PropertySheets" Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <Import Project="$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props" Condition="exists('$(UserRootDir)\Microsoft.Cpp.$(Platform).user.props')" Label="LocalAppDataPlatform" />
+  </ImportGroup>
+  <PropertyGroup Label="UserMacros" />
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <LinkIncremental>false</LinkIncremental>
+    <OutDir>..\..\bin\windows\release\x86-64\</OutDir>
+    <IntDir>$(SolutionDir)$(Platform)\$(Configuration)\$(ProjectName)\</IntDir>
+    <TargetName>rpmalloc-test</TargetName>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <LinkIncremental>false</LinkIncremental>
+    <OutDir>..\..\bin\windows\debug\x86\</OutDir>
+    <IntDir>$(SolutionDir)$(Platform)\$(Configuration)\$(ProjectName)\</IntDir>
+    <TargetName>rpmalloc-test</TargetName>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <LinkIncremental>false</LinkIncremental>
+    <OutDir>..\..\bin\windows\debug\x86-64\</OutDir>
+    <IntDir>$(SolutionDir)$(Platform)\$(Configuration)\$(ProjectName)\</IntDir>
+    <TargetName>rpmalloc-test</TargetName>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <LinkIncremental>false</LinkIncremental>
+    <OutDir>..\..\bin\windows\release\x86\</OutDir>
+    <IntDir>$(SolutionDir)$(Platform)\$(Configuration)\$(ProjectName)\</IntDir>
+    <TargetName>rpmalloc-test</TargetName>
+  </PropertyGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
+    <ClCompile>
+      <PrecompiledHeader>NotUsing</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Full</Optimization>
+      <FunctionLevelLinking>false</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <SDLCheck>
+      </SDLCheck>
+      <PreprocessorDefinitions>NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <ConformanceMode>true</ConformanceMode>
+      <CompileAsManaged>false</CompileAsManaged>
+      <CompileAsWinRT>false</CompileAsWinRT>
+      <MultiProcessorCompilation>true</MultiProcessorCompilation>
+      <InlineFunctionExpansion>AnySuitable</InlineFunctionExpansion>
+      <FavorSizeOrSpeed>Size</FavorSizeOrSpeed>
+      <OmitFramePointers>true</OmitFramePointers>
+      <EnableFiberSafeOptimizations>true</EnableFiberSafeOptimizations>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <ExceptionHandling>false</ExceptionHandling>
+      <StringPooling>true</StringPooling>
+      <BufferSecurityCheck>false</BufferSecurityCheck>
+      <FloatingPointModel>Fast</FloatingPointModel>
+      <FloatingPointExceptions>false</FloatingPointExceptions>
+      <CreateHotpatchableImage>false</CreateHotpatchableImage>
+      <AdditionalIncludeDirectories>..\..\test;..\..\rpmalloc;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Console</SubSystem>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|Win32'">
+    <ClCompile>
+      <PrecompiledHeader>NotUsing</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <PreprocessorDefinitions>WIN32;_DEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <ConformanceMode>true</ConformanceMode>
+      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>
+      <CompileAsManaged>false</CompileAsManaged>
+      <CompileAsWinRT>false</CompileAsWinRT>
+      <MultiProcessorCompilation>true</MultiProcessorCompilation>
+      <SDLCheck>
+      </SDLCheck>
+      <BasicRuntimeChecks>Default</BasicRuntimeChecks>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <MinimalRebuild>false</MinimalRebuild>
+      <ExceptionHandling>false</ExceptionHandling>
+      <BufferSecurityCheck>false</BufferSecurityCheck>
+      <FunctionLevelLinking>false</FunctionLevelLinking>
+      <FloatingPointModel>Fast</FloatingPointModel>
+      <FloatingPointExceptions>false</FloatingPointExceptions>
+      <CreateHotpatchableImage>false</CreateHotpatchableImage>
+      <AdditionalIncludeDirectories>..\..\test;..\..\rpmalloc;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Console</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
+    <ClCompile>
+      <PrecompiledHeader>NotUsing</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Disabled</Optimization>
+      <SDLCheck>
+      </SDLCheck>
+      <PreprocessorDefinitions>NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <ConformanceMode>true</ConformanceMode>
+      <DebugInformationFormat>ProgramDatabase</DebugInformationFormat>
+      <CompileAsManaged>false</CompileAsManaged>
+      <CompileAsWinRT>false</CompileAsWinRT>
+      <MultiProcessorCompilation>true</MultiProcessorCompilation>
+      <BasicRuntimeChecks>Default</BasicRuntimeChecks>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <MinimalRebuild>false</MinimalRebuild>
+      <ExceptionHandling>false</ExceptionHandling>
+      <BufferSecurityCheck>false</BufferSecurityCheck>
+      <FunctionLevelLinking>false</FunctionLevelLinking>
+      <FloatingPointModel>Fast</FloatingPointModel>
+      <FloatingPointExceptions>false</FloatingPointExceptions>
+      <CreateHotpatchableImage>false</CreateHotpatchableImage>
+      <AdditionalIncludeDirectories>..\..\test;..\..\rpmalloc;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Console</SubSystem>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+    </Link>
+  </ItemDefinitionGroup>
+  <ItemDefinitionGroup Condition="'$(Configuration)|$(Platform)'=='Release|Win32'">
+    <ClCompile>
+      <PrecompiledHeader>NotUsing</PrecompiledHeader>
+      <WarningLevel>Level3</WarningLevel>
+      <Optimization>Full</Optimization>
+      <FunctionLevelLinking>false</FunctionLevelLinking>
+      <IntrinsicFunctions>true</IntrinsicFunctions>
+      <PreprocessorDefinitions>WIN32;NDEBUG;_CONSOLE;%(PreprocessorDefinitions)</PreprocessorDefinitions>
+      <ConformanceMode>true</ConformanceMode>
+      <CompileAsManaged>false</CompileAsManaged>
+      <CompileAsWinRT>false</CompileAsWinRT>
+      <SDLCheck>
+      </SDLCheck>
+      <MultiProcessorCompilation>true</MultiProcessorCompilation>
+      <InlineFunctionExpansion>AnySuitable</InlineFunctionExpansion>
+      <FavorSizeOrSpeed>Size</FavorSizeOrSpeed>
+      <OmitFramePointers>true</OmitFramePointers>
+      <EnableFiberSafeOptimizations>true</EnableFiberSafeOptimizations>
+      <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
+      <ExceptionHandling>false</ExceptionHandling>
+      <StringPooling>true</StringPooling>
+      <BufferSecurityCheck>false</BufferSecurityCheck>
+      <FloatingPointModel>Fast</FloatingPointModel>
+      <FloatingPointExceptions>false</FloatingPointExceptions>
+      <CreateHotpatchableImage>false</CreateHotpatchableImage>
+      <AdditionalIncludeDirectories>..\..\test;..\..\rpmalloc;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+    </ClCompile>
+    <Link>
+      <SubSystem>Console</SubSystem>
+      <EnableCOMDATFolding>true</EnableCOMDATFolding>
+      <OptimizeReferences>true</OptimizeReferences>
+      <GenerateDebugInformation>true</GenerateDebugInformation>
+    </Link>
+  </ItemDefinitionGroup>
+  <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
+  <ImportGroup Label="ExtensionTargets">
+  </ImportGroup>
+</Project>
\ No newline at end of file
diff --git a/build/ninja/toolchain.py b/build/ninja/toolchain.py
index 1d7c6f7..593349a 100644
--- a/build/ninja/toolchain.py
+++ b/build/ninja/toolchain.py
@@ -139,7 +139,7 @@
       elif localarch == 'i686':
         self.archs = ['x86']
       else:
-        self.archs = [localarch]
+        self.archs = [str(localarch)]
     elif self.target.is_macos():
       self.archs = ['x86-64']
     elif self.target.is_ios():
@@ -159,7 +159,7 @@
       self.initialize_default_configs()
 
   def initialize_default_configs(self):
-    self.configs = ['debug', 'release'] #, 'profile', 'deploy']
+    self.configs = ['debug', 'release']
 
   def initialize_toolchain(self):
     if self.android != None:
diff --git a/configure.py b/configure.py
index f307b4a..f690770 100755
--- a/configure.py
+++ b/configure.py
@@ -15,15 +15,16 @@
 toolchain = generator.toolchain
 
 rpmalloc_lib = generator.lib(module = 'rpmalloc', libname = 'rpmalloc', sources = ['rpmalloc.c'])
-rpmallocguards_lib = generator.lib(module = 'rpmalloc', libname = 'rpmallocguards', sources = ['rpmalloc.c'], variables = {'defines': ['ENABLE_GUARDS=1']})
+rpmallocguard_lib = generator.lib(module = 'rpmalloc', libname = 'rpmallocguard', sources = ['rpmalloc.c'], variables = {'defines': ['ENABLE_ASSERTS=1', 'ENABLE_STATISTICS=1', 'ENABLE_GUARDS=1']})
 
-if not target.is_android():
+if not target.is_android() and not target.is_ios():
 	rpmallocwrap_lib = generator.lib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'], variables = {'defines': ['ENABLE_PRELOAD=1']})
 
-if not target.is_windows() and not target.is_android() and not target.is_ios():
+if not target.is_android() and not target.is_ios():
 	rpmalloc_so = generator.sharedlib(module = 'rpmalloc', libname = 'rpmalloc', sources = ['rpmalloc.c'])
+
+if not target.is_windows() and not target.is_android() and not target.is_ios():
 	rpmallocwrap_so = generator.sharedlib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'], variables = {'runtime': 'c++', 'defines': ['ENABLE_PRELOAD=1']})
 
 if not target.is_ios() and not target.is_android():
-	generator.bin(module = 'test', sources = ['thread.c', 'main.c'], binname = 'rpmalloc-test', implicit_deps = [rpmalloc_lib], libs = ['rpmalloc'], includepaths = ['rpmalloc', 'test'])
-	generator.bin(module = 'test', sources = ['thread.c', 'main.c'], binname = 'rpmalloc-test-guards', implicit_deps = [rpmallocguards_lib], libs = ['rpmallocguards'], includepaths = ['rpmalloc', 'test'], variables = {'defines': ['ENABLE_GUARDS=1']})
+	generator.bin(module = 'test', sources = ['thread.c', 'main.c'], binname = 'rpmalloc-test', implicit_deps = [rpmallocguard_lib], libs = ['rpmallocguard'], includepaths = ['rpmalloc', 'test'], variables = {'defines': ['ENABLE_GUARDS=1']})
diff --git a/rpmalloc/rpmalloc.c b/rpmalloc/rpmalloc.c
index 448b6b0..9fcf1de 100644
--- a/rpmalloc/rpmalloc.c
+++ b/rpmalloc/rpmalloc.c
@@ -11,71 +11,77 @@
 
 #include "rpmalloc.h"
 
-// Build time configurable limits
-
-// Presets, if none is defined it will default to performance priority
-//#define ENABLE_UNLIMITED_CACHE
-//#define DISABLE_CACHE
-//#define ENABLE_SPACE_PRIORITY_CACHE
-
-// Presets for cache limits
-#if defined(ENABLE_UNLIMITED_CACHE)
-// Unlimited caches
-#define MIN_SPAN_CACHE_RELEASE 16
-#define MAX_SPAN_CACHE_DIVISOR 1
-#elif defined(DISABLE_CACHE)
-//Disable cache
-#define MIN_SPAN_CACHE_RELEASE 1
-#define MAX_SPAN_CACHE_DIVISOR 0
-#elif defined(ENABLE_SPACE_PRIORITY_CACHE)
-// Space priority cache limits
-#define MIN_SPAN_CACHE_SIZE 8
-#define MIN_SPAN_CACHE_RELEASE 8
-#define MAX_SPAN_CACHE_DIVISOR 16
-#define GLOBAL_SPAN_CACHE_MULTIPLIER 1
-#else
-// Default - performance priority cache limits
-//! Limit of thread cache in number of spans for each page count class (undefine for unlimited cache - i.e never release spans to global cache unless thread finishes)
-//! Minimum cache size to remain after a release to global cache
-#define MIN_SPAN_CACHE_SIZE 8
-//! Minimum number of spans to transfer between thread and global cache
-#define MIN_SPAN_CACHE_RELEASE 16
-//! Maximum cache size divisor (max cache size will be max allocation count divided by this divisor)
-#define MAX_SPAN_CACHE_DIVISOR 8
-//! Multiplier for global span cache limit (max cache size will be calculated like thread cache and multiplied with this)
-#define GLOBAL_SPAN_CACHE_MULTIPLIER 4
-#endif
-
+/// Build time configurable limits
+#ifndef HEAP_ARRAY_SIZE
 //! Size of heap hashmap
 #define HEAP_ARRAY_SIZE           79
-
+#endif
+#ifndef ENABLE_THREAD_CACHE
+//! Enable per-thread cache
+#define ENABLE_THREAD_CACHE       1
+#endif
+#ifndef ENABLE_GLOBAL_CACHE
+//! Enable global cache shared between all threads, requires thread cache
+#define ENABLE_GLOBAL_CACHE       1
+#endif
 #ifndef ENABLE_VALIDATE_ARGS
 //! Enable validation of args to public entry points
 #define ENABLE_VALIDATE_ARGS      0
 #endif
-
 #ifndef ENABLE_STATISTICS
 //! Enable statistics collection
 #define ENABLE_STATISTICS         0
 #endif
-
 #ifndef ENABLE_ASSERTS
 //! Enable asserts
 #define ENABLE_ASSERTS            0
 #endif
-
 #ifndef ENABLE_PRELOAD
 //! Support preloading
 #define ENABLE_PRELOAD            0
 #endif
-
 #ifndef ENABLE_GUARDS
 //! Enable overwrite/underwrite guards
 #define ENABLE_GUARDS             0
 #endif
+#ifndef ENABLE_UNLIMITED_CACHE
+//! Unlimited cache disables any cache limitations
+#define ENABLE_UNLIMITED_CACHE    0
+#endif
+#ifndef DEFAULT_SPAN_MAP_COUNT
+//! Default number of spans to map in call to map more virtual memory
+#define DEFAULT_SPAN_MAP_COUNT    16
+#endif
+//! Minimum cache size to remain after a release to global cache
+#define MIN_SPAN_CACHE_SIZE 64
+//! Minimum number of spans to transfer between thread and global cache
+#define MIN_SPAN_CACHE_RELEASE 16
+//! Maximum cache size divisor (max cache size will be max allocation count divided by this divisor)
+#define MAX_SPAN_CACHE_DIVISOR 4
+//! Minimum cache size to remain after a release to global cache, large spans
+#define MIN_LARGE_SPAN_CACHE_SIZE 8
+//! Minimum number of spans to transfer between thread and global cache, large spans
+#define MIN_LARGE_SPAN_CACHE_RELEASE 4
+//! Maximum cache size divisor, large spans (max cache size will be max allocation count divided by this divisor)
+#define MAX_LARGE_SPAN_CACHE_DIVISOR 16
+//! Multiplier for global span cache limit (max cache size will be calculated like thread cache and multiplied with this)
+#define MAX_GLOBAL_CACHE_MULTIPLIER 8
 
-// Platform and arch specifics
+#if !ENABLE_THREAD_CACHE
+#  undef ENABLE_GLOBAL_CACHE
+#  define ENABLE_GLOBAL_CACHE 0
+#  undef MIN_SPAN_CACHE_SIZE
+#  undef MIN_SPAN_CACHE_RELEASE
+#  undef MAX_SPAN_CACHE_DIVISOR
+#  undef MIN_LARGE_SPAN_CACHE_SIZE
+#  undef MIN_LARGE_SPAN_CACHE_RELEASE
+#  undef MAX_LARGE_SPAN_CACHE_DIVISOR
+#endif
+#if !ENABLE_GLOBAL_CACHE
+#  undef MAX_GLOBAL_CACHE_MULTIPLIER
+#endif
 
+/// Platform and arch specifics
 #ifdef _MSC_VER
 #  define ALIGNED_STRUCT(name, alignment) __declspec(align(alignment)) struct name
 #  define FORCEINLINE __forceinline
@@ -109,12 +115,20 @@
 
 #if defined( _WIN32 ) || defined( __WIN32__ ) || defined( _WIN64 )
 #  define PLATFORM_WINDOWS 1
+#  define PLATFORM_POSIX 0
+#else
+#  define PLATFORM_WINDOWS 0
+#  define PLATFORM_POSIX 1
 #endif
 
 #include <stdint.h>
 #include <string.h>
 
 #if ENABLE_ASSERTS
+#  undef NDEBUG
+#  if defined(_MSC_VER) && !defined(_DEBUG)
+#    define _DEBUG
+#  endif
 #  include <assert.h>
 #else
 #  undef  assert
@@ -125,19 +139,19 @@
 #  define MAGIC_GUARD 0xDEADBAAD
 #endif
 
-// Atomic access abstraction
+/// Atomic access abstraction
 ALIGNED_STRUCT(atomic32_t, 4) {
-	int32_t nonatomic;
+	volatile int32_t nonatomic;
 };
 typedef struct atomic32_t atomic32_t;
 
 ALIGNED_STRUCT(atomic64_t, 8) {
-	int64_t nonatomic;
+	volatile int64_t nonatomic;
 };
 typedef struct atomic64_t atomic64_t;
 
 ALIGNED_STRUCT(atomicptr_t, 8) {
-	void* nonatomic;
+	volatile void* nonatomic;
 };
 typedef struct atomicptr_t atomicptr_t;
 
@@ -173,7 +187,7 @@
 
 static FORCEINLINE void*
 atomic_load_ptr(atomicptr_t* src) {
-	return src->nonatomic;
+	return (void*)((uintptr_t)src->nonatomic);
 }
 
 static FORCEINLINE void
@@ -189,73 +203,42 @@
 	                                      (long long)val, (long long)ref) == (long long)ref) ? 1 : 0;
 #  else
 	return (_InterlockedCompareExchange((volatile long*)&dst->nonatomic,
-	                                      (long)val, (long)ref) == (long)ref) ? 1 : 0;
+	                                    (long)val, (long)ref) == (long)ref) ? 1 : 0;
 #  endif
 #else
 	return __sync_bool_compare_and_swap(&dst->nonatomic, ref, val);
 #endif
 }
 
-static void
-thread_yield(void);
-
-// Preconfigured limits and sizes
-
-//! Memory page size
-static size_t _memory_page_size;
-//! Shift to divide by page size
-static size_t _memory_page_size_shift;
-//! Maximum number of pages in a span (span max size divided by page size)
-static size_t _memory_max_page_count;
-
-//! Granularity of all memory page spans for small & medium block allocations
-#define SPAN_ADDRESS_GRANULARITY  65536
-
-//! Maximum size of a span of memory pages
-#define SPAN_MAX_SIZE             (SPAN_ADDRESS_GRANULARITY)
-//! Mask for getting the start of a span of memory pages
-#define SPAN_MASK                 (~((uintptr_t)SPAN_MAX_SIZE - 1))
-//! Maximum number of memory pages in a span
-#define SPAN_MAX_PAGE_COUNT       (SPAN_MAX_SIZE >> _memory_page_size_shift)
-//! Number of size classes for spans
-#define SPAN_CLASS_COUNT          4
-//! Span size class granularity
-#define SPAN_CLASS_GRANULARITY    ((SPAN_ADDRESS_GRANULARITY >> _memory_page_size_shift) / SPAN_CLASS_COUNT)
-
+/// Preconfigured limits and sizes
 //! Granularity of a small allocation block
-#define SMALL_GRANULARITY         16
+#define SMALL_GRANULARITY         32
 //! Small granularity shift count
-#define SMALL_GRANULARITY_SHIFT   4
-//! Maximum size of a small block
-#define SMALL_SIZE_LIMIT          2032
+#define SMALL_GRANULARITY_SHIFT   5
 //! Number of small block size classes
-#define SMALL_CLASS_COUNT         (SMALL_SIZE_LIMIT / SMALL_GRANULARITY)
-
+#define SMALL_CLASS_COUNT         63
+//! Maximum size of a small block
+#define SMALL_SIZE_LIMIT          2016
 //! Granularity of a medium allocation block
 #define MEDIUM_GRANULARITY        512
 //! Medium granularity shift count
 #define MEDIUM_GRANULARITY_SHIFT  9
 //! Number of medium block size classes
 #define MEDIUM_CLASS_COUNT        60
-//! Maximum size of a medium block
-#define MEDIUM_SIZE_LIMIT         (SMALL_SIZE_LIMIT + (MEDIUM_GRANULARITY * MEDIUM_CLASS_COUNT) - SPAN_HEADER_SIZE)
-
 //! Total number of small + medium size classes
 #define SIZE_CLASS_COUNT          (SMALL_CLASS_COUNT + MEDIUM_CLASS_COUNT)
-
 //! Number of large block size classes
 #define LARGE_CLASS_COUNT         32
+//! Maximum size of a medium block
+#define MEDIUM_SIZE_LIMIT         (SMALL_SIZE_LIMIT + (MEDIUM_GRANULARITY * MEDIUM_CLASS_COUNT) - SPAN_HEADER_SIZE)
 //! Maximum size of a large block
-#define LARGE_SIZE_LIMIT          ((LARGE_CLASS_COUNT * SPAN_MAX_SIZE) - SPAN_HEADER_SIZE)
-
-#define SPAN_LIST_LOCK_TOKEN      ((void*)1)
+#define LARGE_SIZE_LIMIT          ((LARGE_CLASS_COUNT * _memory_span_size) - SPAN_HEADER_SIZE)
+//! Size of a span header
+#define SPAN_HEADER_SIZE          32
 
 #define pointer_offset(ptr, ofs) (void*)((char*)(ptr) + (ptrdiff_t)(ofs))
 #define pointer_diff(first, second) (ptrdiff_t)((const char*)(first) - (const char*)(second))
 
-//! Size of a span header
-#define SPAN_HEADER_SIZE          32
-
 #if ARCH_64BIT
 typedef int64_t offset_t;
 #else
@@ -266,29 +249,34 @@
 #if ENABLE_VALIDATE_ARGS
 //! Maximum allocation size to avoid integer overflow
 #undef  MAX_ALLOC_SIZE
-#define MAX_ALLOC_SIZE            (((size_t)-1) - SPAN_ADDRESS_GRANULARITY)
+#define MAX_ALLOC_SIZE            (((size_t)-1) - _memory_span_size)
 #endif
 
-// Data types
-
+/// Data types
 //! A memory heap, per thread
 typedef struct heap_t heap_t;
 //! Span of memory pages
 typedef struct span_t span_t;
 //! Size class definition
 typedef struct size_class_t size_class_t;
-//! Span block bookkeeping 
+//! Span block bookkeeping
 typedef struct span_block_t span_block_t;
-//! Span list bookkeeping 
+//! Span list bookkeeping
 typedef struct span_list_t span_list_t;
 //! Span data union, usage depending on span state
 typedef union span_data_t span_data_t;
 //! Cache data
 typedef struct span_counter_t span_counter_t;
+//! Global cache
+typedef struct global_cache_t global_cache_t;
 
-//Alignment offset must match in both structures
-//to keep the data when transitioning between being
-//used for blocks and being part of a list
+//! Flag indicating span is the first (master) span of a split superspan
+#define SPAN_FLAG_MASTER 1
+//! Flag indicating span is a secondary (sub) span of a split superspan
+#define SPAN_FLAG_SUBSPAN 2
+
+//Alignment offset must match in both structures to keep the data when
+//transitioning between being used for blocks and being part of a list
 struct span_block_t {
 	//! Free list
 	uint16_t    free_list;
@@ -316,11 +304,24 @@
 	span_list_t list;
 };
 
+//A span can either represent a single span of memory pages with size declared by span_map_count configuration variable,
+//or a set of spans in a continuous region, a super span. Any reference to the term "span" usually refers to both a single
+//span or a super span. A super span can further be diviced into multiple spans (or this, super spans), where the first
+//(super)span is the master and subsequent (super)spans are subspans. The master span keeps track of how many subspans
+//that are still alive and mapped in virtual memory, and once all subspans and master have been unmapped the entire
+//superspan region is released and unmapped (on Windows for example, the entire superspan range has to be released
+//in the same call to release the virtual memory range, but individual subranges can be decommitted individually
+//to reduce physical memory use).
 struct span_t {
 	//!	Heap ID
 	atomic32_t  heap_id;
 	//! Size class
-	count_t     size_class;
+	uint16_t    size_class;
+	// TODO: If we could store remainder part of flags as an atomic counter, the entire check
+	//       if master is owned by calling heap could be simplified to an atomic dec from any thread
+	//       since remainder of a split super span only ever decreases, never increases
+	//! Flags and counters
+	uint16_t    flags;
 	//! Span data
 	span_data_t data;
 	//! Next span
@@ -330,6 +331,7 @@
 };
 _Static_assert(sizeof(span_t) <= SPAN_HEADER_SIZE, "span size mismatch");
 
+//Adaptive cache counter of a single superspan span count
 struct span_counter_t {
 	//! Allocation high water mark
 	uint32_t  max_allocations;
@@ -342,22 +344,28 @@
 struct heap_t {
 	//! Heap ID
 	int32_t      id;
-	//! Deferred deallocation
-	atomicptr_t  defer_deallocate;
 	//! Free count for each size class active span
 	span_block_t active_block[SIZE_CLASS_COUNT];
 	//! Active span for each size class
 	span_t*      active_span[SIZE_CLASS_COUNT];
-	//! List of demi-used spans with free blocks for each size class (double linked list)
+	//! List of semi-used spans with free blocks for each size class (double linked list)
 	span_t*      size_cache[SIZE_CLASS_COUNT];
-	//! List of free spans for each page count (single linked list)
-	span_t*      span_cache[SPAN_CLASS_COUNT];
+#if ENABLE_THREAD_CACHE
+	//! List of free spans (single linked list)
+	span_t*      span_cache[LARGE_CLASS_COUNT];
 	//! Allocation counters
-	span_counter_t span_counter[SPAN_CLASS_COUNT];
-	//! List of free spans for each large class count (single linked list)
-	span_t*      large_cache[LARGE_CLASS_COUNT];
-	//! Allocation counters for large blocks
-	span_counter_t large_counter[LARGE_CLASS_COUNT];
+	span_counter_t span_counter[LARGE_CLASS_COUNT];
+#endif
+	//! Mapped but unused spans
+	span_t*      span_reserve;
+	//! Master span for mapped but unused spans
+	span_t*      span_reserve_master;
+	//! Number of mapped but unused spans
+	size_t       spans_reserved;
+	//! Deferred deallocation
+	atomicptr_t  defer_deallocate;
+	//! Deferred unmaps
+	atomicptr_t  defer_unmap;
 	//! Next heap in id list
 	heap_t*      next_heap;
 	//! Next heap in orphan list
@@ -365,10 +373,6 @@
 	//! Memory pages alignment offset
 	size_t       align_offset;
 #if ENABLE_STATISTICS
-	//! Number of bytes currently requested in allocations
-	size_t       requested;
-	//! Number of bytes current allocated
-	size_t       allocated;
 	//! Number of bytes transitioned thread -> global
 	size_t       thread_to_global;
 	//! Number of bytes transitioned global -> thread
@@ -378,9 +382,7 @@
 
 struct size_class_t {
 	//! Size of blocks in this class
-	uint16_t size;
-	//! Number of pages to allocate for a chunk
-	uint16_t page_count;
+	uint32_t size;
 	//! Number of blocks in each chunk
 	uint16_t block_count;
 	//! Class index this class is merged with
@@ -388,48 +390,67 @@
 };
 _Static_assert(sizeof(size_class_t) == 8, "Size class size mismatch");
 
+struct global_cache_t {
+	//! Cache list pointer
+	atomicptr_t cache;
+	//! Cache size
+	atomic32_t size;
+	//! ABA counter
+	atomic32_t counter;
+};
+
+/// Global data
 //! Configuration
 static rpmalloc_config_t _memory_config;
-
+//! Memory page size
+static size_t _memory_page_size;
+//! Shift to divide by page size
+static size_t _memory_page_size_shift;
+//! Mask to get to start of a memory page
+static size_t _memory_page_mask;
+//! Granularity at which memory pages are mapped by OS
+static size_t _memory_map_granularity;
+//! Size of a span of memory pages
+static size_t _memory_span_size;
+//! Shift to divide by span size
+static size_t _memory_span_size_shift;
+//! Mask to get to start of a memory span
+static uintptr_t _memory_span_mask;
 //! Global size classes
 static size_class_t _memory_size_class[SIZE_CLASS_COUNT];
-
+//! Run-time size limit of medium blocks
+static size_t _memory_medium_size_limit;
 //! Heap ID counter
 static atomic32_t _memory_heap_id;
-
+#if ENABLE_THREAD_CACHE
+//! Adaptive cache max allocation count
+static uint32_t _memory_max_allocation[LARGE_CLASS_COUNT];
+#endif
+#if ENABLE_GLOBAL_CACHE
 //! Global span cache
-static atomicptr_t _memory_span_cache[SPAN_CLASS_COUNT];
-
-//! Global large cache
-static atomicptr_t _memory_large_cache[LARGE_CLASS_COUNT];
-
+static global_cache_t _memory_span_cache[LARGE_CLASS_COUNT];
+#endif
 //! All heaps
 static atomicptr_t _memory_heaps[HEAP_ARRAY_SIZE];
-
 //! Orphaned heaps
 static atomicptr_t _memory_orphan_heaps;
-
 //! Running orphan counter to avoid ABA issues in linked list
 static atomic32_t _memory_orphan_counter;
-
 //! Active heap count
 static atomic32_t _memory_active_heaps;
-
-//! Adaptive cache max allocation count
-static uint32_t _memory_max_allocation[SPAN_CLASS_COUNT];
-
-//! Adaptive cache max allocation count
-static uint32_t _memory_max_allocation_large[LARGE_CLASS_COUNT];
-
 #if ENABLE_STATISTICS
-//! Total number of mapped memory pages
+//! Total number of currently mapped memory pages
 static atomic32_t _mapped_pages;
+//! Total number of currently lost spans
+static atomic32_t _reserved_spans;
 //! Running counter of total number of mapped memory pages since start
 static atomic32_t _mapped_total;
 //! Running counter of total number of unmapped memory pages since start
 static atomic32_t _unmapped_total;
 #endif
 
+#define MEMORY_UNUSED(x) (void)sizeof((x))
+
 //! Current thread heap
 #if defined(__APPLE__) && ENABLE_PRELOAD
 static pthread_key_t _memory_thread_heap;
@@ -446,6 +467,7 @@
 static _Thread_local heap_t* _memory_thread_heap TLS_MODEL;
 #endif
 
+//! Get the current thread heap
 static FORCEINLINE heap_t*
 get_thread_heap(void) {
 #if defined(__APPLE__) && ENABLE_PRELOAD
@@ -455,6 +477,7 @@
 #endif
 }
 
+//! Set the current thread heap
 static void
 set_thread_heap(heap_t* heap) {
 #if defined(__APPLE__) && ENABLE_PRELOAD
@@ -464,12 +487,15 @@
 #endif
 }
 
+//! Default implementation to map more virtual memory
 static void*
-_memory_map_os(size_t page_count);
+_memory_map_os(size_t size, size_t* offset);
 
+//! Default implementation to unmap virtual memory
 static void
-_memory_unmap_os(void* ptr, size_t page_count);
+_memory_unmap_os(void* address, size_t size, size_t offset, int release);
 
+//! Deallocate any deferred blocks and check for the given size class
 static int
 _memory_deallocate_deferred(heap_t* heap, size_t size_class);
 
@@ -483,255 +509,541 @@
 	return heap;
 }
 
-//! Get the span size class from page count
-static size_t
-_span_class_from_page_count(size_t page_count) {
-	assert((page_count > 0) && (page_count <= _memory_max_page_count));
-	return ((page_count + SPAN_CLASS_GRANULARITY - 1) / SPAN_CLASS_GRANULARITY) - 1;
-}
+#if ENABLE_THREAD_CACHE
 
 //! Increase an allocation counter
 static void
-_memory_counter_increase(span_counter_t* counter, uint32_t* global_counter) {
+_memory_counter_increase(span_counter_t* counter, uint32_t* global_counter, size_t span_count) {
 	if (++counter->current_allocations > counter->max_allocations) {
 		counter->max_allocations = counter->current_allocations;
-#if MAX_SPAN_CACHE_DIVISOR > 0
-		counter->cache_limit = counter->max_allocations / MAX_SPAN_CACHE_DIVISOR;
+		const uint32_t cache_limit_max = (uint32_t)_memory_span_size - 2;
+#if !ENABLE_UNLIMITED_CACHE
+		counter->cache_limit = counter->max_allocations / ((span_count == 1) ? MAX_SPAN_CACHE_DIVISOR : MAX_LARGE_SPAN_CACHE_DIVISOR);
+		const uint32_t cache_limit_min = (span_count == 1) ? (MIN_SPAN_CACHE_RELEASE + MIN_SPAN_CACHE_SIZE) : (MIN_LARGE_SPAN_CACHE_RELEASE + MIN_LARGE_SPAN_CACHE_SIZE);
+		if (counter->cache_limit < cache_limit_min)
+			counter->cache_limit = cache_limit_min;
+		if (counter->cache_limit > cache_limit_max)
+			counter->cache_limit = cache_limit_max;
+#else
+		counter->cache_limit = cache_limit_max;
 #endif
 		if (counter->max_allocations > *global_counter)
 			*global_counter = counter->max_allocations;
 	}
 }
 
-static void*
-_memory_map(size_t page_count, size_t* align_offset) {
-	void* mapped_address;
-	void* aligned_address;
-	size_t size = page_count * _memory_page_size;
-
-	mapped_address = _memory_config.memory_map(size);
-
-	if (!((uintptr_t)mapped_address & ~(uintptr_t)SPAN_MASK)) {
-		aligned_address = mapped_address;
-		*align_offset = 0;
-	}
-	else {
-		//Retry with space for alignment
-		_memory_config.memory_unmap(mapped_address, size);
-
-		size_t padding = SPAN_ADDRESS_GRANULARITY;
-		mapped_address = _memory_config.memory_map(size + padding);
-		padding -= (uintptr_t)mapped_address % SPAN_ADDRESS_GRANULARITY;
-		aligned_address = pointer_offset(mapped_address, padding);
-		//Offset could be 0x10000 (64KiB) if mapped pages are aligned, divide by 2 to fit in uint16_t
-		assert(padding <= SPAN_ADDRESS_GRANULARITY);
-		assert(!((uintptr_t)mapped_address & ~(uintptr_t)SPAN_MASK));
-		*align_offset = (size_t)padding / 2;
-	}
+#else
+#  define _memory_counter_increase(counter, global_counter, span_count) do {} while (0)
+#endif
 
 #if ENABLE_STATISTICS
-	atomic_add32(&_mapped_pages, (int32_t)(size >> _memory_page_size_shift));
-	atomic_add32(&_mapped_total, (int32_t)(size >> _memory_page_size_shift));
+#  define _memory_statistics_add(atomic_counter, value) atomic_add32(atomic_counter, (int32_t)(value))
+#  define _memory_statistics_sub(atomic_counter, value) atomic_add32(atomic_counter, -(int32_t)(value))
+#else
+#  define _memory_statistics_add(atomic_counter, value) do {} while(0)
+#  define _memory_statistics_sub(atomic_counter, value) do {} while(0)
 #endif
- 
- 	return aligned_address;
+
+//! Map more virtual memory
+static void*
+_memory_map(size_t size, size_t* offset) {
+	assert(!(size % _memory_page_size));
+	_memory_statistics_add(&_mapped_pages, (size >> _memory_page_size_shift));
+	_memory_statistics_add(&_mapped_total, (size >> _memory_page_size_shift));
+	return _memory_config.memory_map(size, offset);
 }
 
+//! Unmap virtual memory
 static void
-_memory_unmap(void* address, size_t page_count, size_t align_offset) {
-	size_t size = page_count * _memory_page_size;
-	void* mapped_address = pointer_offset(address, -(offset_t)(align_offset * 2));
-	if (align_offset)
-		size += SPAN_ADDRESS_GRANULARITY;
-	_memory_config.memory_unmap(mapped_address, size);
+_memory_unmap(void* address, size_t size, size_t offset, int release) {
+	assert((size < _memory_span_size) || !((uintptr_t)address & ~_memory_span_mask));
+	assert(!(size % _memory_page_size));
+	_memory_statistics_sub(&_mapped_pages, (size >> _memory_page_size_shift));
+	_memory_statistics_add(&_unmapped_total, (size >> _memory_page_size_shift));
+	_memory_config.memory_unmap(address, size, offset, release);
 }
 
-//! Insert the given list of memory page spans in the global cache for small/medium blocks
+//! Make flags field in a span from flags, remainder/distance and count
+#define SPAN_MAKE_FLAGS(flags, remdist, count) ((uint16_t)((flags) | ((uint16_t)((remdist) - 1) << 2) | ((uint16_t)((count) - 1) << 9))); assert((flags) < 4); assert((remdist) && (remdist) < 128); assert((count) && (count) < 128)
+//! Check if span has any of the given flags
+#define SPAN_HAS_FLAG(flags, flag) ((flags) & (flag))
+//! Get the distance from flags field
+#define SPAN_DISTANCE(flags) (1 + (((flags) >> 2) & 0x7f))
+//! Get the remainder from flags field
+#define SPAN_REMAINS(flags) (1 + (((flags) >> 2) & 0x7f))
+//! Get the count from flags field
+#define SPAN_COUNT(flags) (1 + (((flags) >> 9) & 0x7f))
+//! Set the remainder in the flags field (MUST be done from the owner heap thread)
+#define SPAN_SET_REMAINS(flags, remains) flags = ((uint16_t)(((flags) & 0xfe03) | ((uint16_t)((remains) - 1) << 2))); assert((remains) < 128)
+
+//! Resize the given super span to the given count of spans, store the remainder in the heap reserved spans fields
 static void
-_memory_global_cache_insert(span_t* first_span, size_t list_size, size_t page_count) {
-	assert((list_size == 1) || (first_span->next_span != 0));
-#if MAX_SPAN_CACHE_DIVISOR > 0
-	while (1) {
-		size_t span_class_idx = _span_class_from_page_count(page_count);
-		void* global_span_ptr = atomic_load_ptr(&_memory_span_cache[span_class_idx]);
-		if (global_span_ptr != SPAN_LIST_LOCK_TOKEN) {
-			uintptr_t global_list_size = (uintptr_t)global_span_ptr & ~SPAN_MASK;
-			span_t* global_span = (span_t*)((void*)((uintptr_t)global_span_ptr & SPAN_MASK));
+_memory_set_span_remainder_as_reserved(heap_t* heap, span_t* span, size_t use_count) {
+	size_t current_count = SPAN_COUNT(span->flags);
 
-#ifdef GLOBAL_SPAN_CACHE_MULTIPLIER
-			size_t cache_limit = GLOBAL_SPAN_CACHE_MULTIPLIER * (_memory_max_allocation[span_class_idx] / MAX_SPAN_CACHE_DIVISOR);
-			if ((global_list_size >= cache_limit) && (global_list_size > MIN_SPAN_CACHE_SIZE))
-				break;
-#endif
-			//We only have 16 bits for size of list, avoid overflow
-			if ((global_list_size + list_size) > 0xFFFF)
-				break;
+	assert(!SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER) || !SPAN_HAS_FLAG(span->flags, SPAN_FLAG_SUBSPAN));
+	assert((current_count > 1) && (current_count < 127));
+	assert(!heap->spans_reserved);
+	assert(SPAN_COUNT(span->flags) == current_count);
+	assert(current_count > use_count);
 
-			//Use prev_span as skip pointer over this sublist range of spans
-			first_span->data.list.size = (uint32_t)list_size;
-			first_span->prev_span = global_span;
-
-			//Insert sublist into global cache
-			global_list_size += list_size;
-			void* first_span_ptr = (void*)((uintptr_t)first_span | global_list_size);
-			if (atomic_cas_ptr(&_memory_span_cache[span_class_idx], first_span_ptr, global_span_ptr))
-				return;
-		}
-		else {
-			//Atomic operation failed, yield timeslice and retry
-			thread_yield();
-			atomic_thread_fence_acquire();
-		}
+	heap->span_reserve = pointer_offset(span, use_count * _memory_span_size);
+	heap->spans_reserved = current_count - use_count;
+	if (!SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER | SPAN_FLAG_SUBSPAN)) {
+		//We must store the heap id before setting as master, to force unmaps to defer to this heap thread
+		atomic_store32(&span->heap_id, heap->id);
+		atomic_thread_fence_release();
+		heap->span_reserve_master = span;
+		span->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_MASTER, current_count, use_count);
+		_memory_statistics_add(&_reserved_spans, current_count);
 	}
-#endif
-	//Global cache full, release pages
-	for (size_t ispan = 0; ispan < list_size; ++ispan) {
-		assert(first_span);
-		span_t* next_span = first_span->next_span;
-		_memory_unmap(first_span, page_count, first_span->data.list.align_offset);
-		first_span = next_span;
+	else if (SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER)) {
+		//Only owner heap thread can modify a master span
+		assert(atomic_load32(&span->heap_id) == heap->id);
+		uint16_t remains = SPAN_REMAINS(span->flags);
+		assert(remains >= current_count);
+		heap->span_reserve_master = span;
+		span->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_MASTER, remains, use_count);
 	}
+	else { //SPAN_FLAG_SUBSPAN
+		//Resizing a subspan is a safe operation in any thread
+		uint16_t distance = SPAN_DISTANCE(span->flags);
+		span_t* master = pointer_offset(span, -(int)distance * (int)_memory_span_size);
+		heap->span_reserve_master = master;
+		assert(SPAN_HAS_FLAG(master->flags, SPAN_FLAG_MASTER));
+		assert(SPAN_REMAINS(master->flags) >= current_count);
+		span->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_SUBSPAN, distance, use_count);
+	}
+	assert((SPAN_COUNT(span->flags) + heap->spans_reserved) == current_count);
 }
 
-//! Extract a number of memory page spans from the global cache for small/medium blocks
+//! Map in memory pages for the given number of spans (or use previously reserved pages)
 static span_t*
-_memory_global_cache_extract(size_t page_count) {
-	span_t* span = 0;
-	size_t span_class_idx = _span_class_from_page_count(page_count);
-	atomicptr_t* cache = &_memory_span_cache[span_class_idx];
-	atomic_thread_fence_acquire();
-	void* global_span_ptr = atomic_load_ptr(cache);
-	while (global_span_ptr) {
-		if ((global_span_ptr != SPAN_LIST_LOCK_TOKEN) &&
-		        atomic_cas_ptr(cache, SPAN_LIST_LOCK_TOKEN, global_span_ptr)) {
-			//Grab a number of thread cache spans, using the skip span pointer
-			//stored in prev_span to quickly skip ahead in the list to get the new head
-			uintptr_t global_span_count = (uintptr_t)global_span_ptr & ~SPAN_MASK;
-			span = (span_t*)((void*)((uintptr_t)global_span_ptr & SPAN_MASK));
-			assert((span->data.list.size == 1) || (span->next_span != 0));
-
-			span_t* new_global_span = span->prev_span;
-			global_span_count -= span->data.list.size;
-
-			//Set new head of global cache list
-			void* new_cache_head = global_span_count ?
-			                       ((void*)((uintptr_t)new_global_span | global_span_count)) :
-			                       0;
-			atomic_store_ptr(cache, new_cache_head);
-			atomic_thread_fence_release();
-			break;
-		}
-
-		//List busy, yield timeslice and retry
-		thread_yield();
-		atomic_thread_fence_acquire();
-		global_span_ptr = atomic_load_ptr(cache);
+_memory_map_spans(heap_t* heap, size_t span_count) {
+	if (span_count <= heap->spans_reserved) {
+		span_t* span = heap->span_reserve;
+		heap->span_reserve = pointer_offset(span, span_count * _memory_span_size);
+		heap->spans_reserved -= span_count;
+		//Declare the span to be a subspan with given distance from master span
+		uint16_t distance = (uint16_t)((uintptr_t)pointer_diff(span, heap->span_reserve_master) >> _memory_span_size_shift);
+		span->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_SUBSPAN, distance, span_count);
+		span->data.block.align_offset = 0;
+		return span;
 	}
 
+	//We cannot request extra spans if we already have some (but not enough) pending reserved spans
+	size_t request_spans = (heap->spans_reserved || (span_count > _memory_config.span_map_count)) ? span_count : _memory_config.span_map_count;
+	size_t align_offset = 0;
+	span_t* span = _memory_map(request_spans * _memory_span_size, &align_offset);
+	span->flags = SPAN_MAKE_FLAGS(0, request_spans, request_spans);
+	span->data.block.align_offset = (uint16_t)align_offset;
+	if (request_spans > span_count) {
+		//We have extra spans, store them as reserved spans in heap
+		_memory_set_span_remainder_as_reserved(heap, span, span_count);
+	}
 	return span;
 }
 
-/*! Insert the given list of memory page spans in the global cache for large blocks,
-    similar to _memory_global_cache_insert */
+//! Defer unmapping of the given span to the owner heap
+static int
+_memory_unmap_defer(int32_t heap_id, span_t* span) {
+	//Get the heap and link in pointer in list of deferred operations
+	heap_t* heap = _memory_heap_lookup(heap_id);
+	if (!heap)
+		return 0;
+	atomic_store32(&span->heap_id, heap_id);
+	void* last_ptr;
+	do {
+		last_ptr = atomic_load_ptr(&heap->defer_unmap);
+		span->next_span = last_ptr;
+	} while (!atomic_cas_ptr(&heap->defer_unmap, span, last_ptr));
+	return 1;
+}
+
+//! Unmap memory pages for the given number of spans (or mark as unused if no partial unmappings)
 static void
-_memory_global_cache_large_insert(span_t* span_list, size_t list_size, size_t span_count) {
-	assert((list_size == 1) || (span_list->next_span != 0));
-	assert(span_list->size_class == (SIZE_CLASS_COUNT + (span_count - 1)));
-#if MAX_SPAN_CACHE_DIVISOR > 0
-	atomicptr_t* cache = &_memory_large_cache[span_count - 1];
-	while (1) {
-		void* global_span_ptr = atomic_load_ptr(cache);
-		if (global_span_ptr != SPAN_LIST_LOCK_TOKEN) {
-			uintptr_t global_list_size = (uintptr_t)global_span_ptr & ~SPAN_MASK;
-			span_t* global_span = (span_t*)((void*)((uintptr_t)global_span_ptr & SPAN_MASK));
-
-#ifdef GLOBAL_SPAN_CACHE_MULTIPLIER
-			size_t cache_limit = GLOBAL_SPAN_CACHE_MULTIPLIER * (_memory_max_allocation_large[span_count-1] / MAX_SPAN_CACHE_DIVISOR);
-			if ((global_list_size >= cache_limit) && (global_list_size > MIN_SPAN_CACHE_SIZE))
-				break;
-#endif
-			if ((global_list_size + list_size) > 0xFFFF)
-				break;
-
-			span_list->data.list.size = (uint32_t)list_size;
-			span_list->prev_span = global_span;
-
-			global_list_size += list_size;
-			void* new_global_span_ptr = (void*)((uintptr_t)span_list | global_list_size);
-			if (atomic_cas_ptr(cache, new_global_span_ptr, global_span_ptr))
-				return;
-		}
-		else {
-			thread_yield();
-			atomic_thread_fence_acquire();
-		}
+_memory_unmap_span(heap_t* heap, span_t* span) {
+	size_t span_count = SPAN_COUNT(span->flags);
+	assert(!SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER) || !SPAN_HAS_FLAG(span->flags, SPAN_FLAG_SUBSPAN));
+	//A plain run of spans can be unmapped directly
+	if (!SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER | SPAN_FLAG_SUBSPAN)) {
+		_memory_unmap(span, span_count * _memory_span_size, span->data.list.align_offset, 1);
+		return;
 	}
-#endif
-	//Global cache full, release spans
-	for (size_t ispan = 0; ispan < list_size; ++ispan) {
-		assert(span_list);
-		span_t* next_span = span_list->next_span;
-		_memory_unmap(span_list, span_count * SPAN_MAX_PAGE_COUNT, span_list->data.list.align_offset);
-		span_list = next_span;
+
+	uint32_t is_master = SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER);
+	span_t* master = is_master ? span : (pointer_offset(span, -(int)SPAN_DISTANCE(span->flags) * (int)_memory_span_size));
+
+	assert(is_master || SPAN_HAS_FLAG(span->flags, SPAN_FLAG_SUBSPAN));
+	assert(SPAN_HAS_FLAG(master->flags, SPAN_FLAG_MASTER));
+
+	//Check if we own the master span, otherwise defer (only owner of master span can modify remainder field)
+	int32_t master_heap_id = atomic_load32(&master->heap_id);
+	if (heap && (master_heap_id != heap->id)) {
+		if (_memory_unmap_defer(master_heap_id, span))
+			return;
+	}
+	if (!is_master) {
+		//Directly unmap subspans
+		assert(span->data.list.align_offset == 0);
+		_memory_unmap(span, span_count * _memory_span_size, 0, 0);
+		_memory_statistics_sub(&_reserved_spans, span_count);
+	}
+	else {
+		//Special double flag to denote an unmapped master
+		//It must be kept in memory since span header must be used
+		span->flags |= SPAN_FLAG_MASTER | SPAN_FLAG_SUBSPAN;
+	}
+	//We are in owner thread of the master span
+	uint32_t remains = SPAN_REMAINS(master->flags);
+	assert(remains >= span_count);
+	remains = ((uint32_t)span_count >= remains) ? 0 : (remains - (uint32_t)span_count);
+	if (!remains) {
+		//Everything unmapped, unmap the master span with release flag to unmap the entire range of the super span
+		assert(SPAN_HAS_FLAG(master->flags, SPAN_FLAG_MASTER) && SPAN_HAS_FLAG(master->flags, SPAN_FLAG_SUBSPAN));
+		span_count = SPAN_COUNT(master->flags);
+		_memory_unmap(master, span_count * _memory_span_size, master->data.list.align_offset, 1);
+		_memory_statistics_sub(&_reserved_spans, span_count);
+	}
+	else {
+		//Set remaining spans
+		SPAN_SET_REMAINS(master->flags, remains);
 	}
 }
 
-/*! Extract a number of memory page spans from the global cache for large blocks,
-    similar to _memory_global_cache_extract */
+//! Process pending deferred cross-thread unmaps
 static span_t*
-_memory_global_cache_large_extract(size_t span_count) {
-	span_t* span = 0;
-	atomicptr_t* cache = &_memory_large_cache[span_count - 1];
+_memory_unmap_deferred(heap_t* heap, size_t wanted_count) {
+	//Grab the current list of deferred unmaps
 	atomic_thread_fence_acquire();
-	void* global_span_ptr = atomic_load_ptr(cache);
-	while (global_span_ptr) {
-		if ((global_span_ptr != SPAN_LIST_LOCK_TOKEN) &&
-			atomic_cas_ptr(cache, SPAN_LIST_LOCK_TOKEN, global_span_ptr)) {
-			uintptr_t global_list_size = (uintptr_t)global_span_ptr & ~SPAN_MASK;
-			span = (span_t*)((void*)((uintptr_t)global_span_ptr & SPAN_MASK));
-			assert((span->data.list.size == 1) || (span->next_span != 0));
-			assert(span->size_class == (SIZE_CLASS_COUNT + (span_count - 1)));
+	span_t* span = atomic_load_ptr(&heap->defer_unmap);
+	if (!span || !atomic_cas_ptr(&heap->defer_unmap, 0, span))
+		return 0;
+	span_t* found_span = 0;
+	do {
+		//Verify that we own the master span, otherwise re-defer to owner
+		void* next = span->next_span;
+		if (!found_span && SPAN_COUNT(span->flags) == wanted_count) {
+			assert(!SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER) || !SPAN_HAS_FLAG(span->flags, SPAN_FLAG_SUBSPAN));
+			found_span = span;
+		}
+		else {
+			uint32_t is_master = SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER);
+			span_t* master = is_master ? span : (pointer_offset(span, -(int)SPAN_DISTANCE(span->flags) * (int)_memory_span_size));
+			int32_t master_heap_id = atomic_load32(&master->heap_id);
+			if ((atomic_load32(&span->heap_id) == master_heap_id) ||
+			        !_memory_unmap_defer(master_heap_id, span)) {
+				//We own the master span (or heap merged and abandoned)
+				_memory_unmap_span(heap, span);
+			}
+		}
+		span = next;
+	} while (span);
+	return found_span;
+}
 
-			span_t* new_global_span = span->prev_span;
-			global_list_size -= span->data.list.size;
+//! Unmap a single linked list of spans
+static void
+_memory_unmap_span_list(heap_t* heap, span_t* span) {
+	size_t list_size = span->data.list.size;
+	for (size_t ispan = 0; ispan < list_size; ++ispan) {
+		span_t* next_span = span->next_span;
+		_memory_unmap_span(heap, span);
+		span = next_span;
+	}
+	assert(!span);
+}
 
-			void* new_global_span_ptr = global_list_size ?
-			                            ((void*)((uintptr_t)new_global_span | global_list_size)) :
-			                            0;
-			atomic_store_ptr(cache, new_global_span_ptr);
-			atomic_thread_fence_release();
+#if ENABLE_THREAD_CACHE
+
+//! Split a super span in two
+static span_t*
+_memory_span_split(heap_t* heap, span_t* span, size_t use_count) {
+	uint16_t distance = 0;
+	size_t current_count = SPAN_COUNT(span->flags);
+	assert(current_count > use_count);
+	assert(!SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER) || !SPAN_HAS_FLAG(span->flags, SPAN_FLAG_SUBSPAN));
+	if (!SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER | SPAN_FLAG_SUBSPAN)) {
+		//Must store heap in master span before use, to avoid issues when unmapping subspans
+		atomic_store32(&span->heap_id, heap->id);
+		atomic_thread_fence_release();
+		span->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_MASTER, current_count, use_count);
+		_memory_statistics_add(&_reserved_spans, current_count);
+	}
+	else if (SPAN_HAS_FLAG(span->flags, SPAN_FLAG_MASTER)) {
+		//Only valid to call on master span if we own it
+		assert(atomic_load32(&span->heap_id) == heap->id);
+		uint16_t remains = SPAN_REMAINS(span->flags);
+		assert(remains >= current_count);
+		span->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_MASTER, remains, use_count);
+	}
+	else { //SPAN_FLAG_SUBSPAN
+		distance = SPAN_DISTANCE(span->flags);
+		span->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_SUBSPAN, distance, use_count);
+	}
+	//Setup remainder as a subspan
+	span_t* subspan = pointer_offset(span, use_count * _memory_span_size);
+	subspan->flags = SPAN_MAKE_FLAGS(SPAN_FLAG_SUBSPAN, distance + use_count, current_count - use_count);
+	subspan->data.list.align_offset = 0;
+	return subspan;
+}
+
+//! Add span to head of single linked span list
+static size_t
+_memory_span_list_push(span_t** head, span_t* span) {
+	span->next_span = *head;
+	if (*head)
+		span->data.list.size = (*head)->data.list.size + 1;
+	else
+		span->data.list.size = 1;
+	*head = span;
+	return span->data.list.size;
+}
+
+//! Remove span from head of single linked span list, returns the new list head
+static span_t*
+_memory_span_list_pop(span_t** head) {
+	span_t* span = *head;
+	span_t* next_span = 0;
+	if (span->data.list.size > 1) {
+		next_span = span->next_span;
+		assert(next_span);
+		next_span->data.list.size = span->data.list.size - 1;
+	}
+	*head = next_span;
+	return span;
+}
+
+//! Split a single linked span list
+static span_t*
+_memory_span_list_split(span_t* span, size_t limit) {
+	span_t* next = 0;
+	if (limit < 2)
+		limit = 2;
+	if (span->data.list.size > limit) {
+		count_t list_size = 1;
+		span_t* last = span;
+		next = span->next_span;
+		while (list_size < limit) {
+			last = next;
+			next = next->next_span;
+			++list_size;
+		}
+		last->next_span = 0;
+		assert(next);
+		next->data.list.size = span->data.list.size - list_size;
+		span->data.list.size = list_size;
+		span->prev_span = 0;
+	}
+	return next;
+}
+
+#endif
+
+//! Add a span to a double linked list
+static void
+_memory_span_list_doublelink_add(span_t** head, span_t* span) {
+	if (*head) {
+		(*head)->prev_span = span;
+		span->next_span = *head;
+	}
+	else {
+		span->next_span = 0;
+	}
+	*head = span;
+}
+
+//! Remove a span from a double linked list
+static void
+_memory_span_list_doublelink_remove(span_t** head, span_t* span) {
+	if (*head == span) {
+		*head = span->next_span;
+	}
+	else {
+		span_t* next_span = span->next_span;
+		span_t* prev_span = span->prev_span;
+		if (next_span)
+			next_span->prev_span = prev_span;
+		prev_span->next_span = next_span;
+	}
+}
+
+#if ENABLE_GLOBAL_CACHE
+
+//! Insert the given list of memory page spans in the global cache
+static void
+_memory_cache_insert(heap_t* heap, global_cache_t* cache, span_t* span, size_t cache_limit) {
+	assert((span->data.list.size == 1) || (span->next_span != 0));
+	int32_t list_size = (int32_t)span->data.list.size;
+	//Unmap if cache has reached the limit
+	if (atomic_add32(&cache->size, list_size) > (int32_t)cache_limit) {
+		_memory_unmap_span_list(heap, span);
+		atomic_add32(&cache->size, -list_size);
+		return;
+	}
+	void* current_cache, *new_cache;
+	do {
+		current_cache = atomic_load_ptr(&cache->cache);
+		span->prev_span = (void*)((uintptr_t)current_cache & _memory_span_mask);
+		new_cache = (void*)((uintptr_t)span | ((uintptr_t)atomic_incr32(&cache->counter) & ~_memory_span_mask));
+	} while (!atomic_cas_ptr(&cache->cache, new_cache, current_cache));
+}
+
+//! Extract a number of memory page spans from the global cache
+static span_t*
+_memory_cache_extract(global_cache_t* cache) {
+	uintptr_t span_ptr;
+	do {
+		void* global_span = atomic_load_ptr(&cache->cache);
+		span_ptr = (uintptr_t)global_span & _memory_span_mask;
+		if (span_ptr) {
+			span_t* span = (void*)span_ptr;
+			//By accessing the span ptr before it is swapped out of list we assume that a contending thread
+			//does not manage to traverse the span to being unmapped before we access it
+			void* new_cache = (void*)((uintptr_t)span->prev_span | ((uintptr_t)atomic_incr32(&cache->counter) & ~_memory_span_mask));
+			if (atomic_cas_ptr(&cache->cache, new_cache, global_span)) {
+				atomic_add32(&cache->size, -(int32_t)span->data.list.size);
+				return span;
+			}
+		}
+	} while (span_ptr);
+	return 0;
+}
+
+//! Finalize a global cache, only valid from allocator finalization (not thread safe)
+static void
+_memory_cache_finalize(global_cache_t* cache) {
+	void* current_cache = atomic_load_ptr(&cache->cache);
+	span_t* span = (void*)((uintptr_t)current_cache & _memory_span_mask);
+	while (span) {
+		span_t* skip_span = (void*)((uintptr_t)span->prev_span & _memory_span_mask);
+		atomic_add32(&cache->size, -(int32_t)span->data.list.size);
+		_memory_unmap_span_list(0, span);
+		span = skip_span;
+	}
+	assert(!atomic_load32(&cache->size));
+	atomic_store_ptr(&cache->cache, 0);
+	atomic_store32(&cache->size, 0);
+}
+
+//! Insert the given list of memory page spans in the global cache
+static void
+_memory_global_cache_insert(heap_t* heap, span_t* span) {
+	//Calculate adaptive limits
+	size_t span_count = SPAN_COUNT(span->flags);
+	const size_t cache_divisor = (span_count == 1) ? MAX_SPAN_CACHE_DIVISOR : (MAX_LARGE_SPAN_CACHE_DIVISOR * span_count * 2);
+	const size_t cache_limit = (MAX_GLOBAL_CACHE_MULTIPLIER * _memory_max_allocation[span_count - 1]) / cache_divisor;
+	const size_t cache_limit_min = MAX_GLOBAL_CACHE_MULTIPLIER * (span_count == 1 ? MIN_SPAN_CACHE_SIZE : MIN_LARGE_SPAN_CACHE_SIZE);
+	_memory_cache_insert(heap, &_memory_span_cache[span_count - 1], span, cache_limit > cache_limit_min ? cache_limit : cache_limit_min);
+}
+
+//! Extract a number of memory page spans from the global cache for large blocks
+static span_t*
+_memory_global_cache_extract(size_t span_count) {
+	span_t* span = _memory_cache_extract(&_memory_span_cache[span_count - 1]);
+	assert(!span || (SPAN_COUNT(span->flags) == span_count));
+	return span;
+}
+
+#endif
+
+//! Insert a single span into thread heap cache, releasing to global cache if overflow
+static void
+_memory_heap_cache_insert(heap_t* heap, span_t* span) {
+#if ENABLE_THREAD_CACHE
+	size_t span_count = SPAN_COUNT(span->flags);
+	size_t idx = span_count - 1;
+	if (_memory_span_list_push(&heap->span_cache[idx], span) <= heap->span_counter[idx].cache_limit)
+		return;
+	heap->span_cache[idx] = _memory_span_list_split(span, heap->span_counter[idx].cache_limit);
+	assert(span->data.list.size == heap->span_counter[idx].cache_limit);
+#if ENABLE_STATISTICS
+	heap->thread_to_global += (size_t)span->data.list.size * span_count * _memory_span_size;
+#endif
+#if ENABLE_GLOBAL_CACHE
+	_memory_global_cache_insert(heap, span);
+#else
+	_memory_unmap_span_list(heap, span);
+#endif
+#else
+	_memory_unmap_span(heap, span);
+#endif
+}
+
+//! Extract the given number of spans from the different cache levels
+static span_t*
+_memory_heap_cache_extract(heap_t* heap, size_t span_count) {
+#if ENABLE_THREAD_CACHE
+	size_t idx = span_count - 1;
+	//Step 1: check thread cache
+	if (heap->span_cache[idx])
+		return _memory_span_list_pop(&heap->span_cache[idx]);
+#endif
+	//Step 2: Check reserved spans
+	if (heap->spans_reserved >= span_count)
+		return _memory_map_spans(heap, span_count);
+	//Step 3: Try processing deferred unmappings
+	span_t* span = _memory_unmap_deferred(heap, span_count);
+	if (span)
+		return span;
+#if ENABLE_THREAD_CACHE
+	//Step 4: Check larger super spans and split if we find one
+	for (++idx; idx < LARGE_CLASS_COUNT; ++idx) {
+		if (heap->span_cache[idx]) {
+			span = _memory_span_list_pop(&heap->span_cache[idx]);
 			break;
 		}
-
-		thread_yield();
-		atomic_thread_fence_acquire();
-		global_span_ptr = atomic_load_ptr(cache);
 	}
-	return span;
+	if (span) {
+		//Mark the span as owned by this heap before splitting
+		size_t got_count = SPAN_COUNT(span->flags);
+		assert(got_count > span_count);
+		atomic_store32(&span->heap_id, heap->id);
+		atomic_thread_fence_release();
+
+		//Split the span and store as reserved if no previously reserved spans, or in thread cache otherwise
+		span_t* subspan = _memory_span_split(heap, span, span_count);
+		assert((SPAN_COUNT(span->flags) + SPAN_COUNT(subspan->flags)) == got_count);
+		assert(SPAN_COUNT(span->flags) == span_count);
+		if (!heap->spans_reserved) {
+			heap->spans_reserved = got_count - span_count;
+			heap->span_reserve = subspan;
+			heap->span_reserve_master = pointer_offset(subspan, -(int32_t)SPAN_DISTANCE(subspan->flags) * (int32_t)_memory_span_size);
+		}
+		else {
+			_memory_heap_cache_insert(heap, subspan);
+		}
+		return span;
+	}
+#if ENABLE_GLOBAL_CACHE
+	//Step 5: Extract from global cache
+	idx = span_count - 1;
+	heap->span_cache[idx] = _memory_global_cache_extract(span_count);
+	if (heap->span_cache[idx]) {
+#if ENABLE_STATISTICS
+		heap->global_to_thread += (size_t)heap->span_cache[idx]->data.list.size * span_count * _memory_span_size;
+#endif
+		return _memory_span_list_pop(&heap->span_cache[idx]);
+	}
+#endif
+#endif
+	return 0;
 }
 
 //! Allocate a small/medium sized memory block from the given heap
 static void*
 _memory_allocate_from_heap(heap_t* heap, size_t size) {
-#if ENABLE_STATISTICS
-	//For statistics we need to store the requested size in the memory block
-	size += sizeof(size_t);
-#endif
-
 	//Calculate the size class index and do a dependent lookup of the final class index (in case of merged classes)
-	const size_t class_idx = _memory_size_class[(size <= SMALL_SIZE_LIMIT) ?
-		((size + (SMALL_GRANULARITY - 1)) >> SMALL_GRANULARITY_SHIFT) - 1 :
-		SMALL_CLASS_COUNT + ((size - SMALL_SIZE_LIMIT + (MEDIUM_GRANULARITY - 1)) >> MEDIUM_GRANULARITY_SHIFT) - 1].class_idx;
+	const size_t base_idx = (size <= SMALL_SIZE_LIMIT) ?
+	                        ((size + (SMALL_GRANULARITY - 1)) >> SMALL_GRANULARITY_SHIFT) :
+	                        SMALL_CLASS_COUNT + ((size - SMALL_SIZE_LIMIT + (MEDIUM_GRANULARITY - 1)) >> MEDIUM_GRANULARITY_SHIFT);
+	assert(!base_idx || ((base_idx - 1) < SIZE_CLASS_COUNT));
+	const size_t class_idx = _memory_size_class[base_idx ? (base_idx - 1) : 0].class_idx;
 
 	span_block_t* active_block = heap->active_block + class_idx;
 	size_class_t* size_class = _memory_size_class + class_idx;
 	const count_t class_size = size_class->size;
 
-#if ENABLE_STATISTICS
-	heap->allocated += class_size;
-	heap->requested += size;
-#endif
-
 	//Step 1: Try to get a block from the currently active span. The span block bookkeeping
 	//        data for the active span is stored in the heap for faster access
 use_active:
@@ -762,11 +1074,6 @@
 			assert(active_block->free_list < size_class->block_count);
 		}
 
-#if ENABLE_STATISTICS
-		//Store the requested size for statistics
-		*(size_t*)pointer_offset(block, class_size - sizeof(size_t)) = size;
-#endif
-
 		return block;
 	}
 
@@ -784,48 +1091,29 @@
 		span_t* span = heap->size_cache[class_idx];
 		*active_block = span->data.block;
 		assert(active_block->free_count > 0);
-		span_t* next_span = span->next_span;
-		heap->size_cache[class_idx] = next_span;
+		heap->size_cache[class_idx] = span->next_span;
 		heap->active_span[class_idx] = span;
+
+		//Mark span as owned by this heap
+		atomic_store32(&span->heap_id, heap->id);
+		atomic_thread_fence_release();
+
 		goto use_active;
 	}
 
-	//Step 4: No semi-used span available, try grab a span from the thread cache
-	size_t span_class_idx = _span_class_from_page_count(size_class->page_count);
-	span_t* span = heap->span_cache[span_class_idx];
+	//Step 4: Find a span in one of the cache levels
+	span_t* span = _memory_heap_cache_extract(heap, 1);
 	if (!span) {
-		//Step 5: No span available in the thread cache, try grab a list of spans from the global cache
-		span = _memory_global_cache_extract(size_class->page_count);
-#if ENABLE_STATISTICS
-		if (span)
-			heap->global_to_thread += (size_t)span->data.list.size * size_class->page_count * _memory_page_size;
-#endif
-	}
-	if (span) {
-		if (span->data.list.size > 1) {
-			//We got a list of spans, we will use first as active and store remainder in thread cache
-			span_t* next_span = span->next_span;
-			assert(next_span);
-			next_span->data.list.size = span->data.list.size - 1;
-			heap->span_cache[span_class_idx] = next_span;
-		}
-		else {
-			heap->span_cache[span_class_idx] = 0;
-		}
-	}
-	else {
-		//Step 6: All caches empty, map in new memory pages
-		size_t align_offset = 0;
-		span = _memory_map(size_class->page_count, &align_offset);
-		span->data.block.align_offset = (uint16_t)align_offset;
+		//Step 5: Map in more virtual memory
+		span = _memory_map_spans(heap, 1);
 	}
 
 	//Mark span as owned by this heap and set base data
+	assert(SPAN_COUNT(span->flags) == 1);
+	span->size_class = (uint16_t)class_idx;
 	atomic_store32(&span->heap_id, heap->id);
 	atomic_thread_fence_release();
 
-	span->size_class = (count_t)class_idx;
-
 	//If we only have one block we will grab it, otherwise
 	//set span as new span to use for next allocation
 	if (size_class->block_count > 1) {
@@ -841,12 +1129,7 @@
 	}
 
 	//Track counters
-	_memory_counter_increase(&heap->span_counter[span_class_idx], &_memory_max_allocation[span_class_idx]);
-
-#if ENABLE_STATISTICS
-	//Store the requested size for statistics
-	*(size_t*)pointer_offset(span, SPAN_HEADER_SIZE + class_size - sizeof(size_t)) = size;
-#endif
+	_memory_counter_increase(&heap->span_counter[0], &_memory_max_allocation[0], 1);
 
 	//Return first block if memory page span
 	return pointer_offset(span, SPAN_HEADER_SIZE);
@@ -856,122 +1139,35 @@
 static void*
 _memory_allocate_large_from_heap(heap_t* heap, size_t size) {
 	//Calculate number of needed max sized spans (including header)
+	//Since this function is never called if size > LARGE_SIZE_LIMIT
+	//the span_count is guaranteed to be <= LARGE_CLASS_COUNT
 	size += SPAN_HEADER_SIZE;
-	size_t num_spans = size / SPAN_MAX_SIZE;
-	if (size % SPAN_MAX_SIZE)
-		++num_spans;
-	size_t idx = num_spans - 1;
+	size_t span_count = size >> _memory_span_size_shift;
+	if (size & (_memory_span_size - 1))
+		++span_count;
+	size_t idx = span_count - 1;
 
-	if (!idx) {
-		//Shared with medium/small spans
-		size_t span_class_idx = _span_class_from_page_count(SPAN_MAX_PAGE_COUNT);
-		//Step 1: Check span cache
-		span_t* span = heap->span_cache[span_class_idx];
-		if (!span) {
-			_memory_deallocate_deferred(heap, 0);
-			span = heap->span_cache[span_class_idx];
-		}
-		if (!span) {
-			//Step 2: No span available in the thread cache, try grab a list of spans from the global cache
-			span = _memory_global_cache_extract(SPAN_MAX_PAGE_COUNT);
-#if ENABLE_STATISTICS
-			if (span)
-				heap->global_to_thread += (size_t)span->data.list.size * SPAN_MAX_PAGE_COUNT * _memory_page_size;
+#if ENABLE_THREAD_CACHE
+	if (!heap->span_cache[idx])
+		_memory_deallocate_deferred(heap, SIZE_CLASS_COUNT + idx);
+#else
+	_memory_deallocate_deferred(heap, SIZE_CLASS_COUNT + idx);
 #endif
-		}
-		if (span) {
-			if (span->data.list.size > 1) {
-				//We got a list of spans, we will use first as active and store remainder in thread cache
-				span_t* next_span = span->next_span;
-				assert(next_span);
-				next_span->data.list.size = span->data.list.size - 1;
-				heap->span_cache[span_class_idx] = next_span;
-			}
-			else {
-				heap->span_cache[span_class_idx] = 0;
-			}
-		}
-		else {
-			//Step 3: All caches empty, map in new memory pages
-			size_t align_offset = 0;
-			span = _memory_map(SPAN_MAX_PAGE_COUNT, &align_offset);
-			span->data.block.align_offset = (uint16_t)align_offset;
-		}
-
-		//Mark span as owned by this heap and set base data
-		atomic_store32(&span->heap_id, heap->id);
-		atomic_thread_fence_release();
-
-		span->size_class = SIZE_CLASS_COUNT;
-
-		//Track counters
-		_memory_counter_increase(&heap->span_counter[span_class_idx], &_memory_max_allocation[span_class_idx]);
-
-		return pointer_offset(span, SPAN_HEADER_SIZE);
+	//Step 1: Find span in one of the cache levels
+	span_t* span = _memory_heap_cache_extract(heap, span_count);
+	if (!span) {
+		//Step 2: Map in more virtual memory
+		span = _memory_map_spans(heap, span_count);
 	}
 
-use_cache:
-	//Step 1: Check if cache for this large size class (or the following, unless first class) has a span
-	while (!heap->large_cache[idx] && (idx < LARGE_CLASS_COUNT) && (idx < num_spans + 1))
-		++idx;
-	span_t* span = heap->large_cache[idx];
-	if (span) {
-		//Happy path, use from cache
-		if (span->data.list.size > 1) {
-			span_t* new_head = span->next_span;
-			assert(new_head);
-			new_head->data.list.size = span->data.list.size - 1;
-			heap->large_cache[idx] = new_head;
-		}
-		else {
-			heap->large_cache[idx] = 0;
-		}
-
-		span->size_class = SIZE_CLASS_COUNT + (count_t)idx;
-
-		//Increase counter
-		_memory_counter_increase(&heap->large_counter[idx], &_memory_max_allocation_large[idx]);
-
-		return pointer_offset(span, SPAN_HEADER_SIZE);
-	}
-
-	//Restore index, we're back to smallest fitting span count
-	idx = num_spans - 1;
-
-	//Step 2: Process deferred deallocation
-	if (_memory_deallocate_deferred(heap, SIZE_CLASS_COUNT + idx))
-		goto use_cache;
-	assert(!heap->large_cache[idx]);
-
-	//Step 3: Extract a list of spans from global cache
-	span = _memory_global_cache_large_extract(num_spans);
-	if (span) {
-#if ENABLE_STATISTICS
-		heap->global_to_thread += (size_t)span->data.list.size * num_spans * SPAN_MAX_SIZE;
-#endif
-		//We got a list from global cache, store remainder in thread cache
-		if (span->data.list.size > 1) {
-			span_t* new_head = span->next_span;
-			assert(new_head);
-			new_head->prev_span = 0;
-			new_head->data.list.size = span->data.list.size - 1;
-			heap->large_cache[idx] = new_head;
-		}
-	}
-	else {
-		//Step 4: Map in more memory pages
-		size_t align_offset = 0;
-		span = _memory_map(num_spans * SPAN_MAX_PAGE_COUNT, &align_offset);
-		span->data.block.align_offset = (uint16_t)align_offset;
-	}
-	//Mark span as owned by this heap
+	//Mark span as owned by this heap and set base data
+	assert(SPAN_COUNT(span->flags) == span_count);
+	span->size_class = (uint16_t)(SIZE_CLASS_COUNT + idx);
 	atomic_store32(&span->heap_id, heap->id);
 	atomic_thread_fence_release();
 
-	span->size_class = SIZE_CLASS_COUNT + (count_t)idx;
-
 	//Increase counter
-	_memory_counter_increase(&heap->large_counter[idx], &_memory_max_allocation_large[idx]);
+	_memory_counter_increase(&heap->span_counter[idx], &_memory_max_allocation[idx], span_count);
 
 	return pointer_offset(span, SPAN_HEADER_SIZE);
 }
@@ -988,151 +1184,88 @@
 	atomic_thread_fence_acquire();
 	do {
 		raw_heap = atomic_load_ptr(&_memory_orphan_heaps);
-		heap = (void*)((uintptr_t)raw_heap & ~(uintptr_t)0xFFFF);
+		heap = (void*)((uintptr_t)raw_heap & _memory_page_mask);
 		if (!heap)
 			break;
 		next_heap = heap->next_orphan;
 		orphan_counter = (uintptr_t)atomic_incr32(&_memory_orphan_counter);
-		next_raw_heap = (void*)((uintptr_t)next_heap | (orphan_counter & 0xFFFF));
+		next_raw_heap = (void*)((uintptr_t)next_heap | (orphan_counter & ~_memory_page_mask));
 	}
 	while (!atomic_cas_ptr(&_memory_orphan_heaps, next_raw_heap, raw_heap));
 
-	if (heap) {
-		heap->next_orphan = 0;
-		return heap;
+	if (!heap) {
+		//Map in pages for a new heap
+		size_t align_offset = 0;
+		heap = _memory_map((1 + (sizeof(heap_t) >> _memory_page_size_shift)) * _memory_page_size, &align_offset);
+		memset(heap, 0, sizeof(heap_t));
+		heap->align_offset = align_offset;
+
+		//Get a new heap ID
+		do {
+			heap->id = atomic_incr32(&_memory_heap_id);
+			if (_memory_heap_lookup(heap->id))
+				heap->id = 0;
+		} while (!heap->id);
+
+		//Link in heap in heap ID map
+		size_t list_idx = heap->id % HEAP_ARRAY_SIZE;
+		do {
+			next_heap = atomic_load_ptr(&_memory_heaps[list_idx]);
+			heap->next_heap = next_heap;
+		} while (!atomic_cas_ptr(&_memory_heaps[list_idx], heap, next_heap));
 	}
 
-	//Map in pages for a new heap
-	size_t align_offset = 0;
-	heap = _memory_map(1 + (sizeof(heap_t) >> _memory_page_size_shift), &align_offset);
-	memset(heap, 0, sizeof(heap_t));
-	heap->align_offset = align_offset;
+#if ENABLE_THREAD_CACHE
+	heap->span_counter[0].cache_limit = MIN_SPAN_CACHE_RELEASE + MIN_SPAN_CACHE_SIZE;
+	for (size_t idx = 1; idx < LARGE_CLASS_COUNT; ++idx)
+		heap->span_counter[idx].cache_limit = MIN_LARGE_SPAN_CACHE_RELEASE + MIN_LARGE_SPAN_CACHE_SIZE;
+#endif
 
-	//Get a new heap ID
-	do {
-		heap->id = atomic_incr32(&_memory_heap_id);
-		if (_memory_heap_lookup(heap->id))
-			heap->id = 0;
-	}
-	while (!heap->id);
-
-	//Link in heap in heap ID map
-	size_t list_idx = heap->id % HEAP_ARRAY_SIZE;
-	do {
-		next_heap = atomic_load_ptr(&_memory_heaps[list_idx]);
-		heap->next_heap = next_heap;
-	}
-	while (!atomic_cas_ptr(&_memory_heaps[list_idx], heap, next_heap));
+	//Clean up any deferred operations
+	_memory_unmap_deferred(heap, 0);
+	_memory_deallocate_deferred(heap, 0);
 
 	return heap;
 }
 
-//! Add a span to a double linked list
-static void
-_memory_list_add(span_t** head, span_t* span) {
-	if (*head) {
-		(*head)->prev_span = span;
-		span->next_span = *head;
-	}
-	else {
-		span->next_span = 0;
-	}
-	*head = span;
-}
-
-//! Remove a span from a double linked list
-static void
-_memory_list_remove(span_t** head, span_t* span) {
-	if (*head == span) {
-		*head = span->next_span;
-	}
-	else {
-		if (span->next_span)
-			span->next_span->prev_span = span->prev_span;
-		span->prev_span->next_span = span->next_span;
-	}
-}
-
-//! Insert span into thread cache, releasing to global cache if overflow
-static void
-_memory_heap_cache_insert(heap_t* heap, span_t* span, size_t page_count) {
-#if MAX_SPAN_CACHE_DIVISOR == 0
-	(void)sizeof(heap);
-	_memory_global_cache_insert(span, 1, page_count);
-#else
-	size_t span_class_idx = _span_class_from_page_count(page_count);
-	span_t** cache = &heap->span_cache[span_class_idx];
-	span->next_span = *cache;
-	if (*cache)
-		span->data.list.size = (*cache)->data.list.size + 1;
-	else
-		span->data.list.size = 1;
-	*cache = span;
-#if MAX_SPAN_CACHE_DIVISOR > 1
-	//Check if cache exceeds limit
-	if ((span->data.list.size >= (MIN_SPAN_CACHE_RELEASE + MIN_SPAN_CACHE_SIZE)) &&
-			(span->data.list.size > heap->span_counter[span_class_idx].cache_limit)) {
-		//Release to global cache
-		count_t list_size = 1;
-		span_t* next = span->next_span;
-		span_t* last = span;
-		while (list_size < MIN_SPAN_CACHE_RELEASE) {
-			last = next;
-			next = next->next_span;
-			++list_size;
-		}
-		next->data.list.size = span->data.list.size - list_size;
-		last->next_span = 0; //Terminate list
-		*cache = next;
-		_memory_global_cache_insert(span, list_size, page_count);
-#if ENABLE_STATISTICS
-		heap->thread_to_global += list_size * page_count * _memory_page_size;
-#endif
-	}
-#endif
-#endif
-}
-
 //! Deallocate the given small/medium memory block from the given heap
 static void
 _memory_deallocate_to_heap(heap_t* heap, span_t* span, void* p) {
 	//Check if span is the currently active span in order to operate
 	//on the correct bookkeeping data
+	assert(SPAN_COUNT(span->flags) == 1);
 	const count_t class_idx = span->size_class;
 	size_class_t* size_class = _memory_size_class + class_idx;
 	int is_active = (heap->active_span[class_idx] == span);
 	span_block_t* block_data = is_active ?
-		heap->active_block + class_idx :
-		&span->data.block;
-
-#if ENABLE_STATISTICS
-	heap->allocated -= size_class->size;
-	heap->requested -= *(size_t*)pointer_offset(p, size_class->size - sizeof(size_t));
-#endif
+	                           heap->active_block + class_idx :
+	                           &span->data.block;
 
 	//Check if the span will become completely free
 	if (block_data->free_count == ((count_t)size_class->block_count - 1)) {
+#if ENABLE_THREAD_CACHE
 		//Track counters
-		size_t span_class_idx = _span_class_from_page_count(size_class->page_count);
-		assert(heap->span_counter[span_class_idx].current_allocations > 0);
-		--heap->span_counter[span_class_idx].current_allocations;
+		assert(heap->span_counter[0].current_allocations > 0);
+		if (heap->span_counter[0].current_allocations)
+			--heap->span_counter[0].current_allocations;
+#endif
 
 		//If it was active, reset counter. Otherwise, if not active, remove from
 		//partial free list if we had a previous free block (guard for classes with only 1 block)
 		if (is_active)
 			block_data->free_count = 0;
 		else if (block_data->free_count > 0)
-			_memory_list_remove(&heap->size_cache[class_idx], span);
+			_memory_span_list_doublelink_remove(&heap->size_cache[class_idx], span);
 
-		//Add to span cache
-		_memory_heap_cache_insert(heap, span, size_class->page_count);
+		//Add to heap span cache
+		_memory_heap_cache_insert(heap, span);
 		return;
 	}
 
 	//Check if first free block for this span (previously fully allocated)
 	if (block_data->free_count == 0) {
 		//add to free list and disable autolink
-		_memory_list_add(&heap->size_cache[class_idx], span);
+		_memory_span_list_doublelink_add(&heap->size_cache[class_idx], span);
 		block_data->first_autolink = (uint16_t)size_class->block_count;
 	}
 	++block_data->free_count;
@@ -1148,57 +1281,26 @@
 //! Deallocate the given large memory block from the given heap
 static void
 _memory_deallocate_large_to_heap(heap_t* heap, span_t* span) {
-	//Check if aliased with 64KiB small/medium spans
-	if (span->size_class == SIZE_CLASS_COUNT) {
-		//Track counters
-		size_t span_class_idx = _span_class_from_page_count(SPAN_MAX_PAGE_COUNT);
-		--heap->span_counter[span_class_idx].current_allocations;
-		//Add to span cache
-		_memory_heap_cache_insert(heap, span, SPAN_MAX_PAGE_COUNT);
-		return;
-	}
-
 	//Decrease counter
-	size_t idx = span->size_class - SIZE_CLASS_COUNT;
-	span_counter_t* counter = heap->large_counter + idx;
-	assert(counter->current_allocations > 0);
-	--counter->current_allocations;
-
-#if MAX_SPAN_CACHE_DIVISOR == 0
-	_memory_global_cache_large_insert(span, 1, idx + 1);
-#else
-	//Insert into cache list
-	span_t** cache = heap->large_cache + idx;
-	span->next_span = *cache;
-	if (*cache)
-		span->data.list.size = (*cache)->data.list.size + 1;
-	else
-		span->data.list.size = 1;
-	*cache = span;
-#if MAX_SPAN_CACHE_DIVISOR > 1
-	//Check if cache exceeds limit
-	if ((span->data.list.size >= (MIN_SPAN_CACHE_RELEASE + MIN_SPAN_CACHE_SIZE)) &&
-			(span->data.list.size > counter->cache_limit)) {
-		//Release to global cache
-		count_t list_size = 1;
-		span_t* next = span->next_span;
-		span_t* last = span;
-		while (list_size < MIN_SPAN_CACHE_RELEASE) {
-			last = next;
-			next = next->next_span;
-			++list_size;
-		}
-		assert(next->next_span);
-		next->data.list.size = span->data.list.size - list_size;
-		last->next_span = 0; //Terminate list
-		*cache = next;
-		_memory_global_cache_large_insert(span, list_size, idx + 1);
-#if ENABLE_STATISTICS
-		heap->thread_to_global += list_size * (idx + 1) * SPAN_MAX_SIZE;
+	size_t idx = (size_t)span->size_class - SIZE_CLASS_COUNT;
+	size_t span_count = idx + 1;
+	assert(SPAN_COUNT(span->flags) == span_count);
+	assert(span->size_class >= SIZE_CLASS_COUNT);
+	assert(idx < LARGE_CLASS_COUNT);
+#if ENABLE_THREAD_CACHE
+	assert(heap->span_counter[idx].current_allocations > 0);
+	if (heap->span_counter[idx].current_allocations)
+		--heap->span_counter[idx].current_allocations;
 #endif
+	if (!heap->spans_reserved && (span_count > 1)) {
+		//Split the span and store remainder as reserved spans
+		//Must split to a dummy 1-span master since we cannot have master spans as reserved
+		_memory_set_span_remainder_as_reserved(heap, span, 1);
+		span_count = 1;
 	}
-#endif
-#endif
+
+	//Insert into cache list
+	_memory_heap_cache_insert(heap, span);
 }
 
 //! Process pending deferred cross-thread deallocations
@@ -1207,16 +1309,14 @@
 	//Grab the current list of deferred deallocations
 	atomic_thread_fence_acquire();
 	void* p = atomic_load_ptr(&heap->defer_deallocate);
-	if (!p)
-		return 0;
-	if (!atomic_cas_ptr(&heap->defer_deallocate, 0, p))
+	if (!p || !atomic_cas_ptr(&heap->defer_deallocate, 0, p))
 		return 0;
 	//Keep track if we deallocate in the given size class
 	int got_class = 0;
 	do {
 		void* next = *(void**)p;
 		//Get span and check which type of block
-		span_t* span = (void*)((uintptr_t)p & SPAN_MASK);
+		span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
 		if (span->size_class < SIZE_CLASS_COUNT) {
 			//Small/medium block
 			got_class |= (span->size_class == size_class);
@@ -1250,7 +1350,7 @@
 //! Allocate a block of the given size
 static void*
 _memory_allocate(size_t size) {
-	if (size <= MEDIUM_SIZE_LIMIT)
+	if (size <= _memory_medium_size_limit)
 		return _memory_allocate_from_heap(get_thread_heap(), size);
 	else if (size <= LARGE_SIZE_LIMIT)
 		return _memory_allocate_large_from_heap(get_thread_heap(), size);
@@ -1258,10 +1358,10 @@
 	//Oversized, allocate pages directly
 	size += SPAN_HEADER_SIZE;
 	size_t num_pages = size >> _memory_page_size_shift;
-	if (size % _memory_page_size)
+	if (size & (_memory_page_size - 1))
 		++num_pages;
 	size_t align_offset = 0;
-	span_t* span = _memory_map(num_pages, &align_offset);
+	span_t* span = _memory_map(num_pages * _memory_page_size, &align_offset);
 	atomic_store32(&span->heap_id, 0);
 	//Store page count in next_span
 	span->next_span = (span_t*)((uintptr_t)num_pages);
@@ -1277,7 +1377,7 @@
 		return;
 
 	//Grab the span (always at start of span, using 64KiB alignment)
-	span_t* span = (void*)((uintptr_t)p & SPAN_MASK);
+	span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
 	int32_t heap_id = atomic_load32(&span->heap_id);
 	heap_t* heap = get_thread_heap();
 	//Check if block belongs to this heap or if deallocation should be deferred
@@ -1293,7 +1393,7 @@
 	else {
 		//Oversized allocation, page count is stored in next_span
 		size_t num_pages = (size_t)span->next_span;
-		_memory_unmap(span, num_pages, span->data.list.align_offset);
+		_memory_unmap(span, num_pages * _memory_page_size, span->data.list.align_offset, 1);
 	}
 }
 
@@ -1301,8 +1401,8 @@
 static void*
 _memory_reallocate(void* p, size_t size, size_t oldsize, unsigned int flags) {
 	if (p) {
-		//Grab the span (always at start of span, using 64KiB alignment)
-		span_t* span = (void*)((uintptr_t)p & SPAN_MASK);
+		//Grab the span using guaranteed span alignment
+		span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
 		int32_t heap_id = atomic_load32(&span->heap_id);
 		if (heap_id) {
 			if (span->size_class < SIZE_CLASS_COUNT) {
@@ -1316,21 +1416,21 @@
 			else {
 				//Large block
 				size_t total_size = size + SPAN_HEADER_SIZE;
-				size_t num_spans = total_size / SPAN_MAX_SIZE;
-				if (total_size % SPAN_MAX_SIZE)
+				size_t num_spans = total_size >> _memory_span_size_shift;
+				if (total_size & (_memory_span_mask - 1))
 					++num_spans;
 				size_t current_spans = (span->size_class - SIZE_CLASS_COUNT) + 1;
 				if ((current_spans >= num_spans) && (num_spans >= (current_spans / 2)))
 					return p; //Still fits and less than half of memory would be freed
 				if (!oldsize)
-					oldsize = (current_spans * (size_t)SPAN_MAX_SIZE) - SPAN_HEADER_SIZE;
+					oldsize = (current_spans * _memory_span_size) - SPAN_HEADER_SIZE;
 			}
 		}
 		else {
 			//Oversized block
 			size_t total_size = size + SPAN_HEADER_SIZE;
 			size_t num_pages = total_size >> _memory_page_size_shift;
-			if (total_size % _memory_page_size)
+			if (total_size & (_memory_page_size - 1))
 				++num_pages;
 			//Page count is stored in next_span
 			size_t current_pages = (size_t)span->next_span;
@@ -1344,7 +1444,7 @@
 	//Size is greater than block size, need to allocate a new block and deallocate the old
 	//Avoid hysteresis by overallocating if increase is small (below 37%)
 	size_t lower_bound = oldsize + (oldsize >> 2) + (oldsize >> 3);
-	void* block = _memory_allocate(size > lower_bound ? size : lower_bound);
+	void* block = _memory_allocate((size > lower_bound) ? size : ((size > oldsize) ? lower_bound : size));
 	if (p) {
 		if (!(flags & RPMALLOC_NO_PRESERVE))
 			memcpy(block, p, oldsize < size ? oldsize : size);
@@ -1357,19 +1457,17 @@
 //! Get the usable size of the given block
 static size_t
 _memory_usable_size(void* p) {
-	//Grab the span (always at start of span, using 64KiB alignment)
-	span_t* span = (void*)((uintptr_t)p & SPAN_MASK);
+	//Grab the span using guaranteed span alignment
+	span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
 	int32_t heap_id = atomic_load32(&span->heap_id);
 	if (heap_id) {
-		if (span->size_class < SIZE_CLASS_COUNT) {
-			//Small/medium block
-			size_class_t* size_class = _memory_size_class + span->size_class;
-			return size_class->size;
-		}
+		//Small/medium block
+		if (span->size_class < SIZE_CLASS_COUNT)
+			return _memory_size_class[span->size_class].size;
 
 		//Large block
 		size_t current_spans = (span->size_class - SIZE_CLASS_COUNT) + 1;
-		return (current_spans * (size_t)SPAN_MAX_SIZE) - SPAN_HEADER_SIZE;
+		return (current_spans * _memory_span_size) - SPAN_HEADER_SIZE;
 	}
 
 	//Oversized block, page count is stored in next_span
@@ -1380,27 +1478,18 @@
 //! Adjust and optimize the size class properties for the given class
 static void
 _memory_adjust_size_class(size_t iclass) {
-	//Calculate how many pages are needed for 255 blocks
 	size_t block_size = _memory_size_class[iclass].size;
-	size_t page_count = (block_size * 255) >> _memory_page_size_shift;
-	page_count = (page_count == 0) ? 1 : ((page_count > _memory_max_page_count) ? _memory_max_page_count : page_count);
-	//Merge page counts to span size class granularity
-	page_count = ((page_count + (SPAN_CLASS_GRANULARITY - 1)) / SPAN_CLASS_GRANULARITY) * SPAN_CLASS_GRANULARITY;
-	if (page_count > _memory_max_page_count)
-		page_count = _memory_max_page_count;
-	size_t block_count = ((page_count * _memory_page_size) - SPAN_HEADER_SIZE) / block_size;
-	//Store the final configuration
-	_memory_size_class[iclass].page_count = (uint16_t)page_count;
+	size_t block_count = (_memory_span_size - SPAN_HEADER_SIZE) / block_size;
+
 	_memory_size_class[iclass].block_count = (uint16_t)block_count;
 	_memory_size_class[iclass].class_idx = (uint16_t)iclass;
-	
+
 	//Check if previous size classes can be merged
 	size_t prevclass = iclass;
 	while (prevclass > 0) {
 		--prevclass;
 		//A class can be merged if number of pages and number of blocks are equal
-		if ((_memory_size_class[prevclass].page_count == _memory_size_class[iclass].page_count) &&
-		        (_memory_size_class[prevclass].block_count == _memory_size_class[iclass].block_count)) {
+		if (_memory_size_class[prevclass].block_count == _memory_size_class[iclass].block_count) {
 			memcpy(_memory_size_class + prevclass, _memory_size_class + iclass, sizeof(_memory_size_class[iclass]));
 		}
 		else {
@@ -1431,28 +1520,32 @@
 rpmalloc_initialize_config(const rpmalloc_config_t* config) {
 	if (config)
 		memcpy(&_memory_config, config, sizeof(rpmalloc_config_t));
-	if (!_memory_config.memory_map)
+
+	int default_mapper = 0;
+	if (!_memory_config.memory_map || !_memory_config.memory_unmap) {
+		default_mapper = 1;
 		_memory_config.memory_map = _memory_map_os;
-	if (!_memory_config.memory_unmap)
 		_memory_config.memory_unmap = _memory_unmap_os;
-	
+	}
+
 	_memory_page_size = _memory_config.page_size;
 	if (!_memory_page_size) {
-#ifdef PLATFORM_WINDOWS
+#if PLATFORM_WINDOWS
 		SYSTEM_INFO system_info;
 		memset(&system_info, 0, sizeof(system_info));
 		GetSystemInfo(&system_info);
-		if (system_info.dwAllocationGranularity < SPAN_ADDRESS_GRANULARITY)
-			return -1;
 		_memory_page_size = system_info.dwPageSize;
+		_memory_map_granularity = system_info.dwAllocationGranularity;
 #else
 		_memory_page_size = (size_t)sysconf(_SC_PAGESIZE);
+		_memory_map_granularity = _memory_page_size;
 #endif
 	}
+
 	if (_memory_page_size < 512)
 		_memory_page_size = 512;
-	if (_memory_page_size > 16384)
-		_memory_page_size = 16384;
+	if (_memory_page_size > (16 * 1024))
+		_memory_page_size = (16 * 1024);
 
 	_memory_page_size_shift = 0;
 	size_t page_size_bit = _memory_page_size;
@@ -1460,9 +1553,31 @@
 		++_memory_page_size_shift;
 		page_size_bit >>= 1;
 	}
-
 	_memory_page_size = ((size_t)1 << _memory_page_size_shift);
-	_memory_max_page_count = (SPAN_MAX_SIZE >> _memory_page_size_shift);
+	_memory_page_mask = ~(uintptr_t)(_memory_page_size - 1);
+
+	size_t span_size = _memory_config.span_size;
+	if (!span_size)
+		span_size = (64 * 1024);
+	if (span_size > (256 * 1024))
+		span_size = (256 * 1024);
+	_memory_span_size = 4096;
+	_memory_span_size_shift = 12;
+	while ((_memory_span_size < span_size) || (_memory_span_size < _memory_page_size)) {
+		_memory_span_size <<= 1;
+		++_memory_span_size_shift;
+	}
+	_memory_span_mask = ~(uintptr_t)(_memory_span_size - 1);
+
+	_memory_config.page_size = _memory_page_size;
+	_memory_config.span_size = _memory_span_size;
+
+	if (!_memory_config.span_map_count)
+		_memory_config.span_map_count = DEFAULT_SPAN_MAP_COUNT;
+	if (_memory_config.span_size * _memory_config.span_map_count < _memory_config.page_size)
+		_memory_config.span_map_count = (_memory_config.page_size / _memory_config.span_size);
+	if (_memory_config.span_map_count > 128)
+		_memory_config.span_map_count = 128;
 
 #if defined(__APPLE__) && ENABLE_PRELOAD
 	if (pthread_key_create(&_memory_thread_heap, 0))
@@ -1471,6 +1586,7 @@
 
 	atomic_store32(&_memory_heap_id, 0);
 	atomic_store32(&_memory_orphan_counter, 0);
+	atomic_store32(&_memory_active_heaps, 0);
 
 	//Setup all small and medium size classes
 	size_t iclass;
@@ -1479,10 +1595,14 @@
 		_memory_size_class[iclass].size = (uint16_t)size;
 		_memory_adjust_size_class(iclass);
 	}
+
+	_memory_medium_size_limit = _memory_span_size - SPAN_HEADER_SIZE;
+	if (_memory_medium_size_limit > MEDIUM_SIZE_LIMIT)
+		_memory_medium_size_limit = MEDIUM_SIZE_LIMIT;
 	for (iclass = 0; iclass < MEDIUM_CLASS_COUNT; ++iclass) {
 		size_t size = SMALL_SIZE_LIMIT + ((iclass + 1) * MEDIUM_GRANULARITY);
-		if (size > MEDIUM_SIZE_LIMIT)
-			size = MEDIUM_SIZE_LIMIT;
+		if (size > _memory_medium_size_limit)
+			size = _memory_medium_size_limit;
 		_memory_size_class[SMALL_CLASS_COUNT + iclass].size = (uint16_t)size;
 		_memory_adjust_size_class(SMALL_CLASS_COUNT + iclass);
 	}
@@ -1498,6 +1618,8 @@
 	atomic_thread_fence_acquire();
 
 	rpmalloc_thread_finalize();
+	//If you hit this assert, you still have active threads or forgot to finalize some thread(s)
+	assert(atomic_load32(&_memory_active_heaps) == 0);
 
 	//Free all thread caches
 	for (size_t list_idx = 0; list_idx < HEAP_ARRAY_SIZE; ++list_idx) {
@@ -1505,76 +1627,63 @@
 		while (heap) {
 			_memory_deallocate_deferred(heap, 0);
 
-			for (size_t iclass = 0; iclass < SPAN_CLASS_COUNT; ++iclass) {
-				const size_t page_count = (iclass + 1) * SPAN_CLASS_GRANULARITY;
-				span_t* span = heap->span_cache[iclass];
-				unsigned int span_count = span ? span->data.list.size : 0;
-				for (unsigned int ispan = 0; ispan < span_count; ++ispan) {
-					span_t* next_span = span->next_span;
-					_memory_unmap(span, page_count, span->data.list.align_offset);
-					span = next_span;
+			//Free span caches (other thread might have deferred after the thread using this heap finalized)
+#if ENABLE_THREAD_CACHE
+			for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
+				if (heap->span_cache[iclass])
+					_memory_unmap_span_list(0, heap->span_cache[iclass]);
+			}
+#endif
+			heap = heap->next_heap;
+		}
+	}
+
+#if ENABLE_GLOBAL_CACHE
+	//Free global caches
+	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass)
+		_memory_cache_finalize(&_memory_span_cache[iclass]);
+#endif
+
+	for (size_t list_idx = 0; list_idx < HEAP_ARRAY_SIZE; ++list_idx) {
+		heap_t* heap = atomic_load_ptr(&_memory_heaps[list_idx]);
+		atomic_store_ptr(&_memory_heaps[list_idx], 0);
+		while (heap) {
+			if (heap->spans_reserved) {
+				span_t* span = heap->span_reserve;
+				span_t* master = heap->span_reserve_master;
+				uint32_t remains = SPAN_REMAINS(master->flags);
+
+				assert(master != span);
+				assert(remains >= heap->spans_reserved);
+				_memory_unmap(span, heap->spans_reserved * _memory_span_size, 0, 0);
+				_memory_statistics_sub(&_reserved_spans, heap->spans_reserved);
+				remains = ((uint32_t)heap->spans_reserved >= remains) ? 0 : (remains - (uint32_t)heap->spans_reserved);
+				if (!remains) {
+					uint32_t master_span_count = SPAN_COUNT(master->flags);
+					_memory_statistics_sub(&_reserved_spans, master_span_count);
+					_memory_unmap(master, master_span_count * _memory_span_size, master->data.list.align_offset, 1);
+				}
+				else {
+					SPAN_SET_REMAINS(master->flags, remains);
 				}
 			}
 
-			//Free large spans
-			for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
-				const size_t span_count = iclass + 1;
-				span_t* span = heap->large_cache[iclass];
-				while (span) {
-					span_t* next_span = span->next_span;
-					_memory_unmap(span, span_count * SPAN_MAX_PAGE_COUNT, span->data.list.align_offset);
-					span = next_span;
-				}
-			}
+			_memory_unmap_deferred(heap, 0);
 
 			heap_t* next_heap = heap->next_heap;
-			_memory_unmap(heap, 1 + (sizeof(heap_t) >> _memory_page_size_shift), heap->align_offset);
+			_memory_unmap(heap, (1 + (sizeof(heap_t) >> _memory_page_size_shift)) * _memory_page_size, heap->align_offset, 1);
 			heap = next_heap;
 		}
-
-		atomic_store_ptr(&_memory_heaps[list_idx], 0);
 	}
 	atomic_store_ptr(&_memory_orphan_heaps, 0);
-
-	//Free global caches
-	for (size_t iclass = 0; iclass < SPAN_CLASS_COUNT; ++iclass) {
-		void* span_ptr = atomic_load_ptr(&_memory_span_cache[iclass]);
-		size_t cache_count = (uintptr_t)span_ptr & ~SPAN_MASK;
-		span_t* span = (span_t*)((void*)((uintptr_t)span_ptr & SPAN_MASK));
-		while (cache_count) {
-			span_t* skip_span = span->prev_span;
-			unsigned int span_count = span->data.list.size;
-			for (unsigned int ispan = 0; ispan < span_count; ++ispan) {
-				span_t* next_span = span->next_span;
-				_memory_unmap(span, (iclass + 1) * SPAN_CLASS_GRANULARITY, span->data.list.align_offset);
-				span = next_span;
-			}
-			span = skip_span;
-			cache_count -= span_count;
-		}
-		atomic_store_ptr(&_memory_span_cache[iclass], 0);
-	}
-
-	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
-		void* span_ptr = atomic_load_ptr(&_memory_large_cache[iclass]);
-		size_t cache_count = (uintptr_t)span_ptr & ~SPAN_MASK;
-		span_t* span = (span_t*)((void*)((uintptr_t)span_ptr & SPAN_MASK));
-		while (cache_count) {
-			span_t* skip_span = span->prev_span;
-			unsigned int span_count = span->data.list.size;
-			for (unsigned int ispan = 0; ispan < span_count; ++ispan) {
-				span_t* next_span = span->next_span;
-				_memory_unmap(span, (iclass + 1) * SPAN_MAX_PAGE_COUNT, span->data.list.align_offset);
-				span = next_span;
-			}
-			span = skip_span;
-			cache_count -= span_count;
-		}
-		atomic_store_ptr(&_memory_large_cache[iclass], 0);
-	}
-
 	atomic_thread_fence_release();
 
+#if ENABLE_STATISTICS
+	//If you hit these asserts you probably have memory leaks or double frees in your code
+	assert(!atomic_load32(&_mapped_pages));
+	assert(!atomic_load32(&_reserved_spans));
+#endif
+
 #if defined(__APPLE__) && ENABLE_PRELOAD
 	pthread_key_delete(_memory_thread_heap);
 #endif
@@ -1584,13 +1693,13 @@
 void
 rpmalloc_thread_initialize(void) {
 	if (!get_thread_heap()) {
-		heap_t* heap =  _memory_allocate_heap();
+		atomic_incr32(&_memory_active_heaps);
+		heap_t* heap = _memory_allocate_heap();
 #if ENABLE_STATISTICS
 		heap->thread_to_global = 0;
 		heap->global_to_thread = 0;
 #endif
 		set_thread_heap(heap);
-		atomic_incr32(&_memory_active_heaps);
 	}
 }
 
@@ -1601,71 +1710,27 @@
 	if (!heap)
 		return;
 
-	atomic_add32(&_memory_active_heaps, -1);
-
 	_memory_deallocate_deferred(heap, 0);
+	_memory_unmap_deferred(heap, 0);
 
 	//Release thread cache spans back to global cache
-	for (size_t iclass = 0; iclass < SPAN_CLASS_COUNT; ++iclass) {
-		const size_t page_count = (iclass + 1) * SPAN_CLASS_GRANULARITY;
+#if ENABLE_THREAD_CACHE
+	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
 		span_t* span = heap->span_cache[iclass];
+#if ENABLE_GLOBAL_CACHE
+		const size_t span_count = iclass + 1;
 		while (span) {
-			if (span->data.list.size > MIN_SPAN_CACHE_RELEASE) {
-				count_t list_size = 1;
-				span_t* next = span->next_span;
-				span_t* last = span;
-				while (list_size < MIN_SPAN_CACHE_RELEASE) {
-					last = next;
-					next = next->next_span;
-					++list_size;
-				}
-				last->next_span = 0; //Terminate list
-				next->data.list.size = span->data.list.size - list_size;
-				_memory_global_cache_insert(span, list_size, page_count);
-				span = next;
-			}
-			else {
-				_memory_global_cache_insert(span, span->data.list.size, page_count);
-				span = 0;
-			}
+			assert(SPAN_COUNT(span->flags) == span_count);
+			span_t* next = _memory_span_list_split(span, !iclass ? MIN_SPAN_CACHE_RELEASE : (MIN_LARGE_SPAN_CACHE_RELEASE / span_count));
+			_memory_global_cache_insert(0, span);
+			span = next;
 		}
+#else
+		if (span)
+			_memory_unmap_span_list(heap, span);
+#endif
 		heap->span_cache[iclass] = 0;
 	}
-
-	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
-		const size_t span_count = iclass + 1;
-		span_t* span = heap->large_cache[iclass];
-		while (span) {
-			if (span->data.list.size > MIN_SPAN_CACHE_RELEASE) {
-				count_t list_size = 1;
-				span_t* next = span->next_span;
-				span_t* last = span;
-				while (list_size < MIN_SPAN_CACHE_RELEASE) {
-					last = next;
-					next = next->next_span;
-					++list_size;
-				}
-				last->next_span = 0; //Terminate list
-				next->data.list.size = span->data.list.size - list_size;
-				_memory_global_cache_large_insert(span, list_size, span_count);
-				span = next;
-			}
-			else {
-				_memory_global_cache_large_insert(span, span->data.list.size, span_count);
-				span = 0;
-			}
-		}
-		heap->large_cache[iclass] = 0;
-	}
-
-	//Reset allocation counters
-	memset(heap->span_counter, 0, sizeof(heap->span_counter));
-	memset(heap->large_counter, 0, sizeof(heap->large_counter));
-#if ENABLE_STATISTICS
-	heap->requested = 0;
-	heap->allocated = 0;
-	heap->thread_to_global = 0;
-	heap->global_to_thread = 0;
 #endif
 
 	//Orphan the heap
@@ -1674,13 +1739,14 @@
 	heap_t* last_heap;
 	do {
 		last_heap = atomic_load_ptr(&_memory_orphan_heaps);
-		heap->next_orphan = (void*)((uintptr_t)last_heap & ~(uintptr_t)0xFFFF);
+		heap->next_orphan = (void*)((uintptr_t)last_heap & _memory_page_mask);
 		orphan_counter = (uintptr_t)atomic_incr32(&_memory_orphan_counter);
-		raw_heap = (void*)((uintptr_t)heap | (orphan_counter & 0xFFFF));
+		raw_heap = (void*)((uintptr_t)heap | (orphan_counter & ~_memory_page_mask));
 	}
 	while (!atomic_cas_ptr(&_memory_orphan_heaps, raw_heap, last_heap));
 
 	set_thread_heap(0);
+	atomic_add32(&_memory_active_heaps, -1);
 }
 
 int
@@ -1688,57 +1754,149 @@
 	return (get_thread_heap() != 0) ? 1 : 0;
 }
 
+const rpmalloc_config_t*
+rpmalloc_config(void) {
+	return &_memory_config;
+}
+
 //! Map new pages to virtual memory
 static void*
-_memory_map_os(size_t size) {
-	void* ptr;
+_memory_map_os(size_t size, size_t* offset) {
+	//Either size is a heap (a single page) or a (multiple) span - we only need to align spans
+	size_t padding = ((size >= _memory_span_size) && (_memory_span_size > _memory_map_granularity)) ? _memory_span_size : 0;
 
-#if ENABLE_STATISTICS
-	atomic_add32(&_mapped_pages, (int32_t)(size >> _memory_page_size_shift));
-	atomic_add32(&_mapped_total, (int32_t)(size >> _memory_page_size_shift));
-#endif
-
-#ifdef PLATFORM_WINDOWS
-	ptr = VirtualAlloc(0, size, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
-#else
-	size_t padding = SPAN_ADDRESS_GRANULARITY;
-
-	ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_UNINITIALIZED, -1, 0);
-	if (ptr == MAP_FAILED)
+#if PLATFORM_WINDOWS
+	//Ok to MEM_COMMIT - according to MSDN, "actual physical pages are not allocated unless/until the virtual addresses are actually accessed"
+	void* ptr = VirtualAlloc(0, size + padding, MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
+	if (!ptr) {
+		assert("Failed to map virtual memory block" == 0);
 		return 0;
-
-	padding -= (uintptr_t)ptr % SPAN_ADDRESS_GRANULARITY;
-	ptr = pointer_offset(ptr, padding);
+	}
+#else
+	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_UNINITIALIZED, -1, 0);
+	if ((ptr == MAP_FAILED) || !ptr) {
+		assert("Failed to map virtual memory block" == 0);
+		return 0;
+	}
 #endif
 
+	if (padding) {
+		size_t final_padding = padding - ((uintptr_t)ptr & ~_memory_span_mask);
+#if PLATFORM_POSIX
+		//Unmap the last unused pages, for Windows this is done with the final VirtualFree with MEM_RELEASE call
+		size_t remains = padding - final_padding;
+		if (remains)
+			munmap(pointer_offset(ptr, final_padding + size), remains);
+#endif
+		ptr = pointer_offset(ptr, final_padding);
+		assert(final_padding <= _memory_span_size);
+		assert(!(final_padding & 5));
+		assert(!((uintptr_t)ptr & ~_memory_span_mask));
+		*offset = final_padding >> 3;
+		assert(*offset < 65536);
+	}
+
 	return ptr;
 }
 
 //! Unmap pages from virtual memory
 static void
-_memory_unmap_os(void* ptr, size_t size) {
-#if ENABLE_STATISTICS
-	atomic_add32(&_mapped_pages, -(int32_t)(size >> _memory_page_size_shift));
-	atomic_add32(&_unmapped_total, (int32_t)(size >> _memory_page_size_shift));
+_memory_unmap_os(void* address, size_t size, size_t offset, int release) {
+	assert(release || (offset == 0));
+	if (release && offset) {
+		offset <<= 3;
+#if PLATFORM_POSIX
+		size += offset;
 #endif
-
-#ifdef PLATFORM_WINDOWS
-	VirtualFree(ptr, 0, MEM_RELEASE);
-	(void)sizeof(size);
+		address = pointer_offset(address, -(int32_t)offset);
+	}
+#if PLATFORM_WINDOWS
+	if (!VirtualFree(address, release ? 0 : size, release ? MEM_RELEASE : MEM_DECOMMIT)) {
+		DWORD err = GetLastError();
+		assert("Failed to unmap virtual memory block" == 0);
+	}
 #else
-	munmap(ptr, size);
+	MEMORY_UNUSED(release);
+	if (munmap(address, size)) {
+		assert("Failed to unmap virtual memory block" == 0);
+	}
 #endif
 }
 
-//! Yield the thread remaining timeslice
+#if ENABLE_GUARDS
 static void
-thread_yield(void) {
-#ifdef PLATFORM_WINDOWS
-	YieldProcessor();
-#else
-	sched_yield();
-#endif
+_memory_guard_validate(void* p) {
+	if (!p)
+		return;
+	void* block_start;
+	size_t block_size = _memory_usable_size(p);
+	span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
+	int32_t heap_id = atomic_load32(&span->heap_id);
+	if (heap_id) {
+		if (span->size_class < SIZE_CLASS_COUNT) {
+			void* span_blocks_start = pointer_offset(span, SPAN_HEADER_SIZE);
+			size_class_t* size_class = _memory_size_class + span->size_class;
+			count_t block_offset = (count_t)pointer_diff(p, span_blocks_start);
+			count_t block_idx = block_offset / (count_t)size_class->size;
+			block_start = pointer_offset(span_blocks_start, block_idx * size_class->size);
+		}
+		else {
+			block_start = pointer_offset(span, SPAN_HEADER_SIZE);
+		}
+	}
+	else {
+		block_start = pointer_offset(span, SPAN_HEADER_SIZE);
+	}
+	uint32_t* deadzone = block_start;
+	//If these asserts fire, you have written to memory before the block start
+	for (int i = 0; i < 8; ++i) {
+		if (deadzone[i] != MAGIC_GUARD) {
+			if (_memory_config.memory_overwrite)
+				_memory_config.memory_overwrite(p);
+			else
+				assert("Memory overwrite before block start" == 0);
+			return;
+		}
+		deadzone[i] = 0;
+	}
+	deadzone = (uint32_t*)pointer_offset(block_start, block_size - 32);
+	//If these asserts fire, you have written to memory after the block end
+	for (int i = 0; i < 8; ++i) {
+		if (deadzone[i] != MAGIC_GUARD) {
+			if (_memory_config.memory_overwrite)
+				_memory_config.memory_overwrite(p);
+			else
+				assert("Memory overwrite after block end" == 0);
+			return;
+		}
+		deadzone[i] = 0;
+	}
 }
+#else
+#define _memory_guard_validate(block)
+#endif
+
+#if ENABLE_GUARDS
+static void
+_memory_guard_block(void* block) {
+	if (block) {
+		size_t block_size = _memory_usable_size(block);
+		uint32_t* deadzone = block;
+		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] =
+		deadzone[4] = deadzone[5] = deadzone[6] = deadzone[7] = MAGIC_GUARD;
+		deadzone = (uint32_t*)pointer_offset(block, block_size - 32);
+		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] =
+		deadzone[4] = deadzone[5] = deadzone[6] = deadzone[7] = MAGIC_GUARD;
+	}
+}
+#define _memory_guard_pre_alloc(size) size += 64
+#define _memory_guard_pre_realloc(block, size) block = pointer_offset(block, -32); size += 64
+#define _memory_guard_post_alloc(block, size) _memory_guard_block(block); block = pointer_offset(block, 32); size -= 64
+#else
+#define _memory_guard_pre_alloc(size)
+#define _memory_guard_pre_realloc(block, size)
+#define _memory_guard_post_alloc(block, size)
+#endif
 
 // Extern interface
 
@@ -1750,77 +1908,15 @@
 		return 0;
 	}
 #endif
-#if ENABLE_GUARDS
-	size += 32;
-#endif
+	_memory_guard_pre_alloc(size);
 	void* block = _memory_allocate(size);
-#if ENABLE_GUARDS
-	if (block) {
-		size_t block_size = _memory_usable_size(block);
-		uint32_t* deadzone = block;
-		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-		deadzone = (uint32_t*)pointer_offset(block, block_size - 16);
-		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-		block = pointer_offset(block, 16);
-	}
-#endif
+	_memory_guard_post_alloc(block, size);
 	return block;
 }
 
-#if ENABLE_GUARDS
-static void
-_memory_validate_integrity(void* p) {
-	if (!p)
-		return;
-	void* block_start;
-	size_t block_size = _memory_usable_size(p);
-	span_t* span = (void*)((uintptr_t)p & SPAN_MASK);
-	int32_t heap_id = atomic_load32(&span->heap_id);
-	if (heap_id) {
-		if (span->size_class < SIZE_CLASS_COUNT) {
-			void* span_blocks_start = pointer_offset(span, SPAN_HEADER_SIZE);
-			size_class_t* size_class = _memory_size_class + span->size_class;
-			count_t block_offset = (count_t)pointer_diff(p, span_blocks_start);
-			count_t block_idx = block_offset / (count_t)size_class->size;
-	 		block_start = pointer_offset(span_blocks_start, block_idx * size_class->size);
-	 	}
-	 	else {
-			block_start = pointer_offset(span, SPAN_HEADER_SIZE);
-	 	}
-  	}
-	else {
-		block_start = pointer_offset(span, SPAN_HEADER_SIZE);
-	}
-	uint32_t* deadzone = block_start;
-	//If these asserts fire, you have written to memory before the block start
-	for (int i = 0; i < 4; ++i) {
-		if (deadzone[i] == MAGIC_GUARD)
-			continue;
-		if (_memory_config.memory_overwrite)
-			_memory_config.memory_overwrite(p);
-		else
-			assert(deadzone[i] == MAGIC_GUARD && "Memory overwrite before block start");
-		return;
-	}
-	deadzone = (uint32_t*)pointer_offset(block_start, block_size - 16);
-	//If these asserts fire, you have written to memory after the block end
-	for (int i = 0; i < 4; ++i) {
-		if (deadzone[i] == MAGIC_GUARD)
-			continue;
-		if (_memory_config.memory_overwrite)
-			_memory_config.memory_overwrite(p);
-		else
-			assert(deadzone[i] == MAGIC_GUARD && "Memory overwrite after block end");
-		return;
-	}
-}
-#endif
-
 void
 rpfree(void* ptr) {
-#if ENABLE_GUARDS
-	_memory_validate_integrity(ptr);
-#endif	
+	_memory_guard_validate(ptr);
 	_memory_deallocate(ptr);
 }
 
@@ -1828,7 +1924,7 @@
 rpcalloc(size_t num, size_t size) {
 	size_t total;
 #if ENABLE_VALIDATE_ARGS
-#ifdef PLATFORM_WINDOWS
+#if PLATFORM_WINDOWS
 	int err = SizeTMult(num, size, &total);
 	if ((err != S_OK) || (total >= MAX_ALLOC_SIZE)) {
 		errno = EINVAL;
@@ -1844,21 +1940,9 @@
 #else
 	total = num * size;
 #endif
-#if ENABLE_GUARDS
-	total += 32;
-#endif
+	_memory_guard_pre_alloc(total);
 	void* block = _memory_allocate(total);
-#if ENABLE_GUARDS
-	if (block) {
-		size_t block_size = _memory_usable_size(block);
-		uint32_t* deadzone = block;
-		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-		deadzone = (uint32_t*)pointer_offset(block, block_size - 16);
-		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-		block = pointer_offset(block, 16);
-		total -= 32;
-	}
-#endif
+	_memory_guard_post_alloc(block, total);
 	memset(block, 0, total);
 	return block;
 }
@@ -1871,21 +1955,10 @@
 		return ptr;
 	}
 #endif
-#if ENABLE_GUARDS
-	_memory_validate_integrity(ptr);
-	size += 32;
-#endif
+	_memory_guard_validate(ptr);
+	_memory_guard_pre_realloc(ptr, size);
 	void* block = _memory_reallocate(ptr, size, 0, 0);
-#if ENABLE_GUARDS
-	if (block) {
-		size_t block_size = _memory_usable_size(block);
-		uint32_t* deadzone = block;
-		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-		deadzone = (uint32_t*)pointer_offset(block, block_size - 16);
-		deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-		block = pointer_offset(block, 16);
-	}
-#endif
+	_memory_guard_post_alloc(block, size);
 	return block;
 }
 
@@ -1893,45 +1966,34 @@
 rpaligned_realloc(void* ptr, size_t alignment, size_t size, size_t oldsize,
                   unsigned int flags) {
 #if ENABLE_VALIDATE_ARGS
-	if ((size + alignment < size) || (alignment > PAGE_SIZE)) {
+	if ((size + alignment < size) || (alignment > _memory_page_size)) {
 		errno = EINVAL;
 		return 0;
 	}
 #endif
 	void* block;
-	if (alignment > 16) {
+	if (alignment > 32) {
 		block = rpaligned_alloc(alignment, size);
 		if (!(flags & RPMALLOC_NO_PRESERVE))
 			memcpy(block, ptr, oldsize < size ? oldsize : size);
 		rpfree(ptr);
 	}
 	else {
-#if ENABLE_GUARDS
-		_memory_validate_integrity(ptr);
-		size += 32;
-#endif
+		_memory_guard_validate(ptr);
+		_memory_guard_pre_realloc(ptr, size);
 		block = _memory_reallocate(ptr, size, oldsize, flags);
-#if ENABLE_GUARDS
-		if (block) {
-			size_t block_size = _memory_usable_size(block);
-			uint32_t* deadzone = block;
-			deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-			deadzone = (uint32_t*)pointer_offset(block, block_size - 16);
-			deadzone[0] = deadzone[1] = deadzone[2] = deadzone[3] = MAGIC_GUARD;
-			block = pointer_offset(block, 16);
-		}
-#endif
+		_memory_guard_post_alloc(block, size);
 	}
 	return block;
 }
 
 RPMALLOC_RESTRICT void*
 rpaligned_alloc(size_t alignment, size_t size) {
-	if (alignment <= 16)
+	if (alignment <= 32)
 		return rpmalloc(size);
 
 #if ENABLE_VALIDATE_ARGS
-	if ((size + alignment < size) || (alignment > PAGE_SIZE)) {
+	if ((size + alignment < size) || (alignment > _memory_page_size)) {
 		errno = EINVAL;
 		return 0;
 	}
@@ -1963,7 +2025,7 @@
 	if (ptr) {
 		size = _memory_usable_size(ptr);
 #if ENABLE_GUARDS
-		size -= 32;
+		size -= 64;
 #endif
 	}
 	return size;
@@ -1971,21 +2033,19 @@
 
 void
 rpmalloc_thread_collect(void) {
-	_memory_deallocate_deferred(get_thread_heap(), 0);
+	heap_t* heap = get_thread_heap();
+	_memory_unmap_deferred(heap, 0);
+	_memory_deallocate_deferred(0, 0);
 }
 
 void
 rpmalloc_thread_statistics(rpmalloc_thread_statistics_t* stats) {
 	memset(stats, 0, sizeof(rpmalloc_thread_statistics_t));
 	heap_t* heap = get_thread_heap();
-#if ENABLE_STATISTICS
-	stats->allocated = heap->allocated;
-	stats->requested = heap->requested;
-#endif
 	void* p = atomic_load_ptr(&heap->defer_deallocate);
 	while (p) {
 		void* next = *(void**)p;
-		span_t* span = (void*)((uintptr_t)p & SPAN_MASK);
+		span_t* span = (void*)((uintptr_t)p & _memory_span_mask);
 		stats->deferred += _memory_size_class[span->size_class].size;
 		p = next;
 	}
@@ -2001,10 +2061,12 @@
 		}
 	}
 
-	for (size_t isize = 0; isize < SPAN_CLASS_COUNT; ++isize) {
-		if (heap->span_cache[isize])
-			stats->spancache = (size_t)heap->span_cache[isize]->data.list.size * (isize + 1) * SPAN_CLASS_GRANULARITY * _memory_page_size;
+#if ENABLE_THREAD_CACHE
+	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
+		if (heap->span_cache[iclass])
+			stats->spancache = (size_t)heap->span_cache[iclass]->data.list.size * (iclass + 1) * _memory_span_size;
 	}
+#endif
 }
 
 void
@@ -2015,24 +2077,9 @@
 	stats->mapped_total = (size_t)atomic_load32(&_mapped_total) * _memory_page_size;
 	stats->unmapped_total = (size_t)atomic_load32(&_unmapped_total) * _memory_page_size;
 #endif
-	for (size_t iclass = 0; iclass < SPAN_CLASS_COUNT; ++iclass) {
-		void* global_span_ptr = atomic_load_ptr(&_memory_span_cache[iclass]);
-		while (global_span_ptr == SPAN_LIST_LOCK_TOKEN) {
-			thread_yield();
-			global_span_ptr = atomic_load_ptr(&_memory_span_cache[iclass]);
-		}
-		uintptr_t global_span_count = (uintptr_t)global_span_ptr & ~SPAN_MASK;
-		size_t list_bytes = global_span_count * (iclass + 1) * SPAN_CLASS_GRANULARITY * _memory_page_size;
-		stats->cached += list_bytes;
-	}
+#if ENABLE_GLOBAL_CACHE
 	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
-		void* global_span_ptr = atomic_load_ptr(&_memory_large_cache[iclass]);
-		while (global_span_ptr == SPAN_LIST_LOCK_TOKEN) {
-			thread_yield();
-			global_span_ptr = atomic_load_ptr(&_memory_large_cache[iclass]);
-		}
-		uintptr_t global_span_count = (uintptr_t)global_span_ptr & ~SPAN_MASK;
-		size_t list_bytes = global_span_count * (iclass + 1) * SPAN_MAX_PAGE_COUNT * _memory_page_size;
-		stats->cached_large += list_bytes;
+		stats->cached += (size_t)atomic_load32(&_memory_span_cache[iclass].size) * (iclass + 1) * _memory_span_size;
 	}
+#endif
 }
diff --git a/rpmalloc/rpmalloc.h b/rpmalloc/rpmalloc.h
index 52ab2d3..f596101 100644
--- a/rpmalloc/rpmalloc.h
+++ b/rpmalloc/rpmalloc.h
@@ -39,8 +39,6 @@
 	size_t mapped;
 	//! Current amount of memory in global caches for small and medium sizes (<64KiB)
 	size_t cached;
-	//! Curren amount of memory in global caches for large sizes (>=64KiB)
-	size_t cached_large;
 	//! Total amount of memory mapped (only if ENABLE_STATISTICS=1)
 	size_t mapped_total;
 	//! Total amount of memory unmapped (only if ENABLE_STATISTICS=1)
@@ -48,10 +46,6 @@
 } rpmalloc_global_statistics_t;
 
 typedef struct rpmalloc_thread_statistics_t {
-	//! Amount of memory currently requested in allocations (only if ENABLE_STATISTICS=1)
-	size_t requested;
-	//! Amount of memory actually allocated in memory blocks (only if ENABLE_STATISTICS=1)
-	size_t allocated;
 	//! Current number of bytes available for allocation from active spans
 	size_t active;
 	//! Current number of bytes available in thread size class caches
@@ -68,17 +62,32 @@
 
 typedef struct rpmalloc_config_t {
 	//! Map memory pages for the given number of bytes. The returned address MUST be
-	//  2 byte aligned, and should ideally be 64KiB aligned. If memory returned is not
-	//  64KiB aligned rpmalloc will call unmap and then another map request with size
-	//  padded by 64KiB in order to align it internally.
-	void* (*memory_map)(size_t size);
+	//  aligned to the rpmalloc span size, which will always be a power of two.
+	//  Optionally the function can store an alignment offset in the offset variable
+	//  in case it performs alignment and the returned pointer is offset from the
+	//  actual start of the memory region due to this alignment. The alignment offset
+	//  will be passed to the memory unmap function. The alignment offset MUST NOT be
+	//  larger than 65535 (storable in an uint16_t), if it is you must use natural
+	//  alignment to shift it into 16 bits.
+	void* (*memory_map)(size_t size, size_t* offset);
 	//! Unmap the memory pages starting at address and spanning the given number of bytes.
-	//  Address will always be an address returned by an earlier call to memory_map function.
-	void (*memory_unmap)(void* address, size_t size);
-	//! Size of memory pages. All allocation requests will be made in multiples of this page
-	//  size. If set to 0, rpmalloc will use system calls to determine the page size. The page
-	//  size MUST be a power of two in [512,16384] range (2^9 to 2^14).
+	//  If release is set to 1, the unmap is for an entire span range as returned by
+	//  a previous call to memory_map and that the entire range should be released.
+	//  If release is set to 0, the unmap is a partial decommit of a subset of the mapped
+	//  memory range.
+	void (*memory_unmap)(void* address, size_t size, size_t offset, int release);
+	//! Size of memory pages. The page size MUST be a power of two in [512,16384] range
+	//  (2^9 to 2^14) unless 0 - set to 0 to use system page size. All memory mapping
+	//  requests to memory_map will be made with size set to a multiple of the page size.
 	size_t page_size;
+	//! Size of a span of memory pages. MUST be a multiple of page size, and in [4096,262144]
+	//  range (unless 0 - set to 0 to use the default span size).
+	size_t span_size;
+	//! Number of spans to map at each request to map new virtual memory blocks. This can
+	//  be used to minimize the system call overhead at the cost of virtual memory address
+	//  space. The extra mapped pages will not be written until actually used, so physical
+	//  committed memory should not be affected in the default implementation.
+	size_t span_map_count;
 	//! Debug callback if memory guards are enabled. Called if a memory overwrite is detected
 	void (*memory_overwrite)(void* address);
 } rpmalloc_config_t;
@@ -89,6 +98,9 @@
 extern int
 rpmalloc_initialize_config(const rpmalloc_config_t* config);
 
+extern const rpmalloc_config_t*
+rpmalloc_config(void);
+
 extern void
 rpmalloc_finalize(void);
 
diff --git a/test/main.c b/test/main.c
index c88248e..28532dc 100644
--- a/test/main.c
+++ b/test/main.c
@@ -28,6 +28,11 @@
 #define pointer_offset(ptr, ofs) (void*)((char*)(ptr) + (ptrdiff_t)(ofs))
 //#define pointer_diff(first, second) (ptrdiff_t)((const char*)(first) - (const char*)(second))
 
+static size_t _hardware_threads;
+
+static void
+test_initialize(void);
+
 static int
 test_alloc(void) {
 	unsigned int iloop = 0;
@@ -43,13 +48,17 @@
 	for (id = 0; id < 20000; ++id)
 		data[id] = (char)(id % 139 + id % 17);
 
+	void* testptr = rpmalloc(253000);
+	testptr = rprealloc(testptr, 154);
+	rpfree(testptr);
+
 	for (iloop = 0; iloop < 64; ++iloop) {
 		for (ipass = 0; ipass < 8142; ++ipass) {
 			addr[ipass] = rpmalloc(500);
 			if (addr[ipass] == 0)
 				return -1;
 
-			memcpy(addr[ipass], data, 500);
+			memcpy(addr[ipass], data + ipass, 500);
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
 				if (addr[icheck] == addr[ipass])
@@ -66,7 +75,7 @@
 		}
 
 		for (ipass = 0; ipass < 8142; ++ipass) {
-			if (memcmp(addr[ipass], data, 500))
+			if (memcmp(addr[ipass], data + ipass, 500))
 				return -1;
 		}
 
@@ -82,7 +91,7 @@
 			if (addr[ipass] == 0)
 				return -1;
 
-			memcpy(addr[ipass], data, cursize);
+			memcpy(addr[ipass], data + ipass, cursize);
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
 				if (addr[icheck] == addr[ipass])
@@ -100,7 +109,7 @@
 
 		for (ipass = 0; ipass < 1024; ++ipass) {
 			unsigned int cursize = datasize[ipass%7] + ipass;
-			if (memcmp(addr[ipass], data, cursize))
+			if (memcmp(addr[ipass], data + ipass, cursize))
 				return -1;
 		}
 
@@ -114,7 +123,7 @@
 			if (addr[ipass] == 0)
 				return -1;
 
-			memcpy(addr[ipass], data, 500);
+			memcpy(addr[ipass], data + ipass, 500);
 
 			for (icheck = 0; icheck < ipass; ++icheck) {
 				if (addr[icheck] == addr[ipass])
@@ -131,7 +140,7 @@
 		}
 
 		for (ipass = 0; ipass < 1024; ++ipass) {
-			if (memcmp(addr[ipass], data, 500))
+			if (memcmp(addr[ipass], data + ipass, 500))
 				return -1;
 		}
 
@@ -141,6 +150,42 @@
 
 	rpmalloc_finalize();
 
+	for (iloop = 0; iloop < 2048; iloop += 16) {
+		rpmalloc_initialize();
+		addr[0] = rpmalloc(iloop);
+		if (!addr[0])
+			return -1;
+		rpfree(addr[0]);
+		rpmalloc_finalize();
+	}
+
+	for (iloop = 2048; iloop < (64 * 1024); iloop += 512) {
+		rpmalloc_initialize();
+		addr[0] = rpmalloc(iloop);
+		if (!addr[0])
+			return -1;
+		rpfree(addr[0]);
+		rpmalloc_finalize();
+	}
+
+	for (iloop = (64 * 1024); iloop < (2 * 1024 * 1024); iloop += 4096) {
+		rpmalloc_initialize();
+		addr[0] = rpmalloc(iloop);
+		if (!addr[0])
+			return -1;
+		rpfree(addr[0]);
+		rpmalloc_finalize();
+	}
+
+	rpmalloc_initialize();
+	for (iloop = 0; iloop < (2 * 1024 * 1024); iloop += 16) {
+		addr[0] = rpmalloc(iloop);
+		if (!addr[0])
+			return -1;
+		rpfree(addr[0]);
+	}
+	rpmalloc_finalize();
+
 	printf("Memory allocation tests passed\n");
 
 	return 0;
@@ -161,22 +206,24 @@
 	unsigned int ipass = 0;
 	unsigned int icheck = 0;
 	unsigned int id = 0;
-	void* addr[4096];
-	char data[8192];
+	void** addr;
+	uint32_t* data;
 	unsigned int cursize;
 	unsigned int iwait = 0;
 	int ret = 0;
 
 	rpmalloc_thread_initialize();
 
-	for (id = 0; id < 8192; ++id)
-		data[id] = (char)id;
+	addr = rpmalloc(sizeof(void*) * arg.passes);
+	data = rpmalloc(512 * 1024);
+	for (id = 0; id < 512 * 1024 / 4; ++id)
+		data[id] = id;
 
 	thread_sleep(1);
 
 	for (iloop = 0; iloop < arg.loops; ++iloop) {
 		for (ipass = 0; ipass < arg.passes; ++ipass) {
-			cursize = 4 + arg.datasize[(iloop + ipass + iwait) % arg.num_datasize] + (iloop % 1024);
+			cursize = 4 + arg.datasize[(iloop + ipass + iwait) % arg.num_datasize] + ((iloop + ipass) % 1024);
 
 			addr[ipass] = rpmalloc(4 + cursize);
 			if (addr[ipass] == 0) {
@@ -218,11 +265,14 @@
 				ret = -1;
 				goto end;
 			}
-			
+
 			rpfree(addr[ipass]);
 		}
 	}
 
+	rpfree(data);
+	rpfree(addr);
+
 	rpmalloc_thread_finalize();
 
 end:
@@ -240,11 +290,11 @@
 
 	rpmalloc_thread_initialize();
 
-	thread_sleep(1);
+	thread_sleep(10);
 
 	for (iloop = 0; iloop < arg.loops; ++iloop) {
 		for (ipass = 0; ipass < arg.passes; ++ipass) {
-			cursize = arg.datasize[(iloop + ipass + iwait) % arg.num_datasize ] + (iloop % 1024);
+			cursize = arg.datasize[(iloop + ipass + iwait) % arg.num_datasize ] + ((iloop + ipass) % 1024);
 
 			void* addr = rpmalloc(cursize);
 			if (addr == 0) {
@@ -256,9 +306,9 @@
 		}
 	}
 
+end:
 	rpmalloc_thread_finalize();
 
-end:
 	thread_exit((uintptr_t)ret);
 }
 
@@ -284,7 +334,7 @@
 		rpmalloc_thread_initialize();
 
 		for (ipass = 0; ipass < arg.passes; ++ipass) {
-			cursize = 4 + arg.datasize[(iloop + ipass + iwait) % arg.num_datasize] + (iloop % 1024);
+			cursize = 4 + arg.datasize[(iloop + ipass + iwait) % arg.num_datasize] + ((iloop + ipass) % 1024);
 
 			addr[ipass] = rpmalloc(4 + cursize);
 			if (addr[ipass] == 0) {
@@ -348,18 +398,31 @@
 
 	rpmalloc_initialize();
 
-	num_alloc_threads = 3;
+	num_alloc_threads = _hardware_threads;
+	if (num_alloc_threads < 2)
+		num_alloc_threads = 2;
+	if (num_alloc_threads > 32)
+		num_alloc_threads = 32;
 
 	arg.datasize[0] = 19;
 	arg.datasize[1] = 249;
 	arg.datasize[2] = 797;
-	arg.datasize[3] = 3;
-	arg.datasize[4] = 79;
-	arg.datasize[5] = 34;
+	arg.datasize[3] = 3058;
+	arg.datasize[4] = 47892;
+	arg.datasize[5] = 173902;
 	arg.datasize[6] = 389;
-	arg.num_datasize = 7;
-	arg.loops = 4096;
-	arg.passes = 1024;
+	arg.datasize[7] = 19;
+	arg.datasize[8] = 2493;
+	arg.datasize[9] = 7979;
+	arg.datasize[10] = 3;
+	arg.datasize[11] = 79374;
+	arg.datasize[12] = 3432;
+	arg.datasize[13] = 548;
+	arg.datasize[14] = 38934;
+	arg.datasize[15] = 234;
+	arg.num_datasize = 16;
+	arg.loops = 100;
+	arg.passes = 4000;
 
 	thread_arg targ = { allocator_thread, &arg };
 	for (i = 0; i < num_alloc_threads; ++i)
@@ -384,39 +447,63 @@
 
 static int 
 test_crossthread(void) {
-	uintptr_t thread;
-	allocator_thread_arg_t arg;
+	uintptr_t thread[8];
+	allocator_thread_arg_t arg[8];
+	thread_arg targ[8];
 
 	rpmalloc_initialize();
 
-	arg.loops = 100;
-	arg.passes = 1024;
-	arg.pointers = rpmalloc(sizeof(void*) * arg.loops * arg.passes);
-	arg.datasize[0] = 19;
-	arg.datasize[1] = 249;
-	arg.datasize[2] = 797;
-	arg.datasize[3] = 3;
-	arg.datasize[4] = 79;
-	arg.datasize[5] = 34;
-	arg.datasize[6] = 389;
-	arg.num_datasize = 7;
+	size_t num_alloc_threads = _hardware_threads;
+	if (num_alloc_threads < 2)
+		num_alloc_threads = 2;
+	if (num_alloc_threads > 4)
+		num_alloc_threads = 4;
 
-	thread_arg targ = { crossallocator_thread, &arg };
-	thread = thread_run(&targ);
+	for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread) {
+		unsigned int iadd = ithread * (16 + ithread);
+		arg[ithread].loops = 50;
+		arg[ithread].passes = 1024;
+		arg[ithread].pointers = rpmalloc(sizeof(void*) * arg[ithread].loops * arg[ithread].passes);
+		arg[ithread].datasize[0] = 19 + iadd;
+		arg[ithread].datasize[1] = 249 + iadd;
+		arg[ithread].datasize[2] = 797 + iadd;
+		arg[ithread].datasize[3] = 3 + iadd;
+		arg[ithread].datasize[4] = 7923 + iadd;
+		arg[ithread].datasize[5] = 344 + iadd;
+		arg[ithread].datasize[6] = 3892 + iadd;
+		arg[ithread].datasize[7] = 19 + iadd;
+		arg[ithread].datasize[8] = 14954 + iadd;
+		arg[ithread].datasize[9] = 39723 + iadd;
+		arg[ithread].datasize[10] = 15 + iadd;
+		arg[ithread].datasize[11] = 493 + iadd;
+		arg[ithread].datasize[12] = 34 + iadd;
+		arg[ithread].datasize[13] = 894 + iadd;
+		arg[ithread].datasize[14] = 6893 + iadd;
+		arg[ithread].datasize[15] = 2893 + iadd;
+		arg[ithread].num_datasize = 16;
 
-	thread_sleep(1000);
+		targ[ithread].fn = crossallocator_thread;
+		targ[ithread].arg = &arg[ithread];
+	}
 
-	if (thread_join(thread) != 0)
-		return -1;
+	for (int iloop = 0; iloop < 32; ++iloop) {
+		for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread)
+			thread[ithread] = thread_run(&targ[ithread]);
 
-	//Off-thread deallocation
-	for (size_t iptr = 0; iptr < arg.loops * arg.passes; ++iptr)
-		rpfree(arg.pointers[iptr]);
+		thread_sleep(100);
 
-	rpfree(arg.pointers);
+		for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread) {
+			if (thread_join(thread[ithread]) != 0)
+				return -1;
 
-	//Simulate thread exit
-	rpmalloc_thread_finalize();
+			//Off-thread deallocation
+			for (size_t iptr = 0; iptr < arg[ithread].loops * arg[ithread].passes; ++iptr)
+				rpfree(arg[ithread].pointers[iptr]);
+		}
+	}
+
+	for (unsigned int ithread = 0; ithread < num_alloc_threads; ++ithread)
+		rpfree(arg[ithread].pointers);
 
 	rpmalloc_finalize();
 
@@ -436,7 +523,11 @@
 	rpmalloc_initialize();
 
 	num_passes = 100;
-	num_alloc_threads = 5;
+	num_alloc_threads = _hardware_threads;
+	if (num_alloc_threads < 2)
+		num_alloc_threads = 2;
+	if (num_alloc_threads > 64)
+		num_alloc_threads = 64;
 
 	arg.loops = 500;
 	arg.passes = 10;
@@ -548,24 +639,27 @@
 test_run(int argc, char** argv) {
 	(void)sizeof(argc);
 	(void)sizeof(argv);
+	test_initialize();
 	if (test_alloc())
 		return -1;
-	if (test_threaded())
-		return -1;
 	if (test_crossthread())
 		return -1;
 	if (test_threadspam())
 		return -1;
 	if (test_overwrite())
 		return -1;
+	if (test_threaded())
+		return -1;
 	return 0;
 }
 
-#if ( defined( __APPLE__ ) && __APPLE__ )
+#if (defined(__APPLE__) && __APPLE__)
 #  include <TargetConditionals.h>
-#  if defined( __IPHONE__ ) || ( defined( TARGET_OS_IPHONE ) && TARGET_OS_IPHONE ) || ( defined( TARGET_IPHONE_SIMULATOR ) && TARGET_IPHONE_SIMULATOR )
+#  if defined(__IPHONE__) || (defined(TARGET_OS_IPHONE) && TARGET_OS_IPHONE) || (defined(TARGET_IPHONE_SIMULATOR) && TARGET_IPHONE_SIMULATOR)
 #    define NO_MAIN 1
 #  endif
+#elif (defined(__linux__) || defined(__linux))
+#  include <sched.h>
 #endif
 
 #if !defined(NO_MAIN)
@@ -576,3 +670,37 @@
 }
 
 #endif
+
+#ifdef _WIN32
+#include <Windows.h>
+
+static void
+test_initialize(void) {
+	SYSTEM_INFO system_info;
+	GetSystemInfo(&system_info);
+	_hardware_threads = (size_t)system_info.dwNumberOfProcessors;
+}
+
+#elif (defined(__linux__) || defined(__linux))
+
+static void
+test_initialize(void) {
+	cpu_set_t prevmask, testmask;
+	CPU_ZERO(&prevmask);
+	CPU_ZERO(&testmask);
+	sched_getaffinity(0, sizeof(prevmask), &prevmask);     //Get current mask
+	sched_setaffinity(0, sizeof(testmask), &testmask);     //Set zero mask
+	sched_getaffinity(0, sizeof(testmask), &testmask);     //Get mask for all CPUs
+	sched_setaffinity(0, sizeof(prevmask), &prevmask);     //Reset current mask
+	int num = CPU_COUNT(&testmask);
+	_hardware_threads = (size_t)(num > 1 ? num : 1);
+}
+
+#else
+
+static void
+test_initialize(void) {
+	_hardware_threads = 1;
+}
+
+#endif