Merge to master for 1.4.2

diff --git a/BENCHMARKS.md b/BENCHMARKS.md
index 14ba507..c389560 100644
--- a/BENCHMARKS.md
+++ b/BENCHMARKS.md
@@ -1,5 +1,5 @@
 # Benchmarks
-Contained in a parallell repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
+Contained in a parallel repository is a benchmark utility that performs interleaved allocations (both aligned to 8 or 16 bytes, and unaligned) and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.
 
 https://github.com/mjansson/rpmalloc-benchmark
 
diff --git a/CHANGELOG b/CHANGELOG
index 2820daf..cfc9e73 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,3 +1,22 @@
+1.4.2
+
+Fixed an issue where calling _exit might hang the main thread cleanup in rpmalloc if another
+worker thread was terminated while holding exclusive access to the global cache.
+
+Improved caches to prioritize main spans in a chunk to avoid leaving main spans mapped due to
+remaining subspans in caches.
+
+Improve cache reuse by allowing large blocks to use caches from slightly larger cache classes.
+
+Fixed an issue where thread heap statistics would go out of sync when a free span was deferred
+to another thread heap
+
+API breaking change - added flag to rpmalloc_thread_finalize to avoid releasing thread caches.
+Pass nonzero value to retain old behaviour of releasing thread caches to global cache.
+
+Add option to config to set a custom error callback for assert failures (if ENABLE_ASSERT)
+
+
 1.4.1
 
 Dual license as both released to public domain or under MIT license
diff --git a/README.md b/README.md
index 626f63a..d31e0e8 100644
--- a/README.md
+++ b/README.md
@@ -33,7 +33,7 @@
 
 # Required functions
 
-Before calling any other function in the API, you __MUST__ call the initization function, either __rpmalloc_initialize__ or __pmalloc_initialize_config__, or you will get undefined behaviour when calling other rpmalloc entry point.
+Before calling any other function in the API, you __MUST__ call the initialization function, either __rpmalloc_initialize__ or __pmalloc_initialize_config__, or you will get undefined behaviour when calling other rpmalloc entry point.
 
 Before terminating your use of the allocator, you __SHOULD__ call __rpmalloc_finalize__ in order to release caches and unmap virtual memory, as well as prepare the allocator for global scope cleanup at process exit or dynamic library unload depending on your use case.
 
@@ -104,7 +104,7 @@
 
 Memory blocks are divided into three categories. For 64KiB span size/alignment the small blocks are [16, 1024] bytes, medium blocks (1024, 32256] bytes, and large blocks (32256, 2097120] bytes. The three categories are further divided in size classes. If the span size is changed, the small block classes remain but medium blocks go from (1024, span size] bytes.
 
-Small blocks have a size class granularity of 16 bytes each in 64 buckets. Medium blocks have a granularity of 512 bytes, 61 buckets (default). Large blocks have a the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 36 bytes will allocate a block of 48 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
+Small blocks have a size class granularity of 16 bytes each in 64 buckets. Medium blocks have a granularity of 512 bytes, 61 buckets (default). Large blocks have the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 36 bytes will allocate a block of 48 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.
 
 Spans for small and medium blocks are cached in four levels to avoid calls to map/unmap memory pages. The first level is a per thread single active span for each size class. The second level is a per thread list of partially free spans for each size class. The third level is a per thread list of free spans. The fourth level is a global list of free spans.
 
@@ -113,7 +113,7 @@
 Large blocks, or super spans, are cached in two levels. The first level is a per thread list of free super spans. The second level is a global list of free super spans.
 
 # Memory mapping
-By default the allocator uses OS APIs to map virtual memory pages as needed, either `VirtualAlloc` on Windows or `mmap` on POSIX systems. If you want to use your own custom memory mapping provider you can use __rpmalloc_initialize_config__ and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes. 
+By default the allocator uses OS APIs to map virtual memory pages as needed, either `VirtualAlloc` on Windows or `mmap` on POSIX systems. If you want to use your own custom memory mapping provider you can use __rpmalloc_initialize_config__ and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes.
 
 The returned memory address from the memory map function MUST be aligned to the memory page size and the memory span size (which ever is larger), both of which is configurable. Either provide the page and span sizes during initialization using __rpmalloc_initialize_config__, or use __rpmalloc_config__ to find the required alignment which is equal to the maximum of page and span size. The span size MUST be a power of two in [4096, 262144] range, and be a multiple or divisor of the memory page size.
 
@@ -128,7 +128,7 @@
 
 A span that is a subspan of a larger super span can be individually decommitted to reduce physical memory pressure when the span is evicted from caches and scheduled to be unmapped. The entire original super span will keep track of the subspans it is broken up into, and when the entire range is decommitted tha super span will be unmapped. This allows platforms like Windows that require the entire virtual memory range that was mapped in a call to VirtualAlloc to be unmapped in one call to VirtualFree, while still decommitting individual pages in subspans (if the page size is smaller than the span size).
 
-If you use a custom memory map/unmap function you need to take this into account by looking at the `release` parameter given to the `memory_unmap` function. It is set to 0 for decommitting invididual pages and the total super span byte size for finally releasing the entire super span memory range.
+If you use a custom memory map/unmap function you need to take this into account by looking at the `release` parameter given to the `memory_unmap` function. It is set to 0 for decommitting individual pages and the total super span byte size for finally releasing the entire super span memory range.
 
 # Memory fragmentation
 There is no memory fragmentation by the allocator in the sense that it will not leave unallocated and unusable "holes" in the memory pages by calls to allocate and free blocks of different sizes. This is due to the fact that the memory pages allocated for each size class is split up in perfectly aligned blocks which are not reused for a request of a different size. The block freed by a call to `rpfree` will always be immediately available for an allocation request within the same size class.
diff --git a/build/ninja/clang.py b/build/ninja/clang.py
index bd4f821..b30549b 100644
--- a/build/ninja/clang.py
+++ b/build/ninja/clang.py
@@ -38,7 +38,7 @@
     self.cxxcmd = '$toolchain$cxx -MMD -MT $out -MF $out.d $includepaths $moreincludepaths $cxxflags $carchflags $cconfigflags $cmoreflags $cxxenvflags -c $in -o $out'
     self.ccdeps = 'gcc'
     self.ccdepfile = '$out.d'
-    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crsD $ararchflags $arflags $arenvflags $out $in'
+    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crs $ararchflags $arflags $arenvflags $out $in'
     if self.target.is_windows():
       self.linkcmd = '$toolchain$link $libpaths $configlibpaths $linkflags $linkarchflags $linkconfigflags $linkenvflags /debug /nologo /subsystem:console /dynamicbase /nxcompat /manifest /manifestuac:\"level=\'asInvoker\' uiAccess=\'false\'\" /tlbid:1 /pdb:$pdbpath /out:$out $in $libs $archlibs $oslibs $frameworks'
       self.dllcmd = self.linkcmd + ' /dll'
@@ -52,7 +52,7 @@
                    '-fno-trapping-math', '-ffast-math']
     self.cwarnflags = ['-W', '-Werror', '-pedantic', '-Wall', '-Weverything',
                        '-Wno-c++98-compat', '-Wno-padded', '-Wno-documentation-unknown-command',
-                       '-Wno-implicit-fallthrough', '-Wno-static-in-inline', '-Wno-reserved-id-macro']
+                       '-Wno-implicit-fallthrough', '-Wno-static-in-inline', '-Wno-reserved-id-macro', '-Wno-disabled-macro-expansion']
     self.cmoreflags = []
     self.mflags = []
     self.arflags = []
@@ -76,8 +76,14 @@
       self.oslibs += ['m']
     if self.target.is_linux() or self.target.is_raspberrypi():
       self.oslibs += ['dl']
+    if self.target.is_raspberrypi():
+      self.linkflags += ['-latomic']
     if self.target.is_bsd():
       self.oslibs += ['execinfo']
+    if self.target.is_haiku():
+      self.cflags += ['-D_GNU_SOURCE=1']
+      self.linkflags += ['-lpthread']
+      self.oslibs += ['m']
     if not self.target.is_windows():
       self.linkflags += ['-fomit-frame-pointer']
 
@@ -391,7 +397,7 @@
       if targettype == 'sharedlib':
         flags += ['-shared', '-fPIC']
     if config != 'debug':
-      if targettype == 'bin' or targettype == 'sharedlib':
+      if (targettype == 'bin' or targettype == 'sharedlib') and self.use_lto():
         flags += ['-flto']
     return flags
 
diff --git a/build/ninja/gcc.py b/build/ninja/gcc.py
index 299be53..f136491 100644
--- a/build/ninja/gcc.py
+++ b/build/ninja/gcc.py
@@ -24,7 +24,7 @@
     self.cxxcmd = '$toolchain$cxx -MMD -MT $out -MF $out.d $includepaths $moreincludepaths $cxxflags $carchflags $cconfigflags $cmoreflags $cxxenvflags -c $in -o $out'
     self.ccdeps = 'gcc'
     self.ccdepfile = '$out.d'
-    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crsD $ararchflags $arflags $arenvflags $out $in'
+    self.arcmd = self.rmcmd('$out') + ' && $toolchain$ar crs $ararchflags $arflags $arenvflags $out $in'
     self.linkcmd = '$toolchain$link $libpaths $configlibpaths $linkflags $linkarchflags $linkconfigflags $linkenvflags -o $out $in $libs $archlibs $oslibs'
 
     #Base flags
@@ -54,8 +54,13 @@
       self.linkflags += ['-pthread']
     if self.target.is_linux() or self.target.is_raspberrypi():
       self.oslibs += ['dl']
+    if self.target.is_raspberrypi():
+      self.linkflags += ['-latomic']
     if self.target.is_bsd():
       self.oslibs += ['execinfo']
+    if self.target.is_haiku():
+      self.cflags += ['-D_GNU_SOURCE=1']
+      self.linkflags += ['-lpthread']
 
     self.includepaths = self.prefix_includepaths((includepaths or []) + ['.'])
 
diff --git a/build/ninja/generator.py b/build/ninja/generator.py
index eb89058..5de7ec1 100644
--- a/build/ninja/generator.py
+++ b/build/ninja/generator.py
@@ -49,6 +49,9 @@
     parser.add_argument('--updatebuild', action='store_true',
                         help = 'Update submodule build scripts',
                         default = '')
+    parser.add_argument('--lto', action='store_true',
+                        help = 'Build with Link Time Optimization',
+                        default = False)
     options = parser.parse_args()
 
     self.project = project
@@ -91,6 +94,8 @@
       variables['monolithic'] = True
     if options.coverage:
       variables['coverage'] = True
+    if options.lto:
+      variables['lto'] = True
     if self.subninja != '':
       variables['internal_deps'] = True
 
diff --git a/build/ninja/platform.py b/build/ninja/platform.py
index cf91c14..5867ed6 100644
--- a/build/ninja/platform.py
+++ b/build/ninja/platform.py
@@ -5,7 +5,7 @@
 import sys
 
 def supported_platforms():
-  return [ 'windows', 'linux', 'macos', 'bsd', 'ios', 'android', 'raspberrypi', 'tizen', 'sunos' ]
+  return [ 'windows', 'linux', 'macos', 'bsd', 'ios', 'android', 'raspberrypi', 'tizen', 'sunos', 'haiku' ]
 
 class Platform(object):
   def __init__(self, platform):
@@ -20,7 +20,7 @@
       self.platform = 'macos'
     elif self.platform.startswith('win'):
       self.platform = 'windows'
-    elif 'bsd' in self.platform:
+    elif 'bsd' in self.platform or self.platform.startswith('dragonfly'):
       self.platform = 'bsd'
     elif self.platform.startswith('ios'):
       self.platform = 'ios'
@@ -32,6 +32,8 @@
       self.platform = 'tizen'
     elif self.platform.startswith('sunos'):
       self.platform = 'sunos'
+    elif self.platform.startswith('haiku'):
+      self.platform = 'haiku'
 
   def platform(self):
     return self.platform
@@ -63,5 +65,8 @@
   def is_sunos(self):
     return self.platform == 'sunos'
 
+  def is_haiku(self):
+    return self.platform == 'haiku'
+
   def get(self):
     return self.platform
diff --git a/build/ninja/toolchain.py b/build/ninja/toolchain.py
index d10d840..30fda08 100644
--- a/build/ninja/toolchain.py
+++ b/build/ninja/toolchain.py
@@ -54,6 +54,7 @@
     #Set default values
     self.build_monolithic = False
     self.build_coverage = False
+    self.build_lto = False
     self.support_lua = False
     self.internal_deps = False
     self.python = 'python'
@@ -132,7 +133,7 @@
   def initialize_default_archs(self):
     if self.target.is_windows():
       self.archs = ['x86-64']
-    elif self.target.is_linux() or self.target.is_bsd() or self.target.is_sunos():
+    elif self.target.is_linux() or self.target.is_bsd() or self.target.is_sunos() or self.target.is_haiku():
       localarch = subprocess.check_output(['uname', '-m']).decode().strip()
       if localarch == 'x86_64' or localarch == 'amd64':
         self.archs = ['x86-64']
@@ -208,6 +209,8 @@
         self.build_monolithic = get_boolean_flag(val)
       elif key == 'coverage':
         self.build_coverage = get_boolean_flag(val)
+      elif key == 'lto':
+        self.build_lto = get_boolean_flag(val)
       elif key == 'support_lua':
         self.support_lua = get_boolean_flag(val)
       elif key == 'internal_deps':
@@ -234,6 +237,8 @@
       self.build_monolithic = get_boolean_flag(prefs['monolithic'])
     if 'coverage' in prefs:
       self.build_coverage = get_boolean_flag( prefs['coverage'] )
+    if 'lto' in prefs:
+      self.build_lto = get_boolean_flag( prefs['lto'] )
     if 'support_lua' in prefs:
       self.support_lua = get_boolean_flag(prefs['support_lua'])
     if 'python' in prefs:
@@ -258,6 +263,9 @@
   def use_coverage(self):
     return self.build_coverage
 
+  def use_lto(self):
+    return self.build_lto
+
   def write_variables(self, writer):
     writer.variable('buildpath', self.buildpath)
     writer.variable('target', self.target.platform)
diff --git a/rpmalloc/malloc.c b/rpmalloc/malloc.c
index 56becb8..63d5b88 100644
--- a/rpmalloc/malloc.c
+++ b/rpmalloc/malloc.c
@@ -292,12 +292,37 @@
 	else if (reason == DLL_THREAD_ATTACH)
 		rpmalloc_thread_initialize();
 	else if (reason == DLL_THREAD_DETACH)
-		rpmalloc_thread_finalize();
+		rpmalloc_thread_finalize(1);
 	return TRUE;
 }
 
+//end BUILD_DYNAMIC_LINK
+#else
+
+extern void
+_global_rpmalloc_init(void) {
+	rpmalloc_set_main_thread();
+	rpmalloc_initialize();
+}
+
+#if defined(__clang__) || defined(__GNUC__)
+
+static void __attribute__((constructor))
+initializer(void) {
+	_global_rpmalloc_init();
+}
+
+#elif defined(_MSC_VER)
+
+#pragma section(".CRT$XIB",read)
+__declspec(allocate(".CRT$XIB")) void (*_rpmalloc_module_init)(void) = _global_rpmalloc_init;
+#pragma comment(linker, "/include:_rpmalloc_module_init")
+
 #endif
 
+//end !BUILD_DYNAMIC_LINK
+#endif 
+
 #else
 
 #include <pthread.h>
@@ -305,6 +330,9 @@
 #include <stdint.h>
 #include <unistd.h>
 
+extern void
+rpmalloc_set_main_thread(void);
+
 static pthread_key_t destructor_key;
 
 static void
@@ -312,6 +340,7 @@
 
 static void __attribute__((constructor))
 initializer(void) {
+	rpmalloc_set_main_thread();
 	rpmalloc_initialize();
 	pthread_key_create(&destructor_key, thread_destructor);
 }
@@ -340,7 +369,7 @@
 static void
 thread_destructor(void* value) {
 	(void)sizeof(value);
-	rpmalloc_thread_finalize();
+	rpmalloc_thread_finalize(1);
 }
 
 #ifdef __APPLE__
@@ -368,7 +397,8 @@
                const pthread_attr_t* attr,
                void* (*start_routine)(void*),
                void* arg) {
-#if defined(__linux__) || defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__APPLE__) || defined(__HAIKU__)
+#if defined(__linux__) || defined(__FreeBSD__) || defined(__OpenBSD__) || defined(__NetBSD__) || defined(__DragonFly__) || \
+    defined(__APPLE__) || defined(__HAIKU__)
 	char fname[] = "pthread_create";
 #else
 	char fname[] = "_pthread_create";
diff --git a/rpmalloc/rpmalloc.c b/rpmalloc/rpmalloc.c
index a23d62a..5186f61 100644
--- a/rpmalloc/rpmalloc.c
+++ b/rpmalloc/rpmalloc.c
@@ -20,7 +20,7 @@
 #if defined(__clang__)
 #pragma clang diagnostic ignored "-Wunused-macros"
 #pragma clang diagnostic ignored "-Wunused-function"
-#elif defined(__GCC__)
+#elif defined(__GNUC__)
 #pragma GCC diagnostic ignored "-Wunused-macros"
 #pragma GCC diagnostic ignored "-Wunused-function"
 #endif
@@ -120,14 +120,15 @@
 #  ifndef WIN32_LEAN_AND_MEAN
 #    define WIN32_LEAN_AND_MEAN
 #  endif
-#  include <Windows.h>
+#  include <windows.h>
 #  if ENABLE_VALIDATE_ARGS
-#    include <Intsafe.h>
+#    include <intsafe.h>
 #  endif
 #else
 #  include <unistd.h>
 #  include <stdio.h>
 #  include <stdlib.h>
+#  include <time.h>
 #  if defined(__APPLE__)
 #    include <TargetConditionals.h>
 #    if !TARGET_OS_IPHONE && !TARGET_OS_SIMULATOR
@@ -137,7 +138,6 @@
 #    include <pthread.h>
 #  endif
 #  if defined(__HAIKU__)
-#    include <OS.h>
 #    include <pthread.h>
 #  endif
 #endif
@@ -149,11 +149,6 @@
 #if defined(_WIN32) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
 #include <fibersapi.h>
 static DWORD fls_key;
-static void NTAPI
-_rpmalloc_thread_destructor(void* value) {
-	if (value)
-		rpmalloc_thread_finalize();
-}
 #endif
 
 #if PLATFORM_POSIX
@@ -162,6 +157,14 @@
 #  ifdef __FreeBSD__
 #    include <sys/sysctl.h>
 #    define MAP_HUGETLB MAP_ALIGNED_SUPER
+#    ifndef PROT_MAX
+#      define PROT_MAX(f) 0
+#    endif
+#  else
+#    define PROT_MAX(f) 0
+#  endif
+#  ifdef __sun
+extern int madvise(caddr_t, size_t, int);
 #  endif
 #  ifndef MAP_UNINITIALIZED
 #    define MAP_UNINITIALIZED 0
@@ -175,9 +178,21 @@
 #    define _DEBUG
 #  endif
 #  include <assert.h>
+#define RPMALLOC_TOSTRING_M(x) #x
+#define RPMALLOC_TOSTRING(x) RPMALLOC_TOSTRING_M(x)
+#define rpmalloc_assert(truth, message)                                                                      \
+	do {                                                                                                     \
+		if (!(truth)) {                                                                                      \
+			if (_memory_config.error_callback) {                                                             \
+				_memory_config.error_callback(                                                               \
+				    message " (" RPMALLOC_TOSTRING(truth) ") at " __FILE__ ":" RPMALLOC_TOSTRING(__LINE__)); \
+			} else {                                                                                         \
+				assert((truth) && message);                                                                  \
+			}                                                                                                \
+		}                                                                                                    \
+	} while (0)
 #else
-#  undef  assert
-#  define assert(x) do {} while(0)
+#  define rpmalloc_assert(truth, message) do {} while(0)
 #endif
 #if ENABLE_STATISTICS
 #  include <stdio.h>
@@ -355,6 +370,8 @@
 #define SPAN_FLAG_SUBSPAN 2U
 //! Flag indicating span has blocks with increased alignment
 #define SPAN_FLAG_ALIGNED_BLOCKS 4U
+//! Flag indicating an unmapped master span
+#define SPAN_FLAG_UNMAPPED_MASTER 8U
 
 #if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
 struct span_use_t {
@@ -363,6 +380,8 @@
 	//! High water mark of spans used
 	atomic32_t high;
 #if ENABLE_STATISTICS
+	//! Number of spans in deferred list
+	atomic32_t spans_deferred;
 	//! Number of spans transitioned to global cache
 	atomic32_t spans_to_global;
 	//! Number of spans transitioned from global cache
@@ -570,6 +589,8 @@
 
 //! Initialized flag
 static int _rpmalloc_initialized;
+//! Main thread ID
+static uintptr_t _rpmalloc_main_thread_id;
 //! Configuration
 static rpmalloc_config_t _memory_config;
 //! Memory page size
@@ -626,6 +647,10 @@
 static heap_t* _memory_first_class_orphan_heaps;
 #endif
 #if ENABLE_STATISTICS
+//! Allocations counter
+static atomic64_t _allocation_counter;
+//! Deallocations counter
+static atomic64_t _deallocation_counter;
 //! Active heap count
 static atomic32_t _memory_active_heaps;
 //! Number of currently mapped memory pages
@@ -634,6 +659,8 @@
 static int32_t _mapped_pages_peak;
 //! Number of mapped master spans
 static atomic32_t _master_spans;
+//! Number of unmapped dangling master spans
+static atomic32_t _unmapped_master_spans;
 //! Number of currently unused spans
 static atomic32_t _reserved_spans;
 //! Running counter of total number of mapped memory pages since start
@@ -662,7 +689,11 @@
 #    define _Thread_local __declspec(thread)
 #    define TLS_MODEL
 #  else
-#    define TLS_MODEL __attribute__((tls_model("initial-exec")))
+#    ifndef __HAIKU__
+#      define TLS_MODEL __attribute__((tls_model("initial-exec")))
+#    else
+#      define TLS_MODEL
+#    endif
 #    if !defined(__clang__) && defined(__GNUC__)
 #      define _Thread_local __thread
 #    endif
@@ -702,14 +733,21 @@
 	uintptr_t tid;
 #  if defined(__i386__)
 	__asm__("movl %%gs:0, %0" : "=r" (tid) : : );
-#  elif defined(__MACH__) && !TARGET_OS_IPHONE && !TARGET_OS_SIMULATOR
-	__asm__("movq %%gs:0, %0" : "=r" (tid) : : );
 #  elif defined(__x86_64__)
+#    if defined(__MACH__)
+	__asm__("movq %%gs:0, %0" : "=r" (tid) : : );
+#    else
 	__asm__("movq %%fs:0, %0" : "=r" (tid) : : );
+#    endif
 #  elif defined(__arm__)
 	__asm__ volatile ("mrc p15, 0, %0, c13, c0, 3" : "=r" (tid));
 #  elif defined(__aarch64__)
+#    if defined(__MACH__)
+	// tpidr_el0 likely unused, always return 0 on iOS
+	__asm__ volatile ("mrs %0, tpidrro_el0" : "=r" (tid));
+#    else
 	__asm__ volatile ("mrs %0, tpidr_el0" : "=r" (tid));
+#    endif
 #  else
 	tid = (uintptr_t)((void*)get_thread_heap_raw());
 #  endif
@@ -731,6 +769,49 @@
 		heap->owner_thread = get_thread_id();
 }
 
+//! Set main thread ID
+extern void
+rpmalloc_set_main_thread(void);
+
+void
+rpmalloc_set_main_thread(void) {
+	_rpmalloc_main_thread_id = get_thread_id();
+}
+
+static void
+_rpmalloc_spin(void) {
+#if defined(_MSC_VER)
+	_mm_pause();
+#elif defined(__x86_64__) || defined(__i386__)
+	__asm__ volatile("pause" ::: "memory");
+#elif defined(__aarch64__) || (defined(__arm__) && __ARM_ARCH >= 7)
+	__asm__ volatile("yield" ::: "memory");
+#elif defined(__powerpc__) || defined(__powerpc64__)
+        // No idea if ever been compiled in such archs but ... as precaution
+	__asm__ volatile("or 27,27,27");
+#elif defined(__sparc__)
+	__asm__ volatile("rd %ccr, %g0 \n\trd %ccr, %g0 \n\trd %ccr, %g0");
+#else
+	struct timespec ts = {0};
+	nanosleep(&ts, 0);
+#endif
+}
+
+#if defined(_WIN32) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
+static void NTAPI
+_rpmalloc_thread_destructor(void* value) {
+#if ENABLE_OVERRIDE
+	// If this is called on main thread it means rpmalloc_finalize
+	// has not been called and shutdown is forced (through _exit) or unclean
+	if (get_thread_id() == _rpmalloc_main_thread_id)
+		return;
+#endif
+	if (value)
+		rpmalloc_thread_finalize(1);
+}
+#endif
+
+
 ////////////
 ///
 /// Low level memory map/unmap
@@ -743,8 +824,8 @@
 //  returns address to start of mapped region to use
 static void*
 _rpmalloc_mmap(size_t size, size_t* offset) {
-	assert(!(size % _memory_page_size));
-	assert(size >= _memory_page_size);
+	rpmalloc_assert(!(size % _memory_page_size), "Invalid mmap size");
+	rpmalloc_assert(size >= _memory_page_size, "Invalid mmap size");
 	_rpmalloc_stat_add_peak(&_mapped_pages, (size >> _memory_page_size_shift), _mapped_pages_peak);
 	_rpmalloc_stat_add(&_mapped_total, (size >> _memory_page_size_shift));
 	return _memory_config.memory_map(size, offset);
@@ -757,10 +838,10 @@
 //  release is set to 0 for partial unmap, or size of entire range for a full unmap
 static void
 _rpmalloc_unmap(void* address, size_t size, size_t offset, size_t release) {
-	assert(!release || (release >= size));
-	assert(!release || (release >= _memory_page_size));
+	rpmalloc_assert(!release || (release >= size), "Invalid unmap size");
+	rpmalloc_assert(!release || (release >= _memory_page_size), "Invalid unmap size");
 	if (release) {
-		assert(!(release % _memory_page_size));
+		rpmalloc_assert(!(release % _memory_page_size), "Invalid unmap size");
 		_rpmalloc_stat_sub(&_mapped_pages, (release >> _memory_page_size_shift));
 		_rpmalloc_stat_add(&_unmapped_total, (release >> _memory_page_size_shift));
 	}
@@ -772,12 +853,12 @@
 _rpmalloc_mmap_os(size_t size, size_t* offset) {
 	//Either size is a heap (a single page) or a (multiple) span - we only need to align spans, and only if larger than map granularity
 	size_t padding = ((size >= _memory_span_size) && (_memory_span_size > _memory_map_granularity)) ? _memory_span_size : 0;
-	assert(size >= _memory_page_size);
+	rpmalloc_assert(size >= _memory_page_size, "Invalid mmap size");
 #if PLATFORM_WINDOWS
 	//Ok to MEM_COMMIT - according to MSDN, "actual physical pages are not allocated unless/until the virtual addresses are actually accessed"
 	void* ptr = VirtualAlloc(0, size + padding, (_memory_huge_pages ? MEM_LARGE_PAGES : 0) | MEM_RESERVE | MEM_COMMIT, PAGE_READWRITE);
 	if (!ptr) {
-		assert(ptr && "Failed to map virtual memory block");
+		rpmalloc_assert(ptr, "Failed to map virtual memory block");
 		return 0;
 	}
 #else
@@ -788,7 +869,10 @@
 		fd |= VM_FLAGS_SUPERPAGE_SIZE_2MB;
 	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, flags, fd, 0);
 #  elif defined(MAP_HUGETLB)
-	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, (_memory_huge_pages ? MAP_HUGETLB : 0) | flags, -1, 0);
+	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE | PROT_MAX(PROT_READ | PROT_WRITE), (_memory_huge_pages ? MAP_HUGETLB : 0) | flags, -1, 0);
+#  elif defined(MAP_ALIGNED)
+	const size_t align = (sizeof(size_t) * 8) - (size_t)(__builtin_clzl(size - 1));
+	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, (_memory_huge_pages ? MAP_ALIGNED(align) : 0) | flags, -1, 0);
 #  elif defined(MAP_ALIGN)
 	caddr_t base = (_memory_huge_pages ? (caddr_t)(4 << 20) : 0);
 	void* ptr = mmap(base, size + padding, PROT_READ | PROT_WRITE, (_memory_huge_pages ? MAP_ALIGN : 0) | flags, -1, 0);
@@ -796,29 +880,30 @@
 	void* ptr = mmap(0, size + padding, PROT_READ | PROT_WRITE, flags, -1, 0);
 #  endif
 	if ((ptr == MAP_FAILED) || !ptr) {
-		assert("Failed to map virtual memory block" == 0);
+		if (errno != ENOMEM)
+			rpmalloc_assert((ptr != MAP_FAILED) && ptr, "Failed to map virtual memory block");
 		return 0;
 	}
 #endif
 	_rpmalloc_stat_add(&_mapped_pages_os, (int32_t)((size + padding) >> _memory_page_size_shift));
 	if (padding) {
 		size_t final_padding = padding - ((uintptr_t)ptr & ~_memory_span_mask);
-		assert(final_padding <= _memory_span_size);
-		assert(final_padding <= padding);
-		assert(!(final_padding % 8));
+		rpmalloc_assert(final_padding <= _memory_span_size, "Internal failure in padding");
+		rpmalloc_assert(final_padding <= padding, "Internal failure in padding");
+		rpmalloc_assert(!(final_padding % 8), "Internal failure in padding");
 		ptr = pointer_offset(ptr, final_padding);
 		*offset = final_padding >> 3;
 	}
-	assert((size < _memory_span_size) || !((uintptr_t)ptr & ~_memory_span_mask));
+	rpmalloc_assert((size < _memory_span_size) || !((uintptr_t)ptr & ~_memory_span_mask), "Internal failure in padding");
 	return ptr;
 }
 
 //! Default implementation to unmap pages from virtual memory
 static void
 _rpmalloc_unmap_os(void* address, size_t size, size_t offset, size_t release) {
-	assert(release || (offset == 0));
-	assert(!release || (release >= _memory_page_size));
-	assert(size >= _memory_page_size);
+	rpmalloc_assert(release || (offset == 0), "Invalid unmap size");
+	rpmalloc_assert(!release || (release >= _memory_page_size), "Invalid unmap size");
+	rpmalloc_assert(size >= _memory_page_size, "Invalid unmap size");
 	if (release && offset) {
 		offset <<= 3;
 		address = pointer_offset(address, -(int32_t)offset);
@@ -830,19 +915,28 @@
 #if !DISABLE_UNMAP
 #if PLATFORM_WINDOWS
 	if (!VirtualFree(address, release ? 0 : size, release ? MEM_RELEASE : MEM_DECOMMIT)) {
-		assert(address && "Failed to unmap virtual memory block");
+		rpmalloc_assert(0, "Failed to unmap virtual memory block");
 	}
 #else
 	if (release) {
 		if (munmap(address, release)) {
-			assert("Failed to unmap virtual memory block" == 0);
+			rpmalloc_assert(0, "Failed to unmap virtual memory block");
 		}
 	} else {
-#if defined(POSIX_MADV_FREE)
-		if (posix_madvise(address, size, POSIX_MADV_FREE))
+#if defined(MADV_FREE_REUSABLE)
+		int ret;
+		while ((ret = madvise(address, size, MADV_FREE_REUSABLE)) == -1 && (errno == EAGAIN))
+			errno = 0;
+		if ((ret == -1) && (errno != 0))
+#elif defined(MADV_FREE)
+		if (madvise(address, size, MADV_FREE))
 #endif
+#if defined(MADV_DONTNEED)
+		if (madvise(address, size, MADV_DONTNEED)) {
+#else
 		if (posix_madvise(address, size, POSIX_MADV_DONTNEED)) {
-			assert("Failed to madvise virtual memory block as free" == 0);
+#endif
+			rpmalloc_assert(0, "Failed to madvise virtual memory block as free");
 		}
 	}
 #endif
@@ -885,19 +979,16 @@
 //! Add a span to double linked list at the head
 static void
 _rpmalloc_span_double_link_list_add(span_t** head, span_t* span) {
-	if (*head) {
-		span->next = *head;
+	if (*head)
 		(*head)->prev = span;
-	} else {
-		span->next = 0;
-	}
+	span->next = *head;
 	*head = span;
 }
 
 //! Pop head span from double linked list
 static void
 _rpmalloc_span_double_link_list_pop_head(span_t** head, span_t* span) {
-	assert(*head == span);
+	rpmalloc_assert(*head == span, "Linked list corrupted");
 	span = *head;
 	*head = span->next;
 }
@@ -905,16 +996,15 @@
 //! Remove a span from double linked list
 static void
 _rpmalloc_span_double_link_list_remove(span_t** head, span_t* span) {
-	assert(*head);
+	rpmalloc_assert(*head, "Linked list corrupted");
 	if (*head == span) {
 		*head = span->next;
 	} else {
 		span_t* next_span = span->next;
 		span_t* prev_span = span->prev;
 		prev_span->next = next_span;
-		if (EXPECTED(next_span != 0)) {
+		if (EXPECTED(next_span != 0))
 			next_span->prev = prev_span;
-		}
 	}
 }
 
@@ -937,7 +1027,7 @@
 //! Declare the span to be a subspan and store distance from master span and span count
 static void
 _rpmalloc_span_mark_as_subspan_unless_master(span_t* master, span_t* subspan, size_t span_count) {
-	assert((subspan != master) || (subspan->flags & SPAN_FLAG_MASTER));
+	rpmalloc_assert((subspan != master) || (subspan->flags & SPAN_FLAG_MASTER), "Span master pointer and/or flag mismatch");
 	if (subspan != master) {
 		subspan->flags = SPAN_FLAG_SUBSPAN;
 		subspan->offset_from_master = (uint32_t)((uintptr_t)pointer_diff(subspan, master) >> _memory_span_size_shift);
@@ -1006,21 +1096,24 @@
 			_rpmalloc_heap_cache_insert(heap, heap->span_reserve);
 		}
 		if (reserved_count > DEFAULT_SPAN_MAP_COUNT) {
+			// If huge pages, make sure only one thread maps more memory to avoid bloat
+			while (!atomic_cas32_acquire(&_memory_global_lock, 1, 0))
+				_rpmalloc_spin();
 			size_t remain_count = reserved_count - DEFAULT_SPAN_MAP_COUNT;
 			reserved_count = DEFAULT_SPAN_MAP_COUNT;
 			span_t* remain_span = (span_t*)pointer_offset(reserved_spans, reserved_count * _memory_span_size);
-			if (_memory_global_reserve)
+			if (_memory_global_reserve) {
+				_rpmalloc_span_mark_as_subspan_unless_master(_memory_global_reserve_master, _memory_global_reserve, _memory_global_reserve_count);
 				_rpmalloc_span_unmap(_memory_global_reserve);
+			}
 			_rpmalloc_global_set_reserved_spans(span, remain_span, remain_count);
+			atomic_store32_release(&_memory_global_lock, 0);
 		}
 		_rpmalloc_heap_set_reserved_spans(heap, span, reserved_spans, reserved_count);
 	}
 	return span;
 }
 
-static span_t*
-_rpmalloc_global_get_reserved_spans(size_t span_count);
-
 //! Map in memory pages for the given number of spans (or use previously reserved pages)
 static span_t*
 _rpmalloc_span_map(heap_t* heap, size_t span_count) {
@@ -1029,9 +1122,8 @@
 	span_t* span = 0;
 	if (_memory_page_size > _memory_span_size) {
 		// If huge pages, make sure only one thread maps more memory to avoid bloat
-		while (!atomic_cas32_acquire(&_memory_global_lock, 1, 0)) {
-			/* Spin */
-		}
+		while (!atomic_cas32_acquire(&_memory_global_lock, 1, 0))
+			_rpmalloc_spin();
 		if (_memory_global_reserve_count >= span_count) {
 			size_t reserve_count = (!heap->spans_reserved ? DEFAULT_SPAN_MAP_COUNT : span_count);
 			if (_memory_global_reserve_count < reserve_count)
@@ -1057,18 +1149,18 @@
 //! Unmap memory pages for the given number of spans (or mark as unused if no partial unmappings)
 static void
 _rpmalloc_span_unmap(span_t* span) {
-	assert((span->flags & SPAN_FLAG_MASTER) || (span->flags & SPAN_FLAG_SUBSPAN));
-	assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN));
+	rpmalloc_assert((span->flags & SPAN_FLAG_MASTER) || (span->flags & SPAN_FLAG_SUBSPAN), "Span flag corrupted");
+	rpmalloc_assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN), "Span flag corrupted");
 
 	int is_master = !!(span->flags & SPAN_FLAG_MASTER);
 	span_t* master = is_master ? span : ((span_t*)pointer_offset(span, -(intptr_t)((uintptr_t)span->offset_from_master * _memory_span_size)));
-	assert(is_master || (span->flags & SPAN_FLAG_SUBSPAN));
-	assert(master->flags & SPAN_FLAG_MASTER);
+	rpmalloc_assert(is_master || (span->flags & SPAN_FLAG_SUBSPAN), "Span flag corrupted");
+	rpmalloc_assert(master->flags & SPAN_FLAG_MASTER, "Span flag corrupted");
 
 	size_t span_count = span->span_count;
 	if (!is_master) {
 		//Directly unmap subspans (unless huge pages, in which case we defer and unmap entire page range with master)
-		assert(span->align_offset == 0);
+		rpmalloc_assert(span->align_offset == 0, "Span align offset corrupted");
 		if (_memory_span_size >= _memory_page_size) {
 			_rpmalloc_unmap(span, span_count * _memory_span_size, 0, 0);
 			_rpmalloc_stat_sub(&_reserved_spans, span_count);
@@ -1076,17 +1168,19 @@
 	} else {
 		//Special double flag to denote an unmapped master
 		//It must be kept in memory since span header must be used
-		span->flags |= SPAN_FLAG_MASTER | SPAN_FLAG_SUBSPAN;
+		span->flags |= SPAN_FLAG_MASTER | SPAN_FLAG_SUBSPAN | SPAN_FLAG_UNMAPPED_MASTER;
+		_rpmalloc_stat_add(&_unmapped_master_spans, 1);
 	}
 
 	if (atomic_add32(&master->remaining_spans, -(int32_t)span_count) <= 0) {
 		//Everything unmapped, unmap the master span with release flag to unmap the entire range of the super span
-		assert(!!(master->flags & SPAN_FLAG_MASTER) && !!(master->flags & SPAN_FLAG_SUBSPAN));
+		rpmalloc_assert(!!(master->flags & SPAN_FLAG_MASTER) && !!(master->flags & SPAN_FLAG_SUBSPAN), "Span flag corrupted");
 		size_t unmap_count = master->span_count;
 		if (_memory_span_size < _memory_page_size)
 			unmap_count = master->total_spans;
 		_rpmalloc_stat_sub(&_reserved_spans, unmap_count);
 		_rpmalloc_stat_sub(&_master_spans, 1);
+		_rpmalloc_stat_sub(&_unmapped_master_spans, 1);
 		_rpmalloc_unmap(master, unmap_count * _memory_span_size, master->align_offset, (size_t)master->total_spans * _memory_span_size);
 	}
 }
@@ -1094,8 +1188,8 @@
 //! Move the span (used for small or medium allocations) to the heap thread cache
 static void
 _rpmalloc_span_release_to_cache(heap_t* heap, span_t* span) {
-	assert(heap == span->heap);
-	assert(span->size_class < SIZE_CLASS_COUNT);
+	rpmalloc_assert(heap == span->heap, "Span heap pointer corrupted");
+	rpmalloc_assert(span->size_class < SIZE_CLASS_COUNT, "Invalid span size class");
 #if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
 	atomic_decr32(&heap->span_use[0].current);
 #endif
@@ -1115,7 +1209,7 @@
 //! as allocated, returning number of blocks in list
 static uint32_t
 free_list_partial_init(void** list, void** first_block, void* page_start, void* block_start, uint32_t block_count, uint32_t block_size) {
-	assert(block_count);
+	rpmalloc_assert(block_count, "Internal failure");
 	*first_block = block_start;
 	if (block_count > 1) {
 		void* free_block = pointer_offset(block_start, block_size);
@@ -1144,8 +1238,8 @@
 
 //! Initialize an unused span (from cache or mapped) to be new active span, putting the initial free list in heap class free list
 static void*
-_rpmalloc_span_initialize_new(heap_t* heap, span_t* span, uint32_t class_idx) {
-	assert(span->span_count == 1);
+_rpmalloc_span_initialize_new(heap_t* heap, heap_size_class_t* heap_size_class, span_t* span, uint32_t class_idx) {
+	rpmalloc_assert(span->span_count == 1, "Internal failure");
 	size_class_t* size_class = _memory_size_class + class_idx;
 	span->size_class = class_idx;
 	span->heap = heap;
@@ -1158,11 +1252,11 @@
 
 	//Setup free list. Only initialize one system page worth of free blocks in list
 	void* block;
-	span->free_list_limit = free_list_partial_init(&heap->size_class[class_idx].free_list, &block, 
+	span->free_list_limit = free_list_partial_init(&heap_size_class->free_list, &block, 
 		span, pointer_offset(span, SPAN_HEADER_SIZE), size_class->block_count, size_class->block_size);
 	//Link span as partial if there remains blocks to be initialized as free list, or full if fully initialized
 	if (span->free_list_limit < span->block_count) {
-		_rpmalloc_span_double_link_list_add(&heap->size_class[class_idx].partial_span, span);
+		_rpmalloc_span_double_link_list_add(&heap_size_class->partial_span, span);
 		span->used_count = span->free_list_limit;
 	} else {
 #if RPMALLOC_FIRST_CLASS_HEAPS
@@ -1188,7 +1282,7 @@
 
 static int
 _rpmalloc_span_is_fully_utilized(span_t* span) {
-	assert(span->free_list_limit <= span->block_count);
+	rpmalloc_assert(span->free_list_limit <= span->block_count, "Span free list corrupted");
 	return !span->free_list && (span->free_list_limit >= span->block_count);
 }
 
@@ -1219,7 +1313,7 @@
 		span->used_count -= free_count;
 	}
 	//If this assert triggers you have memory leaks
-	assert(span->list_size == span->used_count);
+	rpmalloc_assert(span->list_size == span->used_count, "Memory leak detected");
 	if (span->list_size == span->used_count) {
 		_rpmalloc_stat_dec(&heap->span_use[0].current);
 		_rpmalloc_stat_dec(&heap->size_class_use[iclass].spans_current);
@@ -1245,7 +1339,7 @@
 static void
 _rpmalloc_global_cache_finalize(global_cache_t* cache) {
 	while (!atomic_cas32_acquire(&cache->lock, 1, 0))
-		/* Spin */;
+		_rpmalloc_spin();
 
 	for (size_t ispan = 0; ispan < cache->count; ++ispan)
 		_rpmalloc_span_unmap(cache->span[ispan]);
@@ -1270,7 +1364,7 @@
 
 	size_t insert_count = count;
 	while (!atomic_cas32_acquire(&cache->lock, 1, 0))
-		/* Spin */;
+		_rpmalloc_spin();
 
 	if ((cache->count + insert_count) > cache_limit)
 		insert_count = cache_limit - cache->count;
@@ -1291,29 +1385,73 @@
 	}
 	atomic_store32_release(&cache->lock, 0);
 
-	for (size_t ispan = insert_count; ispan < count; ++ispan)
-		_rpmalloc_span_unmap(span[ispan]);
+	span_t* keep = 0;
+	for (size_t ispan = insert_count; ispan < count; ++ispan) {
+		span_t* current_span = span[ispan];
+		// Keep master spans that has remaining subspans to avoid dangling them
+		if ((current_span->flags & SPAN_FLAG_MASTER) &&
+		    (atomic_load32(&current_span->remaining_spans) > (int32_t)current_span->span_count)) {
+			current_span->next = keep;
+			keep = current_span;
+		} else {
+			_rpmalloc_span_unmap(current_span);
+		}
+	}
+
+	if (keep) {
+		while (!atomic_cas32_acquire(&cache->lock, 1, 0))
+			_rpmalloc_spin();
+
+		size_t islot = 0;
+		while (keep) {
+			for (; islot < cache->count; ++islot) {
+				span_t* current_span = cache->span[islot];
+				if (!(current_span->flags & SPAN_FLAG_MASTER) || ((current_span->flags & SPAN_FLAG_MASTER) &&
+				    (atomic_load32(&current_span->remaining_spans) <= (int32_t)current_span->span_count))) {
+					_rpmalloc_span_unmap(current_span);
+					cache->span[islot] = keep;
+					break;
+				}
+			}
+			if (islot == cache->count)
+				break;
+			keep = keep->next;
+		}
+
+		if (keep) {
+			span_t* tail = keep;
+			while (tail->next)
+				tail = tail->next;
+			tail->next = cache->overflow;
+			cache->overflow = keep;
+		}
+
+		atomic_store32_release(&cache->lock, 0);
+	}
 }
 
 static size_t
 _rpmalloc_global_cache_extract_spans(span_t** span, size_t span_count, size_t count) {
 	global_cache_t* cache = &_memory_span_cache[span_count - 1];
 
-	size_t extract_count = count;
+	size_t extract_count = 0;
 	while (!atomic_cas32_acquire(&cache->lock, 1, 0))
-		/* Spin */;
+		_rpmalloc_spin();
 
-	if (extract_count > cache->count)
-		extract_count = cache->count;
+	size_t want = count - extract_count;
+	if (want > cache->count)
+		want = cache->count;
 
-	memcpy(span, cache->span + (cache->count - extract_count), sizeof(span_t*) * extract_count);
-	cache->count -= (uint32_t)extract_count;
+	memcpy(span + extract_count, cache->span + (cache->count - want), sizeof(span_t*) * want);
+	cache->count -= (uint32_t)want;
+	extract_count += want;
 
 	while ((extract_count < count) && cache->overflow) {
 		span_t* current_span = cache->overflow;
 		span[extract_count++] = current_span;
 		cache->overflow = current_span->next;
 	}
+
 	atomic_store32_release(&cache->lock, 0);
 
 	return extract_count;
@@ -1343,37 +1481,37 @@
 	span_t* span = (span_t*)((void*)atomic_exchange_ptr_acquire(&heap->span_free_deferred, 0));
 	while (span) {
 		span_t* next_span = (span_t*)span->free_list;
-		assert(span->heap == heap);
+		rpmalloc_assert(span->heap == heap, "Span heap pointer corrupted");
 		if (EXPECTED(span->size_class < SIZE_CLASS_COUNT)) {
-			assert(heap->full_span_count);
+			rpmalloc_assert(heap->full_span_count, "Heap span counter corrupted");
 			--heap->full_span_count;
+			_rpmalloc_stat_dec(&heap->span_use[0].spans_deferred);
 #if RPMALLOC_FIRST_CLASS_HEAPS
 			_rpmalloc_span_double_link_list_remove(&heap->full_span[span->size_class], span);
 #endif
-			if (single_span && !*single_span) {
+			_rpmalloc_stat_dec(&heap->span_use[0].current);
+			_rpmalloc_stat_dec(&heap->size_class_use[span->size_class].spans_current);
+			if (single_span && !*single_span)
 				*single_span = span;
-			} else {
-				_rpmalloc_stat_dec(&heap->span_use[0].current);
-				_rpmalloc_stat_dec(&heap->size_class_use[span->size_class].spans_current);
+			else
 				_rpmalloc_heap_cache_insert(heap, span);
-			}
 		} else {
 			if (span->size_class == SIZE_CLASS_HUGE) {
 				_rpmalloc_deallocate_huge(span);
 			} else {
-				assert(span->size_class == SIZE_CLASS_LARGE);
-				assert(heap->full_span_count);
+				rpmalloc_assert(span->size_class == SIZE_CLASS_LARGE, "Span size class invalid");
+				rpmalloc_assert(heap->full_span_count, "Heap span counter corrupted");
 				--heap->full_span_count;
 #if RPMALLOC_FIRST_CLASS_HEAPS
 				_rpmalloc_span_double_link_list_remove(&heap->large_huge_span, span);
 #endif
 				uint32_t idx = span->span_count - 1;
-				if (!idx && single_span && !*single_span) {
+				_rpmalloc_stat_dec(&heap->span_use[idx].spans_deferred);
+				_rpmalloc_stat_dec(&heap->span_use[idx].current);
+				if (!idx && single_span && !*single_span)
 					*single_span = span;
-				} else {
-					_rpmalloc_stat_dec(&heap->span_use[idx].current);
+				else
 					_rpmalloc_heap_cache_insert(heap, span);
-				}
 			}
 		}
 		span = next_span;
@@ -1428,7 +1566,7 @@
 		}
 	}
 	//Heap is now completely free, unmap and remove from heap list
-	size_t list_idx = heap->id % HEAP_ARRAY_SIZE;
+	size_t list_idx = (size_t)heap->id % HEAP_ARRAY_SIZE;
 	heap_t* list_heap = _memory_heaps[list_idx];
 	if (list_heap == heap) {
 		_memory_heaps[list_idx] = heap->next_heap;
@@ -1497,11 +1635,6 @@
 static span_t*
 _rpmalloc_heap_thread_cache_extract(heap_t* heap, size_t span_count) {
 	span_t* span = 0;
-	if (span_count == 1) {
-		_rpmalloc_heap_cache_adopt_deferred(heap, &span);
-		if (span)
-			return span;
-	}
 #if ENABLE_THREAD_CACHE
 	span_cache_t* span_cache;
 	if (span_count == 1)
@@ -1517,6 +1650,18 @@
 }
 
 static span_t*
+_rpmalloc_heap_thread_cache_deferred_extract(heap_t* heap, size_t span_count) {
+	span_t* span = 0;
+	if (span_count == 1) {
+		_rpmalloc_heap_cache_adopt_deferred(heap, &span);
+	} else {
+		_rpmalloc_heap_cache_adopt_deferred(heap, 0);
+		span = _rpmalloc_heap_thread_cache_extract(heap, span_count);
+	}
+	return span;
+}
+
+static span_t*
 _rpmalloc_heap_reserved_extract(heap_t* heap, size_t span_count) {
 	if (heap->spans_reserved >= span_count)
 		return _rpmalloc_span_map(heap, span_count);
@@ -1558,10 +1703,11 @@
 	return 0;
 }
 
-//! Get a span from one of the cache levels (thread cache, reserved, global cache) or fallback to mapping more memory
-static span_t*
-_rpmalloc_heap_extract_new_span(heap_t* heap, size_t span_count, uint32_t class_idx) {
-	span_t* span;
+static void
+_rpmalloc_inc_span_statistics(heap_t* heap, size_t span_count, uint32_t class_idx) {
+	(void)sizeof(heap);
+	(void)sizeof(span_count);
+	(void)sizeof(class_idx);
 #if ENABLE_ADAPTIVE_THREAD_CACHE || ENABLE_STATISTICS
 	uint32_t idx = (uint32_t)span_count - 1;
 	uint32_t current_count = (uint32_t)atomic_incr32(&heap->span_use[idx].current);
@@ -1569,48 +1715,68 @@
 		atomic_store32(&heap->span_use[idx].high, (int32_t)current_count);
 	_rpmalloc_stat_add_peak(&heap->size_class_use[class_idx].spans_current, 1, heap->size_class_use[class_idx].spans_peak);
 #endif
+}
+
+//! Get a span from one of the cache levels (thread cache, reserved, global cache) or fallback to mapping more memory
+static span_t*
+_rpmalloc_heap_extract_new_span(heap_t* heap, heap_size_class_t* heap_size_class, size_t span_count, uint32_t class_idx) {
+	span_t* span;
 #if ENABLE_THREAD_CACHE
-	if (class_idx < SIZE_CLASS_COUNT) {
-		if (heap->size_class[class_idx].cache) {
-			span = heap->size_class[class_idx].cache;
-			span_t* new_cache = 0;
-			if (heap->span_cache.count)
-				new_cache = heap->span_cache.span[--heap->span_cache.count];
-			heap->size_class[class_idx].cache = new_cache;
+	if (heap_size_class && heap_size_class->cache) {
+		span = heap_size_class->cache;
+		heap_size_class->cache = (heap->span_cache.count ? heap->span_cache.span[--heap->span_cache.count] : 0);
+		_rpmalloc_inc_span_statistics(heap, span_count, class_idx);
+		return span;
+	}
+#endif
+	(void)sizeof(class_idx);
+	// Allow 50% overhead to increase cache hits
+	size_t base_span_count = span_count;
+	size_t limit_span_count = (span_count > 2) ? (span_count + (span_count >> 1)) : span_count;
+	if (limit_span_count > LARGE_CLASS_COUNT)
+		limit_span_count = LARGE_CLASS_COUNT;
+	do {
+		span = _rpmalloc_heap_thread_cache_extract(heap, span_count);
+		if (EXPECTED(span != 0)) {
+			_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_from_cache);
+			_rpmalloc_inc_span_statistics(heap, span_count, class_idx);
 			return span;
 		}
-	}
-#else
-	(void)sizeof(class_idx);
-#endif
-	span = _rpmalloc_heap_thread_cache_extract(heap, span_count);
-	if (EXPECTED(span != 0)) {
-		_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_from_cache);
-		return span;
-	}
-	span = _rpmalloc_heap_reserved_extract(heap, span_count);
-	if (EXPECTED(span != 0)) {
-		_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_from_reserved);
-		return span;
-	}
-	span = _rpmalloc_heap_global_cache_extract(heap, span_count);
-	if (EXPECTED(span != 0)) {
-		_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_from_cache);
-		return span;
-	}
+		span = _rpmalloc_heap_thread_cache_deferred_extract(heap, span_count);
+		if (EXPECTED(span != 0)) {
+			_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_from_cache);
+			_rpmalloc_inc_span_statistics(heap, span_count, class_idx);
+			return span;
+		}
+		span = _rpmalloc_heap_reserved_extract(heap, span_count);
+		if (EXPECTED(span != 0)) {
+			_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_from_reserved);
+			_rpmalloc_inc_span_statistics(heap, span_count, class_idx);
+			return span;
+		}
+		span = _rpmalloc_heap_global_cache_extract(heap, span_count);
+		if (EXPECTED(span != 0)) {
+			_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_from_cache);
+			_rpmalloc_inc_span_statistics(heap, span_count, class_idx);
+			return span;
+		}
+		++span_count;
+	} while (span_count <= limit_span_count);
 	//Final fallback, map in more virtual memory
-	span = _rpmalloc_span_map(heap, span_count);
+	span = _rpmalloc_span_map(heap, base_span_count);
+	_rpmalloc_inc_span_statistics(heap, base_span_count, class_idx);
 	_rpmalloc_stat_inc(&heap->size_class_use[class_idx].spans_map_calls);
 	return span;
 }
 
 static void
 _rpmalloc_heap_initialize(heap_t* heap) {
+	memset(heap, 0, sizeof(heap_t));
 	//Get a new heap ID
 	heap->id = 1 + atomic_incr32(&_memory_heap_id);
 
 	//Link in heap in heap ID map
-	size_t list_idx = heap->id % HEAP_ARRAY_SIZE;
+	size_t list_idx = (size_t)heap->id % HEAP_ARRAY_SIZE;
 	heap->next_heap = _memory_heaps[list_idx];
 	_memory_heaps[list_idx] = heap;
 }
@@ -1717,7 +1883,7 @@
 _rpmalloc_heap_allocate(int first_class) {
 	heap_t* heap = 0;
 	while (!atomic_cas32_acquire(&_memory_global_lock, 1, 0))
-		/* Spin */;
+		_rpmalloc_spin();
 	if (first_class == 0)
 		heap = _rpmalloc_heap_extract_orphan(&_memory_orphan_heaps);
 #if RPMALLOC_FIRST_CLASS_HEAPS
@@ -1727,59 +1893,71 @@
 	if (!heap)
 		heap = _rpmalloc_heap_allocate_new();
 	atomic_store32_release(&_memory_global_lock, 0);
+	_rpmalloc_heap_cache_adopt_deferred(heap, 0);
 	return heap;
 }
 
 static void
-_rpmalloc_heap_release(void* heapptr, int first_class) {
+_rpmalloc_heap_release(void* heapptr, int first_class, int release_cache) {
 	heap_t* heap = (heap_t*)heapptr;
 	if (!heap)
 		return;
 	//Release thread cache spans back to global cache
 	_rpmalloc_heap_cache_adopt_deferred(heap, 0);
+	if (release_cache  || heap->finalize) {
 #if ENABLE_THREAD_CACHE
-	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
-		span_cache_t* span_cache;
-		if (!iclass)
-			span_cache = &heap->span_cache;
-		else
-			span_cache = (span_cache_t*)(heap->span_large_cache + (iclass - 1));
-		if (!span_cache->count)
-			continue;
+		for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
+			span_cache_t* span_cache;
+			if (!iclass)
+				span_cache = &heap->span_cache;
+			else
+				span_cache = (span_cache_t*)(heap->span_large_cache + (iclass - 1));
+			if (!span_cache->count)
+				continue;
 #if ENABLE_GLOBAL_CACHE
-		if (heap->finalize) {
+			if (heap->finalize) {
+				for (size_t ispan = 0; ispan < span_cache->count; ++ispan)
+					_rpmalloc_span_unmap(span_cache->span[ispan]);
+			} else {
+				_rpmalloc_stat_add64(&heap->thread_to_global, span_cache->count * (iclass + 1) * _memory_span_size);
+				_rpmalloc_stat_add(&heap->span_use[iclass].spans_to_global, span_cache->count);
+				_rpmalloc_global_cache_insert_spans(span_cache->span, iclass + 1, span_cache->count);
+			}
+#else
 			for (size_t ispan = 0; ispan < span_cache->count; ++ispan)
 				_rpmalloc_span_unmap(span_cache->span[ispan]);
-		} else {
-			_rpmalloc_stat_add64(&heap->thread_to_global, span_cache->count * (iclass + 1) * _memory_span_size);
-			_rpmalloc_stat_add(&heap->span_use[iclass].spans_to_global, span_cache->count);
-			_rpmalloc_global_cache_insert_spans(span_cache->span, iclass + 1, span_cache->count);
+#endif
+			span_cache->count = 0;
 		}
-#else
-		for (size_t ispan = 0; ispan < span_cache->count; ++ispan)
-			_rpmalloc_span_unmap(span_cache->span[ispan]);
 #endif
-		span_cache->count = 0;
 	}
-#endif
 
 	if (get_thread_heap_raw() == heap)
 		set_thread_heap(0);
 
 #if ENABLE_STATISTICS
 	atomic_decr32(&_memory_active_heaps);
-	assert(atomic_load32(&_memory_active_heaps) >= 0);
+	rpmalloc_assert(atomic_load32(&_memory_active_heaps) >= 0, "Still active heaps during finalization");
 #endif
 
-	while (!atomic_cas32_acquire(&_memory_global_lock, 1, 0))
-		/* Spin */;
+	// If we are forcibly terminating with _exit the state of the
+	// lock atomic is unknown and it's best to just go ahead and exit
+	if (get_thread_id() != _rpmalloc_main_thread_id) {
+		while (!atomic_cas32_acquire(&_memory_global_lock, 1, 0))
+			_rpmalloc_spin();
+	}
 	_rpmalloc_heap_orphan(heap, first_class);
 	atomic_store32_release(&_memory_global_lock, 0);
 }
 
 static void
-_rpmalloc_heap_release_raw(void* heapptr) {
-	_rpmalloc_heap_release(heapptr, 0);
+_rpmalloc_heap_release_raw(void* heapptr, int release_cache) {
+	_rpmalloc_heap_release(heapptr, 0, release_cache);
+}
+
+static void
+_rpmalloc_heap_release_raw_fc(void* heapptr) {
+	_rpmalloc_heap_release_raw(heapptr, 1);
 }
 
 static void
@@ -1830,7 +2008,7 @@
 		span_cache->count = 0;
 	}
 #endif
-	assert(!atomic_load_ptr(&heap->span_free_deferred));
+	rpmalloc_assert(!atomic_load_ptr(&heap->span_free_deferred), "Heaps still active during finalization");
 }
 
 
@@ -1850,25 +2028,25 @@
 
 //! Allocate a small/medium sized memory block from the given heap
 static void*
-_rpmalloc_allocate_from_heap_fallback(heap_t* heap, uint32_t class_idx) {
-	span_t* span = heap->size_class[class_idx].partial_span;
+_rpmalloc_allocate_from_heap_fallback(heap_t* heap, heap_size_class_t* heap_size_class, uint32_t class_idx) {
+	span_t* span = heap_size_class->partial_span;
 	if (EXPECTED(span != 0)) {
-		assert(span->block_count == _memory_size_class[span->size_class].block_count);
-		assert(!_rpmalloc_span_is_fully_utilized(span));
+		rpmalloc_assert(span->block_count == _memory_size_class[span->size_class].block_count, "Span block count corrupted");
+		rpmalloc_assert(!_rpmalloc_span_is_fully_utilized(span), "Internal failure");
 		void* block;
 		if (span->free_list) {
-			//Swap in free list if not empty
-			heap->size_class[class_idx].free_list = span->free_list;
+			//Span local free list is not empty, swap to size class free list
+			block = free_list_pop(&span->free_list);
+			heap_size_class->free_list = span->free_list;
 			span->free_list = 0;
-			block = free_list_pop(&heap->size_class[class_idx].free_list);
 		} else {
 			//If the span did not fully initialize free list, link up another page worth of blocks			
 			void* block_start = pointer_offset(span, SPAN_HEADER_SIZE + ((size_t)span->free_list_limit * span->block_size));
-			span->free_list_limit += free_list_partial_init(&heap->size_class[class_idx].free_list, &block,
+			span->free_list_limit += free_list_partial_init(&heap_size_class->free_list, &block,
 				(void*)((uintptr_t)block_start & ~(_memory_page_size - 1)), block_start,
 				span->block_count - span->free_list_limit, span->block_size);
 		}
-		assert(span->free_list_limit <= span->block_count);
+		rpmalloc_assert(span->free_list_limit <= span->block_count, "Span block count corrupted");
 		span->used_count = span->free_list_limit;
 
 		//Swap in deferred free list if present
@@ -1880,7 +2058,7 @@
 			return block;
 
 		//The span is fully utilized, unlink from partial list and add to fully utilized list
-		_rpmalloc_span_double_link_list_pop_head(&heap->size_class[class_idx].partial_span, span);
+		_rpmalloc_span_double_link_list_pop_head(&heap_size_class->partial_span, span);
 #if RPMALLOC_FIRST_CLASS_HEAPS
 		_rpmalloc_span_double_link_list_add(&heap->full_span[class_idx], span);
 #endif
@@ -1889,10 +2067,10 @@
 	}
 
 	//Find a span in one of the cache levels
-	span = _rpmalloc_heap_extract_new_span(heap, 1, class_idx);
+	span = _rpmalloc_heap_extract_new_span(heap, heap_size_class, 1, class_idx);
 	if (EXPECTED(span != 0)) {
 		//Mark span as owned by this heap and set base data, return first block
-		return _rpmalloc_span_initialize_new(heap, span, class_idx);
+		return _rpmalloc_span_initialize_new(heap, heap_size_class, span, class_idx);
 	}
 
 	return 0;
@@ -1901,32 +2079,34 @@
 //! Allocate a small sized memory block from the given heap
 static void*
 _rpmalloc_allocate_small(heap_t* heap, size_t size) {
-	assert(heap);
+	rpmalloc_assert(heap, "No thread heap");
 	//Small sizes have unique size classes
 	const uint32_t class_idx = (uint32_t)((size + (SMALL_GRANULARITY - 1)) >> SMALL_GRANULARITY_SHIFT);
+	heap_size_class_t* heap_size_class = heap->size_class + class_idx;
 	_rpmalloc_stat_inc_alloc(heap, class_idx);
-	if (EXPECTED(heap->size_class[class_idx].free_list != 0))
-		return free_list_pop(&heap->size_class[class_idx].free_list);
-	return _rpmalloc_allocate_from_heap_fallback(heap, class_idx);
+	if (EXPECTED(heap_size_class->free_list != 0))
+		return free_list_pop(&heap_size_class->free_list);
+	return _rpmalloc_allocate_from_heap_fallback(heap, heap_size_class, class_idx);
 }
 
 //! Allocate a medium sized memory block from the given heap
 static void*
 _rpmalloc_allocate_medium(heap_t* heap, size_t size) {
-	assert(heap);
+	rpmalloc_assert(heap, "No thread heap");
 	//Calculate the size class index and do a dependent lookup of the final class index (in case of merged classes)
 	const uint32_t base_idx = (uint32_t)(SMALL_CLASS_COUNT + ((size - (SMALL_SIZE_LIMIT + 1)) >> MEDIUM_GRANULARITY_SHIFT));
 	const uint32_t class_idx = _memory_size_class[base_idx].class_idx;
+	heap_size_class_t* heap_size_class = heap->size_class + class_idx;
 	_rpmalloc_stat_inc_alloc(heap, class_idx);
-	if (EXPECTED(heap->size_class[class_idx].free_list != 0))
-		return free_list_pop(&heap->size_class[class_idx].free_list);
-	return _rpmalloc_allocate_from_heap_fallback(heap, class_idx);
+	if (EXPECTED(heap_size_class->free_list != 0))
+		return free_list_pop(&heap_size_class->free_list);
+	return _rpmalloc_allocate_from_heap_fallback(heap, heap_size_class, class_idx);
 }
 
 //! Allocate a large sized memory block from the given heap
 static void*
 _rpmalloc_allocate_large(heap_t* heap, size_t size) {
-	assert(heap);
+	rpmalloc_assert(heap, "No thread heap");
 	//Calculate number of needed max sized spans (including header)
 	//Since this function is never called if size > LARGE_SIZE_LIMIT
 	//the span_count is guaranteed to be <= LARGE_CLASS_COUNT
@@ -1936,12 +2116,12 @@
 		++span_count;
 
 	//Find a span in one of the cache levels
-	span_t* span = _rpmalloc_heap_extract_new_span(heap, span_count, SIZE_CLASS_LARGE);
+	span_t* span = _rpmalloc_heap_extract_new_span(heap, 0, span_count, SIZE_CLASS_LARGE);
 	if (!span)
 		return span;
 
 	//Mark span as owned by this heap and set base data
-	assert(span->span_count == span_count);
+	rpmalloc_assert(span->span_count >= span_count, "Internal failure");
 	span->size_class = SIZE_CLASS_LARGE;
 	span->heap = heap;
 
@@ -1956,7 +2136,8 @@
 //! Allocate a huge block by mapping memory pages directly
 static void*
 _rpmalloc_allocate_huge(heap_t* heap, size_t size) {
-	assert(heap);
+	rpmalloc_assert(heap, "No thread heap");
+	_rpmalloc_heap_cache_adopt_deferred(heap, 0);
 	size += SPAN_HEADER_SIZE;
 	size_t num_pages = size >> _memory_page_size_shift;
 	if (size & (_memory_page_size - 1))
@@ -1984,6 +2165,7 @@
 //! Allocate a block of the given size
 static void*
 _rpmalloc_allocate(heap_t* heap, size_t size) {
+	_rpmalloc_stat_add64(&_allocation_counter, 1);
 	if (EXPECTED(size <= SMALL_SIZE_LIMIT))
 		return _rpmalloc_allocate_small(heap, size);
 	else if (size <= _memory_medium_size_limit)
@@ -2014,7 +2196,7 @@
 		// and size aligned to span header size multiples is less than size + alignment,
 		// then use natural alignment of blocks to provide alignment
 		size_t multiple_size = size ? (size + (SPAN_HEADER_SIZE - 1)) & ~(uintptr_t)(SPAN_HEADER_SIZE - 1) : SPAN_HEADER_SIZE;
-		assert(!(multiple_size % SPAN_HEADER_SIZE));
+		rpmalloc_assert(!(multiple_size % SPAN_HEADER_SIZE), "Failed alignment calculation");
 		if (multiple_size <= (size + alignment))
 			return _rpmalloc_allocate(heap, multiple_size);
 	}
@@ -2106,6 +2288,8 @@
 #endif
 	++heap->full_span_count;
 
+	_rpmalloc_stat_add64(&_allocation_counter, 1);
+
 	return ptr;
 }
 
@@ -2120,7 +2304,7 @@
 static void
 _rpmalloc_deallocate_direct_small_or_medium(span_t* span, void* block) {
 	heap_t* heap = span->heap;
-	assert(heap->owner_thread == get_thread_id() || !heap->owner_thread || heap->finalize);
+	rpmalloc_assert(heap->owner_thread == get_thread_id() || !heap->owner_thread || heap->finalize, "Internal failure");
 	//Add block to free list
 	if (UNEXPECTED(_rpmalloc_span_is_fully_utilized(span))) {
 		span->used_count = span->block_count;
@@ -2130,8 +2314,8 @@
 		_rpmalloc_span_double_link_list_add(&heap->size_class[span->size_class].partial_span, span);
 		--heap->full_span_count;
 	}
-	--span->used_count;
 	*((void**)block) = span->free_list;
+	--span->used_count;
 	span->free_list = block;
 	if (UNEXPECTED(span->used_count == span->list_size)) {
 		_rpmalloc_span_double_link_list_remove(&heap->size_class[span->size_class].partial_span, span);
@@ -2141,6 +2325,8 @@
 
 static void
 _rpmalloc_deallocate_defer_free_span(heap_t* heap, span_t* span) {
+	if (span->size_class != SIZE_CLASS_HUGE)
+		_rpmalloc_stat_inc(&heap->span_use[span->span_count - 1].spans_deferred);
 	//This list does not need ABA protection, no mutable side state
 	do {
 		span->free_list = (void*)atomic_load_ptr(&heap->span_free_deferred);
@@ -2193,9 +2379,9 @@
 //! Deallocate the given large memory block to the current heap
 static void
 _rpmalloc_deallocate_large(span_t* span) {
-	assert(span->size_class == SIZE_CLASS_LARGE);
-	assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN));
-	assert((span->flags & SPAN_FLAG_MASTER) || (span->flags & SPAN_FLAG_SUBSPAN));
+	rpmalloc_assert(span->size_class == SIZE_CLASS_LARGE, "Bad span size class");
+	rpmalloc_assert(!(span->flags & SPAN_FLAG_MASTER) || !(span->flags & SPAN_FLAG_SUBSPAN), "Span flag corrupted");
+	rpmalloc_assert((span->flags & SPAN_FLAG_MASTER) || (span->flags & SPAN_FLAG_SUBSPAN), "Span flag corrupted");
 	//We must always defer (unless finalizing) if from another heap since we cannot touch the list or counters of another heap
 #if RPMALLOC_FIRST_CLASS_HEAPS
 	int defer = (span->heap->owner_thread && (span->heap->owner_thread != get_thread_id()) && !span->heap->finalize);
@@ -2206,7 +2392,7 @@
 		_rpmalloc_deallocate_defer_free_span(span->heap, span);
 		return;
 	}
-	assert(span->heap->full_span_count);
+	rpmalloc_assert(span->heap->full_span_count, "Heap spanc counter corrupted");
 	--span->heap->full_span_count;
 #if RPMALLOC_FIRST_CLASS_HEAPS
 	_rpmalloc_span_double_link_list_remove(&span->heap->large_huge_span, span);
@@ -2216,9 +2402,8 @@
 	size_t idx = span->span_count - 1;
 	atomic_decr32(&span->heap->span_use[idx].current);
 #endif
-	heap_t* heap = get_thread_heap();
-	assert(heap);
-	span->heap = heap;
+	heap_t* heap = span->heap;
+	rpmalloc_assert(heap, "No thread heap");
 	if ((span->span_count > 1) && !heap->finalize && !heap->spans_reserved) {
 		heap->span_reserve = span;
 		heap->spans_reserved = span->span_count;
@@ -2227,8 +2412,8 @@
 		} else { //SPAN_FLAG_SUBSPAN
 			span_t* master = (span_t*)pointer_offset(span, -(intptr_t)((size_t)span->offset_from_master * _memory_span_size));
 			heap->span_reserve_master = master;
-			assert(master->flags & SPAN_FLAG_MASTER);
-			assert(atomic_load32(&master->remaining_spans) >= (int32_t)span->span_count);
+			rpmalloc_assert(master->flags & SPAN_FLAG_MASTER, "Span flag corrupted");
+			rpmalloc_assert(atomic_load32(&master->remaining_spans) >= (int32_t)span->span_count, "Master span count corrupted");
 		}
 		_rpmalloc_stat_inc(&heap->span_use[idx].spans_to_reserved);
 	} else {
@@ -2240,7 +2425,7 @@
 //! Deallocate the given huge span
 static void
 _rpmalloc_deallocate_huge(span_t* span) {
-	assert(span->heap);
+	rpmalloc_assert(span->heap, "No span heap");
 #if RPMALLOC_FIRST_CLASS_HEAPS
 	int defer = (span->heap->owner_thread && (span->heap->owner_thread != get_thread_id()) && !span->heap->finalize);
 #else
@@ -2250,7 +2435,7 @@
 		_rpmalloc_deallocate_defer_free_span(span->heap, span);
 		return;
 	}
-	assert(span->heap->full_span_count);
+	rpmalloc_assert(span->heap->full_span_count, "Heap span counter corrupted");
 	--span->heap->full_span_count;
 #if RPMALLOC_FIRST_CLASS_HEAPS
 	_rpmalloc_span_double_link_list_remove(&span->heap->large_huge_span, span);
@@ -2265,6 +2450,7 @@
 //! Deallocate the given block
 static void
 _rpmalloc_deallocate(void* p) {
+	_rpmalloc_stat_add64(&_deallocation_counter, 1);
 	//Grab the span (always at start of span, using span alignment)
 	span_t* span = (span_t*)((uintptr_t)p & _memory_span_mask);
 	if (UNEXPECTED(!span))
@@ -2277,7 +2463,6 @@
 		_rpmalloc_deallocate_huge(span);
 }
 
-
 ////////////
 ///
 /// Reallocation entry points
@@ -2295,7 +2480,7 @@
 		span_t* span = (span_t*)((uintptr_t)p & _memory_span_mask);
 		if (EXPECTED(span->size_class < SIZE_CLASS_COUNT)) {
 			//Small/medium sized block
-			assert(span->span_count == 1);
+			rpmalloc_assert(span->span_count == 1, "Span counter corrupted");
 			void* blocks_start = pointer_offset(span, SPAN_HEADER_SIZE);
 			uint32_t block_offset = (uint32_t)pointer_diff(p, blocks_start);
 			uint32_t block_idx = block_offset / span->block_size;
@@ -2510,7 +2695,7 @@
 				_memory_page_size = 2 * 1024 * 1024;
 				_memory_map_granularity = _memory_page_size;
 			}
-#elif defined(__APPLE__)
+#elif defined(__APPLE__) || defined(__NetBSD__)
 			_memory_huge_pages = 1;
 			_memory_page_size = 2 * 1024 * 1024;
 			_memory_map_granularity = _memory_page_size;
@@ -2605,7 +2790,7 @@
 	_memory_span_release_count_large = (_memory_span_release_count > 8 ? (_memory_span_release_count / 4) : 2);
 
 #if (defined(__APPLE__) || defined(__HAIKU__)) && ENABLE_PRELOAD
-	if (pthread_key_create(&_memory_thread_heap, _rpmalloc_heap_release_raw))
+	if (pthread_key_create(&_memory_thread_heap, _rpmalloc_heap_release_raw_fc))
 		return -1;
 #endif
 #if defined(_WIN32) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
@@ -2637,6 +2822,18 @@
 #if RPMALLOC_FIRST_CLASS_HEAPS
 	_memory_first_class_orphan_heaps = 0;
 #endif
+#if ENABLE_STATISTICS
+	atomic_store32(&_memory_active_heaps, 0);
+	atomic_store32(&_mapped_pages, 0);
+	_mapped_pages_peak = 0;
+	atomic_store32(&_master_spans, 0);
+	atomic_store32(&_reserved_spans, 0);
+	atomic_store32(&_mapped_total, 0);
+	atomic_store32(&_unmapped_total, 0);
+	atomic_store32(&_mapped_pages_os, 0);
+	atomic_store32(&_huge_pages_current, 0);
+	_huge_pages_peak = 0;
+#endif
 	memset(_memory_heaps, 0, sizeof(_memory_heaps));
 	atomic_store32_release(&_memory_global_lock, 0);
 
@@ -2648,7 +2845,7 @@
 //! Finalize the allocator
 void
 rpmalloc_finalize(void) {
-	rpmalloc_thread_finalize();
+	rpmalloc_thread_finalize(1);
 	//rpmalloc_dump_statistics(stdout);
 
 	if (_memory_global_reserve) {
@@ -2685,9 +2882,9 @@
 #endif
 #if ENABLE_STATISTICS
 	//If you hit these asserts you probably have memory leaks (perhaps global scope data doing dynamic allocations) or double frees in your code
-	assert(atomic_load32(&_mapped_pages) == 0);
-	assert(atomic_load32(&_reserved_spans) == 0);
-	assert(atomic_load32(&_mapped_pages_os) == 0);
+	rpmalloc_assert(atomic_load32(&_mapped_pages) == 0, "Memory leak detected");
+	rpmalloc_assert(atomic_load32(&_reserved_spans) == 0, "Memory leak detected");
+	rpmalloc_assert(atomic_load32(&_mapped_pages_os) == 0, "Memory leak detected");
 #endif
 
 	_rpmalloc_initialized = 0;
@@ -2710,10 +2907,10 @@
 
 //! Finalize thread, orphan heap
 void
-rpmalloc_thread_finalize(void) {
+rpmalloc_thread_finalize(int release_caches) {
 	heap_t* heap = get_thread_heap_raw();
 	if (heap)
-		_rpmalloc_heap_release_raw(heap);
+		_rpmalloc_heap_release_raw(heap, release_caches);
 	set_thread_heap(0);
 #if defined(_WIN32) && (!defined(BUILD_DYNAMIC_LINK) || !BUILD_DYNAMIC_LINK)
 	FlsSetValue(fls_key, 0);
@@ -2964,13 +3161,14 @@
 			((size_t)atomic_load32(&heap->size_class_use[iclass].spans_from_reserved) * _memory_span_size) / (size_t)(1024 * 1024),
 			atomic_load32(&heap->size_class_use[iclass].spans_map_calls));
 	}
-	fprintf(file, "Spans  Current     Peak  PeakMiB  Cached  ToCacheMiB FromCacheMiB ToReserveMiB FromReserveMiB ToGlobalMiB FromGlobalMiB  MmapCalls\n");
+	fprintf(file, "Spans  Current     Peak Deferred  PeakMiB  Cached  ToCacheMiB FromCacheMiB ToReserveMiB FromReserveMiB ToGlobalMiB FromGlobalMiB  MmapCalls\n");
 	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
 		if (!atomic_load32(&heap->span_use[iclass].high) && !atomic_load32(&heap->span_use[iclass].spans_map_calls))
 			continue;
-		fprintf(file, "%4u: %8d %8u %8zu %7u %11zu %12zu %12zu %14zu %11zu %13zu %10u\n", (uint32_t)(iclass + 1),
+		fprintf(file, "%4u: %8d %8u %8u %8zu %7u %11zu %12zu %12zu %14zu %11zu %13zu %10u\n", (uint32_t)(iclass + 1),
 			atomic_load32(&heap->span_use[iclass].current),
 			atomic_load32(&heap->span_use[iclass].high),
+			atomic_load32(&heap->span_use[iclass].spans_deferred),
 			((size_t)atomic_load32(&heap->span_use[iclass].high) * (size_t)_memory_span_size * (iclass + 1)) / (size_t)(1024 * 1024),
 #if ENABLE_THREAD_CACHE
 			(unsigned int)(!iclass ? heap->span_cache.count : heap->span_large_cache[iclass - 1].count),
@@ -2985,6 +3183,7 @@
 			((size_t)atomic_load32(&heap->span_use[iclass].spans_from_global) * (size_t)_memory_span_size * (iclass + 1)) / (size_t)(1024 * 1024),
 			atomic_load32(&heap->span_use[iclass].spans_map_calls));
 	}
+	fprintf(file, "Full spans: %zu\n", heap->full_span_count);
 	fprintf(file, "ThreadToGlobalMiB GlobalToThreadMiB\n");
 	fprintf(file, "%17zu %17zu\n", (size_t)atomic_load64(&heap->thread_to_global) / (size_t)(1024 * 1024), (size_t)atomic_load64(&heap->global_to_thread) / (size_t)(1024 * 1024));
 }
@@ -2994,16 +3193,14 @@
 void
 rpmalloc_dump_statistics(void* file) {
 #if ENABLE_STATISTICS
-	//If you hit this assert, you still have active threads or forgot to finalize some thread(s)
-	assert(atomic_load32(&_memory_active_heaps) == 0);
 	for (size_t list_idx = 0; list_idx < HEAP_ARRAY_SIZE; ++list_idx) {
 		heap_t* heap = _memory_heaps[list_idx];
 		while (heap) {
 			int need_dump = 0;
 			for (size_t iclass = 0; !need_dump && (iclass < SIZE_CLASS_COUNT); ++iclass) {
 				if (!atomic_load32(&heap->size_class_use[iclass].alloc_total)) {
-					assert(!atomic_load32(&heap->size_class_use[iclass].free_total));
-					assert(!atomic_load32(&heap->size_class_use[iclass].spans_map_calls));
+					rpmalloc_assert(!atomic_load32(&heap->size_class_use[iclass].free_total), "Heap statistics counter mismatch");
+					rpmalloc_assert(!atomic_load32(&heap->size_class_use[iclass].spans_map_calls), "Heap statistics counter mismatch");
 					continue;
 				}
 				need_dump = 1;
@@ -3024,6 +3221,20 @@
 	fprintf(file, "HugeCurrentMiB HugePeakMiB\n");
 	fprintf(file, "%14zu %11zu\n", huge_current / (size_t)(1024 * 1024), huge_peak / (size_t)(1024 * 1024));
 
+	size_t global_cache = 0;
+	for (size_t iclass = 0; iclass < LARGE_CLASS_COUNT; ++iclass) {
+		global_cache_t* cache = _memory_span_cache + iclass;
+		global_cache += (size_t)cache->count * iclass * _memory_span_size;
+
+		span_t* span = cache->overflow;
+		while (span) {
+			global_cache += iclass * _memory_span_size;
+			span = span->next;
+		}
+	}
+	fprintf(file, "GlobalCacheMiB\n");
+	fprintf(file, "%14zu\n", global_cache / (size_t)(1024 * 1024));
+
 	size_t mapped = (size_t)atomic_load32(&_mapped_pages) * _memory_page_size;
 	size_t mapped_os = (size_t)atomic_load32(&_mapped_pages_os) * _memory_page_size;
 	size_t mapped_peak = (size_t)_mapped_pages_peak * _memory_page_size;
@@ -3040,9 +3251,17 @@
 		reserved_total / (size_t)(1024 * 1024));
 
 	fprintf(file, "\n");
-#else
-	(void)sizeof(file);
+#if 0
+	int64_t allocated = atomic_load64(&_allocation_counter);
+	int64_t deallocated = atomic_load64(&_deallocation_counter);
+	fprintf(file, "Allocation count: %lli\n", allocated);
+	fprintf(file, "Deallocation count: %lli\n", deallocated);
+	fprintf(file, "Current allocations: %lli\n", (allocated - deallocated));
+	fprintf(file, "Master spans: %d\n", atomic_load32(&_master_spans));
+	fprintf(file, "Dangling master spans: %d\n", atomic_load32(&_unmapped_master_spans));
 #endif
+#endif
+	(void)sizeof(file);
 }
 
 #if RPMALLOC_FIRST_CLASS_HEAPS
@@ -3062,7 +3281,7 @@
 extern inline void
 rpmalloc_heap_release(rpmalloc_heap_t* heap) {
 	if (heap)
-		_rpmalloc_heap_release(heap, 1);
+		_rpmalloc_heap_release(heap, 1, 1);
 }
 
 extern inline RPMALLOC_ALLOCATOR void*
@@ -3070,7 +3289,7 @@
 #if ENABLE_VALIDATE_ARGS
 	if (size >= MAX_ALLOC_SIZE) {
 		errno = EINVAL;
-		return ptr;
+		return 0;
 	}
 #endif
 	return _rpmalloc_allocate(heap, size);
@@ -3081,7 +3300,7 @@
 #if ENABLE_VALIDATE_ARGS
 	if (size >= MAX_ALLOC_SIZE) {
 		errno = EINVAL;
-		return ptr;
+		return 0;
 	}
 #endif
 	return _rpmalloc_aligned_allocate(heap, alignment, size);
diff --git a/rpmalloc/rpmalloc.h b/rpmalloc/rpmalloc.h
index 6b85c0a..b1fa757 100644
--- a/rpmalloc/rpmalloc.h
+++ b/rpmalloc/rpmalloc.h
@@ -153,6 +153,9 @@
 	//  If you set a memory_unmap function, you must also set a memory_map function or
 	//  else the default implementation will be used for both.
 	void (*memory_unmap)(void* address, size_t size, size_t offset, size_t release);
+	//! Called when an assert fails, if asserts are enabled. Will use the standard assert()
+	//  if this is not set.
+	void (*error_callback)(const char* message);
 	//! Size of memory pages. The page size MUST be a power of two. All memory mapping
 	//  requests to memory_map will be made with size set to a multiple of the page size.
 	//  Used if RPMALLOC_CONFIGURABLE is defined to 1, otherwise system page size is used.
@@ -200,7 +203,7 @@
 
 //! Finalize allocator for calling thread
 RPMALLOC_EXPORT void
-rpmalloc_thread_finalize(void);
+rpmalloc_thread_finalize(int release_caches);
 
 //! Perform deferred deallocations pending for the calling thread heap
 RPMALLOC_EXPORT void
@@ -284,7 +287,7 @@
 typedef struct heap_t rpmalloc_heap_t;
 
 //! Acquire a new heap. Will reuse existing released heaps or allocate memory for a new heap
-//  if none available. Heap API is imlemented with the strict assumption that only one single
+//  if none available. Heap API is implemented with the strict assumption that only one single
 //  thread will call heap functions for a given heap at any given time, no functions are thread safe.
 RPMALLOC_EXPORT rpmalloc_heap_t*
 rpmalloc_heap_acquire(void);
@@ -327,7 +330,7 @@
 //  less than memory page size. A caveat of rpmalloc internals is that this must also be strictly less than
 //  the span size (default 64KiB).
 RPMALLOC_EXPORT RPMALLOC_ALLOCATOR void*
-rpmalloc_heap_aligned_realloc(rpmalloc_heap_t* heap, void* ptr, size_t alignment, size_t size, unsigned int flags) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE(3);
+rpmalloc_heap_aligned_realloc(rpmalloc_heap_t* heap, void* ptr, size_t alignment, size_t size, unsigned int flags) RPMALLOC_ATTRIB_MALLOC RPMALLOC_ATTRIB_ALLOC_SIZE(4);
 
 //! Free the given memory block from the given heap. The memory block MUST be allocated
 //  by the same heap given to this function.
diff --git a/test/main.c b/test/main.c
index f8db4c7..0a92ac7 100644
--- a/test/main.c
+++ b/test/main.c
@@ -454,7 +454,7 @@
 	thread_sleep(1);
 
 	if (arg.init_fini_each_loop)
-		rpmalloc_thread_finalize();
+		rpmalloc_thread_finalize(1);
 
 	for (iloop = 0; iloop < arg.loops; ++iloop) {
 		if (arg.init_fini_each_loop)
@@ -504,7 +504,7 @@
 		}
 
 		if (arg.init_fini_each_loop)
-			rpmalloc_thread_finalize();
+			rpmalloc_thread_finalize(1);
 	}
 
 	if (arg.init_fini_each_loop)
@@ -513,7 +513,7 @@
 	rpfree(data);
 	rpfree(addr);
 
-	rpmalloc_thread_finalize();
+	rpmalloc_thread_finalize(1);
 
 end:
 	thread_exit((uintptr_t)ret);
@@ -676,7 +676,7 @@
 	}
 
 end:
-	rpmalloc_thread_finalize();
+	rpmalloc_thread_finalize(1);
 
 	thread_exit((uintptr_t)ret);
 }
@@ -777,12 +777,12 @@
 			rpfree(addr[ipass]);
 		}
 
-		rpmalloc_thread_finalize();
+		rpmalloc_thread_finalize(1);
 		thread_yield();
 	}
 
 end:
-	rpmalloc_thread_finalize();
+	rpmalloc_thread_finalize(1);
 	thread_exit((uintptr_t)ret);
 }
 
@@ -1061,6 +1061,36 @@
 	return 0;
 }
 
+static int got_error;
+
+static void
+test_error_callback(const char* message) {
+	//printf("%s\n", message);
+	(void)sizeof(message);
+	got_error = 1;
+}
+
+static int
+test_error(void) {
+	//printf("Detecting memory leak\n");
+
+	rpmalloc_config_t config = {0};
+	config.error_callback = test_error_callback;
+	rpmalloc_initialize_config(&config);
+
+	rpmalloc(10);
+
+	rpmalloc_finalize();
+
+	if (!got_error) {
+		printf("Leak not detected and reported as expected\n");
+		return -1;
+	}
+
+	printf("Error detection test passed\n");
+	return 0;
+}
+
 int
 test_run(int argc, char** argv) {
 	(void)sizeof(argc);
@@ -1080,6 +1110,8 @@
 		return -1;
 	if (test_first_class_heaps())
 		return -1;
+	if (test_error())
+		return -1;
 	printf("All tests passed\n");
 	return 0;
 }
diff --git a/test/thread.c b/test/thread.c
index 152b268..ff4758b 100644
--- a/test/thread.c
+++ b/test/thread.c
@@ -1,5 +1,6 @@
 
 #include <thread.h>
+#include <errno.h>
 
 #ifdef _MSC_VER
 #  define ATTRIBUTE_NORETURN
@@ -81,10 +82,12 @@
 #ifdef _WIN32
 	SleepEx((DWORD)milliseconds, 0);
 #else
-	struct timespec ts;
+	struct timespec ts, remaining;
 	ts.tv_sec  = milliseconds / 1000;
 	ts.tv_nsec = (long)(milliseconds % 1000) * 1000000L;
-	nanosleep(&ts, 0);
+
+	while (nanosleep(&ts, &remaining) != 0 && errno == EINTR)
+		ts = remaining;
 #endif
 }