Merge branch 'release/1.2.1'
diff --git a/CHANGELOG b/CHANGELOG
index df801a6..565dcc2 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -1,3 +1,15 @@
+1.2.1
+
+Split library into rpmalloc only base library and preloadable malloc wrapper library.
+
+Add arg validation to valloc and pvalloc.
+
+Change ARM memory barrier instructions to dmb ish/ishst for compatibility.
+
+Improve preload compatibility on Apple platforms by using pthread key for TLS in wrapper library.
+
+Fix ABA issue in orphaned heap linked list
+
 1.2
 
 Dual license under MIT
diff --git a/README.md b/README.md
index 15b3025..b744fb9 100644
--- a/README.md
+++ b/README.md
@@ -45,12 +45,14 @@
 
 Then simply use the __rpmalloc__/__rpfree__ and the other malloc style replacement functions. Remember all allocations are 16-byte aligned, so no need to call the explicit rpmemalign/rpaligned_alloc/rpposix_memalign functions unless you need greater alignment, they are simply wrappers to make it easier to replace in existing code.
 
-If you wish to override the standard library malloc family of functions and have automatic initialization/finalization of process and threads, also include the `malloc.c` file in your project. The automatic init/fini is only implemented for Linux and macOS targets.
+If you wish to override the standard library malloc family of functions and have automatic initialization/finalization of process and threads, also include the `malloc.c` file in your project. The automatic init/fini is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use libc replacement only as a convenience for testing the library on an existing code base, not a final solution.
 
 # Building
-To compile as a static library run the configure python script which generates a Ninja build script, then build using ninja. Or use the Visual Studio or XCode projects available in the build subdirectories. This also includes the malloc overrides and init/fini glue code.
+To compile as a static library run the configure python script which generates a Ninja build script, then build using ninja. The ninja build produces two static libraries, one named `rpmalloc` and one named `rpmallocwrap`, where the latter includes the libc entry point overrides.
 
-The configure + ninja build also produces a shared object/dynamic library that can be used with LD_PRELOAD/DYLD_INSERT_LIBRARIES to inject in a preexisting binary, replacing any malloc/free family of function calls. This is only implemented for Linux and macOS targets.
+The configure + ninja build also produces two shared object/dynamic libraries. The `rpmallocwrap` shared library can be used with LD_PRELOAD/DYLD_INSERT_LIBRARIES to inject in a preexisting binary, replacing any malloc/free family of function calls. This is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use preloading as a convenience for testing the library on an existing binary, not a final solution.
+
+The latest stable release is available in the master branch. For latest development code, use the develop branch.
 
 # Cache configuration options
 Free memory pages are cached both per thread and in a global cache for all threads. The size of the thread caches is determined by an adaptive scheme where each cache is limited by a percentage of the maximum allocation count of the corresponding size class. The size of the global caches is determined by a multiple of the maximum of all thread caches. The factors controlling the cache sizes can be set by either defining one of four presets, or by editing the individual defines in the `rpmalloc.c` source file for fine tuned control. If you do not define any of the following three directives, the default preset will be used which is to increase caches and prioritize performance over memory overhead (but not making caches unlimited).
@@ -111,4 +113,6 @@
 
 VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead keeps track and atomically increases a running address counter of where memory pages should be mapped to in the virtual address range. If some other code in the process uses mmap to reserve a part of virtual memory spance this counter needs to catch up and resync in order to keep the 64KiB granularity of span addresses, which could potentially be time consuming.
 
-The free, realloc and usable size functions all require the passed pointer to be within the first 64KiB page block of the start of the memory block. You cannot pass in any pointer from the memory block address range.
+The free, realloc and usable size functions all require the passed pointer to be within the first 64KiB page block of the start of the memory block. You cannot pass in any pointer from the memory block address range. 
+
+All entry points assume the passed values are valid, for example passing an invalid pointer to free would most likely result in a segmentation fault. The library does not try to guard against errors.
diff --git a/build/msvs/rpmalloc.vcxproj b/build/msvs/rpmalloc.vcxproj
index fcfe11f..21dbc2e 100644
--- a/build/msvs/rpmalloc.vcxproj
+++ b/build/msvs/rpmalloc.vcxproj
@@ -19,7 +19,6 @@
     </ProjectConfiguration>
   </ItemGroup>
   <ItemGroup>
-    <ClCompile Include="..\..\rpmalloc\malloc.c" />
     <ClCompile Include="..\..\rpmalloc\rpmalloc.c" />
   </ItemGroup>
   <ItemGroup>
diff --git a/build/ninja/generator.py b/build/ninja/generator.py
index fbc79f9..725f490 100644
--- a/build/ninja/generator.py
+++ b/build/ninja/generator.py
@@ -110,11 +110,11 @@
   def writer(self):
     return self.writer
 
-  def lib(self, module, sources, basepath = None, configs = None, includepaths = None, variables = None, externalsources = False):
-    return self.toolchain.lib(self.writer, module, sources, basepath, configs, includepaths, variables, None, externalsources)
+  def lib(self, module, sources, libname, basepath = None, configs = None, includepaths = None, variables = None, externalsources = False):
+    return self.toolchain.lib(self.writer, module, sources, libname, basepath, configs, includepaths, variables, None, externalsources)
 
-  def sharedlib(self, module, sources, basepath = None, configs = None, includepaths = None, libpaths = None, implicit_deps = None, libs = None, frameworks = None, variables = None, externalsources = False):
-    return self.toolchain.sharedlib(self.writer, module, sources, basepath, configs, includepaths, libpaths, implicit_deps, libs, frameworks, variables, None, externalsources)
+  def sharedlib(self, module, sources, libname, basepath = None, configs = None, includepaths = None, libpaths = None, implicit_deps = None, libs = None, frameworks = None, variables = None, externalsources = False):
+    return self.toolchain.sharedlib(self.writer, module, sources, libname, basepath, configs, includepaths, libpaths, implicit_deps, libs, frameworks, variables, None, externalsources)
 
   def bin(self, module, sources, binname, basepath = None, configs = None, includepaths = None, libpaths = None, implicit_deps = None, libs = None, frameworks = None, variables = None, outpath = None, externalsources = False):
     return self.toolchain.bin(self.writer, module, sources, binname, basepath, configs, includepaths, libpaths, implicit_deps, libs, frameworks, variables, outpath, externalsources)
diff --git a/build/ninja/toolchain.py b/build/ninja/toolchain.py
index 6cb7778..dcad7c9 100644
--- a/build/ninja/toolchain.py
+++ b/build/ninja/toolchain.py
@@ -368,24 +368,24 @@
     writer.newline()
     return built
 
-  def lib(self, writer, module, sources, basepath, configs, includepaths, variables, outpath, externalsources):
+  def lib(self, writer, module, sources, libname, basepath, configs, includepaths, variables, outpath, externalsources):
     built = {}
     if basepath == None:
       basepath = ''
     if configs is None:
       configs = list(self.configs)
-    libfile = self.libprefix + module + self.staticlibext
+    libfile = self.libprefix + libname + self.staticlibext
     if outpath is None:
       outpath = self.libpath
     return self.build_sources(writer, 'lib', 'multilib', module, sources, libfile, basepath, outpath, configs, includepaths, None, None, None, variables, None, externalsources)
 
-  def sharedlib(self, writer, module, sources, basepath, configs, includepaths, libpaths, implicit_deps, libs, frameworks, variables, outpath, externalsources):
+  def sharedlib(self, writer, module, sources, libname, basepath, configs, includepaths, libpaths, implicit_deps, libs, frameworks, variables, outpath, externalsources):
     built = {}
     if basepath == None:
       basepath = ''
     if configs is None:
       configs = list(self.configs)
-    libfile = self.libprefix + module + self.dynamiclibext
+    libfile = self.libprefix + libname + self.dynamiclibext
     if outpath is None:
       outpath = self.binpath
     return self.build_sources(writer, 'sharedlib', 'multisharedlib', module, sources, libfile, basepath, outpath, configs, includepaths, libpaths, libs, implicit_deps, variables, frameworks, externalsources)
diff --git a/build/xcode/rpmalloc.xcworkspace/contents.xcworkspacedata b/build/xcode/rpmalloc.xcworkspace/contents.xcworkspacedata
deleted file mode 100644
index 26b72c3..0000000
--- a/build/xcode/rpmalloc.xcworkspace/contents.xcworkspacedata
+++ /dev/null
@@ -1,7 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<Workspace
-   version = "1.0">
-   <FileRef
-      location = "group:rpmalloc/rpmalloc.xcodeproj">
-   </FileRef>
-</Workspace>
diff --git a/build/xcode/rpmalloc/rpmalloc.xcodeproj/project.pbxproj b/build/xcode/rpmalloc/rpmalloc.xcodeproj/project.pbxproj
deleted file mode 100644
index f44b1b3..0000000
--- a/build/xcode/rpmalloc/rpmalloc.xcodeproj/project.pbxproj
+++ /dev/null
@@ -1,279 +0,0 @@
-// !$*UTF8*$!
-{
-	archiveVersion = 1;
-	classes = {
-	};
-	objectVersion = 46;
-	objects = {
-
-/* Begin PBXBuildFile section */
-		CD3DDFC81DEA2FC8007D58FA /* rpmalloc.c in Sources */ = {isa = PBXBuildFile; fileRef = CD3DDFC61DEA2FC8007D58FA /* rpmalloc.c */; };
-		CD3DDFC91DEA2FC8007D58FA /* rpmalloc.h in Headers */ = {isa = PBXBuildFile; fileRef = CD3DDFC71DEA2FC8007D58FA /* rpmalloc.h */; };
-/* End PBXBuildFile section */
-
-/* Begin PBXFileReference section */
-		CD3DDFBF1DEA2FAF007D58FA /* librpmalloc.a */ = {isa = PBXFileReference; explicitFileType = archive.ar; includeInIndex = 0; path = librpmalloc.a; sourceTree = BUILT_PRODUCTS_DIR; };
-		CD3DDFC61DEA2FC8007D58FA /* rpmalloc.c */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.c; name = rpmalloc.c; path = ../../../rpmalloc/rpmalloc.c; sourceTree = "<group>"; };
-		CD3DDFC71DEA2FC8007D58FA /* rpmalloc.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; name = rpmalloc.h; path = ../../../rpmalloc/rpmalloc.h; sourceTree = "<group>"; };
-/* End PBXFileReference section */
-
-/* Begin PBXFrameworksBuildPhase section */
-		CD3DDFBC1DEA2FAF007D58FA /* Frameworks */ = {
-			isa = PBXFrameworksBuildPhase;
-			buildActionMask = 2147483647;
-			files = (
-			);
-			runOnlyForDeploymentPostprocessing = 0;
-		};
-/* End PBXFrameworksBuildPhase section */
-
-/* Begin PBXGroup section */
-		CD3DDFB61DEA2FAF007D58FA = {
-			isa = PBXGroup;
-			children = (
-				CD3DDFCA1DEA3012007D58FA /* rpmalloc */,
-				CD3DDFC01DEA2FAF007D58FA /* Products */,
-			);
-			sourceTree = "<group>";
-		};
-		CD3DDFC01DEA2FAF007D58FA /* Products */ = {
-			isa = PBXGroup;
-			children = (
-				CD3DDFBF1DEA2FAF007D58FA /* librpmalloc.a */,
-			);
-			name = Products;
-			sourceTree = "<group>";
-		};
-		CD3DDFCA1DEA3012007D58FA /* rpmalloc */ = {
-			isa = PBXGroup;
-			children = (
-				CD3DDFC61DEA2FC8007D58FA /* rpmalloc.c */,
-				CD3DDFC71DEA2FC8007D58FA /* rpmalloc.h */,
-			);
-			name = rpmalloc;
-			sourceTree = "<group>";
-		};
-/* End PBXGroup section */
-
-/* Begin PBXHeadersBuildPhase section */
-		CD3DDFBD1DEA2FAF007D58FA /* Headers */ = {
-			isa = PBXHeadersBuildPhase;
-			buildActionMask = 2147483647;
-			files = (
-				CD3DDFC91DEA2FC8007D58FA /* rpmalloc.h in Headers */,
-			);
-			runOnlyForDeploymentPostprocessing = 0;
-		};
-/* End PBXHeadersBuildPhase section */
-
-/* Begin PBXNativeTarget section */
-		CD3DDFBE1DEA2FAF007D58FA /* rpmalloc */ = {
-			isa = PBXNativeTarget;
-			buildConfigurationList = CD3DDFC31DEA2FAF007D58FA /* Build configuration list for PBXNativeTarget "rpmalloc" */;
-			buildPhases = (
-				CD3DDFBB1DEA2FAF007D58FA /* Sources */,
-				CD3DDFBC1DEA2FAF007D58FA /* Frameworks */,
-				CD3DDFBD1DEA2FAF007D58FA /* Headers */,
-			);
-			buildRules = (
-			);
-			dependencies = (
-			);
-			name = rpmalloc;
-			productName = rpmalloc;
-			productReference = CD3DDFBF1DEA2FAF007D58FA /* librpmalloc.a */;
-			productType = "com.apple.product-type.library.static";
-		};
-/* End PBXNativeTarget section */
-
-/* Begin PBXProject section */
-		CD3DDFB71DEA2FAF007D58FA /* Project object */ = {
-			isa = PBXProject;
-			attributes = {
-				LastUpgradeCheck = 0820;
-				ORGANIZATIONNAME = "Rampant Pixels";
-				TargetAttributes = {
-					CD3DDFBE1DEA2FAF007D58FA = {
-						CreatedOnToolsVersion = 8.1;
-						DevelopmentTeam = TDSNBN44YV;
-						ProvisioningStyle = Automatic;
-					};
-				};
-			};
-			buildConfigurationList = CD3DDFBA1DEA2FAF007D58FA /* Build configuration list for PBXProject "rpmalloc" */;
-			compatibilityVersion = "Xcode 3.2";
-			developmentRegion = English;
-			hasScannedForEncodings = 0;
-			knownRegions = (
-				en,
-			);
-			mainGroup = CD3DDFB61DEA2FAF007D58FA;
-			productRefGroup = CD3DDFC01DEA2FAF007D58FA /* Products */;
-			projectDirPath = "";
-			projectRoot = "";
-			targets = (
-				CD3DDFBE1DEA2FAF007D58FA /* rpmalloc */,
-			);
-		};
-/* End PBXProject section */
-
-/* Begin PBXSourcesBuildPhase section */
-		CD3DDFBB1DEA2FAF007D58FA /* Sources */ = {
-			isa = PBXSourcesBuildPhase;
-			buildActionMask = 2147483647;
-			files = (
-				CD3DDFC81DEA2FC8007D58FA /* rpmalloc.c in Sources */,
-			);
-			runOnlyForDeploymentPostprocessing = 0;
-		};
-/* End PBXSourcesBuildPhase section */
-
-/* Begin XCBuildConfiguration section */
-		CD3DDFC11DEA2FAF007D58FA /* Debug */ = {
-			isa = XCBuildConfiguration;
-			buildSettings = {
-				ALWAYS_SEARCH_USER_PATHS = NO;
-				CLANG_ANALYZER_NONNULL = YES;
-				CLANG_CXX_LANGUAGE_STANDARD = "c++0x";
-				CLANG_CXX_LIBRARY = "libc++";
-				CLANG_ENABLE_CODE_COVERAGE = NO;
-				CLANG_ENABLE_MODULES = NO;
-				CLANG_ENABLE_OBJC_ARC = YES;
-				CLANG_WARN_BOOL_CONVERSION = YES;
-				CLANG_WARN_CONSTANT_CONVERSION = YES;
-				CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
-				CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
-				CLANG_WARN_EMPTY_BODY = YES;
-				CLANG_WARN_ENUM_CONVERSION = YES;
-				CLANG_WARN_INFINITE_RECURSION = YES;
-				CLANG_WARN_INT_CONVERSION = YES;
-				CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
-				CLANG_WARN_SUSPICIOUS_MOVES = YES;
-				CLANG_WARN_UNREACHABLE_CODE = YES;
-				CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
-				CODE_SIGN_IDENTITY = "-";
-				COPY_PHASE_STRIP = NO;
-				DEBUG_INFORMATION_FORMAT = "dwarf-with-dsym";
-				ENABLE_STRICT_OBJC_MSGSEND = YES;
-				ENABLE_TESTABILITY = NO;
-				GCC_C_LANGUAGE_STANDARD = c11;
-				GCC_DYNAMIC_NO_PIC = NO;
-				GCC_ENABLE_CPP_EXCEPTIONS = NO;
-				GCC_ENABLE_CPP_RTTI = NO;
-				GCC_FAST_MATH = YES;
-				GCC_OPTIMIZATION_LEVEL = 0;
-				GCC_PREPROCESSOR_DEFINITIONS = (
-					"DEBUG=1",
-					"$(inherited)",
-				);
-				GCC_THREADSAFE_STATICS = NO;
-				GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
-				GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
-				GCC_WARN_UNDECLARED_SELECTOR = YES;
-				GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
-				GCC_WARN_UNUSED_FUNCTION = YES;
-				GCC_WARN_UNUSED_VARIABLE = YES;
-				HEADER_SEARCH_PATHS = ../../..;
-				MACOSX_DEPLOYMENT_TARGET = 10.7;
-				MTL_ENABLE_DEBUG_INFO = YES;
-				ONLY_ACTIVE_ARCH = NO;
-				SDKROOT = macosx;
-			};
-			name = Debug;
-		};
-		CD3DDFC21DEA2FAF007D58FA /* Release */ = {
-			isa = XCBuildConfiguration;
-			buildSettings = {
-				ALWAYS_SEARCH_USER_PATHS = NO;
-				CLANG_ANALYZER_NONNULL = YES;
-				CLANG_CXX_LANGUAGE_STANDARD = "c++0x";
-				CLANG_CXX_LIBRARY = "libc++";
-				CLANG_ENABLE_CODE_COVERAGE = NO;
-				CLANG_ENABLE_MODULES = NO;
-				CLANG_ENABLE_OBJC_ARC = YES;
-				CLANG_WARN_BOOL_CONVERSION = YES;
-				CLANG_WARN_CONSTANT_CONVERSION = YES;
-				CLANG_WARN_DIRECT_OBJC_ISA_USAGE = YES_ERROR;
-				CLANG_WARN_DOCUMENTATION_COMMENTS = YES;
-				CLANG_WARN_EMPTY_BODY = YES;
-				CLANG_WARN_ENUM_CONVERSION = YES;
-				CLANG_WARN_INFINITE_RECURSION = YES;
-				CLANG_WARN_INT_CONVERSION = YES;
-				CLANG_WARN_OBJC_ROOT_CLASS = YES_ERROR;
-				CLANG_WARN_SUSPICIOUS_MOVES = YES;
-				CLANG_WARN_UNREACHABLE_CODE = YES;
-				CLANG_WARN__DUPLICATE_METHOD_MATCH = YES;
-				CODE_SIGN_IDENTITY = "-";
-				COPY_PHASE_STRIP = NO;
-				DEBUG_INFORMATION_FORMAT = "dwarf-with-dsym";
-				ENABLE_NS_ASSERTIONS = NO;
-				ENABLE_STRICT_OBJC_MSGSEND = YES;
-				ENABLE_TESTABILITY = NO;
-				GCC_C_LANGUAGE_STANDARD = c11;
-				GCC_ENABLE_CPP_EXCEPTIONS = NO;
-				GCC_ENABLE_CPP_RTTI = NO;
-				GCC_FAST_MATH = YES;
-				GCC_OPTIMIZATION_LEVEL = fast;
-				GCC_PREPROCESSOR_DEFINITIONS = "NDEBUG=1";
-				GCC_THREADSAFE_STATICS = NO;
-				GCC_UNROLL_LOOPS = YES;
-				GCC_WARN_64_TO_32_BIT_CONVERSION = YES;
-				GCC_WARN_ABOUT_RETURN_TYPE = YES_ERROR;
-				GCC_WARN_UNDECLARED_SELECTOR = YES;
-				GCC_WARN_UNINITIALIZED_AUTOS = YES_AGGRESSIVE;
-				GCC_WARN_UNUSED_FUNCTION = YES;
-				GCC_WARN_UNUSED_VARIABLE = YES;
-				HEADER_SEARCH_PATHS = ../../..;
-				LLVM_LTO = YES;
-				MACOSX_DEPLOYMENT_TARGET = 10.7;
-				MTL_ENABLE_DEBUG_INFO = NO;
-				ONLY_ACTIVE_ARCH = NO;
-				SDKROOT = macosx;
-			};
-			name = Release;
-		};
-		CD3DDFC41DEA2FAF007D58FA /* Debug */ = {
-			isa = XCBuildConfiguration;
-			buildSettings = {
-				CONFIGURATION_BUILD_DIR = ../../../lib/macosx/debug;
-				DEVELOPMENT_TEAM = TDSNBN44YV;
-				EXECUTABLE_PREFIX = lib;
-				PRODUCT_NAME = "$(TARGET_NAME)";
-			};
-			name = Debug;
-		};
-		CD3DDFC51DEA2FAF007D58FA /* Release */ = {
-			isa = XCBuildConfiguration;
-			buildSettings = {
-				CONFIGURATION_BUILD_DIR = ../../../lib/macosx/release;
-				DEVELOPMENT_TEAM = TDSNBN44YV;
-				EXECUTABLE_PREFIX = lib;
-				PRODUCT_NAME = "$(TARGET_NAME)";
-			};
-			name = Release;
-		};
-/* End XCBuildConfiguration section */
-
-/* Begin XCConfigurationList section */
-		CD3DDFBA1DEA2FAF007D58FA /* Build configuration list for PBXProject "rpmalloc" */ = {
-			isa = XCConfigurationList;
-			buildConfigurations = (
-				CD3DDFC11DEA2FAF007D58FA /* Debug */,
-				CD3DDFC21DEA2FAF007D58FA /* Release */,
-			);
-			defaultConfigurationIsVisible = 0;
-			defaultConfigurationName = Release;
-		};
-		CD3DDFC31DEA2FAF007D58FA /* Build configuration list for PBXNativeTarget "rpmalloc" */ = {
-			isa = XCConfigurationList;
-			buildConfigurations = (
-				CD3DDFC41DEA2FAF007D58FA /* Debug */,
-				CD3DDFC51DEA2FAF007D58FA /* Release */,
-			);
-			defaultConfigurationIsVisible = 0;
-			defaultConfigurationName = Release;
-		};
-/* End XCConfigurationList section */
-	};
-	rootObject = CD3DDFB71DEA2FAF007D58FA /* Project object */;
-}
diff --git a/configure.py b/configure.py
index 86fd434..1daf2df 100755
--- a/configure.py
+++ b/configure.py
@@ -14,6 +14,8 @@
 writer = generator.writer
 toolchain = generator.toolchain
 
-rpmalloc_lib = generator.lib(module = 'rpmalloc', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'])
+rpmalloc_lib = generator.lib(module = 'rpmalloc', libname = 'rpmalloc', sources = ['rpmalloc.c'])
+rpmallocwrap_lib = generator.lib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'], variables = {'defines': ['ENABLE_PRELOAD=1']})
 if not target.is_windows():
-	rpmalloc_so = generator.sharedlib(module = 'rpmalloc', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'], variables = {'runtime': 'c++'})
+	rpmalloc_so = generator.sharedlib(module = 'rpmalloc', libname = 'rpmalloc', sources = ['rpmalloc.c'])
+	rpmallocwrap_so = generator.sharedlib(module = 'rpmalloc', libname = 'rpmallocwrap', sources = ['rpmalloc.c', 'malloc.c', 'new.cc'], variables = {'runtime': 'c++', 'defines': ['ENABLE_PRELOAD=1']})
diff --git a/rpmalloc/malloc.c b/rpmalloc/malloc.c
index 64c3305..dc8fe7c 100644
--- a/rpmalloc/malloc.c
+++ b/rpmalloc/malloc.c
@@ -11,42 +11,65 @@
 
 #include "rpmalloc.h"
 
+#ifndef ENABLE_VALIDATE_ARGS
+//! Enable validation of args to public entry points
+#define ENABLE_VALIDATE_ARGS      0
+#endif
+
+#if ENABLE_VALIDATE_ARGS
+//! Maximum allocation size to avoid integer overflow
+#define MAX_ALLOC_SIZE            (((size_t)-1) - 4096)
+#endif
+
+#ifdef _MSC_VER
+#pragma warning (disable : 4100)
+#undef malloc
+#undef free
+#undef calloc
+#endif
+
 //This file provides overrides for the standard library malloc style entry points
 
-extern void*
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 malloc(size_t size);
 
-extern void*
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 calloc(size_t count, size_t size);
 
-extern void *
+extern void* RPMALLOC_CDECL
 realloc(void* ptr, size_t size);
 
+extern void* RPMALLOC_CDECL
+reallocf(void* ptr, size_t size);
+
 extern void*
+reallocarray(void* ptr, size_t count, size_t size);
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 valloc(size_t size);
 
-extern void*
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 pvalloc(size_t size);
 
-extern void*
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 aligned_alloc(size_t alignment, size_t size);
 
-extern void*
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 memalign(size_t alignment, size_t size);
 
-extern int
+extern int RPMALLOC_CDECL
 posix_memalign(void** memptr, size_t alignment, size_t size);
 
-extern void
+extern void RPMALLOC_CDECL
 free(void* ptr);
 
-extern void
+extern void RPMALLOC_CDECL
 cfree(void* ptr);
 
-extern size_t
+extern size_t RPMALLOC_CDECL
 malloc_usable_size(void* ptr);
 
-extern size_t
+extern size_t RPMALLOC_CDECL
 malloc_size(void* ptr);
 
 #ifdef _WIN32
@@ -78,8 +101,6 @@
 	}
 }
 
-//TODO: Injection from rpmalloc compiled as DLL not yet implemented
-
 #else
 
 #include <pthread.h>
@@ -100,7 +121,8 @@
 		is_initialized = 1;
 		page_size = (size_t)sysconf(_SC_PAGESIZE);
 		pthread_key_create(&destructor_key, thread_destructor);
-		rpmalloc_initialize();
+		if (rpmalloc_initialize())
+			abort();
 	}
 	rpmalloc_thread_initialize();
 }
@@ -188,81 +210,300 @@
 
 #endif
 
-void*
-calloc(size_t count, size_t size) {
-	initializer();
-	return rpcalloc(count, size);
-}
-
-void
-free(void* ptr) {
-	if (!is_initialized || !rpmalloc_is_thread_initialized())
-		return;
-	rpfree(ptr);
-}
-
-void
-cfree(void* ptr) {
-	free(ptr);
-}
-
-void*
+RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 malloc(size_t size) {
 	initializer();
 	return rpmalloc(size);
 }
 
-void*
+void* RPMALLOC_CDECL
 realloc(void* ptr, size_t size) {
 	initializer();
 	return rprealloc(ptr, size);
 }
 
-void*
+void* RPMALLOC_CDECL
+reallocf(void* ptr, size_t size) {
+	initializer();
+	return rprealloc(ptr, size);
+}
+
+void* RPMALLOC_CDECL
+reallocarray(void* ptr, size_t count, size_t size) {
+	size_t total;
+#if ENABLE_VALIDATE_ARGS
+#ifdef _MSC_VER
+	int err = SizeTMult(count, size, &total);
+	if ((err != S_OK) || (total >= MAX_ALLOC_SIZE)) {
+		errno = EINVAL;
+		return 0;
+	}
+#else
+	int err = __builtin_umull_overflow(count, size, &total);
+	if (err || (total >= MAX_ALLOC_SIZE)) {
+		errno = EINVAL;
+		return 0;
+	}
+#endif
+#else
+	total = count * size;
+#endif
+	return realloc(ptr, total);
+}
+
+RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+calloc(size_t count, size_t size) {
+	initializer();
+	return rpcalloc(count, size);
+}
+
+RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 valloc(size_t size) {
 	initializer();
 	if (!size)
 		size = page_size;
 	size_t total_size = size + page_size;
+#if ENABLE_VALIDATE_ARGS
+	if (total_size < size) {
+		errno = EINVAL;
+		return 0;
+	}
+#endif
 	void* buffer = rpmalloc(total_size);
 	if ((uintptr_t)buffer & (page_size - 1))
 		return (void*)(((uintptr_t)buffer & ~(page_size - 1)) + page_size);
 	return buffer;
 }
 
-void*
+RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 pvalloc(size_t size) {
-	if (size % page_size)
-		size = (1 + (size / page_size)) * page_size;
+	size_t aligned_size = size;
+	if (aligned_size % page_size)
+		aligned_size = (1 + (aligned_size / page_size)) * page_size;
+#if ENABLE_VALIDATE_ARGS
+	if (aligned_size < size) {
+		errno = EINVAL;
+		return 0;
+	}
+#endif
 	return valloc(size);
 }
 
-void*
+RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 aligned_alloc(size_t alignment, size_t size) {
 	initializer();
 	return rpaligned_alloc(alignment, size);
 }
 
-void*
+RPMALLOC_RESTRICT void* RPMALLOC_CDECL
 memalign(size_t alignment, size_t size) {
 	initializer();
 	return rpmemalign(alignment, size);
 }
 
-int
+int RPMALLOC_CDECL
 posix_memalign(void** memptr, size_t alignment, size_t size) {
 	initializer();
 	return rpposix_memalign(memptr, alignment, size);
 }
 
-size_t
+void RPMALLOC_CDECL
+free(void* ptr) {
+	if (!is_initialized || !rpmalloc_is_thread_initialized())
+		return;
+	rpfree(ptr);
+}
+
+void RPMALLOC_CDECL
+cfree(void* ptr) {
+	free(ptr);
+}
+
+size_t RPMALLOC_CDECL
 malloc_usable_size(void* ptr) {
 	if (!rpmalloc_is_thread_initialized())
 		return 0;
 	return rpmalloc_usable_size(ptr);
 }
 
-size_t
+size_t RPMALLOC_CDECL
 malloc_size(void* ptr) {
 	return malloc_usable_size(ptr);
 }
+
+#ifdef _MSC_VER
+
+extern void* RPMALLOC_CDECL
+_expand(void* block, size_t size) {
+	return realloc(block, size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_recalloc(void* block, size_t count, size_t size) {
+	initializer();
+	if (!block)
+		return rpcalloc(count, size);
+	size_t newsize = count * size;
+	size_t oldsize = rpmalloc_usable_size(block);
+	void* newblock = rprealloc(block, newsize);
+	if (newsize > oldsize)
+		memset((char*)newblock + oldsize, 0, newsize - oldsize);
+	return newblock;
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_aligned_malloc(size_t size, size_t alignment) {
+	return aligned_alloc(alignment, size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_aligned_realloc(void* block, size_t size, size_t alignment) {
+	initializer();
+	size_t oldsize = rpmalloc_usable_size(block);
+	return rpaligned_realloc(block, alignment, size, oldsize, 0);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_aligned_recalloc(void* block, size_t count, size_t size, size_t alignment) {
+	initializer();
+	size_t newsize = count * size;
+	if (!block) {
+		block = rpaligned_alloc(count, newsize);
+		memset(block, 0, newsize);
+		return block;
+	}
+	size_t oldsize = rpmalloc_usable_size(block);
+	void* newblock = rpaligned_realloc(block, alignment, newsize, oldsize, 0);
+	if (newsize > oldsize)
+		memset((char*)newblock + oldsize, 0, newsize - oldsize);
+	return newblock;
+}
+
+void RPMALLOC_CDECL
+_aligned_free(void* block) {
+	free(block);
+}
+
+extern size_t RPMALLOC_CDECL
+_msize(void* ptr) {
+	return malloc_usable_size(ptr);
+}
+
+extern size_t RPMALLOC_CDECL
+_aligned_msize(void* block, size_t alignment, size_t offset) {
+	return malloc_usable_size(block);
+}
+
+extern intptr_t RPMALLOC_CDECL
+_get_heap_handle(void) {
+	return 0;
+}
+
+extern int RPMALLOC_CDECL
+_heap_init(void) {
+	initializer();
+	return 1;
+}
+
+extern void RPMALLOC_CDECL
+_heap_term() {
+}
+
+extern int RPMALLOC_CDECL
+_set_new_mode(int flag) {
+	(void)sizeof(flag);
+	return 0;
+}
+
+#ifndef NDEBUG
+
+extern int RPMALLOC_CDECL
+_CrtDbgReport(int reportType, char const* fileName, int linenumber, char const* moduleName, char const* format, ...) {
+	return 0;
+}
+
+extern int RPMALLOC_CDECL
+_CrtDbgReportW(int reportType, wchar_t const* fileName, int lineNumber, wchar_t const* moduleName, wchar_t const* format, ...) {
+	return 0;
+}
+
+extern int RPMALLOC_CDECL
+_VCrtDbgReport(int reportType, char const* fileName, int linenumber, char const* moduleName, char const* format, va_list arglist) {
+	return 0;
+}
+
+extern int RPMALLOC_CDECL
+_VCrtDbgReportW(int reportType, wchar_t const* fileName, int lineNumber, wchar_t const* moduleName, wchar_t const* format, va_list arglist) {
+	return 0;
+}
+
+extern int RPMALLOC_CDECL
+_CrtSetReportMode(int reportType, int reportMode) {
+	return 0;
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_malloc_dbg(size_t size, int blockUse, char const* fileName, int lineNumber) {
+	return malloc(size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_expand_dbg(void* block, size_t size, int blockUse, char const* fileName, int lineNumber) {
+	return _expand(block, size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_calloc_dbg(size_t count, size_t size, int blockUse, char const* fileName, int lineNumber) {
+	return calloc(count, size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_realloc_dbg(void* block, size_t size, int blockUse, char const* fileName, int lineNumber) {
+	return realloc(block, size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_recalloc_dbg(void* block, size_t count, size_t size, int blockUse, char const* fileName, int lineNumber) {
+	return _recalloc(block, count, size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_aligned_malloc_dbg(size_t size, size_t alignment, char const* fileName, int lineNumber) {
+	return aligned_alloc(alignment, size);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_aligned_realloc_dbg(void* block, size_t size, size_t alignment, char const* fileName, int lineNumber) {
+	return _aligned_realloc(block, size, alignment);
+}
+
+extern RPMALLOC_RESTRICT void* RPMALLOC_CDECL
+_aligned_recalloc_dbg(void* block, size_t count, size_t size, size_t alignment, char const* fileName, int lineNumber) {
+	return _aligned_recalloc(block, count, size, alignment);
+}
+
+extern void RPMALLOC_CDECL
+_free_dbg(void* block, int blockUse) {
+	free(block);
+}
+
+extern void RPMALLOC_CDECL
+_aligned_free_dbg(void* block) {
+	free(block);
+}
+
+extern size_t RPMALLOC_CDECL
+_msize_dbg(void* ptr) {
+	return malloc_usable_size(ptr);
+}
+
+extern size_t RPMALLOC_CDECL
+_aligned_msize_dbg(void* block, size_t alignment, size_t offset) {
+	return malloc_usable_size(block);
+}
+
+#endif  // NDEBUG
+
+extern void* _crtheap = (void*)1;
+
+#endif
diff --git a/rpmalloc/new.cc b/rpmalloc/new.cc
index 56200da..cbaaaba 100644
--- a/rpmalloc/new.cc
+++ b/rpmalloc/new.cc
@@ -13,52 +13,111 @@
 #include <cstdint>
 #include <cstdlib>
 
+#include "rpmalloc.h"
+
 using namespace std;
 
 #ifdef __clang__
 #pragma clang diagnostic ignored "-Wc++98-compat"
 #endif
 
-void* operator new(size_t size) {
-	return malloc(size);
+extern void*
+operator new(size_t size);
+
+extern void*
+operator new[](size_t size);
+
+extern void
+operator delete(void* ptr) noexcept;
+
+extern void
+operator delete[](void* ptr) noexcept;
+
+extern void*
+operator new(size_t size, const std::nothrow_t&) noexcept;
+
+extern void*
+operator new[](size_t size, const std::nothrow_t&) noexcept;
+
+extern void
+operator delete(void* ptr, const std::nothrow_t&) noexcept;
+
+extern void
+operator delete[](void* ptr, const std::nothrow_t&) noexcept;
+
+extern void
+operator delete(void* ptr, size_t) noexcept;
+
+extern void
+operator delete[](void* ptr, size_t) noexcept;
+
+static int is_initialized;
+
+static void
+initializer(void) {
+	if (!is_initialized) {
+		is_initialized = 1;
+		rpmalloc_initialize();
+	}
+	rpmalloc_thread_initialize();
 }
 
-void operator delete(void* ptr) noexcept {
-	free(ptr);
+void*
+operator new(size_t size) {
+	initializer();
+	return rpmalloc(size);
 }
 
-void* operator new[](size_t size) {
-	return malloc(size);
+void
+operator delete(void* ptr) noexcept {
+	if (rpmalloc_is_thread_initialized())
+		rpfree(ptr);
 }
 
-void operator delete[](void* ptr) noexcept {
-	free(ptr);
+void*
+operator new[](size_t size) {
+	initializer();
+	return rpmalloc(size);
 }
 
-void* operator new(size_t size, const std::nothrow_t&) noexcept {
-	return malloc(size);
+void
+operator delete[](void* ptr) noexcept {
+	if (rpmalloc_is_thread_initialized())
+		rpfree(ptr);
 }
 
-void* operator new[](size_t size, const std::nothrow_t&) noexcept {
-	return malloc(size);
+void*
+operator new(size_t size, const std::nothrow_t&) noexcept {
+	initializer();
+	return rpmalloc(size);
 }
 
-void operator delete(void* ptr, const std::nothrow_t&) noexcept {
-	return free(ptr);
+void*
+operator new[](size_t size, const std::nothrow_t&) noexcept {
+	initializer();
+	return rpmalloc(size);
 }
 
-void operator delete[](void* ptr, const std::nothrow_t&) noexcept {
-	return free(ptr);
+void
+operator delete(void* ptr, const std::nothrow_t&) noexcept {
+	if (rpmalloc_is_thread_initialized())
+		rpfree(ptr);
 }
 
-#if 0
-
-void operator delete(void* ptr, size_t) noexcept {
-	free(ptr);
+void
+operator delete[](void* ptr, const std::nothrow_t&) noexcept {
+	if (rpmalloc_is_thread_initialized())
+		rpfree(ptr);
 }
 
-void operator delete[](void* ptr, size_t) noexcept {
-	free(ptr);
+void
+operator delete(void* ptr, size_t) noexcept {
+	if (rpmalloc_is_thread_initialized())
+		rpfree(ptr);
 }
 
-#endif
+void
+operator delete[](void* ptr, size_t) noexcept {
+	if (rpmalloc_is_thread_initialized())
+		rpfree(ptr);
+}
diff --git a/rpmalloc/rpmalloc.c b/rpmalloc/rpmalloc.c
index 43d57c4..607fbfc 100644
--- a/rpmalloc/rpmalloc.c
+++ b/rpmalloc/rpmalloc.c
@@ -64,29 +64,31 @@
 #define ENABLE_ASSERTS            0
 #endif
 
+#ifndef ENABLE_PRELOAD
+//! Support preloading
+#define ENABLE_PRELOAD            0
+#endif
+
 // Platform and arch specifics
 
 #ifdef _MSC_VER
 #  define ALIGNED_STRUCT(name, alignment) __declspec(align(alignment)) struct name
 #  define FORCEINLINE __forceinline
-#  define TLS_MODEL
 #  define _Static_assert static_assert
-#  define _Thread_local __declspec(thread)
 #  define atomic_thread_fence_acquire() //_ReadWriteBarrier()
 #  define atomic_thread_fence_release() //_ReadWriteBarrier()
 #  if ENABLE_VALIDATE_ARGS
 #    include <Intsafe.h>
 #  endif
 #else
+#  if defined(__APPLE__) && ENABLE_PRELOAD
+#    include <pthread.h>
+#  endif
 #  define ALIGNED_STRUCT(name, alignment) struct __attribute__((__aligned__(alignment))) name
 #  define FORCEINLINE inline __attribute__((__always_inline__))
-#  define TLS_MODEL __attribute__((tls_model("initial-exec")))
-#  if !defined(__clang__) && defined(__GNUC__)
-#    define _Thread_local __thread
-#  endif
 #  ifdef __arm__
-#    define atomic_thread_fence_acquire() __asm volatile("dmb sy" ::: "memory")
-#    define atomic_thread_fence_release() __asm volatile("dmb st" ::: "memory")
+#    define atomic_thread_fence_acquire() __asm volatile("dmb ish" ::: "memory")
+#    define atomic_thread_fence_release() __asm volatile("dmb ishst" ::: "memory")
 #  else
 #    define atomic_thread_fence_acquire() //__asm volatile("" ::: "memory")
 #    define atomic_thread_fence_release() //__asm volatile("" ::: "memory")
@@ -377,15 +379,15 @@
 //! Global large cache
 static atomicptr_t _memory_large_cache[LARGE_CLASS_COUNT];
 
-//! Current thread heap
-static _Thread_local heap_t* _memory_thread_heap TLS_MODEL;
-
 //! All heaps
 static atomicptr_t _memory_heaps[HEAP_ARRAY_SIZE];
 
 //! Orphaned heaps
 static atomicptr_t _memory_orphan_heaps;
 
+//! Running orphan counter to avoid ABA issues in linked list
+static atomic32_t _memory_orphan_counter;
+
 //! Active heap count
 static atomic32_t _memory_active_heaps;
 
@@ -404,6 +406,40 @@
 static atomic32_t _unmapped_total;
 #endif
 
+//! Current thread heap
+#if defined(__APPLE__) && ENABLE_PRELOAD
+static pthread_key_t _memory_thread_heap;
+#else
+#  ifdef _MSC_VER
+#    define _Thread_local __declspec(thread)
+#    define TLS_MODEL
+#  else
+#    define TLS_MODEL __attribute__((tls_model("initial-exec")))
+#    if !defined(__clang__) && defined(__GNUC__)
+#      define _Thread_local __thread
+#    endif
+#  endif
+static _Thread_local heap_t* _memory_thread_heap TLS_MODEL;
+#endif
+
+static FORCEINLINE heap_t*
+get_thread_heap(void) {
+#if defined(__APPLE__) && ENABLE_PRELOAD
+	return pthread_getspecific(_memory_thread_heap);
+#else
+	return _memory_thread_heap;
+#endif
+}
+
+static void
+set_thread_heap(heap_t* heap) {
+#if defined(__APPLE__) && ENABLE_PRELOAD
+	pthread_setspecific(_memory_thread_heap, heap);
+#else
+	_memory_thread_heap = heap;
+#endif
+}
+
 static void*
 _memory_map(size_t page_count);
 
@@ -868,17 +904,22 @@
 //! Allocate a new heap
 static heap_t*
 _memory_allocate_heap(void) {
+	uintptr_t raw_heap, next_raw_heap;
+	uintptr_t orphan_counter;
 	heap_t* heap;
 	heap_t* next_heap;
 	//Try getting an orphaned heap
 	atomic_thread_fence_acquire();
 	do {
-		heap = atomic_load_ptr(&_memory_orphan_heaps);
+		raw_heap = (uintptr_t)atomic_load_ptr(&_memory_orphan_heaps);
+		heap = (void*)(raw_heap & ~(uintptr_t)0xFFFF);
 		if (!heap)
 			break;
 		next_heap = heap->next_orphan;
+		orphan_counter = atomic_incr32(&_memory_orphan_counter);
+		next_raw_heap = (uintptr_t)next_heap | (orphan_counter & 0xFFFF);
 	}
-	while (!atomic_cas_ptr(&_memory_orphan_heaps, next_heap, heap));
+	while (!atomic_cas_ptr(&_memory_orphan_heaps, (void*)next_raw_heap, (void*)raw_heap));
 
 	if (heap) {
 		heap->next_orphan = 0;
@@ -1119,6 +1160,8 @@
 _memory_deallocate_defer(int32_t heap_id, void* p) {
 	//Get the heap and link in pointer in list of deferred opeations
 	heap_t* heap = _memory_heap_lookup(heap_id);
+	if (!heap)
+		return;
 	void* last_ptr;
 	do {
 		last_ptr = atomic_load_ptr(&heap->defer_deallocate);
@@ -1130,9 +1173,9 @@
 static void*
 _memory_allocate(size_t size) {
 	if (size <= MEDIUM_SIZE_LIMIT)
-		return _memory_allocate_from_heap(_memory_thread_heap, size);
+		return _memory_allocate_from_heap(get_thread_heap(), size);
 	else if (size <= LARGE_SIZE_LIMIT)
-		return _memory_allocate_large_from_heap(_memory_thread_heap, size);
+		return _memory_allocate_large_from_heap(get_thread_heap(), size);
 
 	//Oversized, allocate pages directly
 	size += SPAN_HEADER_SIZE;
@@ -1156,7 +1199,7 @@
 	//Grab the span (always at start of span, using 64KiB alignment)
 	span_t* span = (void*)((uintptr_t)p & SPAN_MASK);
 	int32_t heap_id = atomic_load32(&span->heap_id);
-	heap_t* heap = _memory_thread_heap;
+	heap_t* heap = get_thread_heap();
 	//Check if block belongs to this heap or if deallocation should be deferred
 	if (heap_id == heap->id) {
 		if (span->size_class < SIZE_CLASS_COUNT)
@@ -1292,11 +1335,11 @@
 #else
 #  include <sys/mman.h>
 #  include <sched.h>
-#  include <errno.h>
 #  ifndef MAP_UNINITIALIZED
 #    define MAP_UNINITIALIZED 0
 #  endif
 #endif
+#include <errno.h>
 
 //! Initialize the allocator and setup global data
 int
@@ -1308,14 +1351,19 @@
 	if (system_info.dwAllocationGranularity < SPAN_ADDRESS_GRANULARITY)
 		return -1;
 #else
-#if ARCH_64BIT
+#  if defined(__APPLE__) && ENABLE_PRELOAD
+	if (pthread_key_create(&_memory_thread_heap, 0))
+		return -1;
+#  endif
+#  if ARCH_64BIT
 	atomic_store64(&_memory_addr, 0x1000000000ULL);
-#else
+#  else
 	atomic_store64(&_memory_addr, 0x1000000ULL);
-#endif
+#  endif
 #endif
 
 	atomic_store32(&_memory_heap_id, 0);
+	atomic_store32(&_memory_orphan_counter, 0);
 
 	//Setup all small and medium size classes
 	size_t iclass;
@@ -1331,7 +1379,7 @@
 		_memory_size_class[SMALL_CLASS_COUNT + iclass].size = (uint16_t)size;
 		_memory_adjust_size_class(SMALL_CLASS_COUNT + iclass);
 	}
-	
+
 	//Initialize this thread
 	rpmalloc_thread_initialize();
 	return 0;
@@ -1417,18 +1465,22 @@
 	}
 
 	atomic_thread_fence_release();
+
+#if defined(__APPLE__) && ENABLE_PRELOAD
+	pthread_key_delete(_memory_thread_heap);
+#endif
 }
 
 //! Initialize thread, assign heap
 void
 rpmalloc_thread_initialize(void) {
-	if (!_memory_thread_heap) {
+	if (!get_thread_heap()) {
 		heap_t* heap =  _memory_allocate_heap();
 #if ENABLE_STATISTICS
 		heap->thread_to_global = 0;
 		heap->global_to_thread = 0;
 #endif
-		_memory_thread_heap = heap;
+		set_thread_heap(heap);
 		atomic_incr32(&_memory_active_heaps);
 	}
 }
@@ -1436,7 +1488,7 @@
 //! Finalize thread, orphan heap
 void
 rpmalloc_thread_finalize(void) {
-	heap_t* heap = _memory_thread_heap;
+	heap_t* heap = get_thread_heap();
 	if (!heap)
 		return;
 
@@ -1508,19 +1560,22 @@
 #endif
 
 	//Orphan the heap
+	uintptr_t raw_heap, orphan_counter;
 	heap_t* last_heap;
 	do {
 		last_heap = atomic_load_ptr(&_memory_orphan_heaps);
-		heap->next_orphan = last_heap;
+		heap->next_orphan = (void*)((uintptr_t)last_heap & ~(uintptr_t)0xFFFF);
+		orphan_counter = atomic_incr32(&_memory_orphan_counter);
+		raw_heap = (uintptr_t)heap | (orphan_counter & 0xFFFF);
 	}
-	while (!atomic_cas_ptr(&_memory_orphan_heaps, heap, last_heap));
+	while (!atomic_cas_ptr(&_memory_orphan_heaps, (void*)raw_heap, last_heap));
 	
-	_memory_thread_heap = 0;
+	set_thread_heap(0);
 }
 
 int
 rpmalloc_is_thread_initialized(void) {
-	return (_memory_thread_heap != 0) ? 1 : 0;
+	return (get_thread_heap() != 0) ? 1 : 0;
 }
 
 //! Map new pages to virtual memory
@@ -1574,6 +1629,7 @@
 
 #ifdef PLATFORM_WINDOWS
 	VirtualFree(ptr, 0, MEM_RELEASE);
+	(void)sizeof(page_count);
 #else
 	munmap(ptr, PAGE_SIZE * page_count);
 #endif
@@ -1606,7 +1662,7 @@
 
 // Extern interface
 
-void* 
+RPMALLOC_RESTRICT void*
 rpmalloc(size_t size) {
 #if ENABLE_VALIDATE_ARGS
 	if (size >= MAX_ALLOC_SIZE) {
@@ -1622,7 +1678,7 @@
 	_memory_deallocate(ptr);
 }
 
-void*
+RPMALLOC_RESTRICT void*
 rpcalloc(size_t num, size_t size) {
 	size_t total;
 #if ENABLE_VALIDATE_ARGS
@@ -1667,12 +1723,17 @@
 		return 0;
 	}
 #endif
-	//TODO: If alignment > 16, we need to copy to new aligned position
-	(void)sizeof(alignment);
+	if (alignment > 16) {
+		void* block = rpaligned_alloc(alignment, size);
+		if (!(flags & RPMALLOC_NO_PRESERVE))
+			memcpy(block, ptr, oldsize < size ? oldsize : size);
+		rpfree(ptr);
+		return block;
+	}
 	return _memory_reallocate(ptr, size, oldsize, flags);
 }
 
-void*
+RPMALLOC_RESTRICT void*
 rpaligned_alloc(size_t alignment, size_t size) {
 	if (alignment <= 16)
 		return rpmalloc(size);
@@ -1690,7 +1751,7 @@
 	return ptr;
 }
 
-void*
+RPMALLOC_RESTRICT void*
 rpmemalign(size_t alignment, size_t size) {
 	return rpaligned_alloc(alignment, size);
 }
@@ -1711,13 +1772,13 @@
 
 void
 rpmalloc_thread_collect(void) {
-	_memory_deallocate_deferred(_memory_thread_heap, 0);
+	_memory_deallocate_deferred(get_thread_heap(), 0);
 }
 
 void
 rpmalloc_thread_statistics(rpmalloc_thread_statistics_t* stats) {
 	memset(stats, 0, sizeof(rpmalloc_thread_statistics_t));
-	heap_t* heap = _memory_thread_heap;
+	heap_t* heap = get_thread_heap();
 #if ENABLE_STATISTICS
 	stats->allocated = heap->allocated;
 	stats->requested = heap->requested;
diff --git a/rpmalloc/rpmalloc.h b/rpmalloc/rpmalloc.h
index 948b872..e2a435f 100644
--- a/rpmalloc/rpmalloc.h
+++ b/rpmalloc/rpmalloc.h
@@ -19,13 +19,16 @@
 
 #if defined(__clang__) || defined(__GNUC__)
 # define RPMALLOC_ATTRIBUTE __attribute__((__malloc__))
-# define RPMALLOC_CALL
+# define RPMALLOC_RESTRICT
+# define RPMALLOC_CDECL
 #elif defined(_MSC_VER)
 # define RPMALLOC_ATTRIBUTE
-# define RPMALLOC_CALL __declspec(restrict)
+# define RPMALLOC_RESTRICT __declspec(restrict)
+# define RPMALLOC_CDECL __cdecl
 #else
 # define RPMALLOC_ATTRIBUTE
-# define RPMALLOC_CALL
+# define RPMALLOC_RESTRICT
+# define RPMALLOC_CDECL
 #endif
 
 //! Flag to rpaligned_realloc to not preserve content in reallocation
@@ -87,13 +90,13 @@
 extern void
 rpmalloc_global_statistics(rpmalloc_global_statistics_t* stats);
 
-extern RPMALLOC_CALL void*
+extern RPMALLOC_RESTRICT void*
 rpmalloc(size_t size) RPMALLOC_ATTRIBUTE;
 
 extern void
 rpfree(void* ptr);
 
-extern RPMALLOC_CALL void*
+extern RPMALLOC_RESTRICT void*
 rpcalloc(size_t num, size_t size) RPMALLOC_ATTRIBUTE;
 
 extern void*
@@ -102,10 +105,10 @@
 extern void*
 rpaligned_realloc(void* ptr, size_t alignment, size_t size, size_t oldsize, unsigned int flags);
 
-extern RPMALLOC_CALL void*
+extern RPMALLOC_RESTRICT void*
 rpaligned_alloc(size_t alignment, size_t size) RPMALLOC_ATTRIBUTE;
 
-extern RPMALLOC_CALL void*
+extern RPMALLOC_RESTRICT void*
 rpmemalign(size_t alignment, size_t size) RPMALLOC_ATTRIBUTE;
 
 extern int