Add optimized routines for pairwise long adds and _mm_mullo_epi32 (#25)

* Add optimized routines for pairwise long adds and _mm_mullo_epi32

vpaddlq_uN can be implemented as so:
{
    const __m128i ff = _mm_set1_epi2N((1 << N) - 1);
    __m128i low = _mm_and_si128(a, ff);
    __m128i high = _mm_srli_epi2N(a, N);
    return _mm_add_epi2N(low, high);
|

and the other unsigned pairwise adds are the same.

vpaddlq_s32 can be implemented like so:
{
    __m128i top, bot;
    bot = _mm_shuffle_epi32(a, _MM_SHUFFLE(0, 0, 2, 0));
    bot = _MM_CVTEPI32_EPI64(bot);
    top = _mm_shuffle_epi32(a, _MM_SHUFFLE(0, 0, 3, 1));
    top = _MM_CVTEPI32_EPI64(top);
    return _mm_add_epi64(top, bot);
}

And _mm_mullo_epi32 uses the same routine that GCC uses with vector
extensions (Clang uses a similar method, but it uses pshufd which is
slow on pre-Penryn chips):

{
    __m128i a_high = _mm_srli_epi64(a, 32);
    __m128i low = _mm_mul_epu32(a, b);
    __m128i b_high = _mm_srli_epi64(b, 32);
    __m128i high = _mm_mul_epu32(a_high, b_high);
    low = _mm_shuffle_epi32(low, _MM_SHUFFLE(0, 0, 2, 0));
    high = _mm_shuffle_epi32(high, _MM_SHUFFLE(0, 0, 2, 0));
    return _mm_unpacklo_epi32(low, high);
}
1 file changed
tree: c03283b3425c6330bfdf8f5ffdb5a0a522f2f3e9
  1. cmake/
  2. .gitignore
  3. CMakeLists.txt
  4. LICENSE
  5. NEON_2_SSE.h
  6. ReadMe.md
ReadMe.md

The NEON_2_SSE.h file is intended to simplify ARM->IA32 porting. It makes the correspondence (or a real port) between ARM NEON intrinsics (as defined in “arm_neon.h”) header and x86 SSE (up to SSE4.2) intrinsic functions as defined in corresponding x86 compilers headers files.


To take advantage of this file just include it in your project that uses ARM NEON intinsics instead of “arm_neon.h”, compile it as usual and enjoy the result.

For significant performance improvement in some cases you might need to define USE_SSE4 in your project settings. Otherwise SIMD up to SSSE3 to be used.

If NEON2SSE_DISABLE_PERFORMANCE_WARNING macro is defined, then the performance warnings are disabled.

For more information and license please read the NEON_2_SSE.h content.