- 9e63749 AVX 8bit kernel. Forked from AVX2+FMA version by T.J. Alumbaugh · 3 years, 8 months ago
- 29a155b Update README.md by Benoit Jacob · 3 years, 9 months ago
- ce0e559 Changes are excluded via Copybara by Ruy Contributors · 3 years, 9 months ago
- 4b1972b Changes are excluded via Copybara by Ruy Contributors · 3 years, 9 months ago
- 59c2de8 Rename kOutOfOrder -> kGeneric, kInOrder -> kA55ish, by Benoit Jacob · 3 years, 9 months ago
- 4f6a37b Reimplement :tune on top of :cpuinfo. by Benoit Jacob · 3 years, 9 months ago
- f99b42b Add bzl_library rules for .bzl files without one. by Ruy Contributors · 3 years, 9 months ago
- 2b24016 Adds AVX float packing code. by T.J. Alumbaugh · 3 years, 9 months ago
- 70d32d6 Adds AVX path and AVX float kernel. by T.J. Alumbaugh · 3 years, 9 months ago
- d4822f4 Adds AVX path and AVX float kernel. by T.J. Alumbaugh · 3 years, 9 months ago
- 18e34fa Adds AVX path and AVX float kernel. by T.J. Alumbaugh · 3 years, 9 months ago
- d7bd2a1 Print extra information in case of disagreeing TestResults. by T.J. Alumbaugh · 3 years, 9 months ago
- 5bb02fb check_macros improvements: promote operands before comparisons (avoids -Wsign-compare errors with GCC in cases like RUY_CHECK_NE(unsigned_bitmask_expression, 0)) and move all of the implementation to an inline function instead of having half of it in the macro. by Benoit Jacob · 3 years, 9 months ago
- f876353 Add missing #include of <cstring>. by Benoit Jacob · 3 years, 9 months ago
- bfe6e0d Simplify bias-loading code now that bias buffers are always rounded up to multiple of kernel size. by Benoit Jacob · 3 years, 10 months ago
- b53312b Use lambdas to shorten source code like we did in the avx512 kernel. by Benoit Jacob · 3 years, 10 months ago
- f611892 Handle per-column multipliers in the avx512 kernel without transposing the 16x16 accumulator block. by Benoit Jacob · 3 years, 10 months ago
- 1efd970 Optimized packing code path for row-major 8bit inputs for the x86 paths. by Benoit Jacob · 3 years, 10 months ago
- 257a0fc Optimized packing code path for row-major 8bit inputs for the kNeon path. Written in intrinsics to handle 3 cases at once: by Benoit Jacob · 3 years, 10 months ago
- 550655f Use lambdas to shorten Kernel8bitAvx512's source code, and to split the resulting non-opt binary code into smaller functions. This makes no difference in opt builds, but for non-opt builds this reduces the stack frame of this function from 60k down to 24k. This avoids stack overflows in some toolchains. by Benoit Jacob · 3 years, 10 months ago
- ec99c70 Optimized packing code path for row-major float inputs. by Benoit Jacob · 3 years, 10 months ago
- bebf022 Optimized packing code path for row-major 8bit inputs for the kNeonDotprod path. by Benoit Jacob · 3 years, 10 months ago
- d492ac8 Fix the build on some toolchains - a missing #include<cstring> and some avx512 intrinsic synonyms. by Benoit Jacob · 3 years, 10 months ago
- 90f7274 Rename packing code implementation functions now that they are explicitly about one specific source matrix storage order. by Benoit Jacob · 3 years, 10 months ago
- cd375d3 Templatize packing code paths on the source order, so that we support any combination source order, with the worst case being a fall back to the standard c++ packing code, which readily supports any storage order. by Benoit Jacob · 3 years, 10 months ago
- 5210e3e Simplification of FallBackToStandardCpp now that we are past the incremental steps toward supporting any channel_dimension. by Benoit Jacob · 3 years, 10 months ago
- 6d218c3 Efficient support for any channel_dimension for quantized kernels on AVX-512, part 2: handling of per-channel multipliers. by Benoit Jacob · 3 years, 10 months ago
- c1d5b4f Efficient support for any channel_dimension for quantized kernels on AVX-512, part 1: non-per-channel-multiplier case, so we only have to deal with bias vectors for now. by Benoit Jacob · 3 years, 10 months ago
- bb9349c Efficient support for any channel_dimension for quantized kernels on AVX2. by Benoit Jacob · 3 years, 10 months ago
- bd21e0c Simplify x86 kernels by using the fact that there always is a per-channel buffer to read from, even in the non-perchannel case (in that case, its size is just the kernel's width and one must use 0 as offset). by Benoit Jacob · 3 years, 10 months ago
- 98c5213 Simplify x86 kernels thanks to the new fact that perchannel buffers are rounded to next multiple of kernel width. by Benoit Jacob · 3 years, 10 months ago
- a776b5d Fix runtime detection of support for our AVX2+FMA code path: we were only checking for AVX2, which happens to imply FMA on Intel CPUs. by Benoit Jacob · 3 years, 10 months ago
- 7784e18 FMA is technically a separate ISA extension from AVX2. by Benoit Jacob · 3 years, 10 months ago
- 27d16d0 Efficient support for any channel_dimension for float kernels on AVX-512. by Benoit Jacob · 3 years, 10 months ago
- 592d30c Efficient support for any channel_dimension for float kernels on AVX2. by Benoit Jacob · 3 years, 10 months ago
- f88e08e Allow the user to specify that they have allocated a slightly larger capacity for the per-channel buffers, so that ruy can then avoid reallocating and copying these buffers. by Benoit Jacob · 3 years, 10 months ago
- 388ffd2 Fix ARM32 packing code reading past the end of the source matrix, and finishing enabling the use of SeparateMappingVector in StorageMatrix in the test code to guard against that (It had discovered this issue). by Benoit Jacob · 3 years, 10 months ago
- 856f0fd Add comments and some minor simplications to packing code. by Benoit Jacob · 3 years, 10 months ago
- e600a4d Avoid overrunning per-channel buffers, whose size is that of the corresponding user-facing matrix dimension, but which assembly kernels tend to address as if they had the same size as the corresponding packed matrix dimension. AddressSanitizer can't see what asm kernels do. by Benoit Jacob · 3 years, 10 months ago
- f5b52f9 Minor optimization of in-order arm64 kernels, interleave the dup's used in the channels-are-columns case with other instructions. by Benoit Jacob · 3 years, 10 months ago
- 62aa923 Minor simplification of arm32 assembly: the add instruction itself can be conditional. by Benoit Jacob · 3 years, 10 months ago
- ec970ca Efficient support for any channel_dimension for quantized kernels on ARM32. by Benoit Jacob · 3 years, 10 months ago
- 53c5454 Efficient support for any channel_dimension for float kernels on ARM32. by Benoit Jacob · 3 years, 10 months ago
- 3cacc71 Efficient support for any channel_dimension for kNeonDotprod quantized kernels on ARM64. by Benoit Jacob · 3 years, 10 months ago
- ffb0866 Efficient support for any channel_dimension for kNeon quantized kernels on ARM64. by Benoit Jacob · 3 years, 10 months ago
- 1f9e146 Ensure that the 1Col kernels are not used with channel_dimension==kCol, so that we don't need to update them. by Benoit Jacob · 3 years, 10 months ago
- caf57cc Efficient support for any channel_dimension for float kernels on ARM64. by Benoit Jacob · 3 years, 10 months ago
- df335bc Groundwork to pass channel_dimension down to kernels and to incrementally enable fast kernels in the channel_dimension==kCol case. by Benoit Jacob · 3 years, 10 months ago
- cd4f776 Revisiting RUY_OPT(AVOID_ALIASING). by Benoit Jacob · 3 years, 10 months ago
- b3edb05 Fix benchmarking of caching. by Benoit Jacob · 3 years, 10 months ago
- 2d09352 Allow benchmarking any combination of storage orders, and disable the randomization of the channel_dimension in the case of benchmarking, so that the actual storage order of the destination matrix being benchmarked internally matches what is specified (no internal transposition). Randomization of the channel_dimension is kept in non-benchmark tests. by Benoit Jacob · 3 years, 10 months ago
- c03ab18 Allow disabling the reference path in the benchmark. by Benoit Jacob · 3 years, 10 months ago
- c72d487 Start of a documentation directory. by Benoit Jacob · 3 years, 10 months ago
- 33fa58e Remove RUY_OPT(NATIVE_ROUNDING) or rather, the ability to disable it. by Benoit Jacob · 3 years, 10 months ago
- 8525a43 Make the reference/standard-cpp code in ApplyMultiplier match the ARM code, by changing the RoundingDivideByPOT function, which was borrowed from gemmlowp, to a RoundingRightShift function that is more like just a standard rounding arithmetic shift instruction, breaking ties upwards instead of away-from-zero. by Benoit Jacob · 3 years, 10 months ago
- 7fb015f Avoid relying on std::max being constexpr, which is c++14 behavior but is not implemented on TensorFlow continuous integration on Ubuntu 16. by Benoit Jacob · 3 years, 10 months ago
- e7f175f Remove ExpectedOutcome support, it was used for death tests in test_special_mul_params, which has been removed already. by Benoit Jacob · 3 years, 10 months ago
- 03bbc8f Store perchannel members in a union with their non-perchannel counterpart. by Benoit Jacob · 3 years, 10 months ago
- 39df743 Split the storage of MulParams data members into 3 separate template specializations for the floating-point, raw integer and quantized cases. by Benoit Jacob · 3 years, 10 months ago
- f5e0fac Remove cpuinfo from s390x build as there is no support yet by cdavoudian · 3 years, 10 months ago
- 8678f55 Reduce to the case of column-major destination matrix by transposing the whole Mul in the row-major destination matrix case. by Benoit Jacob · 3 years, 10 months ago
- 5b496e0 Some refactoring in create_trmul_params.* ahead of implementing the transposition technique to reduce to column-major destination. by Benoit Jacob · 3 years, 10 months ago
- c17ae28 Implement the channels_dimension==kCol case. by Benoit Jacob · 3 years, 11 months ago
- 375895e Change Transpose functions to returning the result by value. by Benoit Jacob · 3 years, 11 months ago
- e273e15 Store the MulParams by value, in a char[] buffer, in TrMulParams. by Benoit Jacob · 3 years, 11 months ago
- fd803fb Add a channel_dimension member to MulParams, bringing the last piece to make Ruy's API fully LHS<->RHS symmetric, allowing the implementation to transpose the whole Mul to reduce to column major destination matrices. by Benoit Jacob · 3 years, 11 months ago
- d2509b7 Make FixedKernelLayout internal by Benoit Jacob · 3 years, 11 months ago
- 66961ae Fix up templates specialization for change by Ruy Contributors · 3 years, 11 months ago
- 19b09a4 Make FixedKernelLayout internal by Robert David · 3 years, 11 months ago
- c9f5f9c Clean up #includes and deps among kernel* and pack*. by Benoit Jacob · 3 years, 11 months ago
- ae6e0ed trim down common.h, keeping only the macros. by Benoit Jacob · 3 years, 11 months ago
- 5efd3eb Make FixedKernelLayout internal by Benoit Jacob · 3 years, 11 months ago
- 43680a7 Detemplatize on MulParmsType, part 2. by Benoit Jacob · 3 years, 11 months ago
- 1acc6f5 Avoid templatizing on MulParamsType, instead templatize on AccumScalar/DstScalar, as the only MulParamsType is MulParams<AccumScalar,DstScalar> (part 1). by Benoit Jacob · 3 years, 11 months ago
- 412e17e Finish cleaning up mul_params.h: remove ZeroPointSupport and LayoutSupport enums, and other now-unused things. Mark MulParams as final. by Benoit Jacob · 3 years, 11 months ago
- 5111a55 Remove the LoopStructure enum. by Benoit Jacob · 3 years, 11 months ago
- 5c28dfe Delete test_special_mul_params and de-templatize the test code on a MulParamsType, restricting it to non-subclassed MulParams. This is a temporary regression in testing coverage but in the next commit in these series we will recover the testing of special StandardCpp kernel layouts thanks to the new Path's, while removing the other features that subclassing MulParams offered. by Benoit Jacob · 3 years, 11 months ago
- bf0c1c4 Introduce new internal-only Paths that are variants of kStandardCpp exercising internal corners of ruy. by Benoit Jacob · 3 years, 11 months ago
- f6363d0 Delete stale file, forgot to remove it in cl/317146687. by Benoit Jacob · 3 years, 11 months ago
- fb8fa3b the example code was still teaching people to use <ruy::kAllPaths>, which most users now don't need or want to. by Benoit Jacob · 3 years, 11 months ago
- 1014033 Shuffle Path values a bit. kStandardCpp=1, other values < 0x10 will be used for kStandardCpp variants for internal testing purposes, SIMD paths start at 0x10. by Benoit Jacob · 3 years, 11 months ago
- 8dd9136 Remove SSE4.2 and VNNI placeholder code for now. by Benoit Jacob · 3 years, 11 months ago
- 4d8ad9f The word 'packed' is being used for too many things, so rename to make it more specific in each case. by Benoit Jacob · 3 years, 11 months ago
- e6603bf Rename Other to OtherSide for readability at call sites, and use it in one more place. by Benoit Jacob · 3 years, 11 months ago
- b7649fa Refactoring of the front-end code. by Benoit Jacob · 3 years, 11 months ago
- 0b64129 Check that the actually used kernel code path matches the path we think we're taking, at least when it should match, i.e. in standard cases that fast code path are supposed to handle. by Benoit Jacob · 3 years, 11 months ago
- c45f194 Fix a recent regression (from cl/316525635): when the LHS/RHS scalr type was uint8 (not int8), we had disabled all NEON paths on ARM 32bit (not on ARM 64bit)! by Benoit Jacob · 3 years, 11 months ago
- 072976c Restructure pack*.h headers so that just pack_common.h does not provide any code path, only common helpers, so that one can't accidentally #include pack_common.h instead of pack.h and silently fall back to slow code. by Benoit Jacob · 3 years, 11 months ago
- e7b27d6 Restructure kernel*.h headers so that just kernel_common.h does not provide any code path, only common helpers, so that one can't accidentally #include kernel_common.h instead of kernel.h and silently fall back to slow code. by Benoit Jacob · 3 years, 11 months ago
- b896b0c Support --cpu=armeabi, used in TensorFlow Raspberry Pi builds like here: by Benoit Jacob · 3 years, 11 months ago
- 9ad26c7 Complete the rollback by deleting files that were added by that CL and not deleted by the rollback. by Benoit Jacob · 3 years, 11 months ago
- 34ea9f4 Rollback refactoring. by Ruy Contributors · 3 years, 11 months ago
- 3281c7c Rename Other to OtherSide for readability at call sites, and use it in one more place. by Ruy Contributors · 3 years, 11 months ago
- b786fbd Rollback refactoring. by Ruy Contributors · 3 years, 11 months ago
- 93fdb9e The word 'packed' is being used for too many things, so rename to make it more specific in each case. by Benoit Jacob · 4 years ago
- db28e82 Rename Other to OtherSide for readability at call sites, and use it in one more place. by Benoit Jacob · 4 years ago
- 40394f7 Update our arm32 detection logic to support the case of cpu=='armv7a' as opposed to cpu=='armeabi-v7a' as we have on Android. Use naming that's more explicit as to our intent to just assume NEON support. by Benoit Jacob · 4 years ago
- c03298c Import the fix from XNNPACK's cpuinfo.BUILD to support the case where cpu=="armv7a". by Benoit Jacob · 4 years ago
- 55cb53a Refactoring of the front-end code. by Benoit Jacob · 4 years ago
- 921b9fe Better comments in trmul.cc. by Benoit Jacob · 4 years ago