commit | 4621b0cc7fd4f694b3ec4a2827cf82c4983fd237 | [log] [tgz] |
---|---|---|
author | George Steed <george.steed@arm.com> | Wed May 15 15:35:56 2024 +0100 |
committer | Frank Barchard <fbarchard@chromium.org> | Thu Oct 24 21:25:23 2024 +0000 |
tree | 2de87843298cf37068ade992fde0cca5939cec82 | |
parent | faade2f73f4a33ead7e19f9506a588074d3154f9 [diff] |
[AArch64] Rework data loading in ScaleFilterCols_NEON Lane-indexed LD2 instructions are slow and introduce an unnecessary dependency on the previous iteration of the loop. To avoid this dependency use a scalar load for the first iteration and lane-indexed LD1 for the remainder, then TRN1 and TRN2 to split out the even and odd elements. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -6.7% Cortex-A510: -13.2% Cortex-A520: -13.1% Cortex-A76: -54.5% Cortex-A715: -60.3% Cortex-A720: -61.0% Cortex-X1: -69.1% Cortex-X2: -68.6% Cortex-X3: -73.9% Cortex-X4: -73.8% Cortex-X925: -69.0% Bug: b/42280945 Change-Id: I1c4adfb82a43bdcf2dd4cc212088fc21a5812244 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872804 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>
libyuv is an open source project that includes YUV scaling and conversion functionality.
See Getting started for instructions on how to get started developing.
You can also browse the docs directory for more documentation.