vmx: optimize vmx_composite_over_n_8888_8888_ca

This patch optimizes vmx_composite_over_n_8888_8888_ca by removing use
of expand_alpha_1x128, unpack/pack and in_over_2x128 in favor of
splat_alpha, in_over and MUL/ADD macros from pixman_combine32.h.

Running "lowlevel-blt-bench -n over_8888_8888" on POWER8, 8 cores,
3.4GHz, RHEL 7.2 ppc64le gave the following results:

reference memcpy speed = 23475.4MB/s (5868.8MP/s for 32bpp fills)

                Before          After           Change
              --------------------------------------------
L1              244.97          474.05         +93.51%
L2              243.74          473.05         +94.08%
M               243.29          467.16         +92.02%
HT              144.03          252.79         +75.51%
VT              174.24          279.03         +60.14%
R               109.86          149.98         +36.52%
RT              47.96           53.18          +10.88%
Kops/s          524             576            +9.92%

Signed-off-by: Oded Gabbay <oded.gabbay@gmail.com>
Acked-by: Pekka Paalanen <pekka.paalanen@collabora.co.uk>
Acked-by: Siarhei Siamashka <siarhei.siamashka@gmail.com>
1 file changed