Optimize asm for decoding copy fragments.
Relative to the previous commit:
benchmark old MB/s new MB/s speedup
BenchmarkWordsDecode1e1-8 518.05 518.80 1.00x
BenchmarkWordsDecode1e2-8 776.28 871.43 1.12x
BenchmarkWordsDecode1e3-8 995.41 1411.32 1.42x
BenchmarkWordsDecode1e4-8 615.92 1469.60 2.39x
BenchmarkWordsDecode1e5-8 453.95 771.07 1.70x
BenchmarkWordsDecode1e6-8 453.74 872.19 1.92x
Benchmark_UFlat0-8 863.12 1129.79 1.31x
Benchmark_UFlat1-8 766.01 1075.37 1.40x
Benchmark_UFlat2-8 15463.36 15617.45 1.01x
Benchmark_UFlat3-8 1388.63 1438.15 1.04x
Benchmark_UFlat4-8 4367.79 4838.37 1.11x
Benchmark_UFlat5-8 844.84 1075.46 1.27x
Benchmark_UFlat6-8 442.42 811.70 1.83x
Benchmark_UFlat7-8 437.68 781.87 1.79x
Benchmark_UFlat8-8 458.19 819.38 1.79x
Benchmark_UFlat9-8 423.36 724.43 1.71x
Benchmark_UFlat10-8 1023.05 1193.70 1.17x
Benchmark_UFlat11-8 507.18 879.15 1.73x
1 file changed