Optimize asm for decoding literal fragments.
Relative to the previous commit:
benchmark old MB/s new MB/s speedup
BenchmarkWordsDecode1e1-8 519.36 518.05 1.00x
BenchmarkWordsDecode1e2-8 691.63 776.28 1.12x
BenchmarkWordsDecode1e3-8 858.97 995.41 1.16x
BenchmarkWordsDecode1e4-8 581.86 615.92 1.06x
BenchmarkWordsDecode1e5-8 380.78 453.95 1.19x
BenchmarkWordsDecode1e6-8 403.12 453.74 1.13x
Benchmark_UFlat0-8 784.21 863.12 1.10x
Benchmark_UFlat1-8 625.49 766.01 1.22x
Benchmark_UFlat2-8 15366.67 15463.36 1.01x
Benchmark_UFlat3-8 1321.47 1388.63 1.05x
Benchmark_UFlat4-8 4338.83 4367.79 1.01x
Benchmark_UFlat5-8 770.24 844.84 1.10x
Benchmark_UFlat6-8 386.10 442.42 1.15x
Benchmark_UFlat7-8 376.79 437.68 1.16x
Benchmark_UFlat8-8 400.47 458.19 1.14x
Benchmark_UFlat9-8 362.89 423.36 1.17x
Benchmark_UFlat10-8 943.89 1023.05 1.08x
Benchmark_UFlat11-8 493.98 507.18 1.03x
1 file changed