Optimize asm for decoding literal fragments.

Relative to the previous commit:

benchmark                     old MB/s     new MB/s     speedup
BenchmarkWordsDecode1e1-8     519.36       518.05       1.00x
BenchmarkWordsDecode1e2-8     691.63       776.28       1.12x
BenchmarkWordsDecode1e3-8     858.97       995.41       1.16x
BenchmarkWordsDecode1e4-8     581.86       615.92       1.06x
BenchmarkWordsDecode1e5-8     380.78       453.95       1.19x
BenchmarkWordsDecode1e6-8     403.12       453.74       1.13x
Benchmark_UFlat0-8            784.21       863.12       1.10x
Benchmark_UFlat1-8            625.49       766.01       1.22x
Benchmark_UFlat2-8            15366.67     15463.36     1.01x
Benchmark_UFlat3-8            1321.47      1388.63      1.05x
Benchmark_UFlat4-8            4338.83      4367.79      1.01x
Benchmark_UFlat5-8            770.24       844.84       1.10x
Benchmark_UFlat6-8            386.10       442.42       1.15x
Benchmark_UFlat7-8            376.79       437.68       1.16x
Benchmark_UFlat8-8            400.47       458.19       1.14x
Benchmark_UFlat9-8            362.89       423.36       1.17x
Benchmark_UFlat10-8           943.89       1023.05      1.08x
Benchmark_UFlat11-8           493.98       507.18       1.03x
1 file changed