Speed up decompression by making the fast path for literals faster.

We do the fast-path step as soon as possible; in fact, as soon as we know the
literal length. Since we usually hit the fast path, we can then skip the checks
for long literals and available input space (beyond what the fast path check
already does).

Note that this changes the decompression Writer API; however, it does not
change the ABI, since writers are always templatized and as such never
cross compilation units. The new API is slightly more general, in that it
doesn't hard-code the value 16. Note that we also take care to check
for len <= 16 first, since the other two checks almost always succeed
(so we don't want to waste time checking for them until we have to).

The improvements are most marked on Nehalem, but are generally positive
on other platforms as well. All microbenchmarks are 64-bit, opt.

Clovertown (Core 2):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0      110226     110224     100000 886.0MB/s  html    [ +1.5%]
  BM_UFlat/1     1036523    1036508      10000 646.0MB/s  urls    [ -0.8%]
  BM_UFlat/2       26775      26775     522570 4.4GB/s  jpg       [ +0.0%]
  BM_UFlat/3       49738      49737     280974 1.8GB/s  pdf       [ +0.3%]
  BM_UFlat/4      446790     446792      31334 874.3MB/s  html4   [ +0.8%]
  BM_UFlat/5       40561      40562     350424 578.5MB/s  cp      [ +1.3%]
  BM_UFlat/6       18722      18722     746903 568.0MB/s  c       [ +1.4%]
  BM_UFlat/7        5373       5373    2608632 660.5MB/s  lsp     [ +8.3%]
  BM_UFlat/8     1615716    1615718       8670 607.8MB/s  xls     [ +2.0%]
  BM_UFlat/9      345278     345281      40481 420.1MB/s  txt1    [ +1.4%]
  BM_UFlat/10     294855     294855      47452 404.9MB/s  txt2    [ +1.6%]
  BM_UFlat/11     914263     914263      15316 445.2MB/s  txt3    [ +1.1%]
  BM_UFlat/12    1222694    1222691      10000 375.8MB/s  txt4    [ +1.4%]
  BM_UFlat/13     584495     584489      23954 837.4MB/s  bin     [ -0.6%]
  BM_UFlat/14      66662      66662     210123 547.1MB/s  sum     [ +1.2%]
  BM_UFlat/15       7368       7368    1881856 547.1MB/s  man     [ +4.0%]
  BM_UFlat/16     110727     110726     100000 1021.4MB/s  pb     [ +2.3%]
  BM_UFlat/17     382138     382141      36616 460.0MB/s  gaviota [ -0.7%]

Westmere (Core i7):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0       78861      78853     177703 1.2GB/s  html      [ +2.1%]
  BM_UFlat/1      739560     739491      18912 905.4MB/s  urls    [ +3.4%]
  BM_UFlat/2        9867       9866    1419014 12.0GB/s  jpg      [ +3.4%]
  BM_UFlat/3       31989      31986     438385 2.7GB/s  pdf       [ +0.2%]
  BM_UFlat/4      319406     319380      43771 1.2GB/s  html4     [ +1.9%]
  BM_UFlat/5       29639      29636     472862 791.7MB/s  cp      [ +5.2%]
  BM_UFlat/6       13478      13477    1000000 789.0MB/s  c       [ +2.3%]
  BM_UFlat/7        4030       4029    3475364 880.7MB/s  lsp     [ +8.7%]
  BM_UFlat/8     1036585    1036492      10000 947.5MB/s  xls     [ +6.9%]
  BM_UFlat/9      242127     242105      57838 599.1MB/s  txt1    [ +3.0%]
  BM_UFlat/10     206499     206480      67595 578.2MB/s  txt2    [ +3.4%]
  BM_UFlat/11     641635     641570      21811 634.4MB/s  txt3    [ +2.4%]
  BM_UFlat/12     848847     848769      16443 541.4MB/s  txt4    [ +3.1%]
  BM_UFlat/13     384968     384938      36366 1.2GB/s  bin       [ +0.3%]
  BM_UFlat/14      47106      47101     297770 774.3MB/s  sum     [ +4.4%]
  BM_UFlat/15       5063       5063    2772202 796.2MB/s  man     [ +7.7%]
  BM_UFlat/16      83663      83656     167697 1.3GB/s  pb        [ +1.8%]
  BM_UFlat/17     260224     260198      53823 675.6MB/s  gaviota [ -0.5%]

Barcelona (Opteron):

  Benchmark     Time(ns)    CPU(ns) Iterations
  --------------------------------------------
  BM_UFlat/0      112490     112457     100000 868.4MB/s  html    [ -0.4%]
  BM_UFlat/1     1066719    1066339      10000 627.9MB/s  urls    [ +1.0%]
  BM_UFlat/2       24679      24672     563802 4.8GB/s  jpg       [ +0.7%]
  BM_UFlat/3       50603      50589     277285 1.7GB/s  pdf       [ +2.6%]
  BM_UFlat/4      452982     452849      30900 862.6MB/s  html4   [ -0.2%]
  BM_UFlat/5       43860      43848     319554 535.1MB/s  cp      [ +1.2%]
  BM_UFlat/6       21419      21413     653573 496.6MB/s  c       [ +1.0%]
  BM_UFlat/7        6646       6645    2105405 534.1MB/s  lsp     [ +0.3%]
  BM_UFlat/8     1828487    1827886       7658 537.3MB/s  xls     [ +2.6%]
  BM_UFlat/9      391824     391714      35708 370.3MB/s  txt1    [ +2.2%]
  BM_UFlat/10     334913     334816      41885 356.6MB/s  txt2    [ +1.7%]
  BM_UFlat/11    1042062    1041674      10000 390.7MB/s  txt3    [ +1.1%]
  BM_UFlat/12    1398902    1398456      10000 328.6MB/s  txt4    [ +1.7%]
  BM_UFlat/13     545706     545530      25669 897.2MB/s  bin     [ -0.4%]
  BM_UFlat/14      71512      71505     196035 510.0MB/s  sum     [ +1.4%]
  BM_UFlat/15       8422       8421    1665036 478.7MB/s  man     [ +2.6%]
  BM_UFlat/16     112053     112048     100000 1009.3MB/s  pb     [ -0.4%]
  BM_UFlat/17     416723     416713      33612 421.8MB/s  gaviota [ -2.0%]

R=sanjay


git-svn-id: https://snappy.googlecode.com/svn/trunk@53 03e5f5b5-db94-4691-08a0-1a8bf15f6143
1 file changed