Optimize ARM64 SIMD code for Cavium ThunderX

Per @ssvb:
ThunderX is an ARM64 chip that dedicates most of its transistor real
estate to providing 48 cores, so each core is not as fast as a result.
Each core is dual-issue & in-order for scalar instructions and has only
a single-issue half-width NEON unit, so the peak throughput is one
128-bit instruction per 2 cycles.  So careful instruction scheduling is
important.  Furthermore, ThunderX has an extremely slow implementation
of ld2 and ld3, so this commit implements the equivalent of those
instructions using ld1.

Compression speedup relative to libjpeg-turbo 1.4.2:
48-core ThunderX (RunAbove ARM Cloud), Linux, 64-bit: 58-85% (avg. 74%)
relative to jpeg-6b: 1.75-2.14x (avg. 1.95x)

Refer to #49 and #51 for discussion.

Closes #51.

This commit also wordsmiths the ChangeLog entry (the ARMv8 SIMD
implementation is "complete" only for compression-- it still lacks some
decompression algorithms, as does the ARMv7 implementation.)

Based on:
https://github.com/mayeut/libjpeg-turbo/commit/9405b5fd031558113bdfeae193a2b14baa589a75

which is based on:
https://github.com/libjpeg-turbo/libjpeg-turbo/commit/f561944ff70adef65bb36212913bd28e6a2926d6
https://github.com/libjpeg-turbo/libjpeg-turbo/commit/962c8ab21feb3d7fc2a7a1ec8d26f6b985bbb86f
2 files changed
tree: 013c35d9228fac97bde0d8c66a14d1d0d22d075d
  1. cmakescripts/
  2. doc/
  3. java/
  4. md5/
  5. release/
  6. sharedlib/
  7. simd/
  8. testimages/
  9. win/
  10. .gitignore
  11. acinclude.m4
  12. bmp.c
  13. bmp.h
  14. BUILDING.md
  15. cderror.h
  16. cdjpeg.c
  17. cdjpeg.h
  18. change.log
  19. ChangeLog.txt
  20. cjpeg.1
  21. cjpeg.c
  22. CMakeLists.txt
  23. coderules.txt
  24. configure.ac
  25. djpeg.1
  26. djpeg.c
  27. doxygen-extra.css
  28. doxygen.config
  29. example.c
  30. jaricom.c
  31. jcapimin.c
  32. jcapistd.c
  33. jcarith.c
  34. jccoefct.c
  35. jccolext.c
  36. jccolor.c
  37. jcdctmgr.c
  38. jchuff.c
  39. jchuff.h
  40. jcinit.c
  41. jcmainct.c
  42. jcmarker.c
  43. jcmaster.c
  44. jcomapi.c
  45. jconfig.h.in
  46. jconfig.txt
  47. jconfigint.h.in
  48. jcparam.c
  49. jcphuff.c
  50. jcprepct.c
  51. jcsample.c
  52. jcstest.c
  53. jctrans.c
  54. jdapimin.c
  55. jdapistd.c
  56. jdarith.c
  57. jdatadst-tj.c
  58. jdatadst.c
  59. jdatasrc-tj.c
  60. jdatasrc.c
  61. jdcoefct.c
  62. jdcoefct.h
  63. jdcol565.c
  64. jdcolext.c
  65. jdcolor.c
  66. jdct.h
  67. jddctmgr.c
  68. jdhuff.c
  69. jdhuff.h
  70. jdinput.c
  71. jdmainct.c
  72. jdmainct.h
  73. jdmarker.c
  74. jdmaster.c
  75. jdmerge.c
  76. jdmrg565.c
  77. jdmrgext.c
  78. jdphuff.c
  79. jdpostct.c
  80. jdsample.c
  81. jdsample.h
  82. jdtrans.c
  83. jerror.c
  84. jerror.h
  85. jfdctflt.c
  86. jfdctfst.c
  87. jfdctint.c
  88. jidctflt.c
  89. jidctfst.c
  90. jidctint.c
  91. jidctred.c
  92. jinclude.h
  93. jmemmgr.c
  94. jmemnobs.c
  95. jmemsys.h
  96. jmorecfg.h
  97. jpeg_nbits_table.h
  98. jpegcomp.h
  99. jpegint.h
  100. jpeglib.h
  101. jpegtran.1
  102. jpegtran.c
  103. jquant1.c
  104. jquant2.c
  105. jsimd.h
  106. jsimd_none.c
  107. jsimddct.h
  108. jstdhuff.c
  109. jutils.c
  110. jversion.h
  111. libjpeg.map.in
  112. libjpeg.txt
  113. LICENSE.md
  114. Makefile.am
  115. rdbmp.c
  116. rdcolmap.c
  117. rdgif.c
  118. rdjpgcom.1
  119. rdjpgcom.c
  120. rdppm.c
  121. rdrle.c
  122. rdswitch.c
  123. rdtarga.c
  124. README.ijg
  125. README.md
  126. structure.txt
  127. tjbench.c
  128. tjbenchtest.in
  129. tjbenchtest.java.in
  130. tjexampletest.in
  131. tjunittest.c
  132. tjutil.c
  133. tjutil.h
  134. transupp.c
  135. transupp.h
  136. turbojpeg-jni.c
  137. turbojpeg-mapfile
  138. turbojpeg-mapfile.jni
  139. turbojpeg.c
  140. turbojpeg.h
  141. usage.txt
  142. wizard.txt
  143. wrbmp.c
  144. wrgif.c
  145. wrjpgcom.1
  146. wrjpgcom.c
  147. wrppm.c
  148. wrrle.c
  149. wrtarga.c