ARM: NEON better instruction scheduling of over_n_8888

New head, tail, tail/head blocks are added and instructions
are reordered to eliminate pipeline stalls

Performance numbers of before/after

- cortex a8 -
before : L1: 375.39  L2: 391.93  M:114.39 ( 40.99%)  HT: 99.37  VT: 98.20  R: 90.24  RT: 32.87 ( 240Kops/s)
after  : L1: 481.90  L2: 483.46  M:114.29 ( 40.69%)  HT:106.91  VT: 93.38  R: 90.74  RT: 29.51 ( 236Kops/s)

- cortex a9 -
before : L1: 324.50  L2: 332.79  M:155.55 ( 47.51%)  HT:111.93  VT: 93.58  R: 71.92  RT: 28.21 ( 233Kops/s)
after  : L1: 355.87  L2: 364.49  M:156.90 ( 47.59%)  HT:111.52  VT: 91.76  R: 72.16  RT: 28.22 ( 234Kops/s)
1 file changed