firmware/2lib: Use SSE2 to speed-up Montgomery multiplication

This commit implements the Algorithm 2 described in Montgomery
Multiplication Using Vector Instructions from August 20, 2013
(cf. https://eprint.iacr.org/2013/519.pdf). This variation of the
Montgomery multiplication algorithm decouples some arithmetic steps
which can then be performed in parallel.

This implementation leverages the SSE2 instruction set to perform
arithmetic operations in parallel. It can be selected by setting the
`X86_SSE2_INSTR' compilation flag which can deliver a significant
performance boost of the RSA signature verification process.

For instance, on a Meteor Lake rex0 board we measured a 56% reduction
of the modular exponentiation operation execution time.

| modpow() function call during boot | original | SSE2 Algorithm 2 |
|------------------------------------+----------+------------------|
| coreboot/verstage 1                |    6.531 |            2.965 |
| coreboot/verstage 2                |    1.854 |            0.750 |
| coreboot/verstage 3                |    1.841 |            0.751 |
| depthcharge 1                      |    0.547 |            0.273 |
| depthcharge 2                      |    0.152 |            0.079 |
| depthcharge 3                      |    0.164 |            0.077 |
|------------------------------------+----------+------------------|
| Total (ms)                         |   11.089 |            4.895 |
| Ratio compared to original         |    100 % |             44 % |

However, on some SoC, SSE2 may not deliver good performances before
the silicon is fully initialized. Typically, on a Raptor Lake board we
observed a performance degradation before FSP-S and an improvement
after FSP-S.

On a Raptor Lake brya0 board:

| modpow() function call during boot | original | SSE2 Algorithm 2 |
|------------------------------------+----------+------------------|
| coreboot/verstage 1                |    5.642 |           10.145 |
| coreboot/verstage 2                |    1.560 |            2.412 |
| coreboot/verstage 3                |    1.548 |            2.413 |
| depthcharge 1                      |    0.693 |            0.248 |
| depthcharge 2                      |    0.172 |            0.065 |
| depthcharge 3                      |    0.223 |            0.067 |
|------------------------------------+----------+------------------|
| Total (ms)                         |    9.838 |           15.350 |
| Ratio compared to original         |    100 % |            156 % |

Hence the optimized code should only be enabled when appropriate.

BUG=b:312709384
BRANCH=main
TEST=modpow time duration is divided by two on rex board

Cq-Depend: chromium:5146107
Change-Id: I8739f94c0f6423f6e848bcae10b56e8be8e352ce
Signed-off-by: Jeremy Compostella <jeremy.compostella@intel.com>
Signed-off-by: Muhammad Monir Hossain <muhammad.monir.hossain@intel.com>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/vboot_reference/+/5055254
Commit-Queue: Julius Werner <jwerner@chromium.org>
Tested-by: Julius Werner <jwerner@chromium.org>
Reviewed-by: Julius Werner <jwerner@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/vboot_reference/+/5149259
Commit-Queue: YH Lin <yueherngl@chromium.org>
Tested-by: Subrata Banik <subratabanik@chromium.org>
Reviewed-by: YH Lin <yueherngl@chromium.org>
9 files changed