ARM's 64-bit mode (AArch64 ARMv8):
- Good web presentation about AArch64: http://people.linaro.org/~rikuvoipio/aarch64-talk/#/
- All AArch64 instructions are 32-bits long (instead of some 16-bit and some 32-bit).
- ARMv8 64-bit mode has doubled the register widths (64-bit instead of 32-bit) but also doubled the number of general CPU registers (32 instead of 16).
- When only part of a large register (CPU or FPU/SIMD) is used, the higher bits are ignored for reads and set to 0 for writes.
- Smaller registers are no longer packed into larger registers, but are mapped one-to-one to the low-order bits of the 128-bit register. ie: D2 = X2[0:63], not X1[0:63], and Scalar S2 = D2[0:31], not D1[0:31].
- Scalar VFP has explicit round-float-to-int instructions for many types of rounding modes.
- Scalar VFP also has instructions to convert between float and fixed-point, with rounding. eg: FCVTZS W0, S1, #14
- NEON instruction names no longer start with "V", now start with "S" (signed) or "U" (unsigned) or "F" (floating-point) or "P" (polynomial), and several other things about the mnemonics have changed.
- NEON has proper IEEE754 float & double vector support (instead of some basic float support and no double support at all).
- NEON includes some new operations such as vector reduction between lanes, vector float normalization (FRECPX, FMULX), vector bit-shifts, vector reciprocal & sqrt estimation. See ARMv8 ISA doc 5.8.25 for full list.
- Even Aarch32 in ARMv8 has a few new instructions compared to ARMv7: VCVTx.s32.f32, VRINTx, VMAXNM/VMINNM.
- Integer division by 0 creates the result 0 without a divide-by-zero error trap.
- The instructions LDM, STM, PUSH, POP have all been removed, but are replaced with LDP & STP (Load-Pair & Store-Pair) instructions.
- There is now quite fine-grained control of cache prefetching through the PRFM (Prefetch Mem) instruction, such as prefetch-to-L1 or prefetch-instructions-to-L2, etc.
- New instructions LDNP & STNP (non-cached load or store pair) will DRAM directly instead of through cache, aimed at situations where you only expect to access the memory location once, such as bulk memcpy. These non-cached instructions still support the PRFM instruction to prefetch the data in advance, such as the PRFM PLDL2STRM instruction.
- Stack Pointer (XSP) must be aligned to 16-bytes (quad-word).