ARM's 64-bit mode (AArch64 ARMv8):

Last updated on 11th June, 2016 by Shervin Emami. Posted originally on 11th June, 2016.

Good web presentation about AArch64: http://people.linaro.org/~rikuvoipio/aarch64-talk/#/
All AArch64 instructions are 32-bits long (instead of some 16-bit and some 32-bit).
ARMv8 64-bit mode has doubled the register widths (64-bit instead of 32-bit) but also doubled the number of general CPU registers (32 instead of 16).
When only part of a large register (CPU or FPU/SIMD) is used, the higher bits are ignored for reads and set to 0 for writes.
Smaller registers are no longer packed into larger registers, but are mapped one-to-one to the low-order bits of the 128-bit register. ie: D2 = X2[0:63], not X1[0:63], and Scalar S2 = D2[0:31], not D1[0:31].
Scalar VFP has explicit round-float-to-int instructions for many types of rounding modes.
Scalar VFP also has instructions to convert between float and fixed-point, with rounding. eg: FCVTZS W0, S1, #14
NEON instruction names no longer start with "V", now start with "S" (signed) or "U" (unsigned) or "F" (floating-point) or "P" (polynomial), and several other things about the mnemonics have changed.
NEON has proper IEEE754 float & double vector support (instead of some basic float support and no double support at all).
NEON includes some new operations such as vector reduction between lanes, vector float normalization (FRECPX, FMULX), vector bit-shifts, vector reciprocal & sqrt estimation. See ARMv8 ISA doc 5.8.25 for full list.
Even Aarch32 in ARMv8 has a few new instructions compared to ARMv7: VCVTx.s32.f32, VRINTx, VMAXNM/VMINNM.
Integer division by 0 creates the result 0 without a divide-by-zero error trap.
The instructions LDM, STM, PUSH, POP have all been removed, but are replaced with LDP & STP (Load-Pair & Store-Pair) instructions.
There is now quite fine-grained control of cache prefetching through the PRFM (Prefetch Mem) instruction, such as prefetch-to-L1 or prefetch-instructions-to-L2, etc.
New instructions LDNP & STNP (non-cached load or store pair) will DRAM directly instead of through cache, aimed at situations where you only expect to access the memory location once, such as bulk memcpy. These non-cached instructions still support the PRFM instruction to prefetch the data in advance, such as the PRFM PLDL2STRM instruction.
Stack Pointer (XSP) must be aligned to 16-bytes (quad-word).

64-bit ARM in C/C++:

Data types in AArch64 mode:

char:                8-bit unsigned.
bool / _Bool:        8-bit unsigned. False is 0 and True is 1.
long long:           64-bit signed.
int, long, pointer:  might be 32-bit or 64-bit, depending on IPL32/LP64/LLP64 shown below.

64-bit Data models:

IPL32:               long is 32-bit, wchar_t is 32-bit, all pointers & size_t are 32-bit.
LP64:                long is 64-bit, wchar_t is 32-bit, all pointers & size_t are 64-bit.
LLP64:               long is 32-bit, wchar_t is 16-bit, all pointers & size_t are 64-bit.

AArch64 64-bit AAPCS Calling Convention:

AAPCS is the format the C/C++ compiler uses to call functions. By default, C/C++ uses the LP64 data model and so "int" still defaults to 32-bit, but "long" and pointers default to 64-bit.

Default sizes in AArch64 64-bit AAPCS:

char:        8 bits
int:         32 bits
long:        64 bits
long long:   64 bits
pointer:     64 bits

AArch64 has these GCC-specific defines:

#define __aarch64__ 1
#define __SIZEOF_POINTER__  8

eg:

#if __SIZEOF_POINTER__ == 8
    // 64-bit pointers.
#else
    // 32-bit pointers.
#endif

Registers in the AArch64 Calling Convention:

X0 - X7:    arguments & result (instead of R0-R3).
X8:         indirect result (struct) location (new!).
X9 - X15:   spare temp registers (new!).
X16 - X17:  intra-call registers (PLT, linker) (instead of R12).
X18:        platform specific (TLS) (instead of R9).
X19 - X28:  callee-saved registers (instead of R4-R8,R10,R11). Must save all 64-bits even in ILP32 mode!
X29:        frame pointer (instead of R7 in iOS).  Must save all 64-bits even in ILP32 mode!
X30:        link register (instead of R14).
X31:        stack pointer "SP" (or 0 value) (instead of R13).
(And notice there is no longer a "PC" register that was R15. It is implicitly used by certain instructions like branches & functions).

No soft-float option anymore (since VFP/SIMD is mandatory).

VFP/SIMD registers in the AArch64 hard-float Calling Convention:

V0 - V7:    arguments & result.
V8 - V15:   callee-saved registers (same registers as AAarch32), but bits 64:128 not saved.
V16 - V31:  spare temp registers (new!)

Function arguments larger than 16 bytes (eg: large structs) are passed by reference, while smaller structs are passed by value.
Function arguments that are structs are padded to multiples of 8 bytes.
Function arguments that are floating-point (of any size) or SIMD vectors use the first 8 float/SIMD registers.
Function arguments each take up 64-bits of register or 64-bits of memory.
Unlike in the 32-bit AAPCS, half-precision floats can be passed directly.
If a function returns data larger than 8 registers, then an extra pointer argument is sent in register X8 to the function. The function stores the result into the struct pointed to by X8, instead of return it by value via X0 - X7.
To see a basic .S assembly file, compile a simple Hello World C program using "gcc -S main.c -o main.S", to give you the assembly file "main.S".

Linux ELF64 info for a typical 64-bit ARM C/C++ AArch64 application:

Output of a typical "file mytest" command:

    mytest: ELF 64-bit LSB executable, version 1 (SYSV), statically linked, for GNU/Linux 3.9.0, not stripped

Output of a typical "readelf -h mytest" command:

    ELF Header:
    Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
    Class:                             ELF64
    Data:                              2s complement, little endian
    Version:                           1 (current)
    OS/ABI:                            UNIX - System V
    ABI Version:                       0
    Type:                              EXEC (Executable file)
    Machine:                           
    Version:                           0x1
    Entry point address:               0x400c10
    Start of program headers:          64 (bytes into file)
    Start of section headers:          814728 (bytes into file)
    Flags:                             0x0
    Size of this header:               64 (bytes)
    Size of program headers:           56 (bytes)
    Number of program headers:         4
    Size of section headers:           64 (bytes)
    Number of section headers:         36
    Section header string table index: 33