These lists were created by AIDA64 Instruction Latency dump feature. If you do not believe in software measurements, wait for the official Intel/AMD/etc. guide and hope it will be more detailed and accurate than the current one. ;) You can create such dump in AIDA64 by right-clicking on the bottom status bar of AIDA64 main window -> CPU Debug -> Instruction Latency Dump. It fully works on trial version, too.
In this dump latency means the time that it takes for the next dependent same-type instruction to start. Throughput means the time that it takes for the next independent same-type instruction to start:
```
  L: ADD rax, rax       T: ADD rax, rax
     ADD rax, rax          ADD rbx, rbx
     ADD rax, rax          ADD rcx, rcx
     ...                   ...
```
These values are measured by long chains of instructions (~6000), so these are the sustained rates, peak values can be higher.
Some instructions do not modify the target register. E.g. CMP, TEST, BT, NOP. This way it is not always possible to measure directly the instruction latency.
Some instructions never depend on a previous one: they use different source and destination register sets or have memory operand, so it is not always possible to measure directly the instruction latency, but it possible to measure instruction pairs. E.g. PUSH + POP, MOV reg, [mem] + MOV [mem], reg. The abbreviation "LS pair" means a load and a store form pair of a moving instruction.
Newer processors can recognize that some instructions with the same operand are independent from previous ones. In this case latency can be lower than 1. Classic example is the XOR instruction: XOR eax, eax always 0 so it never depends on the result of the previous XOR. XOR r32_1, r32_2 means
```
 L: XOR rax, rbx        T: XOR rax, rbx
    XOR rax, rbx           XOR rbx, rcx
    XOR rax, rbx           XOR rcx, rdx
    ...                    ...
```
If TP value is less than 1, it means that more than one same-type instruction can start in the same clock cycle.
In case of memory operand, throughput can be higher than latency, because it uses more memory location than latency measurement.
FSQRT throughput can be higher than FSQRT latency on older processors because FSQRT is measured via
```
 L: FSQRT               T: FSQRT 
    FSQRT                  FDECSTP
    FSQRT                  FSQRT
    FSQRT                  FDECSTP
     ...                   ...
```
chains and the oldies cannot do FSQRT and FDECSTP parallel.Update: FDECSTP changed to FXCH.

The (I)DIV latency on modern processors depends on the operand size. Because (I)DIV always uses rDX:rAX registers for dividend, quotient and remainder, and only for some operand sizes is possible to dividend = quotient : remainder (e.g. If AX = 0xFEFF, after an DIV AL AX remains 0xFEFF) , need to refresh rDX/rAX. So "DIV r8 12/ 8b ax upd" means

 L: DIV al              T: DIV bl
    MOV ax, const          MOV ax, const
    DIV al                 DIV cl
    MOV ax, const          MOV ax, const
    ...                    ...

chains. Similarly "DIV r32 2^62/2^31 eax/edx" means

 L: DIV eax            T: DIV ebx
    MOV eax, const1       MOV eax, const1
    MOV edx, const2       MOV edx, const2
    DIV eax               DIV ecx
    MOV eax, const1       MOV eax, const1
    MOV edx, const2       MOV edx, const2
    ...                   ...

For some x87 instruction combinations (and for some SSE in 32b mode) the 8 registers are not enough to measure the instruction throughput.
It is a measurement, not a constant table, so some values are rounded.
Keep in mind that even though instruction latency and throughput are important, they may not directly reflect CPU performance!