• These lists were created by AIDA64 Instruction Latency dump feature. If you do not believe in software measurements, wait for the official Intel/AMD/etc. guide and hope it will be more detailed and accurate than the current one. ;) You can create such dump in AIDA64 by right-clicking on the bottom status bar of AIDA64 main window -> CPU Debug -> Instruction Latency Dump. It fully works on trial version, too.
  • In this dump latency means the time that it takes for the next dependent same-type instruction to start. Throughput means the time that it takes for the next independent same-type instruction to start:
      L: ADD rax, rax       T: ADD rax, rax
         ADD rax, rax          ADD rbx, rbx
         ADD rax, rax          ADD rcx, rcx
         ...                   ...
    
  • These values are measured by long chains of instructions (~6000), so these are the sustained rates, peak values can be higher.
  • Some instructions do not modify the target register. E.g. CMP, TEST, BT, NOP. This way it is not always possible to measure directly the instruction latency.
  • Some instructions never depend on a previous one: they use different source and destination register sets or have memory operand, so it is not always possible to measure directly the instruction latency, but it possible to measure instruction pairs. E.g. PUSH + POP, MOV reg, [mem] + MOV [mem], reg. The abbreviation "LS pair" means a load and a store form pair of a moving instruction.
  • Newer processors can recognize that some instructions with the same operand are independent from previous ones. In this case latency can be lower than 1. Classic example is the XOR instruction: XOR eax, eax always 0 so it never depends on the result of the previous XOR. XOR r32_1, r32_2 means
     L: XOR rax, rbx        T: XOR rax, rbx
        XOR rax, rbx           XOR rbx, rcx
        XOR rax, rbx           XOR rcx, rdx
        ...                    ...
    
  • If TP value is less than 1, it means that more than one same-type instruction can start in the same clock cycle.
  • In case of memory operand, throughput can be higher than latency, because it uses more memory location than latency measurement.
  • FSQRT throughput can be higher than FSQRT latency on older processors because FSQRT is measured via
     L: FSQRT               T: FSQRT 
        FSQRT                  FDECSTP
        FSQRT                  FSQRT
        FSQRT                  FDECSTP
         ...                   ...
    
    chains and the oldies cannot do FSQRT and FDECSTP parallel.Update: FDECSTP changed to FXCH.
  • The (I)DIV latency on modern processors depends on the operand size. Because  (I)DIV always uses rDX:rAX registers for dividend, quotient and remainder, and only for some operand sizes is possible to dividend = quotient : remainder  (e.g. If AX = 0xFEFF, after an DIV AL  AX remains 0xFEFF) , need to refresh rDX/rAX. So "DIV r8 12/ 8b ax upd" means
     L: DIV al              T: DIV bl
        MOV ax, const          MOV ax, const
        DIV al                 DIV cl
        MOV ax, const          MOV ax, const
        ...                    ...
    
    chains. Similarly "DIV r32 2^62/2^31 eax/edx" means
     L: DIV eax            T: DIV ebx
        MOV eax, const1       MOV eax, const1
        MOV edx, const2       MOV edx, const2
        DIV eax               DIV ecx
        MOV eax, const1       MOV eax, const1
        MOV edx, const2       MOV edx, const2
        ...                   ...
    
  • For some x87 instruction combinations (and for some SSE in 32b mode) the 8 registers are not enough to measure the instruction throughput.
  • It is a measurement, not a constant table, so some values are rounded.
  • Keep in mind that even though instruction latency and throughput are important, they may not directly reflect CPU performance!