Emulation is only an implementation of the ISA. The x86 *ISA* is hardly "as grea...

rayiner · on June 10, 2017

x86-64 is a perfectly acceptable ISA. Strong memory ordering, no architectural optimizations leaking out like branch delay slots or stack windows. Pretty good i-cache efficiency through the use of two-address code and memory operands. Of course, Intel didn't have much to do with it. Most of it is either an accident of history or the work of AMD, who a lot of work regularizing the ISA in the 64-bit transition.

pcwalton · on June 10, 2017

1. Yeah, ARM has one nasty architectural optimization leaking out: the program counter register being 8 bytes ahead of where it should be due to pipelining. Thankfully that got fixed up in AArch64, and if 32-bit mode gets dropped down the line (which is allowed by the architecture) it'll be a thing of the past. x86 has some architectural leaks too, though: the aliasing of the MMX and FP stacks as a hack for compatibility with early versions of Windows comes to mind. This one hasn't been fixed.

2. The REX prefixes are a nightmare: most instructions have one and this tremendously bloats up the instruction stream size. For this reason, the i-cache efficiency is not good compared to actual compressed instruction sets such as Thumb-2 (not that Thumb-2 is wonderful either). Note that if you do extreme hand-optimization of binary size, you can get x86-64 down pretty far, but so few people do that that it doesn't matter in practice.

3. Two address code isn't necessarily a win, especially since it doubles the number of REX prefixes. In AArch64 "and x9,x10,x11" is 4 bytes; in x86-64 "mov r9,r10; and r9,r11" is 6 bytes (and clobbers the condition codes). There's a reason compilers love to emit the three-address LEA...

4. Memory operands are nice, though I think the squeeze on instruction space makes them not worth it in practice. I'd rather use that opcode space for more registers.

5. Immediate encoding on x86-64 is crazy inefficient. "mov rax,1" is a whopping 7 bytes.

gsg · on June 11, 2017

Immediates are a bit of a mix. mov doesn't have very nice encodings, but many instructions do: push $1 is 2 bytes, addl $2, %eax is only 3 bytes.

There's no question that x86-64 could be improved on in terms of code density.

kccqzy · on June 10, 2017

Regarding 5, no, it's five bytes (b8 01 00 00 00) for movl $1,%eax. If you actually have a 64-bit immediate, just the immediate itself would be 8 bytes, and the actual instruction is 10 bytes.

besselheim · on June 11, 2017

MOV RAX, 1 is seven bytes: 48 C7 C0 01 00 00 00

pcwalton · on June 11, 2017

The parent has a good point actually, because "mov eax,1" automatically zero-extends in 64-bit mode.

It's still one byte longer than the equivalent AArch64 instruction, though.

kccqzy · on June 11, 2017

Fine. If you are really into getting the shortest instruction, try "xorl %eax,%eax" then "incl %eax" which is four bytes (31 c0 ff c0).

gsg · on June 11, 2017

push $1/pop %eax (6a 01 58) is shorter, but perhaps not the best idea.

nayuki · on June 11, 2017

Actually, there is no 32-bit pop instruction in x86-64 mode. Your code won't work.

gsg · on June 11, 2017

You're right: it should be pushq $1/pop %rax (which is also three bytes, although there will be a prefix byte for registers r8 through r15).

bogomipz · on June 10, 2017

Can you elaborate on what is meant by "optimizations leaking out"? I am not familiar with this term. Thanks.

rayiner · on June 10, 2017

Like, implementation details that leak out into the architecture for optimization reasons. Classic example is branch delay slots: https://en.wikipedia.org/wiki/Delay_slot.

bogomipz · on June 11, 2017

Thanks this is very interesting.

cwyers · on June 10, 2017

Microsoft isn't emulating x86-64, so far this is 32-bit only.

bubblethink · on June 11, 2017

Almost everything that you described is microarchitectural, and not tied to the ISA.

chrisseaton · on June 12, 2017

> Almost everything that you described is microarchitectural, and not tied to the ISA.

They listed these features 'strong memory ordering', '(no) branch delay slots', '(no) stack windows', 'good i-cache efficiency through the use of two-address code and memory operands'.

Every single one of those is a property of the ISA - the instruction set, its semantics and encoding - not the implementation.

Which do you think isn't part of the ISA?

bubblethink · on June 14, 2017

strong memory ordering is a contract. There is nothing intrinsic about the ISA, or its virtues, that dictates the ordering one way or the other. It is entirely guided by what the vendor wants to support. x86's ordering is similar to TSO in SPARC, which uses a RISC like ISA. The ordering is described as a part of the ISA, but any ISA can implement a strong ordering (at the risk of performance losses) if they want to.

i-cache efficiency: Again implementation specific. Efficiency is entirely a result of implementation, isn't it ?

no branch delay slot: Yes, this is a part of the ISA. My point though was that it is uncommon enough that I wouldn't call it a great virtue of x86 per se.

rayiner · on June 12, 2017

Which thing is micro architectural?

youdontknowtho · on June 11, 2017

Intel chips have been RISC-like internally for years. They have an instruction decode stage that converts x86 instructions into an internal ISA that's more RISC-ish.

Ars Technica, as always, has the details of how that has evolved over the years. Can't remember when the article in question was written, though.

yellowapple · on June 10, 2017

"x86-64 is as bloated as RISC architectures usually are without any of RISC's benefits."

Huh? I thought the whole point of a RISC ISA is to not be bloated.

pcwalton · on June 10, 2017

Bloated in terms of instruction encoding. All instructions on RISC architectures usually have a uniform size, as opposed to CISC architectures which are usually variable length. (Tons of exceptions exist in both directions of course.)

baobrien · on June 10, 2017

To add, in the case of RISC-V, the base integer ISA and most of the core extensions use fixed length 32-bit encoding (RV32/64 E/IMFAD). The basic encoding, however, allows for shorter and longer instructions in 16 bit increments. There is also the compressed ISA extension that encodes a subset of IMFAD into 16 bit instructions. The per-byte dynamic code size of the compressed extension ends up being on par with x86/x64 and Thumb2.

thesz · on June 11, 2017

They are necessarily - they have to to make programs run faster.

For example, Alpha AXP, one of the least blown up ISAs, did not provided non-word aligned loads and stores, providing word aligned loads and stores and a way to extract and/or combine bytes and subwords from/to the whole word. And it ended having separate instructions for loading and storing every subword type. The reason I stated above - to make program run faster and to make programs smaller.

The same is true for every RISC ISA I studied.

For example, MIPS includes an instruction to store a floating point number in the reg1+reg2*arg_size address. This can be split into two RISC instructions and fused at runtime in hardware, but still here it is!

muricula · on June 10, 2017

The first RISC should have read CISC. It's a typo