Bitset Decoding on Apple’s A12

Bitset Decoding on Apple’s A12

An additional difficulty is that ARM processors do not have a fast 64-bit population-count instruction (to determine the number of 1-bit per register). Similarly ARM processors lack a “trailing zero” instruction: you have to reverse the bit order and use a “leading zero” instruction. If these numbers can be trusted, then the Apple A12 might possibly be more efficient than an Intel Skylake processor (3.5 cycles vs. 3 cycles).

Source: lemire.me