While working on my block explorer I realized that my grasp on what it means for a system to be big or little endian was weak at best. I took the time to think about what exactly was the difference between the two. In the process I cleared up confusions in my overall understanding of encodings.
***
At a lower level, data structures in computer programs are all kept as series of bits. The state of these bits are stored on a piece of hardware such as a flash drive, an ssd drive, a mechanical hard drive, a stick of RAM, a CD, a cache, etc. Each bit has some physical location within its storage device. We can abstract this observation and give each location of a bit state a unique numerical address. Ostensibly numerically adjacent addresses map to physically adjacent bits, but this need not be the case.
We are often curious to know the state of the bits on our physical devices. Our goal becomes to bring an invisible state of the world into our field of consciousness. To do this we need our machine to read the bits on its storage device and construct light on our monitor that form characters we can perceive with our eyes. The characters we choose to represent the raw data are usually the same ones we use when we write numbers in hexadecimal: 0123456789ABCDEF. Since there are 16 characters in our symbol system, each character can represent an ordered list of 4 bits. To get a better understanding, we can work through an example.
Here is a mapping of addresses to bit states, with the address written in base 10 and bit state written as a 1 for on and a 0 for off.
0 -> 1
1 -> 0
2 -> 1
3 -> 0
4 -> 1
5 -> 1
6 -> 1
7 -> 0
8 -> 1
9 -> 1
10 -> 0
11 -> 1
12 -> 1
13 -> 1
14 -> 0
15 -> 1
Writing out the 1's and 0's with the lowest address on the left, the above becomes:
1010111011011101
If we want to represent the data above with the hexadecimal characters we do the following. We write out a hexadecimal string such that if we were to convert each hexadecimal character into its 4 digit binary equivilant we would have the bit state of each address, with the leftmost bit being the lowest address (0) and the rightmost bit being the highest address (15)
1010 | 1110 | 1101 | 1101 |
A | E | D | D |
AEDD
It's important to understand that so far we are not using the hexadecimal characters nor their 4-digit binary equivilant to represent numbers. These characters are only being used to represent the state of the data as it is stored in some storage device.
However, we do often use the state of chunks of memory to represent numbers. Coming up for a standard for how bit states map to numbers allows one to do arithmetic operations. We could, in theory, come up with any arbitrary mapping. For example, one could make the 5th position in memory represent the most significant bit of a number, then the 0th the 2nd most significant bit, the 9th the 3rd, etc. This would admittedly be a stupid and confusing scheme for representing numbers. In practice there are two prevailing conventions - big endian and little endian.
In a big endian1 system, the bit state at the lowest address represents the most sigificant bit in a binary number and the state in the highest address represents the least significant bit. The physical state of the machine above in a big endian context represents the following number: (I've written the same number 3 times in different bases)
1010111011011101 (base 2)
AEDD (base 16)
44,765 (base 10)
In little endian, the least significant byte is placed at the lowest memory address. However, within that byte, the bit with the lowest address is the most significant.
0 -> 1 (9th most significant bit)
1 -> 0 (10th most significant bit)
2 -> 1 (11th most significant bit)
3 -> 0 (12th most significant bit)
4 -> 1 (13th most significant bit)
5 -> 1 (14th most significant bit)
6 -> 1 (15th most significant bit)
7 -> 0 (16th most significant bit)
8 -> 1 (1st most significant bit)
9 -> 1 (2nd most significant bit)
10 -> 0 (3rd most significant bit)
11 -> 1 (4th most significant bit)
12 -> 1 (5th most significant bit)
13 -> 1 (6th most significant bit)
14 -> 0 (7th most significant bit)
15 -> 1 (8th most significant bit)
When in the context of little endian, the same above state now represents a different number, again written 3 times in different bases:
1101110110101110 (base 2)
DDAE (base 16)
56,750 (base 10)
If you tell a computer "store 56,750 starting at address 0 for me", the state of the bits at the addresses in memory on a big/little endian machine will differ. But if you give the more specfic instruction "write byte 9F at address 0", then the resulting state will be the same on the two machines.
***
As jfw points out, some fields in trb have the endianness of their internal and external representations swapped. So when you extract the hex characters that represent the "state of the machine" that is storing the block data, you need to (sometimes) reverse the order of the bytes so that the resulting hexadecimal string represents the correct numerical value for the field you're observing.
Creating/finding an exhausitive list of which fields in trb are represented in little endian and which are represented in big endian is on the todo list.
- Also known as Network Byte Order, since it is the standard way to organize bits to send accross networks. [↩]
> However, within that byte, the bit with the lowest address is the most significant.
That's not quite it. What you seem to miss is that bits are not individually addressable in typical CPU instruction sets: you move data between registers and memory a word at a time or at the smallest a byte at a time. You can get at individual bits through shift/mask operations but those are defined arithmetically i.e. "shift right" always means shifting toward the least significant end, independent of how the bits are laid out physically.
The annoying appearance of mixed ordering comes into play with external representations with sub-byte-level symbols, such as a hex dump. With hex, it's a well entrenched convention to write the two half-byte digits ('nibbles") big-endian. Thus if you take a numeric 0x12345678, read it byte-at-a-time on an LE machine, and display each byte using the normal programming language hex facility, you get "78 56 34 12" rather than the 87654321 that the CPU and RAM could well be using internally.
In TRB, from what I've seen so far the extra byte reversal is done only for parsing/display of hashes, i.e. block and tx IDs and not numeric fields. See also.
I see. I knew that bits are not individually accessible, but thought it was okay to omit that detail from my overview.
So the LE machine may or may not have the internal bits of a byte reversed. But regardless of how the bits are layed out physically the external representation for a single byte is always displayed in big endian. With this understanding, I gather that how the bits are laid out on a per byte or per word level is black-box'd.
It's not that I'm quibbling over an omitted detail though; it's right in the quoted sentence: an assumption that bits have addresses. Certainly you could have a machine where they do, but if it worked as you described, it would be a mixed-order monster rather than LE.
The reason people tend to think LE is somehow intrinsically "reversed" is our human convention of writing numbers BE. Incidentally there'd be some advantages if we'd gone the other way: we wouldn't have to right-align numeric fields in tables for the place values to line up, and addition/subtraction/multiplication would proceed left-to-right the same as writing.
Yes, my incorrect assumption was that although the machine could only access bits one-word-at-a-time, those bits did have some notion of an address. And I also believed that perhaps some machines were this 'mixed order monster.'
From doing a bit more noobsearch I understand now that it's only the bytes that have addresses. (Argueably a machine could be designed to only give addresses to words.)
At the risk of further extending the tangential nitpicks:
Bits within flag registers are typically modified one at a time, by whichever operation has caused that bit to have its value set.