While working on my block explorer I realized that my grasp on what it means for a system to be big or little endian was weak at best. I took the time to think about what exactly was the difference between the two. In the process I cleared up confusions in my overall understanding of encodings.
***
At a lower level, data structures in computer programs are all kept as series of bits. The state of these bits are stored on a piece of hardware such as a flash drive, an ssd drive, a mechanical hard drive, a stick of RAM, a CD, a cache, etc. Each bit has some physical location within its storage device. We can abstract this observation and give each location of a bit state a unique numerical address. Ostensibly numerically adjacent addresses map to physically adjacent bits, but this need not be the case.
We are often curious to know the state of the bits on our physical devices. Our goal becomes to bring an invisible state of the world into our field of consciousness. To do this we need our machine to read the bits on its storage device and construct light on our monitor that form characters we can perceive with our eyes. The characters we choose to represent the raw data are usually the same ones we use when we write numbers in hexadecimal: 0123456789ABCDEF. Since there are 16 characters in our symbol system, each character can represent an ordered list of 4 bits. To get a better understanding, we can work through an example.
Here is a mapping of addresses to bit states, with the address written in base 10 and bit state written as a 1 for on and a 0 for off.
0 -> 1
1 -> 0
2 -> 1
3 -> 0
4 -> 1
5 -> 1
6 -> 1
7 -> 0
8 -> 1
9 -> 1
10 -> 0
11 -> 1
12 -> 1
13 -> 1
14 -> 0
15 -> 1
Writing out the 1's and 0's with the lowest address on the left, the above becomes:
1010111011011101
If we want to represent the data above with the hexadecimal characters we do the following. We write out a hexadecimal string such that if we were to convert each hexadecimal character into its 4 digit binary equivilant we would have the bit state of each address, with the leftmost bit being the lowest address (0) and the rightmost bit being the highest address (15)
1010 | 1110 | 1101 | 1101 |
A | E | D | D |
AEDD
It's important to understand that so far we are not using the hexadecimal characters nor their 4-digit binary equivilant to represent numbers. These characters are only being used to represent the state of the data as it is stored in some storage device.
However, we do often use the state of chunks of memory to represent numbers. Coming up for a standard for how bit states map to numbers allows one to do arithmetic operations. We could, in theory, come up with any arbitrary mapping. For example, one could make the 5th position in memory represent the most significant bit of a number, then the 0th the 2nd most significant bit, the 9th the 3rd, etc. This would admittedly be a stupid and confusing scheme for representing numbers. In practice there are two prevailing conventions - big endian and little endian.
In a big endian1 system, the bit state at the lowest address represents the most sigificant bit in a binary number and the state in the highest address represents the least significant bit. The physical state of the machine above in a big endian context represents the following number: (I've written the same number 3 times in different bases)
1010111011011101 (base 2)
AEDD (base 16)
44,765 (base 10)
In little endian, the least significant byte is placed at the lowest memory address. However, within that byte, the bit with the lowest address is the most significant.
0 -> 1 (9th most significant bit)
1 -> 0 (10th most significant bit)
2 -> 1 (11th most significant bit)
3 -> 0 (12th most significant bit)
4 -> 1 (13th most significant bit)
5 -> 1 (14th most significant bit)
6 -> 1 (15th most significant bit)
7 -> 0 (16th most significant bit)
8 -> 1 (1st most significant bit)
9 -> 1 (2nd most significant bit)
10 -> 0 (3rd most significant bit)
11 -> 1 (4th most significant bit)
12 -> 1 (5th most significant bit)
13 -> 1 (6th most significant bit)
14 -> 0 (7th most significant bit)
15 -> 1 (8th most significant bit)
When in the context of little endian, the same above state now represents a different number, again written 3 times in different bases:
1101110110101110 (base 2)
DDAE (base 16)
56,750 (base 10)
If you tell a computer "store 56,750 starting at address 0 for me", the state of the bits at the addresses in memory on a big/little endian machine will differ. But if you give the more specfic instruction "write byte 9F at address 0", then the resulting state will be the same on the two machines.
***
As jfw points out, some fields in trb have the endianness of their internal and external representations swapped. So when you extract the hex characters that represent the "state of the machine" that is storing the block data, you need to (sometimes) reverse the order of the bytes so that the resulting hexadecimal string represents the correct numerical value for the field you're observing.
Creating/finding an exhausitive list of which fields in trb are represented in little endian and which are represented in big endian is on the todo list.
- Also known as Network Byte Order, since it is the standard way to organize bits to send accross networks. [↩]