Improve Your Hex Life With This One Weird Trick

One of the best ways to learn about network protocols is to deeply analyze raw network packet hex dumps. Few people do this, which is a shame: there’s a lot to be said for going straight to the raw data and comparing it to the mental model you’re trying to build up.

When I’m trying to understand something, it always helps if my brain is generating questions like “Why is this done this way?” and “What tradeoffs were the designers considering?”. If the software I’m using to analyze some data format–especially a binary one–is too helpful, then my brain just switches into power-saving mode (the default state) which is great at daydreaming but bad at learning.

For example, given a tool like Wireshark where all I have to do is click on some byte sequence and the UI says “That’s the header length in 4 byte words” then my brain just goes “Ah, I see. So that’s the header length in 4 byte words. Thank you oh wise shark of the wires! Say, what are the next bytes for? The ‘type of service’ you say? How wonderful. And the ones after to those?”. Click, click, click. Study so easy … what are we doing for lunch? Sushi anyone?

Wireshark. Like being given a library when you asked for a book.

Some of the best learning comes from guided struggle. Too much struggle can be frustrating; too little and the knowledge won’t stick. For our networking classes, our compromise has been to select packets in Wireshark, right click, then click “copy as hex” and paste it into a text editor.

After much referencing of various diagrams and specifications, we usually end up with a detailed analysis that looks something like this:

$ cat annotated_hexdump.txt
# link layer envelope - ethernet frame
c4 e9 84 87 60 28 # destination addr - router wifi card
a4 5e 60 df 2e 1b # source addr - local wifi card
08 00 # type/protocol - ip = 8
# network layer envelope - ip datagram
45 # first byte is two 4 bit numbers
# version number = 4
# length of header in 4 byte words = 5
00 # type of service
# first 6 bits are "differentiated services code point" = 0
# last 2 bits are "explicit congestion notification" = 0
00 40 # total length = 64
d0 03 # id
00 00 # fragment fields
# first bit unused
# second is "do not fragment" = 0
# third is "more fragments to follow" = 0
# and last 13 are the fragment number (offset) = 0
40 # time to live = 64
06 # protocol - tcp
2c ee # header checksum
c0 a8 00 65 # source addr = 192.168.00.101
c0 1e fc 9a # destination addr = 192.30.252.154
# transport layer envelope - tcp segment
e7 9f # source port = 59295
00 50 # destination port = 80
5e ab 22 65 # sequence number = 1588273765
00 00 00 00 # acknowledgement number = 0
b0 02 # first 4 bits are size of header in 4 byte words = 11
# next 6 bits are unused
# next 6 bits are the flags
# of which just the second last is set
# the second last bit is the "SYN" flag
ff ff # receive window size = 65535
58 23 # checksum
00 00 # urgent pointer (unused in practice)
# additional tcp options
02 04 05 b4 # 1st option
# set the maximum segment size to 1460
01 # no-op option (for alignment)
03 03 05 # 2nd option
# set the window scale factor to 5
01 01 # no-op options (for alignment)
08 0a 3a 4d bd c5 00 00 00 00 # 3rd option
# local timestamp = 978173381
# remote timestamp = 0
04 02 # 4th option, enable selective acknowledgement
00 00 # two bytes of zero padding (for alignment)

Doing this in a few different classes got me thinking: what’s the easiest way to do the reverse? I.e. given a text file of hex-encoded byte values and a bunch comments, what program will generate the original byte stream?

Nothing came to mind, but it’s not a difficult program to write:

$ cat to_bytestream
#!/bin/bash
# remove comments
sed -e 's/#.*//' |
# remove whitespace
tr -d ' \n' |
# convert ascii hex chars to bytestream
xxd -r -p

Let’s pass the output bytestream to hexdump to make sure it works:

$ < annotated_hexdump.txt to_bytestream | hexdump
0000000 c4 e9 84 87 60 28 a4 5e 60 df 2e 1b 08 00 45 00
0000010 00 40 d0 03 00 00 40 06 2c ee c0 a8 00 65 c0 1e
0000020 fc 9a e7 9f 00 50 5e ab 22 65 00 00 00 00 b0 02
0000030 ff ff 58 23 00 00 02 04 05 b4 01 03 03 05 01 01
0000040 08 0a 3a 4d bd c5 00 00 00 00 04 02 00 00
000004e

The nice thing about having this program is that we can “roundtrip” our analysis and confirm we didn’t accidentally remove (or duplicate) a byte sequence while moving things around and adding comments.

We just need to compare the two bytestreams:

$ < annotated_hexdump.txt to_bytestream | cmp -p original_bytestream

(If the shell syntax has you scratching your head. Try the excellent explainshell.com)

Combine this roundtrip check command with a file-watching tool like wach, and we have a tight feedback loop for doing “by hand” analysis of binary data, while confirming that we’re not accidentally making comments about byte sequences that don’t even exist in the original source data.

$ wach '
< annotated_hexdump.txt to_bytestream |
cmp -p original_bytestream && { clear; echo all good at; date; }'

Do this kind of analysis with enough binary formats, and your perception of tools like Wireshark will shift from intimidating “Oracle of Mysteries” status to the more approachable “Instrument of Insight” level.