Google CTF: David and the Tree // room2042

We’re given a ZIP file challenge.zip containing the first 26 chapters of “Gadsby: A Story of Over 50,000 Words Without Using the Letter “E”” by Ernest Vincent Wright.

The text of the book in the ZIP file is entirely in uppercase, but it is identical to the text on Project Gutenberg if converted to uppercase. This led us to believe that the challenge is about the ZIP file itself, not the contents. We examined the file using Kaitai Struct’s WebIDE and tools like zipinfo. After some time we realized that we got two hints from the challenge’s name and description:

“what’s more important is not what you have, but what you’re missing”: we’re looking for something missing, the ZIP’ed book omits the letter ‘E’.
“David and the Tree”: the ZIP file is compressed using the DEFLATE algorithm which uses (David) Huffman trees to compress the data.

This led us to believe that the letter E, together with its byte code representing its compressed version, might be present in the Huffman tree for each of the deflated files. Each byte code for the compressed letter E could then encode a character of the flag. Note that normally there would be no reason to include the letter E in the Huffman tree as the uncompressed text does not contain the letter E.

After some really ugly hacking, we indeed got the flag this way: CTF{!-!OLE-E-COM!7RESSION}. In case you’re interested in the quick and dirty way we solved this challenge: we wrote a simple script that gets the byte code for the letter E from the Huffman tree in two steps. First, we extract the DEFLATE bytes from the ZIP file using Kaitai Struct’s ZIP parser. Next, we parse the DEFLATE bytes using a deflatedecompress.py script found on Github to get to its Huffman tree. We hacked our way into the deflatedecompress.py script to access the Huffman tree by adding a global variable hacky = litlencode right after the line litlencode = CanonicalCode(codelens[ : numlitlencodes]).

While the byte codes associated with the letter E are 9 bits long and do not directly encode ASCII characters, we knew we were on the right track as the letter E should normally not have appeared in the Huffman tree. It turns out that we have to drop the most significant bit and reverse order of the remaining bits to get to the character of the flag.

#!/usr/bin/env python3
import io

import deflatedecompress  # modified from https://github.com/nayuki/Simple-DEFLATE-decompressor/blob/master/python/deflatedecompress.py
import zip  # from https://formats.kaitai.io/zip/python.html


parsed_zip = zip.Zip.from_file('challenge.zip')

flag = []
for section_number, section in enumerate(parsed_zip.sections):
    if isinstance(section.body, zip.Zip.LocalFile):
        deflated_file = section.body.body
    else:
        continue

    deflated_buf = deflatedecompress.BitInputStream(io.BytesIO(deflated_file))
    inflated_buf = deflatedecompress.Decompressor.decompress_to_bytes(deflated_buf)

    table = deflatedecompress.hacky._code_bits_to_symbol

    for k, v in table.items():
        if v == ord('E'):
            k = k & 0xFF

            # reverse bit order
            rk = 0
            for _ in range(8):
                rk = rk << 1
                if k & 1 == 1:
                    rk = rk | 1
                k = k >> 1

            flag.append(rk)

print(bytes(flag).decode())