Recognizing ZLIB Compression

Published: 2019-07-29. Last Updated: 2019-07-29 15:30:24 UTC
by Didier Stevens (Version: 1)

In diary entry "Analyzing Compressed PowerShell Scripts" and video "Video: Analyzing Compressed PowerShell Scripts" I show how to decompress ZLIB compressed data.

Let me share some more info on ZLIB compressed data. Compressing data with ZLIB is called deflating, and the algorithm is called DEFLATE.
When I compress the text "Hello, Hello, Hello, Hello" with Python's ZLIB module, I obtain the following binary data (represented in hexadecimal): 789cf348cdc9c9d751f0c0a400745608b5.

This data is structured according to RFC 1950: the first byte (0x78 in this example) if known as CMF (Compression Method and Flags). This byte is very often equal to 0x78. The 4 least significant bits identify the compression method (8 is DEFLATE and 15 is reserved), the 4 most significant bits are used to encode the size of the window when the compression method is 8. This value is often 7 (32K window size).
0x78 is a lowercase letter x, so easy to recognize in an ASCII dump. So, if you encounter some high entropy data that starts with x (0x78), it might be ZLIB compressed data according to RFC 1950.

My tool translate.py can be used, with function ZlibD (ZLIB Decompression), to decompress this data:

There's a second header byte after CMF: FLG (flags). And depending on these flags, there might be some more data, but usually, it's the compressed data that follows. This is compressed with the DEFLATE algorithm, and is structured according to RFC 1951. translate.py can also decompress this data, using function ZlibRawD.