Using the MD5 collision attack on zip/gzip/bzip2 and Linux package formats

First, we will  have a look at the various zip formats. With the knowledge of computing MD5 colliding blocks for any initialization vector, the attacker could do similar tricks to these formats as well as described in the main paper (located at homepage of the project).

Various zip/gzip/bzip2 formats

Well, there is a catch: how to get the block inside uncompressed and how to cause CRC32 collision along with MD5 collision?

1. Finding the right spot to place the block

2. Creating concurrent MD5 and CRC32 collisions

The zip formats use CRC32 merely as an interity check against errors. We know that CRC32 itself is not cryptographically secure and it is fairly simple to find two binary strings having the same CRC32 checksum. However, we need to cause concurrent collision of CRC32 and MD5. This is not trivial, but not so hard either. Now the birthday paradox comes handy. There are 232 possible CRC32 values, so when we choose 216 colliding blocks with same MD5 hash, we have 50% chance of getting two blocks causing collision in MD5 as well as CRC32. We don't have to compute 216 MD5 colliding blocks, we need just 16 of them to use Joux's multicollision trick [1]. We use the following notation: MD5(H, M) means MD5 computed with hash context H as initialization vector.

At first, we find the right place in the zip/gzip/bzip2 file and compute the MD5 hash context H0 up to this position in the file we want to put colliding blocks into. Then, we compute first pair of colliding blocks, M(0, 0) and M(0, 1). We know that
MD5(H0,M(0,0)) = MD5(H0,M(0,1)) = H1 (1)

Next, we continue computing another 15 pairs of colliding blocks, always using the previous hash context as our initialization vector value:
Hi+1=MD5(Hi,M(i,0))=MD5(Hi,M(i,1)) (2)

Then, both messages (symbol || means concatenation)
M(0, 0) || M(0, 1) || ... || M(15, 0)
M(0, 1) || M(1, 1) || ... || M(15, 1)

have the same MD5 hash starting from context H0. Moreover, using (1) and (2) we can swap M(i, 0) for M(i, 1) in the messages, while messages will still have the same MD5 hash. That gives us 216 colliding blocks with same MD5 hash. That is enough for our birthday paradox attack.

3. Time estimate for finding concurrent MD5 and CRC32 collision

A simple algorithm using backtrack to test all the 216 blocks has time complexity of 216 tests and space complexity of 216 words to store corresponding block-CRC32 checksum table. Ideal structure would be some sort of hash table.
We need to compute the 16 colliding block pairs first, which would take approximately 16 hours on an IBM p690 [2]. Using the bogomips estimate for IBM p690 in [3] and the fact that a better PC with Intel/Athlon processor has about 6000 bogomips, we get approx. 16*20 = 320 hours of machine time on PC. Searching the blocks for CRC32 collision we need to process 16 blocks*128 bytes*16 384 messages = 32 MB through CRC32 algorithm. Using benchmark [4], it will take less than a second (time for additional operations is negligible).
That totals in 320 hours. The computation can be easily distributed among more machines, e.g. using only 16 PCs would yield the result in less than one day.
Note: in bzip2 and gzip formats, the differing bits from colliding blocks won't be extracted into files, so install script (Makefile, configure, etc.) would have to read the original tar.bz2 or tar.gz.

Linux package formats

Linux packages are very common in Linux software distribution (instead of self-extract executables). Some formats are rpm, deb, etc. (almost each distribution has own package format). The packages usually implement MD5 for integrity checking, with optional signatures.

RPM format

The problem is, MD5 checksums are computed for each stored file. That means that an attacker would have to compute such colliding pair of blocks, so that it will cause collision for the complete rpm package and the file stored inside as well. Such requirement means (using current knowledge of MD5 collisions), that the hash context must be identical from the point of view of whole package as well as the single file inside. Generally that's not true in a package.
However, similar trick used in the bzip2/gzip formats can be used here. We do not put it directly into a stored file, but into lead header instead. Lead header is not checked by MD5 internally [5]:

The signature section follows the lead in the RPM package file. It contains information that can be used to verify the integrity, and optionally, the authenticity of the majority of the package file. The signature is implemented as a header structure.

You probably noticed the word, "majority", above. The information in the signature header structure is based on the contents of the package file's header and archive only. The data in the lead and the signature header structure are not included when the signature information is created, nor are they part of any subsequent checks based on that information.

While that omission might seem to be a weakness in RPM's design, it really isn't. In the case of the lead, since it is used only for easy identification of package files, any changes made to that part of the file would, at worst, leave the file in such a state that RPM wouldn't recognize it as a valid package file. Likewise, any changes to the signature header structure would make it impossible to verify the file's integrity, since the signature information would have been changed from their original values.

Shortly, the RPM file is organized into special blocks, each block is preceded by a tag. A tag tells the type of the following block, its length and some other parameters. We are interested in the SIGTAG_MD5 tag, which describes a block where MD5 checksum of the archive is stored. This block of of the BIN (binary) type, with length of 16 bytes (obviously). But we could change the tag by changing the block's size so that we could put the colliding block there, after the original MD5 sum itself. We assume that the rpm installer only compares the first 16 bytes of the block to the checksum of the rest of archive. We haven't tried it out yet or studied closely the rpm installer's source code, so no guarantees. Other option might be insterting a new block into rpm containing our MD5 colliding block, or changing the SIGTAG_SIZE tag's type and size to store the colliding block after the size. To be sure, one would have to inspect rpm installer's source code, but we are quite convinced that there is a place to put the colliding block where it would not be checked.
Using attack like this, the install scriptlet located inside the rpm package would have to read the original rpm's header, but we don't think that would be a problem.

Other package formats

We've looked at the structure of deb package, which looks very similar to rpm package in spirit. We can assume that most other packages have similar formats that allow mentioned attacks.


[1] A. Joux. Multicollision on Iterated Hash Function. Advances in Cryptology, CRYPTO 2004, Lecture Notes in Computer Science 3152.
[2] X. Wang, D. Feng, X. Lai, H. Yu, "Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD", rump session, CRYPTO 2004, Cryptology ePrint Archive, Report 2004/199,
[3] IBM p690 bogomips estimate in forum,
[4] CRC32 benchmark,