 ### How much is big data compressible? An interesting theorem.

I am doing some research to compress data available as tables (rows and columns, or cubes) more efficiently. This is the reverse data science approach: instead of receiving compressed data and applying statistical techniques to extract insights, here, we are looking at uncompressed data, extract all possible insights, and eliminate everything but the insights, to compress the data. In this process, I was wondering if one can design an algorithm that can compress any data set, by at least one bit. Intuitively, the answer is clearly no, otherwise you could recursilvely compress any data set to 0 bit. Any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, can not be compressed. I have seen contests offering an award if you find a compression algorithm that defeats this principle, but it would be a waste of time participating.

But what if you design an algorithm that, when a data set can not be compressed, leaves the data set unchanged? Would you be able, on average, to compress any data set then? Note that if you assemble numbers together to create a data set, the resulting data set would be mostly random. In fact, the vast majority of all data sets, are almost random and not compressible. But data sets resulting from experiments are usually not random, but they represent a tiny minority of all potential data sets. In practice this tiny minority represents all data sets that data scientists are confronted to.

It turns out that the answer is no. Even if you leave uncompressible data sets "as is" and compress those that can be compressed, on average, the compression factor (of any data compression algorithm) will be negative. The explanation is as follows: you need to add 1 bit to any data set: this extra bit tells you whether the data set is compressed using your algorithm, or left uncompressed. This extra bit makes the whole thing impossible. Interestingly, there have been official patents claiming that all data can be compressed. These are snake oil (according to the founder of the GZIP compressing tool), it is amazing that they were approved by the patent office.

Anyway, here's the mathematical proof, in simple words.

Theorem

There is no algorithm that, on average, will successfully compress any data set, even it leaves uncompressible data sets uncompressed.

Proof

Let y be a multivariate vector with integer values, representing the compressed data. Let say that y can take on m different values. Let x be the original data, and for any x, y=f(y) represents the compressed data.

How many solutions can we have to the equation f(y) ∈ S, where S is a set that has k distinct elements? Let denote the number of solutions in question as n. In other words, how many different values can n take, if the uncompressed data can take on k potential values? Note that n depends on k and m. Now we need to prove that:

 n * (1 + log2 m) + (k -n ) * (1 + log2 k) ≥ k log2 k

where:

• log2 m is the number of bits required to store the compressed data
• log2 k is the number of bits required to store the uncompressed data
• the number 1 corresponds to the extra bit necessary to tell whether we store the data in compressed or uncompressed format
• k log2 k represents the number of bits required to store ALL data sets of size k, as is, without using any compression whatsoever
• n * (1 + log2 m) + (k -n ) * (1 + log2 k) represents the number of bits required to store ALL data sets, compressing data sets if and only if efficient, leaving them uncompressed when compression is inefficient
• n is the number of data sets (out of k) that can be compressed efficiently
• log2 is the logarithm, in base 2

The proof consists in showing that the left hand side of the equation  is always larger than the right hand side (k log2 k)

In practice, m < k, otherwise the result is obvious and meaningless (if m > k, it means that your compression algorithm ALWAYS increases the size of the initial data set, regardless of the data set). As a result, we have

 n ≤ m, and n ≤ k

Equation  can be written as n * log2 (m / k) + k ≥ 0. And since m < k, we have

 n ≤ k / log2 (k / m).

Equation  is always verified when m < k and  is satisfied. Indeed k / log2 (k / m) is always minimum (for a given k) when m = 1, and since n ≤ k / log2 k, the theorem is proved.