I am doing some research to compress data available as tables (rows and columns, or cubes) more efficiently. This is the reverse data science approach: instead of receiving compressed data and applying statistical techniques to extract insights, here, we are looking at uncompressed data, extract all possible insights, and eliminate everything but the insights, to compress the data.
In this process, I was wondering if one can design an algorithm that can compress any data set, by at least one bit. Intuitively, the answer is clearly no, otherwise you could recursilvely compress any data set to 0 bit. Any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, can not be compressed. I have seen contests offering an award if you find a compression algorithm that defeats this principle, but it would be a waste of time participating.
But what if you design an algorithm that, when a data set can not be compressed, leaves the data set unchanged? Would you be able, on average, to compress any data set then? Note that if you assemble numbers together to create a data set, the resulting data set would be mostly random. In fact, the vast majority of all data sets, are almost random and not compressible. But data sets resulting from experiments are usually not random, but they represent a tiny minority of all potential data sets. In practice this tiny minority represents all data sets that data scientists are confronted to.
It turns out that the answer is no. Even if you leave uncompressible data sets "as is" and compress those that can be compressed, on average, the compression factor (of any data compression algorithm) will be negative. The explanation is as follows: you need to add 1 bit to any data set: this extra bit tells you whether the data set is compressed using your algorithm, or left uncompressed. This extra bit makes the whole thing impossible. Interestingly, there have been official patents claiming that all data can be compressed. These are snake oil (according to the founder of the GZIP compressing tool), it is amazing that they were approved by the patent office.
Anyway, here's the mathematical proof, in simple words.
Theorem
There is no algorithm that, on average, will successfully compress any data set, even it leaves uncompressible data sets uncompressed.
Proof
Let y be a multivariate vector with integer values, representing the compressed data. Let say that y can take on m different values. Let x be the original data, and for any x, y=f(y) represents the compressed data.
How many solutions can we have to the equation f(y) ∈ S, where S is a set that has k distinct elements? Let denote the number of solutions in question as n. In other words, how many different values can n take, if the uncompressed data can take on k potential values? Note that n depends on k and m. Now we need to prove that:
[1] n * (1 + log2 m) + (k -n ) * (1 + log2 k) ≥ k log2 k
where:
The proof consists in showing that the left hand side of the equation [1] is always larger than the right hand side (k log2 k)
In practice, m < k, otherwise the result is obvious and meaningless (if m > k, it means that your compression algorithm ALWAYS increases the size of the initial data set, regardless of the data set). As a result, we have
[2] n ≤ m, and n ≤ k
Equation [1] can be written as n * log2 (m / k) + k ≥ 0. And since m < k, we have
[3] n ≤ k / log2 (k / m).
Equation [3] is always verified when m < k and [2] is satisfied. Indeed k / log2 (k / m) is always minimum (for a given k) when m = 1, and since n ≤ k / log2 k, the theorem is proved.
Bulk email represents one of the largest portions of legitimate emailing (spam is not included in this category). Sending bulk email requires a lot of bandwidth, and technical expertize to obtain high delivery rates. Newsletter that you subscribe too are typically sent via newsletter management companies, such as Vertical Response, MailChimp, Constant Contact or iContact. It is also expensive, with $10,000 per year to manage a 100,000 mailing list (including mailing, unsubscribes, reporting, A/B testing, resolving issues with ISP's and blacklisting services such as Spamhaus, and so on).
What if Gmail, Yahoo mail and Hotmail (they account to more than 60% of email addresses targeted by bulk email) offered the following services to make bulk emailing less bandwidth-consuming, and easier to monitor. Any time you send a newsletter to more than (say) 50,000 Gmail recipients, here is how it works:
This achieves the following goal: Gmail actually distributes the message (not you), using Google servers that are close to their Gmail servers. There is also one fewer node between the sender (the mailing list management company) and the recipient, thus saving considerable bandwidth. In short, it benefits both the sender, Gmail and the recipient (the latter one benefits thanks to better monitoring capability by Gmail, to block a message when deemed spammy).
There is a problem: what if you send a customized email to 50,000 recipents? For instance, the message starts with "Hi [Your Name]". The workaround is simple: Gmail could accept a few macros in your message, such as [Your Name], and deliver the customized version to all 50,000. All is needed is a very rudimentary macro language. And of course, the mailing list uploaded on Google servers must contain the email address but also the first name., for this type of customized message.
Related articles
This is potentially one of the worst nightmares for security experts. This type of fraud has been observed in the context of click fraud, but the payload potential is far bigger if it ever gets implemented to steal bank account login/password.
About the scheme:
An infected user - his computer has been infected by a virus, and (say) Firefox is now corrupt on his computer - tries to logon to his bank account. He types the correct domain name (say www.key.com) on the URL box in Firefox, and the real key.com webpage in question shows up. But when the key.com page shows us on the browser, everything is legit except the key.com login box that was substituted, on the fly, by a script on your hijacked computer, planted by a Botnet client who wants to access your bank account to make wire transfers to his account.
Once you enter your loging/password in the box, your info gets transferred to the criminals. If the criminals are smart enough, you won't notice anything: atfer entering your credentials, maybe you get served a genuine key.com error page, but it's too late: criminals got your login/password and are now wiring all your money to external bank accounts.
A potential strategy, for criminals to make this system more effective, is to have the Botnet operator send millions of email messages to users known to be infected by its Botnet. The Botnet operator just have to send a message (that will look very legitimate), providing the real URL for you to sign up on your real key.com account, knowing that your browser is infected.
While I haven't seen any scheme like this so far (involving hijacking your bank account via browser sign-on Trojan through browser infection), I've seen the exact same scheme used in the context of click fraud, deployed by a company known as MediaForce.com, still operating as of today, substituting genuine banner ads by fake ones - to promote their porn and Viagra ads from their clients.
Here's a new idea for Google to make money and cover the costs of processing / filtering billions of messages per day.
This is a solution to eliminate spam as well, without too many false positives as currently.
Solution: Google to create its own newsletter management system!
Or at least, Google works with major providers (Vertical Response, Constant Contact, iContact, Mail Chimp etc.) to allow their clients (the companies sending billions of messages each day, such as LinkedIn) to pay a fee based on volume. The fee would help the sender to not end up in Gmail spam box, as long as it complies with Google policies. Even better: let Google offer this newsletter management service directly to clients who want to reach Gmail more effectively, under Google's controls and conditions.
I believe Google is now in position to offer this service, as more than 50% of new personal email accounts currently created are Gmail, and they last much longer than any corporate email accounts (you don't lose your Gmail account when you lose your job). Indeed, we would be one of the first clients to sign up with Gmail Contact (that's the name I have invented for the Google newsletter management service). Google could reasonably charge $100 per 20,000 messages sent to Gmail accounts: the potential revenue is huge.
If Google would offer this service internally (rather than through a 3rd party such as Constant Contact), they would make more money and have more control, and the task of eliminating spam would be easier and less costly.
Currently, since Google offers none of these services, we face the following issues:
In the meanwhile, a solution for companies regularly sending newsletters to a large number of subscribers is to:
Guest blog by Vincent Granville, first posted here.
Here's some simple JavaScript code to encode numbers, such as credit card numbers, passwords made up of digits, phone numbers, social security numbers, dates such as 20131014 etc.
NSA Headquarters
How does it work?
This code is very simple, it is by no means strong encryption. It is indeed less sophisticated than uuencode. But uuencode is for geeks, while our app is easy to use by any mainstream people. The encoded value is also a text string, easy to copy and paste in any email client. The encoded value has some randomness, in the sense that encoding twice the same values will result in two different encoded values. Finally, it is more secure than it seems at first glance, if you don't tell anyone (except over the phone) where the decoder can be found. I will create a version that accepts parameters, to make it even more secure.
Related articles
Here's the JavaScript / HTML code for those interested (this is the source code of the web page where our application is hosted). You could save it as an HTML document on your local machine, with file name (say)encode.html in a folder (say) C://Webpages, and then open and run it from a browser on your local machine: the URL for this local webpage would be \\/C:/Webpages/encode.html if you use Chrome.
<html>
<script language="Javascript">
<!--
function encrypt2() {
var form=document.forms[0]
if (form.encrypt.checked) {
form.cardnumber.value=crypt(form.cardnumber.value)
} else {
form.cardnumber.value=decrypt(form.cardnumber.value)
}
}
function crypt(string) {
var len=string.length
var intCarlu
var carlu
var newString="e"
if ((string.charCodeAt(i)!=101)&&(len>0)) {
for (var i=0; i<len; i++) {
intCarlu=string.charCodeAt(i)
rnd=Math.floor(Math.random()*7)
newIntCarlu=30+10*rnd+intCarlu+i-48
if (newIntCarlu<48) { newIntCarlu+=50 }
if (newIntCarlu>=58 && newIntCarlu<=64) { newIntCarlu+=10 }
if (newIntCarlu>=90 && newIntCarlu<=96) { newIntCarlu+=10 }
carlu=String.fromCharCode(newIntCarlu)
newString=newString.concat(carlu)
}
return newString
} else {
return string
}
}
function decrypt(string) {
var len=string.length
var intCarlu
var carlu
var newString=""
if (string.charCodeAt(i)==101) {
for (var i=1; i<len; i++) {
intCarlu=string.charCodeAt(i)
carlu=String.fromCharCode(48+(intCarlu-i+1)%10)
newString=newString.concat(carlu)
}
return newString
} else {
return string
}
}
// -->
</script>
<form>
Enter Number <input type=text name=cardnumber size=19><p>
Encrypt / Decrypt <input type=checkbox name=encrypt onClick="encrypt2()">
</form>
</html>
Guest blog by Vincent Granville, first posted here.
With email encryption being targeted by the government as if it was criminal activity (read the story about the Lavabit platform), this could be a great opportunity for mathematicians and data scientists: creating a startup that offers encrypted email that no government or entity could ever decrypt, offering safe solutions to corporations who don't want their secrets stolen by competitors, criminals or the government.
Here's the kind of email platform that I have in mind:
Indeed, the government could create such an app and disguise it as a private enterprise: it would in this case be an honeypot app. Some people worry that the government is tracking everyone and that you could get in trouble (your Internet connection shut down, bank account frozen) because you posted stuff that the government algorithms deem extremely dangerous, maybe a comment about pressure cookers. At the same time, I believe the threat is somewhat exaggerated. While there is a risk for false positives, you will never be sent in jail for talking about pressure cooker recipes (at worst, you'll get a visit from the NSA - someone indeed did). While big data and big brother are getting bigger and more powerful every second, the number of available cells in prison is not increasing. Maybe it is even decreasing. So even if magically, millions of people suddenly wanted to become law enforcement, NSA, CIA or FBI agents (and the money was available to train and hire them), there is just simply not enough prison cells to accommodate more prisoners (US has the largest prison population of any country, measured as the proportion of people incarcerated at any given time).
On the other side, many people seemed to be OK with increased regulations and more police. I think this is a side effect of living in an over-crowded world, with unsustainable population growth: the younger generation accepts or is forced into lower quality of life, having to share a small apartment with many roommates in over-crowded cities. They are more risk-adverse on average, and worry about all sorts of real issues such as increased terrorism, the risk of an epidemics, giant financial systems that could collapse under their own weight, pollution killing people at a younger age, etc. I believe eventually people will find solutions to escape from this environment, maybe by building floating cities, cities under the see, or underground cities. In my case, after many years of cubicle life and the morning and afternoon rat race (AKA the commute), I no longer drive to work, and have a much better lifestyle working from home 100% of the time - for the safest job one could ever wish to have: one that you created yourself, an adaptive, lean, agile enterprise that you founded yourself with a few great partners. But this is another story.
Anyone interested in building this encryption app? Note that no system is perfectly safe. If there's an invisible camera behind you, filming everything you do on your computer, then my system offers no protection for you - though it would still be safe for the recipient, unless he also has a camera tracking all his computer activity. But the link between you and the recipient (the fact that both of you are connected) would be invisible to any third party. And increased security can be achieved if you use the web app from an anonymous computer - maybe from a public computer in some hotel lobby.
Related articles