Many applications store passwords for user authentication. Using an appropriate password hashing algorithm can efficiently protect the stored passwords even when the persisted password hashes get stolen by an attacker.
Unfortunately many developers assigned with the task to implement a persistent password storage lack the necessary cryptographical background knowledge to choose a strong password hashing algorithm, often leading to passwords stored in plain or hashed with weak algorithms such as a secure hash algorithm without any salt or iterations.
This article aims to help the cryptographically unencumbered developers to make the right choice when hashing user passwords. The first part will start with a closer look on the goal we try to achieve, and then examine the secure hash functions which build the core of every password hashing. The following parts will show how to further strengthen a raw hash function against cracking attacks, finally leading to the state-of-the-art algorithms PBKDF2 and scrypt.
Plain Wrong
Persistend data and often complete databases are stolen all the time. Often they contain user data, including passwords used to access a service or website. If you don’t believe me, just look at a few recently affected organizations (and a duck):
- Java-Forum (german)
- OVH web hosting service (german)
- Ubuntu-Forum
- Hetzner web hosting service
- Evernote (german)
- Adobe Forum
- Gamigo online gaming platform
- Yahoo Voice
- Donald Duck (german)
One rule I draw from this: Databases in general and especially those containing any personal or confidential data (such as passwords) should ideally always be encrypted when stored on persistent storage. This does not solve all problems, but for data in general this is probably the maximum of cryptographic security you can get for the time being. (There is some very interesting research going on into processing of encrypted data without the need to reveal the actual plaintext behind it. But don’t expect anything from that to be ready for production in the near future.)
However, there is one category of especially sensible data which we can protect much better when stored in a database: secret passwords used to authenticate the users of our applications. Possession of such passwords will enable an attacker to impersonate the user against the application opening the doors to all kinds of espionage and sabotage and further penetration. And because users, against all good advice, tend to reuse passwords for different services, cracking a users password for one service often means access to numerous other unrelated (and possibly much more critical) services.
Luckily, we do not need to (and therefore never should) store such passwords in plaintext. Instead we store the result H of some function called a KDF:
H = KDF(P)
KDF stands for key derivation function. A KDF can be used to generate a pseudo-random, fixed-size cryptographic key from a variable length user password or passphrase (hence the name of these functions) or to transform such passwords into pseudo-random, fixed-size hashes suitable for authentication.
The most important property of KDF when used for password hashing is that it must be easy to calculate the result H for any given plaintext password P but difficult (ideally impossible) to calculate the original password P from a given value H. Then, if we only store H instead of P, an attacker getting to know H will not be able to determine the original password P from this value. This property is called pre-image resistance.
To be suitable for authentication, it seems we need to demand another property from a KDF: For two non-equal passwords P1 and P2, the resulting values H1 and H2 should never be equal. We call a pair (P1, P2) with P1 ≠ P2 and H1 = KDF(P1) = KDF(P2) = H2 a Collision. It seems a good KDF should be collision free.
We could then use the following protocol to authenticate a user:
- The client asks the user for the username and the password P and sends these to the server.
- The server calculates the value H = KDF(P) from the received password.
- The server retrieves the user data corresponding to the specified username from the database, including the expected KDF result H’.
- The server compares the value H calculated from the client request with the value H’ stored in the database. If the values match, the password was correct and the user is authenticated.
Because our KDF is collision free, we know that H = H’ can only be true if P = P’ (where P’ is the expected password of the user).
But it turns out that this second property is overly strict. We don’t really need the KDF to be completely collision free. Instead we only need the chance for a collision to be sufficiently small that an attacker cannot gain any advantage from such collisions. Or, in other words, collisions should be hard to find (ideally at least as hard as simply guessing a users password). Such a relaxed property is called collision resistance.
Relaxing the second property allows us to use a well known cryptographic building block to construct a KDF: a cryptographic hash function.
Cryptographic Hash Functions
Every developer should know hash functions which take an input of arbitrary length (say some string of characters) and calculates an output of fixed length (for example a 32 bit integer).
A cryptographic hash function is a hash function with the additional properties pre-image resistance, collision resistance and second pre-image resistance. We already saw the former two properties as necessary properties of a KDF. The third property, second pre-image resistance, means that given a message M1 it should be difficult to find another message M2 such that the hash values of both messages are equal. While this is an essential property for the construction of digital signatures, it clearly is not very relevant for our use to protect passwords. (After all, if an attacker already knows a password P1 he will simply use it. Finding some other password P2 with the same hash value will not help his plans.)
So, knowing the often used MD5, many developers will resort to simply use MD5 for password hashing:
KDF(P) := MD5(P)
However, doing this will give you about the worst possible password hashing. Passwords hashed like this are so easy to crack (as you will see below) that you are probably better off to simply store your passwords in plaintext.
The first problem is that MD5 has been broken a long time ago:
Therefore we suggest that in the future MD5 should no longer be implemented in applications like signature schemes, where a collision-resistant hash function is required. According to our present knowledge, the best recommendations for alternatives to MD5 are SHA-1 and RIPEMD-160.
(The Status of MD5 After a Recent Attack, CryptoBytes, RSA Laboratories, Volume 2, Number 2 — Summer 1996)
Note that this attack on collision resistance did not yet disqualify MD5 for use in password hashing. The attack simplified the task to find two arbitrary messages P1 and P2 which both result in the same hash: MD5(P1) = MD5(P2). This is no direct help to an attacker attempting to break hashed passwords.
But first successfull attacks on a hash algorithm, which typically concern (strong) collision resistance, are a clear sign that the security of the algorithm as a whole is in peril, and that it is about time to look for an alternative.
MD5 held up for another 13 years, but in 2009 a pre-image attack followed. See Finding Preimages in Full MD5 Faster Than Exhaustive Search (Yu Sasaki, Kazumaro Aoki) and Construction of the Initial Structure for Preimage Attack of MD5 (Ming Mao, Shaohui Chen, Jin Xu) if you are interested in the cryptographical details. Though this attack was still theoretical, today no one should be using MD5 as a cryptographically secure hash algorithm anymore.
Back in 1996 the RSA Laboratories recommended the algorithms SHA-1 and RIPEMD-160 to replace MD5, but in 2013 I would not want to give such a recommendation. There have been (at least partially) successfull attacks on SHA-1 (see for example Schneier on Security: SHA-1 Broken), and the only reason RIPEMD-160 looks slightly better on paper is most probably that it has been analyzed much less than SHA-1.
Happy Birthday
But much more important: MD5, RIPEMD-160 and SHA-1 all have the problem that their hashes are too small for general use. You might argue that this deficiency is less relevant for password hashing where birthday attacks are not normally an issue, but since there are alternatives with bigger hash sizes with the SHA-2 family and SHA-3 you can follow this simple general advise:
When using secure hash algorithms always use hash sizes with 256 bits and more.
This will give you a “security level” of (at least) 128 bits even in cases where a birthday attack could be mounted.
Considering the standardized hash functions that currently are available for us, there is a corollary following:
When choosing a secure hash algorithm in 2013 (or later) choose SHA-256, SHA-384, SHA-512 or better yet (where already available) SHA-3 with a hash size of at least 256 bits.
Note that SHA-224 (slightly breaking our 256 bit limit) is practically identical to SHA-256 with the result truncated to 224 bits, and likewise SHA-384 is practically SHA-512 truncated to 384 bits. If you need to specify a hash algorithm with an output of exactly 224 resp. 384 bits you could use these SHA variants. If you need even more flexibility, you can always take any hash algorithm considered secure with an output bigger than your desired hash output size and truncate it to the number of bits needed. As long as your final output size is not smaller than appropriate for your minimally acceptable security level (usually 256 bits for secure hashes) this will not reduce the security of the hash algorithm.
Length Extensions
There is more to know if you want to use “raw” secure hash algorithms as cryptographic building blocks. The practically available secure hash algorithms are not perfect secure hash algorithms and it pays off to know about their weaknesses. The most important one in SHA-256 and SHA-512 and a number of older algorithms (such as MD5 and SHA-1) is the “length extension bug”. This “bug” or weakness results from the fact that the complete state of an engine calculating a hash like SHA-256 is contained in the output of its hash function. So, omitting some details like padding to block sizes etc., for any two messages M1 and M2 we could write:
SHA-256(M1 ∥ M2) = SHA-256(SHA-256(M1) ∥ M2)
Where the symbol ∥ means the concatenation of two byte sequences.
For an illustration that this is a problem consider a (too) simple message authentication code (MAC) which works by calculating the secure hash of a secret password K concatenated with the message M:
MACK(M) := SHA-256(K ∥ M)
Now imagine a user Alice sending an authenticated message M to another user Bob with whom she shares a secret key K. Bob will expect to receive the message and the calculated MAC:
M, MACK(M)
But an attacker (let’s call him Mallory) intercepts Alices transmission and attempts to alter the message by appending some suffix:
M ∥ X, MACK(M)
However, Mallory cannot forward the transmission like this. Bob would check the MAC and realize that MACK(M) ≠ MACK(M ∥ X). Mallory needs to calculate a correct MAC to trick Bob into believing that the complete message comes from Alice. But Mallory does not know the secret key K. Fortunately (for Mallory) SHA-256 and its application in our MAC scheme makes this very easy:
MACK(M ∥ X) = SHA-256(K ∥ M ∥ X) = SHA-256(SHA-256(K ∥ M) ∥ X) = SHA-256(MACK(M) ∥ X)
Since Mallory has the value of MACK(M) from the original message he has no problems of generating a valid looking MAC for his new message M ∥ X and making Bob believe that he is reading a genuine message from Alice.
Granted, the authentication scheme has room for improvements; switching the order of the concatenation and defining MACK(M) := SHA-256(M ∥ K) would already prevent the attack shown here. But if we slightly modify the scenario, we run into another problem called a “partial message collision”.
Partial Message Collisions
Partial message collisions can be seen in all practically used secure hash functions (MD5, SHA-1, SHA-2, SHA-3 and others), because all of them are designed to iteratively work on blocks of data. This makes it possible to hash a stream of data without the need to buffer more than the small block currently processed. This is a very usefull property of these hash algorithms but it is also a cryptographical weakness.
That these collisions might be a problem can be seen if we use SHA-256(M ∥ K) to “sign” messages which we receive from other users. This setup gives Mallory a chance to search for pairs of messages M1 and M2 for which SHA-256(M1) = SHA-256(M2). Mallory could then let us sign M1 resulting in a signature which also works for M2:
MACK(M1) = SHA-256(M1 ∥ K) = SHA-256(M2 ∥ K) = MACK(M2)
In effect, we blindly signed message M2 while believing to sign M1. (Though, again, we omitted details like padding to block sizes which might complicate the attackers life.)
If the hash function is strong and the hash length big enough, this weakness will not be a problem, because it will be very difficult to find colliding messages M1 and M2. On the other hand the attack can be prepared offline and it is a classical birthday attack halving the effective size of the hash, which is a very significant reduction of the hash strength.
Stronger Hashing
Whatever the exact attack vectors may be, the crucial point here is that we certainly don’t want such mathematical relationships like length extensions or partial message collisions in a “perfect”, truly secure hash algorithm. Luckily there is a way to fix both of these problems by hashing the message twice, as shown by Bruce Schneier in his book “Cryptography Engineering”:
SHADBL-256(M) := SHA-256(SHA-256(M) ∥ M)
The idea here is to let the initial state of the (outer) hash engine depend on every bit of the message before hashing the complete message (again). I don’t know of a formal proof but given our two scenarios above, you can probably intuitively see, that these and similar attacks will not work anymore, and we can also give some pseudo-mathematical explanation:
MACK(M ∥ X)
= SHADBL-256(K ∥ M ∥ X)
= SHA-256(SHA-256(K ∥ M ∥ X) ∥ K ∥ M ∥ X)
= SHA-256(K ∥ M ∥ X ∥ K ∥ M ∥ X)
≠ SHA-256(K ∥ M ∥ K ∥ M ∥ X)
= SHA-256(SHA-256(SHA-256(K ∥ M) ∥ K ∥ M) ∥ X)
= SHA-256(SHA-256DBL(K ∥ M) ∥ X)
= SHA-256(MACK(M) ∥ X)
And for the partial message collision:
MACK(M1)
= SHADBL-256(M1 ∥ K)
= SHA-256(SHA-256(M1 ∥ K) ∥ M1 ∥ K)
= SHA-256(M1 ∥ K ∥ M1 ∥ K)
≠ SHA-256(M2 ∥ K ∥ M2 ∥ K)
= SHA-256(SHA-256(M2 ∥ K) ∥ M2 ∥ K)
= SHADBL-256(M2 ∥ K)
= MACK(M2)
This fix is very well suited for short messages such as passwords. But for longer messages, and especially if you calculate the hash value of a stream, SHADBL is not only twice as expensive as the simple SHA, it also forces you to buffer the complete message.
If your goal is to fix the partial message collision there seems to be no easy way without buffering the message. However, since all attacks using partial message collisions are birthday attacks, which we already countered with our rule to use hash sizes of 256 bits or more. This leaves a security level of at least 128 bits even after a successfull birthday attack, which should normally be sufficient.
On the other hand, if you want to just fix the more disturbing length extension problems in SHA-256 and SHA-512, there are more efficient constructions than hashing the complete message twice. One (proposed by Bruce Schneier in his book) is to hash the result of the hash function again:
SHAD-256(M) := SHA-256(SHA-256(M))
This construction works by doing some “special processing” at the end of the hash operation (namely hashing the result once again with a “fresh” engine) to break the mathematical relationship making the length extension attacks possible.
Your second option is to truncate the output of the hash function. This also introduces a final “special processing” and hides some part of the state of the hash engine from the attacker. Because the resulting hash value should still have at least 256 bits, we are forced to use SHA-512 in this case:
SHA-512/256 := “the first 256 bits of SHA-512(M)”
Because truncation prevents the length extension attacks, SHA-384 and SHA-3 are protected without any additional measures. SHA-384 is mostly SHA-512 (with an internal state of 512 bits) truncated to 384 bits, therefore 128 bits of the internal state are hidden from the attacker. (This is only half of the hidden state compared to SHA-512/256, but in practice it should be sufficient to prevent length extension attacks.) SHA-3 finally has an internal state of 1600 bits, so even for a maximal hash length of 512 bits, there are more than 1000 bits of internal state hidden from the attacker.
Conclusion
After having learnt some basics about cryptographic hash functions, we can now choose a reasonable hash algorithm such as SHA-512/256 or SHA-3 instead of MD5. But there is a crucial ingredient for a secure password hashing which is still missing from our recipe: the “salt”. This will be the topic of the next part of this article.