Securely storing and verifying passwords
Last updated: April 12, 2012
The task of storing and verifying passwords arises in just about any piece of software that has any level of security incorporated into it. This task is not quite as straightforward as an uninitiated programmer might think. Certainly, the most obvious solution to the problem is not at all close to secure. This document briefly discusses some standard or common approaches to password storage and verification. This document or one which covers the same issues should be considered required reading for anybody implementing a password storage and verification scheme without prior experience.
The Naive Approach
What I've decided to call a "naive password storage scheme" is probably exactly what you think it is. It's the simplest possible password storage scheme that could work in practice. It's also the least secure scheme possible. Despite this, it is still sometimes used occasionally by people who don't know any better.
Naive password storage
We keep pairs of usernames and passwords stored in a database, plain and simple. When a user authenticates themselves to our application, we look to see if the username and submitted password occur together in our database.
That is, the verification procedure is very simply:
- Take the offered password,
- Look up the stored password,
- Compare the offered password from step 1 to the stored password from step 2 and verify the password if and only if these match.
The password database simply looks like this:
username | password ----------------+------------------------ alice | password bob | secret eve | money ----------------+------------------------
Throughout this document, example password databases will be presented for each storage scheme discussed. The usernames and corresponding passwords will always be the same as those above.
Why do it like this?
The only reason to implement this password storage system is because it is obvious, easy and you didn't know any better! It should never be used in practice, under any circumstances, for reasons we shall soon see.
Problems with the Naive Approach
What if somebody gets read access through the database by exploiting a web server vulnerability? What if a disgruntled system administrator sells the database to the local organised crime syndicate? What if someone physically steals a database server? What if an old backup tape containing the database is thrown out and found by a dumpster diver?
As soon as any one of these things happens, whoever illegitimately comes into possession of your database knows every single one of your user's passwords. They can log in as any one of your users and wreak havoc with their account.
It's easy to take the attitude of "none of those things would ever happen to me!". The truth is that these things occasionally do happen. In December of 2006, the popular (and excellent) social bookmarking site reddit had a backup of their database stolen. They had stored their user's passwords as plain text in that database - they were using a naive password storage scheme. The database thief subsequently knew every Reddit user's password.
Occasionally, things go horribly wrong. There is nothing that can be done to prevent this. But you can take steps to make sure that when things go horribly wrong, the consequences are not as bad as they have to be. Never, ever use a naive password scheme.
The first step up the security ladder from naive password storage is something called a hashed password storage scheme.
The central concept for this section is that of a hash function. A sufficient definition of a hash function for our purposes is this:
A hash function is a function which:
- Takes as input an arbitrarily long sequence of bits,
- Produces as output a fixed length sequence of bits, and
- has the following properties:
- Making even a small change to the input (e.g. flipping a single bit) will cause a substantial change in the output (e.g. will flip at least half the bits)
- Given an output of the hash function, finding an input which maps to it is very, very hard.
- Finding two different inputs which map to the same output is very, very hard.
Lots of different hash functions have been designed over the years. By far the most common are MD5 and the SHA family of hash functions. Hash functions have a variety of uses in computer science, including assuring the integrity of data and as a vital component of digital signatures. The design of new, more secure hash functions has recently become a hot topic in cryptography. You can read more about hash functions on Wikipedia.
In an attempt to make the concept clear, here are some example strings and their corresponding hashes (the output of a hash function given a certain input is usually called the hash of that input) using the MD5 hash algorithm:
|The quick brown fox jumps over the lazy dog||37c4b87edffc5d198ff5a185cee7ee09|
|the quick brown fox jumps over the lazy dog||1e280e1713df124d35709cf6138d9f91|
Observing the second and third strings, notice that making a simple change to the "Quick brown fox..." string (changing the first letter from upper to lower case) has resulted in the hash changing entirely. This is the first property in our list of properties above.
Hashed password storage
In a hashed password storage scheme, rather than storing pairs of usernames and plaint's passwords, we store usernames and the hashes of plain text passwords. When a user authenticates themselves to our system, we take the password they have provided, compute its hash and compare the result.
In other words, the verification procedure is very simply:
- Take the offered password,
- Compute the hash of the offered password,
- Look up the stored hash of the user's password
- Compare the computed hash from step 2 with the stored hash from step 3, verifying the password if and only if these match.
The password database will now look like this (using MD5 hashes):
username | hashed_password ----------------+---------------------------------- alice | 286755fad04869ca523320acce0dc6a4 bob | dd02c7c2232759874e1c205587017bed eve | 21eb8dc2babd4f6db0b3bef1923d7398 ----------------+----------------------------------
Why do it like this?
Consider the problem we found with the naive password storage scheme: as soon as the password database was compromised in any one of several possible ways, every user's password is immediately compromised. This problem does not occur in a hashed password storage scheme. If the database is compromised, the attacker is in possession of the hash of each user's password, not the passwords themselves. By virtue of property 2 in our list of hash function properties, retrieving the actual password from this hash is very, very hard. For an ideally secure hash function, the attacker's only option is continually compute the hash of random guessed passwords until a matching hash is found. This sort of exhaustive searching - or "brute force" attack - is supposed to be outside the realm of feasibility. Whether it is or not can be questionable, as we shall see later, but it certainly provides more security than the naive scheme.
Problems with Hashed Password Storage
Hashed password storage is a definite step up from naive password storage, but it is not without problems itself. Some problems are discussed in this section. In particular, the first two problems we discuss are security vulnerabilities which are sufficient reason not to use a plain hashed password storage scheme in production.
Vulnerability to "Rainbow tables"
The theoretical ideal which gives hashed password storage an advantage is that if an attacker only has a hash of a user's password, the only way to recover the actual password is to repeatedly hash random guess after random guess until a match is found. This is supposed to be too impractical to ever succeed.
The unfortunate reality is that the vast majority of users choose passwords which make the search for a matching hash substantially easier. A large proportion of passwords are short (8 characters or under), contain only lowercase letters (no punctuation or digits) and are either ordinary words one would find in a dictionary for the user's native language or common names in that language. Thus, rather than having to hash every possible password, an attacker needs only to hash a relatively small subset of all possible passwords before having a good chance of finding at least once match.
Large tables which contain the hashes of a range of likely passwords are called "rainbow tables" (some purists will insist that this usage of the term rainbow table is a simplification of the actual meaning of the term. They're right, and you can see Wikipedia for the full details, but this approximation of the phrase is sufficient for our purposes), and are a traded commodity amongst those in the business of breaking security systems. Reportedly, large rainbow tables can be bought on CDs or DVDs at the right kinds of conventions. Certainly, online rainbow table services and projects are around - see, e.g., www.freerainbowtables.com, The Shmoo Group or Project Rainbow Crack.
So, while hashing passwords certainly increases the difficulty of breaking a password, it is insufficient by itself to stop a determined and resourceful attacker. This unfortunate situation can be avoided by "salting" passwords, as discussed in the next section.
Identifiability of Identical Passwords
One less serious problem with hashed password storage is that if two different users have the same password, then the same hashes will be present in the database. If the database is stolen by someone who has an account themselves (or the ability to get on-demand access to the database is obtained by somebody who can subsequently create an account), then they may find the hash that corresponds to their (known) password simply by searching for their username. They can then search for other usernames which have the same hash, and log in to any such accounts using their own password.
Complication of forgotten password recovery
Something else worth noting about hashed password storage schemes is that they somewhat complicate the task of dealing with forgotten passwords. If a user forgets their password for some application which uses hashed password storage, the administrator of that application is powerless to recover that password - all they or anybody else has access to is the hash.
Thus, using a hashed password storage scheme means you need to implement a more sophisticated password recovery scheme. Of course, the scheme doesn't really need to be complicated. A common solution is to generate a random replacement password (storing its hash!) and then email that password to the user. The user can log in with the random password once and then change it to one of their choosing.
It is worth noting that all hash functions which will be available as existing libraries in common programming languages will have been designed for speed. This is because hash functions are often used to ensure data integrity: if you download a large file (say an .iso), you can make sure that the download was not corrupted by computing the hash of what you downloaded and comparing it to the hash of a known good copy, published on the download website. In this sort of application, one wants to compute the hash relatively quickly.
When using hash functions to secure passwords, however, speed is not desirable at all. The faster a hash function is, the faster an attacker can compute rainbow tables. Furthermore, a slow hash function is not really a problem in this application. Slowing a hash function down by a factor of ten might mean that logging in takes 0.1 seconds instead of 0.01 seconds for a legitimate user who knows the correct password (no big deal), but that brute forcing a single password takes an attacker 10 years instead of 1 year (certainly a big deal for the attacker). In light of this, some applications iterate a standard hash function like MD5 a few times over and store the result of that - i.e., they store the hash of the hash of the hash of the...etc., say 10 levels deep.
The final password storage scheme we consider in this document might be called a salted and hashed password storage scheme. This is more secure than both naive password storage and hashed password storage. It is not a new idea - Unix operating systems have been storing user passwords on the local hard drive in exactly this way for literally decades. However, astonishingly few programmers who implement password storage systems are aware of this sort of scheme. Even those who are occasionally don't understand all of its details.
If you are responsible for developing a password storage system from scratch, you should use the scheme described in this section. Anything less is taking an unnecessary risk, and should things go wrong your efforts will make you look negligent at best.
Hashed and Salted Password Storage
The idea behind salting passwords is that, instead of hashing the user's password verbatim, we add a small quantity of random "salt" to the end of the password and then compute and store the hash of this combination. We also store the salt by itself. When a user attempts to authenticate themselves using their password, we append the stored salt to the offered password and compare the resulting hash against that in the database.
The password verification process is:
- Take the offered password,
- Look up the username's corresponding salt string,
- Append the salt string to the offered password and compute the hash of the result,
- Look up the stored hash of the user's salted password,
- Compare the hash computed in step 3 with the stored hash from step 4 and verify the password if and only if these match. The password database looks something like this:
username | salt | hashed_password_salt ----------------+-------------------+---------------------------------- alice | dB7e1fN9ugtL48OU | a4b84caf1d746ca4f8837ac2edd9d22e bob | 8MExmNnbJYryuclK | cf556ff70a072c90bd495579febca9d1 eve | hJeBg5YU2k0XGLQu | edc75e1c45802955901e98cea2870fa1 ----------------+------------------------------------------------------
Why do it like this?
The advantage to salting is that it defeats attackers using pre-computed tables of the hashes of easy passwords. Even if users choose passwords like "dog" or "cat", hashes stored in the database are actually of strings like "catbHnYP9WFpjqGV3SX". Thus, instead of requiring a pre-computed table of all passwords of length 8 or less using only lower case characters in order to recover passwords, an attacker requires a pre-computed table of all passwords of length at least 16 using lower and upper case letters, together with digits and possibly even punctuation. The longer the salt and the larger the range of characters within it, the more effort pre-computed tables take to generate and thus the more secure the scheme.
One salt for the whole table, or one per user?
One sometimes sees explanations of a hashed and slated password scheme in which the same random salt is used for the entire database of passwords, instead of there being a unique random salt per user as I've done here.
While using the same salt for each user will save storage space, it does add less security. If the database is compromised, an attacker may set about constructing a rainbow table for the database - this will take as much time as constructing a rainbow table for an unsalted database of hashed passwords.
A space-time trade-off
A possible variation on randomly generating a salt for each user and storing that salt in the database is to use the hash of the user's username as the salt. This saves some storage space, in that there does not need to be a salt column in the password database, but it makes password checking take slightly longer as one extra hash needs to be computed. Whether or not this is a good trade depends on your particular hardware situation. This trick does not affect the security of the overall scheme in any way.
Doesn't the salt need to be secret?
There seems to be some confusion as to whether or not the random salt(s) need(s) to be kept secret. The answer is quite definitively that they do not. Anybody who insists that they must be (and there are plenty of these sorts of people on the internet) has entirely misunderstood the purpose of salting. In fact, they haven't really thought things through at all - the salt must be readable by the password verifying program in order for the scheme to work at all. So you can put your salts in your database with no obfuscation at all without affecting the security of the overall scheme in any way.
Secure transmission of passwords
An ever present truth in information security is that any system (be it for password based authentication or something entirely different) is only as strong as its weakest link. The relevance of this truth here is that no matter how securely your passwords are stored in the database, the system is vulnerable if users submit their passwords for authentication over a public network - such as the internet - without some sort of preventative measure.
If you're writing a web application where the users authenticate themselves by POSTing a filled out HTML form (by far the most common way this is done), unless you have good reason not to you should do the post over a HTTPS connection, rather than a plain HTTP connection. Otherwise, the password travels from the end user's personal computer to your web server unencrypted. Anybody in between the two computers - and that includes, but is not limited to, people working for your user's ISP and your own ISP - can recover that password with minimal effort.
Passwords should always be transmitted to web applications over HTTPS. This includes not just the main log in form, but also pages which handle functionality like changing passwords.