THE IMPORTANCE OF HASHING PASSWORDS, PART 4: THE HARDWARE THREAT

The third part of this series presented PBKDF2 as a modern key derivation and password hashing algorithm. But PBKDF2 has its limitations; for best protection against password cracking the iteration count (defining the computing power needed to hash a password) should be chosen as high as possible. On the other hand, a higher iteration count also means that a login of a regular user will be slower. The maximal time users are prepared to wait for a successfull login will limit the maximal iteration count which you can choose for the available computing power.

For some time we could at least assume that all but the most resourcefull attackers will have roughly the same computing power at hand as the defenders have on their login servers. An attacker might be able to set up (and finance) hardware to hash passwords 100 or even 1000 times faster than a server, but this could be compensated for by a sufficiently high iteration count. However, by using hardware specialized towards massively parallel execution of hashing operations the relation of the average servers and the potential attackers “hashing power” shifted more and more to the advantage of the attacker. Hashing algorithms like the scrypt algorithm presented in this Blog article attempt to shift this relation back in favor of the defender.

Hardware Supported Attacks

The key to a successfull attack on password hashing algorithms like PBKDF2 is parallelization. A salted password hash is cracked by hashing a dictionary of potential passwords using the same algorithm and salt, and comparing the results against the given password hash. The expensive operation here is the hashing, while the mere comparison of the hashes even for very big dictionaries can be done quickly with any modern database even on moderate hardware.

The salt (see part 2) prevents the attacker from precomputing all the password hashes of the dictionary and use these one-time precomputed values for all password hashes to be cracked. Thanks to the salt each password hash is “unique” and the attacker is forced to to recalculate the dictionary hashes for each specific target hash.

However, the process of computing password hashes can be highly parallelized by splitting the dictionary test potential passwords in parallel. If you had a dictionary of one million possible passwords and hardware which can do one million of password hashing operations concurrently, you can check your complete dictionary against a given password hash in roughly the time needed to do a single-threaded password hash calculation. Even if your single-threaded hashing performance is very low compared to the attacked servers performance, the check of the complete dictionary will be rather fast.

The first blow to algorithms like PBKDF2 came with the use of GPUs as general purpose computing devices which allowed highly parallel calculations not comparable with the use of conventional CPUs. But nowadays hashing is done even more efficiently by Application Specific Integrated Circuits (“ASICs”) which are specifically customized to attack a specific hashing algorithm.

The drive to extremely parallelize hashing did not come from password crackers but rather from the “mining” of cryptocurrencies, first and foremost Bitcoin mining. Mining Bitcoins or other cryptocurrencies in essence boils down to do hash calculations faster than your competitors. Because miners can make real money with their activity, they are willing to invest into specialized hardware to gain an advantage above the competitors. (And if the competitors don’t want to fall out of the race, they will be forced to follow suit.) This flow of money into companies and factories producing ASICs for cryptocurrency mining pushed the further development of these systems.

As a mere side effect of the cryptocurrency mining game, attackers of password hashes will have comparatively cheap but very powerfull ASIC hardware easily available, which with some customization will allow to calculate password hashes in an order of magnitudes faster than any conventional server. Against an attacker employing such hardware the PBKDF2 algorithm will be a rather weak defense.

scrypt to the Defense

Searching for a way to defend password hashes against highly parallel attacks it is helpfull to realize that highly parallel systems have an inherent memory problem. If you have a real lot of parallel threads and each thread needs a real lot of memory, then you will need a real whopping lot of memory driving your hardware costs considerably higher.

So what we are looking for is an algorithm which not only lets us tweak the computing power requirement (like PBKDF2) but also lets us enforce the use of a certain amount of memory to compute a password hash. On a regular PC or server you will have at most a few threads doing user logins concurrently (each doing a single-threaded password hash operation) and there will be plenty of memory for every thread. A highly-parallel ASIC system on the other hand, will have much more limited memory resources per thread due to the high number of concurrent threads. (You may invest more of the available transistors to memory but this will leave less for the processing and considerably decrease the number of concurrent threads possible.)

As of today, the best known “memory hard” password hashing algorithm is the scrypt algorithm. In this article I will focus on the scrypt algorithm as it is being currently standardized in the RFC draft draft-josefsson-scrypt-kdf-01 (“The scrypt Password-Based Key Derivation Function”). The original paper presenting the algorithm can be found here: https://www.tarsnap.com/scrypt/scrypt.pdf (PDF)

Let’s start with a birds eye view of the algorithm:

The basic input to the algorithm is the passphrase and the salt which we already know well from the previous parts of this article series. In addition the algorithm can be configured using several parameters:

CPU/Memory Cost Parameter (N): As we will see, this is the equivalent to the iteration count in PBKDF2 and determines how much processing time and/or memory is needed to execute the algorithm. As with an iteration count you want to choose this parameter as high as possible to make the attackers life harder.
Block Size (r): The block size determines the memory needed for a basic block processed by the core part of the algorithm, the scryptROMix function. By increasing the block size you increase the memory requirements without considerably influencing the CPU requirements.
Parallelization Parameter (p): As you can see in the overview above, the algorithm splits the work into p scryptROMix operations which can be executed concurrently. The parallelization parameter allows to finetune the “concurrency” of the algorithm and increase the CPU requirements without increasing the memory requirements.
Output Length (dkLen): This is simply the number of bytes finally produced by the algorithm. For our use case of password hashing you will normally choose a value around 32 bytes (or 256 bits) for a reasonable hash size. (In other use cases, namely the generation of key material, you would choose the output length according to the number and sizes of the key(s) being generated.) The output length is only relevant for the last PBKDF2 operation and does not otherwise influence the algorithm.

The basic structure of the scrypt algorithm as presented in the overview diagram above is not very impressing. First PBKDF2 is used to mix the input data and “stretch” or “trim” it to the input size neded by the core of the scrypt algorithm. The result of the PBKDF2 operation is a “homogeneous” byte string which can be split into p blocks, each containing r bytes. (Note that the iteration count chosen for the PBKDF2 function is 1 since we don’t use PBKDF2 here to maximize processing time.)

The blocks are then processed independently from each other by the scryptROMix operation (which will be explained below). This block processing can be done concurrently so the number of blocks (p) determines the “maximal concurrency” of the algorithm.

Finally the result of each scryptROMix operation is once again mixed and “stretched” or “trimmed” to the desired output size (dkLen), using PBKDF2 with an iteration count of 1 as before.

The algorithm overview looks quite reasonable but does not give us any insight into the specific qualities of scrypt. While we see that we can tweak the “concurrency” by modifying the parameter p, the meaning of the other parameters and how the influence the processing power and memory requirements is still a mystery.

The solution to the mystery lies in the scryptROMix operation:

The scryptROMix operation essentially consists of a sequence of scryptBlockMix operations on the input block of 128 * r bytes. We will have a more detailed look at scryptBlockMix below, but its essential property is to mix all bits of the input block in some “unpredictable” way and producing an output block of equal size to the input block.

The scryptBlockMix sequence itself is split into two “loops” with N iterations per loop. In the first loop a vector V consisting of the N input blocks of each iteration is stored.

A single block of the vector V is then xored into the input block of each scryptBlockMix operation of the second loop. Note that the last bytes of the input block itself determine which block of V to choose for the specific iteration.

Compared to the scryptROMix operation, scryptBlockMix is rather simple again:

Here the Salsa operation stands for the Salsa20/8 Core algorithm which is a variant of Salsa20 Core reduced from 20 to 8 rounds. The Salsa20 Core algorith is a function which mixes the input bits similar to a hash function but works on strings of 64 bytes and produces output of the same size.

Salsa20 Core has been designed for the Salsa20 encryption function, but scrypt does not otherwise have anything to do with Salsa20. In fact, the original paper presenting scrypt (Stronger Key Derivation Via Sequential Memory-Hard Functions) does not specifically request Salsa20/8 Core but states that Salsa20/8 Core seems to be an ideal match for scrypt because it is fast but not parallelizable, which are both properties desirable for the mix function used in scrypt.

Note that Salsa20 Core does not have the properties of a cryptographical hash function. This is not necessary for the application in scrypt which works with any mix function (where each output bit depends on each input bit) which is fast, not parallelizable and not “predictable” in a sense that there is a more efficient way to determine the output for a given input than actually executing the calculations of the algorithm. (Since the number of possible input strings is finite it would theoretically be possible to build a complete table of inputs and outputs, but the huge number of input strings makes this infeasible in practice.)

scryptBlockMix then just splits its input into blocks of 64 bytes size and applies the Salsa20/8 Core function to the blocks to produce the corresponding output block. To intermix all bits of each block with all the bits of each other block, the first input block is xored with the last input block and each other input block is xored with the previous output block.

This essentially “stretches” the Salsa20/8 Core function from the fixed block size of 64 bytes to the variable block size of the blocks processed by the scryptROMix operation. Note that the basic properties of the Salsa20/8 Core function are retained: The value of each output bit depends on the value of each input bit and the process is not (significantly) parallelizable. (Since the input of each Salsa calculation depends on the output of the previous Salsa calculation, these calculations cannot be done concurrently.)

How it Works

So how does scrypt enforce minimal memory requirements? The answer lies in the vector V of the scryptROMix algorithm. This vector of 128 * r * N bytes is built in the first loop of the algorithm and by choosing essentially unpredictable elements from that vector in the second loop the engine executing the algorithm will be forced to keep the complete vector readily available in memory (ideally in the CPU cache) to maximize execution speed.

As you will notice, it is not really necessary to keep the complete vector in memory to execute the algorithm. As it turns out it is really difficult to “force” an algorithm engine to use a considerable amount of memory for the execution of an algorithm like scrypt. In practice there will always be the possibility trade memory against processing power. One could easily implement scryptROMix without storing the vector V at all! When accessing a specific element of V in the second loop the calculations of the first loop can be repeated up to the iteration corresponding to the requested element, therefore “recalculating” that vector element. By storing a variable number of intermediate results of the first loop (instead of each result) an attacker might even finetune the CPU/memory tradeoff to best suit his means.

However, the crucial point is that doing such recalculations instead of storing the vector will increase the necessary processing power by such factors that the advantage of highly parallel hardware is compensated. At least that is the promise of scrypt.

But we have not seen the end of the evolution yet. After the success of Bitcoin a great number competing crypto currencies (more than 275 by May 2014 according to Wikipedia) have sprung up, some of them, like Dogecoin and Auroracoin, using scrypt as their hash algorithm to increase the mining difficulty. As soon as such a crypto currency gains some popularity (which seems to be the case at least for Dogecoin) there will be an increasing incentive to develop hardware specialized into scrypt calculations.

So scrypt is just the next step in the arms race between cryptography and cryptoanalysis, and in my opinion it does not look too good for the defenders. However, if you have a chance to choose scrypt over (say) PBKDF2, you should certainly do so. It may not completely stop every dedicated attacker but it will certainly slow him down and make his job much more difficult and expensive.

Some Practical Advice

When using scrypt you will have to choose the parameters N, r and p. Ideally you would tune these to optimal values according to the amount of memory, parallelism and processing power available at production time and to the properties of the memory subsystem in question.

In practice, needing a quick solution to hash some passwords, you will probably not have the luxury of extensive testing and finetuning. The authors of the scrypt paper recommend r = 8 and p = 1 for typical current PC hardware.

Given the above starting points for r and p you could then try a value around 32’000 for N, which will consume around 16 MiB to store the vector V. If the resulting key derivations are intolerably slow, decrease N possibly while increasing r to keep the memory use constant. On the other hand, if you have some CPU power left, increase either N (which will also increase the memory use) or p (which will increase CPU use without influencing the memory use).

On servers doing a lot of parallel logins, using 16 MiB (or more) per login might be too much. In this case decrease N which will decrease the needed memory as well as the needed processing power. Then increase p as much as tolerable to get the optimal processing power consumption for the chosen N. (An example for a reasonable setup would be N around 1000 and p = r = 8.)

Conclusion

This article ends the password hashing series. I hope I could give you some insight into the importance of hashing password for persistent storage and the inner workings of modern password algorithms.

References

All articles of the series:

The Importance of Hashing Passwords, Part 1: Cryptographic Hashes: An introduction into cryptographic one-way hashes such as SHA.
The Importance of Hashing Passwords, Part 2: Spice it up: What is salt and why is it important?
The Importance of Hashing Passwords, Part 3: Raise the Price: How PBKDF2 attempts to slow down attackers by chaining iterated hash operations.
The Importance of Hashing Passwords, Part 4: The Hardware Threat: How to use scrypt to defend against dedicated cracking hardware by enforcing minimal memory requirements.

References:

Finding Preimages in Full MD5 Faster Than Exhaustive Search (Yu Sasaki, Kazumaro Aoki)
Construction of the Initial Structure for Preimage Attack of MD5 (Ming Mao, Shaohui Chen, Jin Xu)
Schneier on Security: SHA-1 Broken (Bruce Schneier)
Cryptography Engineering (Bruce Schneier)
SplashData News: Worst Passwords of 2012 — and How to Fix Them
RFC 2104: HMAC: Keyed-Hashing for Message Authentication
RFC 2898: PKCS #5: Password-Based Cryptography Specification Version 2.0
Internet Draft: The scrypt Password-Based Key Derivation Function, draft-josefsson-scrypt-kdf-00 (S. Josefsson e.a.)
The Salsa20 core (D. J. Bernstein)
Snuffle 2005: the Salsa20 encryption function (D. J. Bernstein)
Stronger Key Derivation via Sequential Memory-hard Functions [PDF] (Colin Percival)
Bitcoin Homepage
Dogecoin Homepage
Auroracoin Homepage

Wikipedia Glossary: