Why are some techniques for generating random data in python more secure than others?
Is "random" really random? Let's find out.
Think of a number from 0 to 5. C'mon. Humor me and try this out.
Do you have something in mind? Good. Now, consider how you landed on that specific numeric value and why you chose it. Were you thinking of a birthday? Maybe a recent sports score? Perhaps you glanced at a particular visual pattern or quantity of something nearby?
There's actually a LOT of ways that the human mind can generate "random" data. Some techniques are more unpredictable and elaborate than others. The same can be said for programming languages and the techniques they use for generating random data. Python is a great example.
For the purpose of illustration, let's say I have a list of numbers in python:
numbers = ['0', '1', '2', '3', '4', '5']
How would I tell python to randomly choose one of those values from the list? One approach would be to use the random.choice()
method and run some code like:
import random
numbers = ['0', '1', '2', '3', '4', '5']
print(random.choice(numbers))
In modern versions of python (3.6 and later), I could use something like secrets.choice()
and run some code like:
import secrets
numbers = ['0', '1', '2', '3', '4', '5']
print(secrets.choice(numbers))
So what exactly is the difference here? Basically, it comes down to determining how random a "random" value actually is. Computers are GREAT at following instructions but it's more difficult to enable them to produce unpredictable data.
random.choice()
is a python method that uses "standard" pseudo-random number generators (PRNGs). To a human, whatever value it picks from the list would seem completely unplanned but in actuality, it's somewhat predictable. The algorithms that create the perceived randomness, rely on a starting point (typically referred to as a "seed"). That seed value (and how the computer comes up with it) is very important. In the case of standard PRNG, the seed could be something like a user-set value or otherwise predictable data-point like a computer's local time.
In contrast, secrets.choice()
is using cryptographically secure pseudo-random number generators (CSPRNGs). The underlying math and algorithms still need seed values, but these are harder to predict and reproduce (which is good when it comes to security). They can use sources of "entropy" (true randomness) from the system, such as timing of hardware events (e.g. mouse movements) or thermal noise/decay.
OK, so with all that said, what method should be used and when? It really just depends (I know, I know - classic consultant answer, right?). It's true though! If we're not protecting anything with the randomness, it might not matter. On the other hand, if we were needing this random data for security use cases (e.g. generating a passphrase, etc) then it's ABSOLUTELY important to make it as difficult as possible to construct/predict the data.