Note: This blog was created as an entry to the Summer of Math Exposition 3 by 3Blue1Brown which is a part initiative - part competition to increase the number and quality of resources on the internet related to Maths topics.

Edit after SoME3 results. You can check out here other interesting entries and math explainers.

Motivation

You’re with friends playing cards and it’s your turn to shuffle. You think to yourself ‘I don’t know how to shuffle’ or ‘maybe I should ask someone else to do it for me’. But there is no escape, you shuffle and deal the cards, everyone looks to one another in confusion until someone says: ‘Who shuffled these?’

Usually when playing cards, people use the standard method of riffle shuffling (splitting the deck in two and then mixing them) like this:

I'm not very good at doing that, so I usually do pile shuffling (creating any number of piles by adding cards from the unshuffled deck) which looks nicer on the board:

However a few weeks ago I thought: how can I be sure that my deck is properly shuffled? What does that even mean? To answer this we'll work our way from visual examples, some testing and definitions that were posponed to the end when they're needed. I hope that when we use those definitions you won't have any doubt on the why.

We'll focus on a Ticket to Ride deck (boardgame where we pay with colored cards the cost to build roads of different size) because it is while playing this game that I wondered if pile shuffling is enough. Also, there is a second reason, which will become apparent at the middle of the explanation: for an English deck of cards you have 4 kinds but you have a lot of complicated relationships between cards of same or different kind: full house, straight, flush, four of a kind, etc.

The blog is divided in these major sections:

First tests. We'll apply riffle and pile shuffling to a sorted deck, and define what a used deck means.
Measuring randomness. We will define what to look for and what to measure.
Finally some results. We will apply a method of shuffling a large number of times and analyze some interesting comparisons.

First tests

Let’s begin by testing a few shuffles and looking at the resulting deck. We will represent a deck as a series of stripes of color as shown in some of the images below and for the first third of this blog we'll start with a sorted deck, but then we will study what happens with an used deck.

We’ll start with a sorted deck of Ticket to Ride which contains 110 cards in total (8 colors with 12 cards each, and 14 multicolored). For ease of visualization, we won’t represent a card with edges, just its color. We see that doing just 1 riffle shuffle is not enough, it’s obvious we need to keep shuffling. Remember, each stripe of color represents 1 card.

However, let’s draw attention to a simple detail. When you do a riffle shuffle you have two equal halves. Even so, there is something left in the air: should the top card from the left or right half, stay on top?

The difference is subtle, but it’s there. So if we do multiple riffles shuffles, we won’t be keeping tabs on that. This will add some variability to multiple uses of the same method to shuffle. See Appendix A1 for more details.

Then, multiple uses of the same method means starting with a sorted deck, applying the method (1 riffle shuffle, 7 riffle shuffles, etc.) and getting a shuffled deck as a result.

So, how many riffle shuffles should we do to get a well shuffled deck? Research tells us that for an English deck of cards (52 cards, 4 kinds) 7 riffle shuffles is enough to get a randomized deck. That sounds good, let’s apply it here.

If we compare these different attempts at shuffling a sorted deck first we see that they’re different from each other. That’s consistent with the variability we already mentioned (left-right). However, some of these shuffled decks don’t look right. What do you see? I can see some patterns in #1 (blue-green-yellow-orange-red) and #3 (green-black-orange).

We see patterns, but we can't say for sure if the method is good or bad. You have probably encountered this. You have a friend that is great at shuffling, they deal everyone’s hand, and someone gets lucky. Your friend thinks ‘I slipped up and I didn’t shuffle that well’, but maybe that’s not what happened. Maybe cards lined up just right to create a pattern.

Takeaway 1. Any method of shuffling could encounter some patterns. That doesn’t make it a bad method necessarily.

Other methods

Then, how do we know if the method we’re using is good enough? What does good enough mean in shuffling? We’ll answer these, but first let’s look at the other method we’re testing: pile shuffling.

The image below shows applying different method to a sorted deck. We can see that a single riffle shuffle is not enough, and neither is doing 9 pile shuffling. These methods just move cards around.

As a reference, a random shuffle was added (see Appendix A4 for more on this), where basically we asked the computer to do the best random shuffle it could provide.

Lastly, if we compare multiples uses of the riffle method with multiples uses of the piles method, they don’t look that different. You could argue that some patterns are present, but we can’t say for sure that this will happen all the time.

How can we decide which method is better? Let’s establish three important observations:

Patterns. We are seeing patterns which are sequences of colors that appears more than once. For instance, in ‘pile multiple’ the sequence red-yellow-green-blue appears three times.
Initial deck. Up until now, we’ve used our shuffling methods to sorted decks. When in reality, we will be shuffling used decks or decks at end-of-the-game. We need to improve on that before reaching any conclusion.
Repetition. Let’s say you have a 100 used decks. For each one, you apply the same method: 7 riffle shuffles. What do you expect? The resulting 100 shuffled decks should not be all the same! We need to analyze if a method is good enough over a large number of scenarios.

Takeaway 2. To test if a method is good enough, we need to check for patterns over several uses of the method to account for possible variability.

Intermission: Constructing a plausible deck at end-of-the-game

When you finish playing you don’t have either an ordered deck, nor a random one. In the case of the English deck, you will probably have pairs, straights or any other. In the case of Ticket to Ride, you will have a semi-structured deck:

Part of the desk will be totally random, if you started the game with a well shuffled deck.
The rest will be in chunks of colored cards.
The proportion of these two parts is unknown.

So, at the end of the game, the deck will look like the image below. We have a sorted deck as a reference, and also a totally random one. The two 'ticket decks' (end-of-the-game decks) created have a mix of these two types of behavior. On the next sections, we will only apply methods only to used decks. See Appendix A3, for more details on the game.

Measuring randomness

Now, we set out to define a simple measure of randomness based on the concepts already mentioned before such as patterns and repeatability. We will do a simple count method, such as the image below for a test small deck. Some sequences (2 colors) are not present, and some are twice. If I shuffle again, will the count table stay the same?

The general procedure to be used is shown in the diagram below, will be done thousands of times to get a good sample of the variability shuffled decks could have.

It’s important to note that the initial used deck will be randomized each time we apply any method, so that we will always get a new result. We can repeat this for other methods or different sizes of patterns. However, we will limit ourselves to riffle/pile shuffling and only to 2-card patterns. See Appendix A2 for an explanation on how that restriction doesn't affect our method.

It’s important to note, that when comparing two methods, we will apply those methods to the same used deck, before going to the next used deck. This, so that we're fair to each method.

Finally getting some results

We will divide this major section in three parts, so that we can follow a logical train of thought and reach our grand finale.

Doing just a few shuffles.
Doing several shuffles.
Understanding how the number of shuffles influences the deck.

R1. Doing two riffles, or just two piles

We can see at first glance that both methods have some differences, and we will go over each one. Note that we have 81 possible patterns (9 colors by 9 colors) such as: green-red, red-red, yellow-multicolor, etc.

Pile shuffling has some peaks at same color patterns (such as red-red or green-green). This is expected: we started with a deck that already has some same color cards next to each other and the method used was not enough to set them apart.
Both methods have a higher average at the beginning. This is related to multicolor cards and it’s because there are more of them (14) vs any other color (any other color has 12). So, it’s more likely that one of those patterns emerges. These patterns are of the type (multicolor)-(any color). And why at the beginning? Because I ordered the patterns that way when listing them.
Riffle shuffling has some peaks, right before the valleys. This is due to patterns of the type (any color)-(multicolor). Similar to the previous item.
Riffle shuffling has some valleys, at the same place that pile shuffling has peaks. This happens because for a well shuffled deck same color patterns are less likely. Let’s do an example: you have 2 orange cards and 2 green ones. These are all the possibilities: orange-orange (two times), orange—green(four times), green-orange (four times) and green-green (two times). See the image below.

Takeaway 3. With just a few shuffles most patterns appear with almost the same frequency in the deck over a large number of games.

R2. Doing 7 riffles, or 9 piles

This comparison will raise some important questions which will then allow us to reach the end result (R3) which we set out to answer.

We can see in the image below that the average ocurrence of each pattern is almost identical for both methods. We can still see some differences, but they’re due to two facts we already mentioned.

Same color patterns are less likely that mixed color patterns
Multicolored cards are higher in number that any other color

These facts will stay the same. Even a third method was added: ask the computer to randomize the deck as best it can and we get the same result. So if we look at the average ocurrence of any pattern we know that 7 riffle shuffles or 9 piles are enough to have a random ocurrence of any pattern for a large number of shuffled decks.

So, what do we do now? Surely there must be differences between riffle and pile shuffling. That leads us to include another statistical definition here: the standard deviation. I want to avoid a formal definition, so we’ll give an example as shown in the table. What method would you prefer?

We can see that even though both methods have the same average ocurrence for green-red patterns, method B has a higher standard deviation. If we look closely at method B we see a large variation from shuffled deck to shuffled deck, and this is what the standard deviation tries to measure: how much do decks deviate from an expected value or from each other.

Note: if you have doubts on how the standard deviation works, head over to Appendix A5. But, if you got the idea and the concept, then I would advise just continuing with the main blog.

Even if the standard deviation were too large, patterns could occur on average with the same frequency as others. What we need is for shuffled decks to be as closely similar to each other as possible, so that our games are really consistent and feel random.

We'll calculate the standard deviation for each method, and for each possible pattern. Now we can see a difference between these two methods! The riffle shuffle has in general a lower deviation than pile shuffling. The key question you should ask yourself here is this one: what will happen if I do 4 riffle shuffles instead of 7, and 13 piles instead of 9. Will this result stay the same?

Takeaway 4. For a large number of games, doing 7 riffle shuffles versus 9 piles shuffling will yield the same average occurrence of patterns, but in general with a lower standard deviation.

R3. How do results change when we change the number of times we shuffle?

Should the shuffle be good enough then if we do 1, 4, 15 or 30 riffle shuffles? To do this, for each parameter we repeat the same procedure explained in the previous sections. And here, we have two options:

Look at the variation at how in general the standard deviation changes when changing either: number of shuffles, method or sequence.
Or, creating a general value of the standard deviation that it is representative of all patterns. This is the option we used and the general value is just the average value of the standard deviation. This allows us to differentiate the scenarios by method and number of shuffles, only.

What we get, is a really interesting result after all this work. We can see that the deviation for riffle shuffling falls rapidly after 3 shuffles and then slowly after 15. As for the pile method, the standard deviation also falls after 4, but much more slowly.

So, we can get a pretty well shuffled deck with 7 to 10 riffle shuffles, but we can't get a good one with pile shuffling. Even if we did 30 piles, which is impractical, we would get something just as good as doing 3 riffle shuffles.

Summary

I believe a summary is in order, given that the work done here involves several steps. Then, we’ll end with some brief thoughts.

The whole point was to explore how can I be sure that I’m shuffling correctly. If I’m bad at one method, what are the implications of using another one?
For multiple riffle shuffling, we saw that some patterns emerged. This made us question if that meant that the method was faulty or not.
We tried another method, pile shuffling and found that it can be close enough to riffle shuffling. At this point, we were only seeing qualitatively if there were differences or not. But that can only take us so far, we needed to include a quantitative method.
We started to count the numbers of patterns in each shuffled deck and we did it a large number of times. Then, we extracted two important statistics: the average of occurrences of each pattern and how much each shuffle deviated from an ideal quantity.
We found that doing riffle or pile shuffling a few times was enough to get a consistent average across all patterns, but that riffle shuffling allows us to obtain that with a lower deviation.

In practical terms this means that riffle shuffling gets with a few shuffles more consistent and more random-like games. This was the goal we set out to reach, to prove that a method really is consistent or good enough. We managed to do it by looking at the result of shuffled decks and understanding two simple statistics: average and standard deviation.

Now, if you think you have a better method than riffle shuffling, you have the means to test it!

Closing thoughts

I learned a lot creating this blog and now I remember back a few months ago while doing pile shuffling on Ticket to Ride decks at home, thinking if I wasn't hindering my game nights by doing this particular method. But after all this work, I can prove my past self wrong and show that, indeed, I need to properly riffle shuffle my decks.

If you’re interested in delving deeper into notation and more formal definitions, please refer to the References.

The Appendices at the end go over some mathematical relevant observations but are not needed to understand the main blog.

Disclaimer

Most images used were of my own making or they’re the result from my code on Python, tables in Excel or diagrams in PowerPoint.

'Ticket to Ride' is a trademarked boardgame that involves constructing railroad routes across various landscapes and cities. It's important to clarify that I have no connections with the creators or proprietors of the 'Ticket to Ride' brand/name/game. Any references made to the game in this blog are solely for educational purposes and do not imply any endorsement or affiliation with the brand or its developers. My perspectives and opinions remain impartial and are not representative of any official relationship with the brand.

The Ticket to Ride image used was created by user 'JIP' under the license CC BY-SA 4.0.

About the author

I’m a Mechanical Engineer with a passion for Maths and Education. I’ve worked as a tutor for Maths and Physics courses at a college level. I usually play videogames on my spare time. I have plans to do other blogs which could be interesting, also related to Maths and Physics, so stay tuned!

References

[1] https://en.wikipedia.org/wiki/Shuffling (good starting point, work your way up from here)

[2] Anke van Zuylen and Frans Schalekamp (2004). THE ACHILLES' HEEL OF THE GSR SHUFFLE: A NOTE ON NEW AGE SOLITAIRE. Probability in the Engineering and Informational Sciences, 18, pp 315-328

[3] Vennos A., Michaels A., (2021) Shannon Entropy Loss in Mixed-Radix Conversions. Entropy. 23(8):967.

[4] Diaconis, P., Graham R.L., Kantor, W.M. (1983) The Mathematics of Perfect Shuffles. Advances in Applied Mathematics, 4, pp 175-196.

Appendices

A1. What do you mean by variability in a multiple riffle shuffle?

At every riffle shuffle we have 2 possibilities (left or right). If we do a riffle shuffle N times. Then the number of possibilities is really 2^N, which is not really random. So, if we started with a sorted deck and did the method using 7 riffle shuffles (2^7=128) after 128 we will surely encounter a unique deck for a second time. For a bigger number such as 20 riffle shuffles (2^20 ~ 1 million), I would be surprised if you could distinguish the decks from each other, so they will truly feel random.

A2. What about long patterns? At some point we won’t be able to see them all in a finite deck

If we are talking of a sequence of a finite number of cards: red-blue, green-yellow-white, green-black-yellow, etc. At some point, if the sequence is long enough we’ll run into an apparent problem. Let’s call a n-sequence, a sequence of n cards which can be composed from k possible colors (9 in total in Ticket to Ride). Then, a n-sequence has k^n possible configurations. We’ll compare this number to N, which is the total number of cards.

Parameter	Quantity
Total cards N	110
Number of 1-sequences	9
Number of 2-sequences	9 x 9 = 81
Number of 3-sequences	9 x 9 x 9 = 729
…

If we try to count for each possible 3-sequence how many times it appears in a 110-card deck, even if by some miracle you can manage to order the cards and each 3-sequence appears only once, then it means that only you can see some of them (108 to be precise).

This is when repeatability comes into play. If my shuffle method is consistent over repetitions and creates a truly randomized deck, then we will encounter over time all possible sequences equally.

A3. How did you create a used deck at the end of the game?

If you know how to play Ticket to Ride, you are probably wondering how did I create a random deck, on demand, that truly captures the reality of the game. If you looked closely enough, you noticed that it has some errors. I’ll give some comments here on what was done, and what wasn’t covered:

From the Ticket to Ride standard map I counted the number of roads for any color. I summarized this in a distribution that selects a random road and tells me: length and color.
I did not include the fact the multicolored cards can only be played in singles or pairs, in groups with colored cards. I treated it just as another color.
I did not account for the fact that there are definite objectives (from one city to another). So the selection of roads does not necessarily reflect that. This is probably the most important note here, but also the hardest to work out. To truly create a real deck at the end of the game short of creating a bot that plays the game over and over, I would have to account for the network of roads and possible ways to win which I deemed not useful for the purposes of this blog.

In summary, the decks used for the explanation, are in the opinion of the author, sufficiently close to reality. Over a large number of end-of-the-game or used decks, the main difference will focus on how well a shuffle method works, even though some subtleties are lost. Working this out could be an entirely new blog from the ground up.

A4. What do you mean by random on your computer

The code used Python as programming language. When randomness was needed the random library was used. You can see more details on the library or some discussion about it.

But in practical terms, the method creates something that looks pretty random and you will have a hard time proving otherwise. So, this is random enough for our purposes because we only need something to compare our shuffles to.

A5. Average and standard deviation formal definitions, with examples. [Recommended Appendix]

We will briefly define this statistics in case you need a refresher or more guidance:

First, we'll define the N ocurrences of a pattern or variable with these symbols: x1, x2, x3, ..., xN.

Numerical Example. The ocurrences of a green-red pattern in used decks in Ticket to Ride: 3, 0, 1, 2, 4.

Average

Then, the average of N values is a representative value of a list of numbers and it's calculated by summing them up, and then dividing by the total.

Numerical Example. Based on the previous numerical example, the average would be:

Day-to-day example. You want a general idea of how much you spend on bills every month. You calculate the total, for each month, and then take an average to get an idea of how much could next month be.

Standard deviation

Finally, we define the standard deviation of N values as a way to measure the spread of the values.

Numerical example: Based on the previous numerical example, the standard deviation would be:

Day-to-day example. You already have the total bill amount for each month, but you see that they change a lot from month to month. You calculate the standard deviation, because you need a margin of error. For instance, if the average is $200 and the standard deviation $50, then next month could most likely cost from $150 to $250.

More comments on the standard deviation

To round up here, I'll make two comments on how it's useful and why it's defined like that.

First, for the numerical example we get 2 as average, and 1.41 as standard deviation. We could say that most values then, will be in the range of 0.59 (2 minus 1.41) and 3.41 (2 plus 1.41). That's a practical use of these definitions, because we're getting more information than just an average.

Second, you could be wondering why do we use such a complicated formula. We could've used the range (the maximum value minus the minimum value). Or we could've avoided using powers and roots. It seems unnecessarily complicated.

To answer this, as succintly as possible, I can first say, that there is no easy answer (sorry!). But to give you some food for thought (maybe you can get an answer) I'll show a method to convince yourself:

Come up with a new variable that captures the spread of the values. I'll use the range and compare myself to the standard deviation (STD).
Create two sets of numbers. For instance,

Set A. 1, 2, 3, 4 and 5. Range: 4. STD=1.41
Set B. 1, 3, 3, 3 and 5. Range: 4. STD=1.26

We get the same range, but the sets have some slights differences
Now, let's do a more extreme example, with two new sets.

Set A. 1, 2, 3, 4 and 5. Range: 4. STD=1.41
Set B. 1, 2, 3, 4 and 50. Range: 49. STD=19.02

Now we have kind of an opposite situation. The sets are similar, but the range is really different. What would've happened if we had used the standard definition?

For the first two sets, the STD value is similar, but it picked up some differences that the range couldn't.
For the second two sets, the STD has a large difference, but not as much as the range. So it 'understood' that most values are near lower values. Then, it also picked up more information.

On both cases, we could say that the standard deviation picked up more information that our variable.
Homework for you. Come up with new sets to test this and new possible definitions. Is the standard deviation still better?

I realize this is kind of a non-answer. But, unfortunately, the standard deviation is used because it has proven itself useful in a lot of ways in Maths. And, in some cases where more mathematical formalism is needed, it is the natural choice as no other definition would be possible or useful.

Note: You can check some other simple examples here and a more technical argument on the why we use the standard deviation here.

Comments

AnonymousAugust 21, 2023 at 7:14 AM
Something that occurred to me while reading - as a general method, perhaps you could make a matrix for each position in the original deck (just imagining the cards as labelled 1-52 etc), ending up at given position in the final deck. I suppose an ideally random shuffle would have a uniform value across the entire matrix, with biased shuffles having peaks in the response. However, it may well be a less peaky response might have a tendency to create more patterns (for whatever reason) - which is after all what starts the complaints around the table haha.
AnonymousOctober 9, 2023 at 8:40 AM
Could you explain the process of pile shuffling as I've never heard of it. Do you take a small random number of cards and try to randomly put it on different piles? Of course I could just google it but maybe a bit more extensive of an explanation is a good idea.

Search This Blog

Math Blog

Who shuffled these? A visual and mathematical introduction to shuffling cards