How many packs to complete the album? – Wagner Gonçalves Pinto

The World Cup was recently over. Along with the competition, the sticker album also arises, it’s a quite big tradition, but I’ve never joined it. I got interested in the statistics behind it and asked myself how many stickers you must buy to fill the album completely.

My approach is a statistical simulation, modeling each package, until the album is complete. The same procedure is repeated for a large number of runs to get an estimated distribution of the total number of packages/stickers that are necessary to complete the album. First, I tested the convergence of the routine, initially based on 2 unanimous assumptions: the distribution of the stickers is uniform (that means, you have an equal chance to get any of the stickers) and that there are no repeated stickers for each pack (this one is maintained for all the tests here). Secondly, I tested what are the advantages of buying the missing stickers (from 1 to 50). Finally, two cases where the distribution is not uniform are evaluated: for a selected nation, the stickers are more abundant (from +10% to +50%) or rarer (from -10% to -50%) than the others.

This analysis can be performed for any album, being the number of stickers in the album and the number of stickers in a pack the necessary variables. So, for this case, the values for the Panini World Cup sticker book are selected:

681 stickers in the album;
5 stickers per pack.

Also, the possibility to buy missing stickers directly from them (maximum of 50) is also considered in this work.

Algorithm

I implemented the procedure in C. I do not recommend if you just want to get the result, you’ll make it quite faster with python (there is a nice tutorial here) or R.

The procedure to complete the album is quite simple: you start with a boolean array full of false, with as many elements as the number of stickers in the album. So, while the album is not complete, you fill a package (select the index of the stickers) and set the selected elements as true in the album logical array.

For checking if there are not repeated stickers in the pack, a loop is implemented. The function for filling the packages is presented, where *fun_sticker is a pointer to a function that returns the selected sticker:

void fillPackage(int *selected_stickers, const size_t size, int (*fun_sticker)(const void *), const void *input){
    int i, j, sticker;
    bool no_repeat = false;
    bool not_inside = true;

    // Selecting first sticker
    selected_stickers[0] = (*fun_sticker)(input);

    for (i = 1; i < size; i++){
        sticker = (*fun_sticker)(input);
        if (no_repeat){
            // loop until the selected sticker is not in the package
            while (not_inside){
                // selecting a new sticker
                sticker = (*fun_sticker)(input);
                // checking if the stickers is already in the package
                for (j = 0; j < i; j++){
                    not_inside = not_inside && (sticker =! selected_stickers[j]);
                }
            }
        }
        selected_stickers[i] = sticker;
    }
}

Two functions are implemented for selecting the sticker to be added to the pack:

Uniform distribution

For the uniform distribution, the function rand() is selected, for 0 to 681:

int uniformDist(const int *n){
    return rand()%*n;
};

The seed for the random number generation is the time, called in the main program.

// Setting random function seed - with time
srand(time(NULL));

Don’t forget to include the time library with #include <time.h>.

Non-uniform distribution

To select a variable based on a discrete, non-uniform distribution is a not so direct task. I used the Vose’s Alias Method, extremely well presented by Keith Schwarz (Darts, Dice, and Coins: Sampling from a Discrete Distribution), I really recommend you take a look. In short terms, it transforms a discrete probability distribution in a combination of a biased coin and a fair dice, illustrated next (with the same values used by Mr Schwarz). The implementation is rather long, so the code is not presented on this post, but you can check it out here: alias_method.c

For testing my implementation, 100,000 calls are performed for the sample case distribution. The results and the Vose’s Alias Method settings (the probability and alias arrays) are presented on the following table:

variable	0	1	2	3	4	5	6
distribution	0.125	0.200	0.100	0.250	0.100	0.100	0.125
biased coin probability	0.875	0.975	0.700	1.000	0.700	0.700	0.875
alias	1	3	3	–	1	3	3
result	0.1250	0.2002	0.1003	0.2505	0.1019	0.0977	0.1243

Album #1: uniform distribution

Simulation is performed for the complete album, with an uniform distribution of the stickers. Convergence is evaluated to check the minimal number of runs to achieve a significant result. On the following graph, the obtained probability distribution of the number of packs (on the right) and the global statistics (mean, median and mode) are presented on the right.

Graphs with the probability distribution of the number of packages to fill the album for several number of runs (10, 100, 1000, 10000, 100000 and 100000) and a graph of the global statistics (mean, median and mode) as a function of the number of runs, in log scale.

Taking in account the most extended simulation (1 million runs), the average number of packs is of 969.33, or 4845 stickers, to complete the album. The answer is quite close to the values presented in the press:

album stickers	average number of packages	source
682	969	current work
680*	961	Deutsche Welle
682	967	The Guardian
669**	941	La Nación (in spanish)

*album for the 2016 UEFA European Championships
**the Argentinian album variation, as many others, only have 670 stickers (more info here)

At the final simulation, the minimum number of packs was 519 and the maximum was 2954. In the end, even if you are lucky, it’s going to be a quite expensive album (£773 for the average number of packs at The Guardian article). Of course, this result doesn’t account for filling multiple albums (like a group of friends) and most importantly, swapping stickers. If you want a cheaper album, you should try both.

For about 10 thousand runs, both the distribution and the global statistics are converged. So, for the following cases, 10,000 runs are evaluated.

Album #2: buying missing stickers

If you are having trouble with that one specific sticker, Panini offers the opportunity to buy them directly from their website. At least in France, you can buy a maximum of 50 stickers. The same simulation, with uniform distribution, is performed, but now the simulation is stopped before the 682 unique stickers are selected, considering that the missing sticker is bought from them. As before, the probability distribution and the global statistical quantities are plotted.

Graphs with the probability distribution of the number of packages to fill the album for several number of missing stickers (0, 10, 25 and 50) and a graph of the global statistics (mean, median and mode) as a function of the number of missing stickers (from 0 to 50).

As you can see, in average and globally, there is an important gain in buying missing stickers (what is the same as just a small album). Compared to the initial case, the average number of packs is reduced from 969.33 to 355.68, a 63% of reduction. Naturally, buying the missing stickers is more expensive than the random package, so is up to you to see if you can profit from this possibility or not. I also invite you to check the post of Laurie Belcher on the matter.

I attempted to fit a model for the evolution of the average number of packs as a function of the number of missing stickers, but as you can imagine, the curves are non-linear and not so easy to find, so I also leave the job for the readers.

Albums #3 and #4: non-uniform distributions

The point here was not to check whether or not the distributions of the stickers that are distributed uniformly or not, as I cannot check this by myself, but to test what would happen if they were not. As there is pretty much an infinity of ways to play with the distribution (random disturbance, probability laws, etc), I followed a straight logic, based on a single, unchecked and probably false premise: at a corresponding country (let’s say, Brazil), the public would be more interested in getting stickers of its national team (the Brazilian players for my example), and according to that logic, Panini’s production and distribution could be biased. So I simulated two options:

less frequent Brazilian players’ stickers, so everyone is buying more stickers (Album #3);
more frequent Brazilian players’ stickers, so everyone is happy (Album #4).

The discrete distribution is achieved by the method introduced previously, and an example of what it produces is presented (1 million calls, Brazil’s national team stickers – from 352 to 371 – are 50% more probable than the others):

Result for the Vose's Alias Method for a discrete distribution of stickers, where the stickers for the Brazilian football team are 50% more probably than the others.

The results are presented, for modifications in the absolute probability offset from 0 to 50%.

Graphs with the probability distribution of the number of packages to fill the album for several national stickers probability offsets (0, -10%, -20%, -30%, -40% and -50%) and a graph of the global statistics (mean, median and mode) as a function of the absolute variation of probability (from 0% to 50%).

Graphs with the probability distribution of the number of packages to fill the album for several national stickers probability offsets (0, +10%, +20%, +30%, +40% and +50%) and a graph of the global statistics (mean, median and mode) as a function of the absolute variation of probability (from 0% to 50%).

We can see that, rare stickers can be quite a problem when completing the album, increasing the mean from 969.33 (uniform) to 1104.18 (-50% probability) and moving the distribution to the right, what means that is globally harder to complete the album. For the second test, abundant national stickers, there is almost no influence in the final result (last mean is of 993.93 packs for +50%).

What we can conclude is that, for the performed simulations, the unbalance in the stickers distribution, to be representative in average, must be expressive. The result is not so representative in the sense that, you only need one sticker to be infinitely hard to get and you’ll never complete the album. There are many parameters that may produce those oscillations, and of course, when you are completing your album, you do not really care about what you can get “in average”.

Conclusions

In synthesis, you will need, in average, about 970 packs to complete your album. Buying missing stickers is a great option, for both your mental health and your wallet, once you can finish sooner. A located disturbance on the stickers probability distribution can change the probability results, but it must be quite expressive (larger than 30%).

My analysis suffers from a big limitation: it considers a lone collector in a desert island. One interesting direction is modelling the swap of stickers between the collectors (there is a nice post by Freddy Boulton on that matter) and filling more than one album at a time. Algorithm-wise, the natural path would be performing its parallelization, what is really straight forward once the tasks of filling multiple albums are completely independent.

If you are more into the math, I invite you to look at this article: Paninimania: sticker rarity and cost-effective strategy, by Sylvain Sardy and Yvan Velenik, and to check a global summary at the Wikipedia article on The Coupon collector’s problem, a classical subject in probability.

If you are interested, the complete implementation is available here (Attention: if you run the code without any modification, it will take about 5 minutes to run and 95 files (240 Mb) are going to be generated! All the results presented here – and some others – are going to be recalculated). The graphs are produced in Python, using matplotlib.

Update Mars 2023: an slightly modified version of the code was update to my github: sticker-packs-estimation. The original version can still be from the link presented just before.

Share, try, modify, and contact me if you have any suggestion or corrections!

—
Recapitulation of all the articles linked on this post: