Abstract
Recent trends in mathematics education emphasize discovery learning over drill. This has proven to be a bad idea in some cases, for the simple reason that practice is required to learn basic arithmetic skills. Drills in arithmetic skills can be made interesting through gamification. This study proposes a family of puzzles that gamify arithmetic practice. The puzzles are designed with an evolutionary algorithm forming an instance of automatic content generation. Two methods of evolutionary puzzle design are presented and discussed. The first method used transformed the problem into an almost trivial optimization. The second algorithm was designed to avoid the flaws of the first and produced a huge variety of puzzles. A hardness measure, based on the difficulty experienced by the evolutionary puzzle generator, is employed. The hardness measure is tested on a large collection of puzzles produced with the evolutionary automatic content generation system. An initial assumption, that all the pieces in the puzzle must be used to achieve a maximum score, was shown to be incorrect in puzzles located via automatic search. Two classes of puzzle are defined: those where the optimal solution uses all pieces and those where the optimal solution fails to use at least one piece. The latter sort of puzzle were found to be far more common in the search space.
Introduction
There has been a sharp decline of scores for basic math skills on standardized international tests by Canadian students (Richards, 2017). The puzzles given in this study are intended to provide a gamified version of arithmetic drill. Novel interventions in techniques of teaching basic math skills are desirable. In Weatherby et al. (2018) the authors perform a review of math education practices that demonstrate that recent attempts to improve mathematics education in Canada have had little or no effect. Additional information on recent policy interventions appear in Mou and Atkinson (2020) and also suggest that additional approaches should be added to the mix. This study reports on part of a collection of research initiated at the request of a small group of parents and high school teachers to provide more interesting arithmetic drill.
Gamification is the use of game elements or a game-like structure in other contexts. In this material presented in this study it is intended for use in curriculum material development and learning. An introduction to the applications of gamification in learning and education appears in Kim et al. (2018). This study finds that gamification increases student engagement with material being taught. A literature review of gamification in education appears in Majuri et al. (2018) while on overview of the techniques of gamification in education is available in Kim et al. (2018).
This study defines an open-ended family of arithmetic puzzles and presents evolutionary algorithms for discovery of interesting instances of the puzzle. An instance of the puzzle is shown in Fig. 1. The numbers shown are presented to the player, together with a collection of playing pieces called polyominos. The player’s task is to place as many of the polyominos as they wish on the board in an optimal scoring configuration. A design constraint for the puzzle is that it must always be possible to exactly cover the board with the polyominos supplied.

An example of the polyomino arithmetic puzzle with the playing area filled with polyominos. This puzzle scores 112 points, see the text for details.
Scoring is performed by looking at the number appearing under each polyomino. The score for a polyomino is the larger of the sum or product of the numbers it covers. For the example board, most of the polyominos score best by multiplying the numbers, but some score better by adding. The blocks of numbers are 1,1,4,3,1; 2,1,1,1; 2,3,4,1; 4,1,4,1; 1,2,1,4; 3,1,2,4; 1,1,2,2; 3,1,2,2; 1,1,3. Their scores are 12m, 5a, 24m, 16m, 8e, 24m, 6a, 12m, and 5a, where a, e, and m denote add, multiply, or either. The total score for the example board is
Scoring a board provides practice with addition, multiplication, and strategy in the form of comparison when determining which of addition or multiplication yields the better score. The puzzle consists in placing the polyominos, which also engages spatial perception and planning skills. The goal of getting the highest possible score reduces the tedium of arithmetic practice and the problem of placing the polyominos to maximize the score provides additional entertainment. A discovery moment for students also arises in many of the puzzles because the optimum score is obtained by failing to place a one or more of the polyominos.
In the original conception of Polyomino Math used only multiplication in scoring. Changing the scoring rule to the best of multiplying or adding increases the difficulty of the procedural content generation problem and increases the scope of skills practice afforded by the puzzle. Polyomino Math Puzzles represents an open-ended collection of puzzles that can be made harder by increasing the size of the puzzle board, by carefully choosing the polyominos used, the numbers used, and the placement of those numbers. This multi-factorial design problem makes the design of instances of the puzzle a natural target for computational intelligence for procedural content generation.
The remainder of this study is structured as follows. Section 2 reviews past work and defines terms. Section 3 gives the design of experiments including specification of the evolutionary algorithms used. Section 4 gives and discusses results while Section 5 draws conclusions and outlines next steps.
Background
Puzzle games like Minesweeper™ or Same Game have the property that there are a huge number of different instances of the game. A random game generator can create easy games, hard games, games that are impossible, or games which turn on purely random choices. Deductive reasoning in minesweeper sometimes reveals that there are two possible moves, one of which must result in death, but yielding no obvious way to choose between them. This sort of variation in game difficulty can be managed with procedural content generation, using algorithms to steer the difficulty of an instance of a game.
Procedural content generation (Hendrikx et al., 2013) is the use of procedural methods to generate content for games (Ashlock et al., 2011b), the world wide web (Oliver et al., 2002), terrain maps (Doran and Parberry, 2010), or any of a number of other application domains (Hingston et al., 2008). The technique used in this study is an example of search based procedural content generation. Search based PCG (Togelius et al., 2010b) uses search methods rather than composing algorithms that generate acceptable content in a single pass. Both sorts of content generation can suffer materially from scaling problems which can be addressed by problem decomposition with an off-line and an on-line phase to the software (Ashlock and McGuinness, 2011). In Sorenson and Pasquier (2010), levels for 2D sidescroller and top-down 2D adventure games are automatically generated using a two population feasible/infeasible evolutionary algorithm.
In Togelius et al. (2010a) multiobjective optimization is applied to the task of search-based procedural content generation for real time strategy maps. In Ashlock et al. (2011a) level maps for dungeons were generated with checkpoint based fitness functions. This work was extended in Ashlock et al. (2011b) by having multiple types of walls that defined multiple mazes that co-existed in a single level map. Related work includes generating distinct co-existing maps with designed tactical properties (Ashlock et al., 2011b). Another representation used to generate level maps is the evolution of fashion based cellular automata which are studied in Johnson et al. (2010); Ashlock and Bickley (2016). This cellular automata based method has remarkable scalability and reusability properties.
The use of procedural content generation to create educational materials is a relatively new application. In Smith and Hartveld (2014) the authors make the case that digital content generation is a good way to generate a sufficiently large collection of examples that students are more likely, through exposure to the rich collection of examples, to inductively learn the underlying principles. This study joins this effort, using PCG to create puzzles that encourage the practice of basic arithmetic skills.
Terminology and properties
A polyomino is a connected shape made of squares of equal size that are joined along full faces. The connected regions in Fig. 1 serve as examples of polyominos. The size of a polyomino is the number of squares making it up. Polyominos are referred to by their size. A single square is a 1-omino. Two adjacent squares are a 2-omino or domino. A polyomino with n squares is called an n-omino.
The first PCG system for Polyomino Math operated on a fixed arrangement of polyominos, supplied as an input to the algorithm, and evolved the positioning on the board of a list of numbers supplied by the user. The goal was to find the maximum score for those numbers on a board composed of the fixed arrangement of polyominos. The use of a fixed arrangement of polyominos was motivated by the following lemma.
So long as a set of polyominos covers a board, the maximum score, resulting from rearrangement of a fixed collection of numbers, is independent of the arrangement of the polyominos.
Assume that we have located an optimal position of the numbers to maximize the score for a placement of polyominos on the board. If we glue the numbers to the polyominos and place them differently on the board, so long as all are placed, the score remains the same. Since this is true of the score arising from any placement of the numbers, this score remains optimal for the new placement of the polyominos. □
Examples of a layout of numbers that forms a Type II board appears in Fig. 2. The maximum score on this board with the given set of polyominos is 416, but this arrangement does not use all the polyominos. Several local optima scoring 403, 404, 405, and 410 are also shown and also do not use all the polyominos.

Results for one set of polyominos on one board of numbers. This example was chosen because it is a type II board, one on which the maximum score never uses all the polyominos, even though the polyominos can cover the board exactly. The set of polyominos used are those in the first board with a single
An additional issue that arises from attempting to fit numbers to a fixed array of polyominos is that the number rearranging problem turned out to be a very easy optimization problem, solvable with a greedy algorithm; this is discussed in Section 3. This led the investigation to a more useful version of the PCG evolutionary algorithm, which places polyominos on a fixed field of numbers, trying to maximize the score, but compares games between distinct runs on the basis of the time required for the evolutionary algorithm to discover its best result. The evolutionary algorithm’s time-to-solution is a surrogate difficulty measure for the puzzles being evolved.
The evolutionary algorithm
Two evolutionary algorithms were designed for this study. The first was an ordered gene algorithm that shuffled a set of numbers specified by the user to be used with a fixed array of polyominos. The algorithm functioned well but was found to be trivial, leading to the discovery of Lemma 1, and motivating the decision to abandon this algorithm for generating polyomino math puzzles. The algorithm gathered the largest numbers under the largest polyominos – clearly a correct strategy for maximizing score – which means that the evolutionary algorithm was shadowing a fairly obvious greedy algorithm. The greedy algorithm sorts the numbers in descending order, sort the polyominos by size, and then place the sorted numbers under the polyominos in descending order of size of both numbers and polyominos. This is a good example of a situation in which an evolutionary algorithm generated evidence that it was not needed.
The revised form of the algorithm worked with a fixed field of numbers on the rectangular board, which can be chosen by the user to avoid too many obvious placements of polyominos on blocks of large numbers. The algorithm then placed the polyominos on the grid of numbers, maximizing the score. The input to the algorithm is a list of polyominos, with repetition allowed, and the fixed number grid. Earlier work (Ashlock and Taylor, 2016) had shown how to generate large numbers of polyomino puzzles, that it to say sets of polyominos that fill a specified rectangular domain. The term “polyomino” is a bit ambiguous until we deal with the issue of symmetries, as shown in Fig. 3.
The representation used by the evolutionary algorithm is a list of

All eight symmetries of an example 4-omino.
Once the gene has been decoded to determine what order the polyominos will attempt to be placed in and what their orientation will be, each polyomino is placed in the manner specified above. A polyomino that is not placed is discarded. Given a placement of polyominos on the board, it is always possible to reverse this algorithm to find a gene that would yield that placement. If the genetically specified first placements of the polyominos is the place they appear in a target arrangement, then the algorithm will reconstruct the target arrangement. This means that the representation is complete: it can encode all possible solutions.
The evolutionary algorithm used to evolve the real number genes is steady state (Syswerda, 1991). It updates the population using size seven single tournament selection. Seven population members are selected uniformly at random, and the two best reproduce and replace the two worst. Reproduction uses two point crossover of the array of numbers and mutates from 1–7 loci in the list of numbers, with the number of modified loci selected uniformly at random. A loci is modified by generating a new value for that loci in the range
An experiment consists of an number of independent runs of the evolutionary algorithm, specified in the descriptions of individual experiments. Each run uses 400,000 instances of tournament selection, recording the highest scoring configuration of polyominos as its result.
The significance of varying the placement of polyominos was of primary interest during the evolution. Accordingly it was important to test with many distinct sets of polyominos. Using a brute force algorithm, a listing of the 827 unique sets of 3 and 4-ominos that can tile a 5 by 6 grid was generated. Keeping the size of the polyominos small and consistent (in this case no larger than 4) stops the multiplication from forcing obvious placements for the largest polyomino(s).
An array of scores containing mostly 1s, 2s, and 3s with some 4s (8 1s, 8 2s, 8 3s, and 6 4s) was shuffled to produce 100 5 by 6 boards. Each of the 827 distinct polyomino sets was evolved against all 100 of these randomly created numerical arrays for a total of 82,700 distinct polyomino set/board pairings. For each pairing, the evolutionary algorithm performed 20 runs.
Two searches of the space were performed using the above pairings. The initial search of the space was performed with a population size of 4000. The population size was reduced to 20 for the second search.

An 8-omino created by fuseing two 4-ominos used in the experiments.
To get a better feel for the fitness gradient during evolution, sampled polyomino sets were also evolved against 10 by 10 boards. A similarly proportioned array of scores (24/25 1s, 29/30 2s, 29/30 3s and 14/15 4s) was shuffled to produce 6 random 10 by 10 boards.
Composing 4 distinct random sets of 3 and 4-ominos that can fill a 5 by 5 grid yields a polyomino set that is guaranteed to fill a 10 by 10 board in multiple distinct configurations. 200 such randomly generated polyomino sets was evolved against all 50 randomly created boards for a total of 1200 distinct polyomino set/board pairings. For each pairing, the evolutionary algorithm performed 10 runs. The population size was set to 1000 for these runs.
Larger polyominos
A final round of testing was performed to measure the effects of introducing a single larger polyomino. As before, 4 distinct random sets of 3 and 4-ominos that each can fill a 5 by 5 grid were composed into a larger set. Given a valid layout of each larger set in a 10 by 10 grid, 2 adjacent 4-ominos were randomly selected to be fused into an 8-omino. An exemplary result is shown in Fig. 4. This method of introducing a single larger polyomino to each set guaranteed that each resulting set was able to completely cover the 10 by 10 grid. 50 distinct polyomino sets were generated using this approach, and each was tested against the above 6 randomly created boards for a total of 300 distinct polyomino set/board pairings. For each pairing, the evolutionary algorithm performed 10 runs.

The graph displays the evolution of the maximum fitness in a population over time. The area under the graph is the raw material, before normalization, of the hardness measure. Normalization consists of passing to the average maximum fitness value and dividing by the maximum fitness found in any run.
Intuitively, these larger polyominos were predicted to have a significant impact on the optimal solution. They have a large scoring potential, but also serve to inflexibly partition the board, which limits the placement of the other polyominos. The population size was set to 1000 for these runs.
For the smaller

An example of a board with a high hardness score. This board is type II and the configuration of polyominos show in one that maximizes fitness at 404.
It is important to note that the units of the hardness measure are normalized sum of gain of fitness per unit time, where the fitness has an arbitrary scale arising from the size of board under examination. This means that these hardness numbers are relative and also, as computed in this study, are samples from a distribution of hardness estimates. For that reason we will display hardness as box plots of thirty samples of hardness obtained from distinct runs of the evolutionary algorithm. Another consequence of this setup is that the hardness numbers are only useful for relative comparison of the hardness of different instances of the polyomino math puzzle with the same size of boards.

Shown are the distribution of hardness estimates for the ten hardest boards for the
During the initial search of the
Suspecting that the large population size was causing a near ideal solution to often be present in the starting population, the population size was reduced to 20 and the search was reperformed, in order to give the hardness measure more traction. The new results contained more runs that started significantly further from where they plateaued, and many went through several increases in maximum fitness before eventually plateauing. This verifies that the smaller population grants more traction in sorting instances of the puzzle by difficulty. To give an idea of the extent of the contrast, 57% of the 1,654,000 runs in this experiment saw a 50%+ increase in the maximum fitness of the starting population.
Figure 7 shows, for the four types of experiments performed, variation in hardness for the six hardest puzzles in four groups of experiments. There was relatively little variation in hardness for low hardness puzzles, motivating the display of hardness variation of only the hardest puzzles, which did exhibit significant variation. Each panel shows the variation in hardness involving one set of polyominos. Overall the hardness results demonstrate that we can find easy and difficult puzzles with the proposed harness measure.
Population size did not have an effect on the maximum fitness found for each polyomino set/board pairing. The overwhelming majority of the time, all 20 runs found the same maximum fitness for a given polyomino set/board pairing. This lends confidence, if not certainty, to the belief that evolution is in fact finding the global optima for each puzzle. In the second round of testing, there were several polyomino set/board pairings that consistently demonstrated significant improvement during evolution across the 20 runs. This implies that these boards would be most challenging for a human to solve.
Increasing the number of distinct polyominos was shown to correlate with an increase in average hardness. Figure 8 shows that in the second test, average hardness monotonically increases with the number of distinct polyominos. This assessment was performed by taking the exhaustive enumeration of all 847 sets of 3- and 4-ominos that can fill a
Increasing the board size to 10 by 10 had the expected effect of making the problem more challenging for the evolutionary algorithm. It also seems likely that this board size would be very hard for students. Maximum fitness exhibited far more improvements during evolution for these runs, and even with only 10 runs performed for each polyomino set/board pairing, it was rare for two runs to find the exact same maximum fitness. It was also very unusual for the range of maximum fitness across the 10 runs to vary wildly; in all but 1 of the 1200 polyomino set/board pairings the lowest maximum fitness found in any of the 10 runs was at least 80% of the maximum fitness found in the 10 runs.

Average hardness across all boards as a function of the number of distinct polyominos in a polyomino set.
Maximum fitness changed during evolution far fewer times when large partitioning polyominos were added into the mix. This is unsurprising, as a good placement of the large polyomino significantly increases the score, which penalizes many of the possible rearrangements of the polyominos. While there was still variation in the maximum fitness found across the 10 runs for each polyomino set/board pairing containing a partitioning polyomino, the partitioning polyomino was always evolved to the exact same position on the board.
Examine the four solutions, using the same polyominos set with a fused polyomino, that appear in Fig. 9. The fused polyomino appears in the same position every time, garnering a score for itself of 48 – far from the largest score. This is because the fused polynomial is not wasted on large scores that come from relatively compact collections of high scoring numbers; those can be captured by smaller polyominos. The fused polyomino gathers in a relatively high score that only it can capture. This is, in part, because there is no insane block of high scoring numbers that the fused polyomino perfectly covers.
While Type I boards were relatively rare outcomes of the procedural content generator, an example of one is shown in Fig. 10. This solution, with a score of 471, appeared in all thirty of the evolutionary runs for the pieces and array of numbers shown.

Four evolved puzzles using a fused polyomino, built from two 4-ominos. Note the fused polyomino appears in the same position in all four examples. This polyomino is near the center of the grid and is red in the top left example and cyan in the other three examples.

An example of an evolved type I board. The solution shown was found by 30 out of 30 runs of the evolutionary algorithm for the array of numbers and polyominos shown.
This study explored a design space for the polyomino arithmetic puzzle, learned something about the design space, and arrived at an acceptable but improvable design for a PCG system for the puzzles. This system was constructed around an evolutionary algorithm. A preliminary hardness measure on the puzzle, derived from the behavior of the evolutionary algorithm, was devised and tested, providing proof-of-concept and suggesting directions for better hardness measures, particularly measures not dominated by outliers.
The
The most surprising discovery in the course of this study was that Type II boards, those where leaving a polyomino out is required to achieve the best score, were far more common than Type I boards. In fact, for the
Non-rectangular boards
During the review process, the issue of non-rectangular boards was raised. The evolutionary algorithm decodes the chromosome into an order in which to attempt to place the polyominos and a orientation for each of the polyominos. That polyomino is then greedily placed where it fits. While the algorithm currently does this in a rectangular target space, using a non-rectangular space requires only a simple modification of the algorithm. Taking a rectangular bounding box for the desired board and then obstruct the portions of this board that are not to have numbers in them or polyominos placed on them yields the ability to search for non-rectangular designs. This is a modest revision of the algorithm.
Validation and a better hardness criterion
An excellent algorithm to provide both a better hardness test and a search algorithm that can validate the maximum score of relatively large boards is Monte-Carlo tree search (MCTS) (Browne et al., 2012). Solving polyomino math puzzles can easily be re-cast as a single player game, a natural domain for MCTS. MCTS can be run on a puzzle with differing time allowances which will create a graded sorting of harnesses by which allowance of time first permits an accurate solution. The true maximum scoring solution, located by pruned, exhaustive search, can be used to set the bar for MCTS hardness determination.
While MCTS, like the evolutionary algorithm used in this study, grant traction to some version of our hardness measure, neither is guaranteed to find the true maximum fitness and may potentially mistake a Type I puzzle for a Type II puzzle. Using a constraint satisfaction solver (Lecoutre, 2013) or Planning Domain Definition Language tools (Haslum et al., 2019) are approaches to exact solutions that would be useful to obtain exact solutions, something that would be needed in a published volume of such puzzles.
An initial study to solving the problem of covering a board with pieces using an MCTS solver appears as part of Ashlock and McGuinness (2016). Extending this work to a method of optimizing the score of a polyomino math puzzle appears straightforward. The moves of the one-player game consist of choosing a polyomino, an orientation, and an initial position as the starting point of a scan to determine if the piece fit.
Constructing and generalizing type I boards
Suppose that we have a board and a set of polyominos of size
The construction given has many potential degrees of freedom. They include the following:
Numbers assigned to a polyomino of a given size may be assigned to any polyomino of that size. Any time two different sets of numbers the same size are not identical, their assignments to polyominos in a covering of the board may be swapped. The numbers assigned to a particular polyomino may be put down in any order, so long as they remain under that poylomino. Using Lemma 1, each possible covering of the board with the polyominos is another version of the problem, representing an additional Type I board.
If all the numbers are the same then all these degrees of freedom vanish, but having a diversity of numbers is required in any case to make the puzzle more interesting than simply figuring out a placement of the polyominos that covers the board. In Ashlock and McGuinness (2016) it was shown that boards that can be covered by a set of polyominos usually have a large number of different ways to be covered by polyominos. For a board with an area of 36, hundreds of distinct solutions were located. This means that, depending on the diversity of the numbers used in the puzzle, there is the potential for a combinatorial explosion of different Type I puzzles that can be constructed, starting with an initial type I puzzle, by exploiting these degrees of freedom.
It is important to note that most type I polyomino math puzzles are not ones that arise from this construction. In most arrays of numbers, moving the numbers around changed the best score and has an excellent chance of turning a type I puzzle into a type II puzzle.
Deployment
The ability to generate thousands of puzzles with graded hardness is a pleasant accomplishment, but it is important to try to move the results to more places than a publication in an academic journal. One of the authors runs a popular blog entitled Occupy Math that will permit broad generalization of a collection of puzzles. A sheet of puzzles, with cutouts for the polyominos, will permit distribution and voluntary feedback on the effectiveness of the puzzle as a teaching tool.
