As computer programmers, we often write code to do a task and then almost immediately begin to wonder, “Can this be done more efficiently, what if I try x, will that be a better solution?” Which of course is often a necessary step after just getting something to work. Make it work, then make it work better. This process is now common and often referred to as iterative programming. Now, as a programmer manually iterates over versions of code the solution evolves towards an optimal implementation. Typically, versions of software number on the order of 10s of iterations for a medium sized project. What if you could iterate on the order of 100s, 1000s, or 100000s of versions? What if you could simulate the evolution of the program and automatically pick the best candidate solutions? This is essentially what genetic programming is. And if you are using python, then there are some great genetic programming libraries available to do the heavy lifting of generating, mutating, and testing candidate solutions to iterate on and converge on an optimal solution. For this example, we will use DEAP.
So, “how do you simulate a program?” The obvious first step is to define what the program needs to do. Given a set of specific inputs, what is the output and/or given a certain state what is the next state. If you’re a programmer, you are probably already doing something close to this with Test Driven Development (TDD) or Behaviour Driven Development (BDD) methodologies. So not much changes, for the initial work of a programmer, regarding the identification of the solution domain. For our example, we'll pick something relatively simple and straightforward but not always obvious, like my children's homework. They are often challenged with pattern recognition problems, like given three numbers what is the next number in the series, or now that some are a little older, what is the function that describes the series. Let's have the computer write a program to solve this for us.
As with any python program let's import the modules we'll use first:
Ok, so here is our little homework problem: given an input x, determine a function to calculate y:
And in code:
Next we need to setup the primitive functions that can be used to build a solution. Here is where knowledge of the problem domain can help to narrow the scope of all possible solutions and therefore arrive at a solution faster. So, given that our homework problem domain is generally limited to arithmetic solutions with small integers we will add primitives representing these possible solutions:
Line 1 creates the signature of the function we will derive. Recall our problem from the table above, we are generating f(x)=y. So we have 1 input parameter, the name of the function will be "main", and the parameter name is "x". Lines 2,4,5 are adding primitive python functions for multiplication, addition, and subtraction.
If we were to add the standard division function, we would be in trouble because the blind evolutionary process would try division by zero. These, of course, would be invalid candidates and fall out of the evolutionary tree, but during execution we will still get exceptions, so introducing a safe division function catches the exception to allow candidate testing to be performed yet still return an invalid candidate, this is the safe_div function below. Also since we typically are working with small integers on these homework problems, we'll define a primitive to return a random small integer, rand_int_10. These functions are used on lines 3,6 above to add to the primitives set.
Next, we need to provide a way to determine if a generated solution is correct. This is called fitness and is the measure of the fitness of individual candidate solutions. In our example, we are looking to maximize the total number of passing results from our test input/output set, so we add a positive weight for our fitness. Also we define what an individual candidate solution is, a tree of primitives that meets our fitness function:
Now we need a function that can be used to evaluate an individual against our test input/output in order to measure it against our fitness:
The above evaluation function will compile the individual as code and bind to a local function that can be executed for each test input. We sum these results so that our maximization fitness function will award higher "fitness" to individuals that pass all the test cases.
We can now bind together our primitives, fitness, and individual in a toolbox that the genetic generation and testing algorithms can operate on for us:
We register our primitives on line 2. Our fitness was bound to the individual earlier and so we register the individual on line 3. Critically, we register our fitness evaluation function on line 6. The other lines are generally applied here, but can be used to tune the generation and mutation of individuals and can be optimized later.
Finally, we can define our main function that will use the toolbox to drive the generation of individuals and test their fitness, reporting result statistics along the way. For our example we will use some built-in individual evolution algorithms, but even these algorithms can be customized as needed with the DEAP library.
The HallOfFame object will maintain the best individuals (only want 1 here) across all of the population runs. Line 17 will print out the actual code of the solution.
Running this code, you'll get something like this:
gen nevals avg std min max
0 40 0.45 0.630476 0 2
1 36 1.025 0.651441 0 2
2 30 1.475 0.547152 0 2
3 36 1.5 0.67082 0 2
4 32 1.675 0.468375 1 2
5 36 1.4 0.538516 0 2
6 30 1.5 0.632456 0 2
7 29 1.825 1.4472 0 10
8 32 2.8 3.07571 0 10
9 35 5.75 4.27639 0 10
10 33 8.325 3.64272 0 10
That last line is the generated code, if you put this code into a python function and run the input from above through it, you will get the desired output. Pretty cool.
Now the fun begins, try running a few more times and see the different results you may get. Here is an interesting result:
sub(sub(sub(mul(x0, sub(add(x0, x0), sub(1, 5))), mul(add(x0, 7), sub(x0, x0))), mul(add(x0, x0), add(x0, 1))), mul(mul(add(x0, 7), sub(x0, x0)), mul(safe_div(8, 10), add(x0, 2))))
Hardly an optimal result, but try it out - it works. No one ever said computers can write readable code. However by tuning the evolution algorithm, toolbox setup, and given more time to evolve (increased population executions, multiprocessing on multiple computers) an optimized solution can be derived for any definable result. So, bring it back to your current process of design and test definition, can you see how once these steps are done the computer can actually write the code?