Wednesday, October 31, 2018

Genetic algorithms and the Zodiac killer’s 340 cipher

A few weeks ago there was a Travel Channel special about the Zodiac killer hosted by the Mysteries at the Museum guy.  As usual, I actually tuned in the watch the previous show but was too lazy to change the channel when this special started.  Anyway, there was a serial killer active in the 60's and 70's who was never caught, dubbed the Zodiac killer.  The killer left behind a series of ciphers, some of which have yet to be decoded.

That got me thinking... has anyone ever tried using an evolutionary/genetic algorithm to solve the cipher?  Well, who knows, but let's give it a try!

What’s a Genetic Algorithm?

It’s an approach to solving a problem that mimics natural selection in the natural world.  Wikipedia has a much more detailed definition.  We’ll start with a set of “candidate” solutions, measure and select the fittest candidates, then combine the fittest candidates to create a new generation of candidates.  Repeat hundreds or thousands of times, and you might end up with an optimal solution.

This is especially useful when optimization is necessary and a "brute-force" approach isn't practical.  One example would be to find which out of 100 songs would fit best on a 60 minute CD.  You could go through and evaluate every possible combination.  Or... try a genetic algorithm.

In this case,  not knowing which letters are repeated as cipher symbols, there would be over one quadrillion possible combinations, which is far too many to test and analyze.  We need a better way to find a solution.  Enter the genetic algorithm...

Candidates

Ideally we’d start with candidates that are as strong as possible.  In this case I figured minimal effort would involve randomly matching the 62 cipher symbols to letters.  To be sure we don’t end up
With 62 “Z” or “Q” letters, let’s also take into account the general probability of a letter occurring in the English language.  

Having too few candidates might not offer enough diversity and you’ll converge on a non-optimal solution.  Too many candidates and it might affect performance or take too long to converge.

static public class RandomLetterGenerator
    {
        /// <summary>
        /// Dictionary of letters and their probability of occurring in the English language
        /// </summary>
        static private Dictionary<char, double> dictLetterProbabilities = new Dictionary<char, double>()
        {
            {'e',.1116},{'a',.0850},{'r',.0758},{'i',.0754},{'o',.0716},{'t',.0695},{'n',.0665},{'s',.0574},{'l',.0549},{'c',.0454},{'u',.0363},{'d',.0338},{'p',.0317},{'m',.0301},{'h',.0300},{'g',.0247},{'b',.0207},{'f',.0181},{'y',.0178},{'w',.0129},{'k',.0110},{'v',.0101},{'x',.0029},{'z',.0027},{'j',.0020},{'q',.0020}
        };
        static public Random r = new Random();
        /// <summary>
        /// Gets a random character a-z taking into account the probability of a letter occurring in a word.
        /// </summary>
        /// <returns>A character</returns>
        static public char GetWeightedRandomChar()
        {
            double next = r.NextDouble();
            double sum = 0;
            // Keep adding probabilities until the sum is greater than the random double.  The current letter is then the one returned.
            foreach(KeyValuePair<char,double> kvp in dictLetterProbabilities)
            {
                sum += kvp.Value;
                if (sum >= next)
                {
                    return kvp.Key;
                }
            }
            return null;
        }

    }


Fitness/Selection

Now that we have a set of candidates, we need a way to measure which are the best.  This is usually referred to as the “fitness” of a candidate.  I chose to check what percentage of the 340 characters match the 1000 most common English words.  We’ll just take a substring of the 340 characters with a length that matches the longest word in our “Dictionary”, removing a character until we match a word or run out of characters.  This will give preference to longer words over shorter words.  Otherwise we’d just end up with a bunch of one and two letter word matches.  Next, loop until all 340 characters have been exhausted.

You could set your algorithm to stop after a certain fitness is found, or if it doesn’t improve after N consecutive generations.  My solution runs through N generation regardless of if the fitness is improving or not.  This lets me watch how it progresses while I'm testing the solution.

Crossover

After we find the fittest candidates, we should remove the least fit and mate/crossover the fittest of the generation to replace them.  How you crossover two candidates and how many you cull involves some trial and error.  

The point where you crossover is called, amazingly, the “crossover point”.  In our case we have 62 symbols, so my program randomly selects a number between 0 and 61. Any symbol before the chosen index gets copied from the first candidate and any symbol after or equal to that index is copied from the second candidate.  Combine the two and you have a new candidate created from the strongest of the previous generation.  

Also, my algorithm finds a random mate within the population.  I’ve seen some algorithms mate 1&2, 3&4, etc, but randomizing it seemed to produce better results in this case.  This algorithm also keenest the fittest of each generation.  A “true” genetic algorithm would only have offspring in the next generation.  However, keeping the fittest candidates from the previous generation produced better results in this case.  Again, you can customize the process however you see fit.  It will involve some trial and error.



/// <summary>
        /// Crossover between two candidates.  The crossover point is a random point in the cipher key.
        /// </summary>
        /// <param name="c"></param>
        /// <param name="mutationProbability"></param>
        /// <returns></returns>
        public Candidate MateWith(Candidate c, double mutationProbability)
        {
            string newCipher = string.Empty;
            if (r == null)
                r = new Random();
            // Randomize a mutation
            if (r.NextDouble() <= mutationProbability)
            {
                char[] charArray = CipherKey.ToCharArray();
                charArray[r.Next(0, CipherKey.Length - 1)] = RandomLetterGenerator.GetWeightedRandomChar();
                CipherKey = new string(charArray);                               
            }
            // Randomize the crossover point
            int crossoverPt = r.Next(0, CipherKey.Length-1);
            for (int i=0;i<crossoverPt;i++)
            {
                newCipher += CipherKey[i];
            }
            for(int i=crossoverPt;i<c.CipherKey.Length;i++)
            {
                newCipher += c.CipherKey[i];
            }
            return new Candidate() { CipherKey = newCipher,Fitness=null };
        }

Mutation

Notice mutationProbability in the snippet above.  That's because we also need to add mutation into the mix.  Every time two candidates are mated, there should be a small chance of one or more letters being switched.  This helps add diversity to the candidate pool and helps prevent premature convergence to a lower fitness solution.  What percentage and how many letters to switch?  That’s up to you.  My program flips one letter for the sake of simplicity.

Results?

Unfortunately, I wasn’t able to reach a solution that made any sense.  Finding an excuse to incorporate an evolutionary algorithm was worth it for me though.  There are some cloud based text analytic APIs out there.  Perhaps incorporating something like that would improve the outcome.  I wasn't able to reach any meaningful solution, but maybe you can improve on the idea and find a solution that makes sense!

The full C# source code can be found on github for your review .(https://github.com/1000littleideas/Zodiac340).

The textbox on the left will display the fittest candidate thus far while the textbox on the right displays the fittest candidate from each generation.  It ended up at just over 70%, but there were a lot of two/three letter words that didn't combine to mean much.




No comments:

Post a Comment

Genetic algorithms and the Zodiac killer’s 340 cipher

A few weeks ago there was a Travel Channel special about the Zodiac killer hosted by the Mysteries at the Museum guy.  As usual, I actually...