Notion Page Link <--- This has better formatting and looks nicer
Problem
When we perform the decryptions, one of the questions we have to solve is:
How do you know what decryption to do next?
This is solved by a search algorithm.
This proposal will be for the search algorithm for Ares.
Solution
Our nodes will be the decryption text, and our edges will be the decryption modules.
The struct for nodes should look like (taken from here):
struct Nodes<V> {
children: Vec<Nodes<V>>,
parents: Vec<Nodes<V>>,
value: V
}
Where the value is the decoded text.
A simple Breadth First Search is:
impl<V> Nodes<V> {
fn bfs(&self, f: impl Fn(&V)) {
let mut q = VecDeque::new();
q.push_back(self);
while let Some(t) = q.pop_front() {
(f)(&t.value);
for child in &t.children {
q.push_back(child);
}
}
}
}
Recording Parents
Because each edge (decoding) takes one input and produces one output, following the path to the parents is easy.
All we need to do is when calling the child, tell it what node we are currently on. We can do this by appending to the vector.
To keep the last X nodes, we can use a First in First Out queue. When the length of the vector reaches X, we remove the first node.
Edges
Because each edge is a decoding / decryption algorithm, we need something along the lines of an array of all decryption objects we can run.
This array should be sorted according to Godly Searcher, it has to be sorted because:
- If we do all decodings first, when will we get to the decryptions?
- And vice-versa.
Our first searcher will be breadth-first-search, so we do not need to sort them.
Ideally instead of keeping the edges of a vector in a node, we keep a global list. So something like this:
Global Array of Edges, Nodes, and their heuristic score.
This way, we always do the node with the best heuristics next. If we didn't keep a global list, we'd have to expand node-by-node which isn't optimal because perhaps after Rot13 on text2 → text3 the next best choice is Binary Code on text1 → text4, not whatever else is attached to Text2.
Because of our idea of only using the ones that might work (entropy, index of coincidence) each node will have to add to the array its possible edges.
In a way, this array is a queue of which nodes (text) to expand using what edge (decoder).
Multi-threading
There are quite a few ways we can multi-thread this, the most important idea is that the queue that every node adds to has to be locked.
-
Rayon
Because our data is:
And we
- Run the function
.crack()
on all the edges
We may be able to easily use Rayon here.
-
We manually create threads one at a time
- We create one thread per edge, and each thread will obtain a lock on the queue system
-
We divide the arrays and thread those mini-arrays
- There is a context switching cost to (2), we can avoid that with (3).
-
We use Async
- Async has less of a context switching cost and allow us to do one async thread per edge.
I am leaning towards Rayon as it's easier.
What will the Heuristic be?
There are a few things I want to achieve.
Speculative Galloping
Let's say our current decryption tree looks like:
We have successfully decoded Base64 twice in a row. Because Base64 fails to decode if it's not Base64 (sometimes it can succeed if lucky, but generally it fails) we will make the assumption that all decodings are Base64.
This pattern is common. You have a string that's encoded with Base64, Base32, whatever multiple times and you have to decode it.
Speculative Galloping comes from the Timsort algorithm.
In Timsort, roughly, if our array looks like [1, 2, 3, 4, 5] we can assume the rest of it is "sorted" and thus we "gallop" untill we see it's unsorted
In our speculative galloping, we speculate that the plaintext was encoded with Base64 (for example) multiple times. So we "gallop" by only doing Base64 until it fails.
Eventually, Base64 will either:
- Come across a string it can't decode
- Or will trigger the checkers
And we win!
We make the edge case where the encoded text is Base64 100 times much faster.
All Decoders are Equal, but some are more equal than others
There are a lot of decoders. Some of them fast, some of them popular. For example — Minecraft Enchanting Table is not as popular as Base64.
For that reason, we can prioritise decoders.
We can set:
- Popularity ratings (Like name-that-hash and pywhat). The more popular something is, the higher it's number! From 0 → 1.
- Speed It is much quicker to decode Base64 or Binary then it is to decode some other tthings. This is because if you see the number "7" in binary, you know it's not binary so you can early stop. We can benchmark the speed using the Filter System.
Minecraft Enchanting Table and XOR? Unlikely buddy.
It's unlikely to see a chain of decryptiions like:
In our Configuration file we can define "rules" that will help speed things up.
We should also define a rule like "No base64 then rot13" in our config file, as that's very unlikely.
These will, of course, be optional.
Entropy
If the previous 5 nodes do not show a normalisation in entropy then we are not "getting close to the plaintext" and should abandon that path, using alpha-beta pruning unless we are in speculative guessing mode.
Entropy "normalises" the closer we are to plaintext, look at this:
This is encrypted with Base64 -> Rot13 -> Vigenère (key: “key”).
The Shannon Entropy is 5.23.
When we decode the Base64:
It’s now 3.88. We can tell if we are going in the right direction by the normalisation of Entropy.
Therefore, we should:
- Prioritise paths where their entropy "normalises" the most and:
- Prune paths where the entropy does not normalise (or gets closer to the encrypted / compressed part).
This will allow us to:
- Run longer since we are freeing memory
- Be faster over larger inputs since we are deleting unnecessary paths.
This should be optional as I have not benchmarked or properly tested this idea.
🧙♂️ Knowing what the right decryption is
Entropy for Base64 falls around 3 - 5. We can use this information to determine what the next decryption method is. For example, Rot13's Entropy may be 3 or 4.
If we decrypt text and we get:
We can guess it's rot13 and do that next. We can do this because classical ciphers are not truly random and we can "guess" what it is by looking at it, the same for encodings.
We need to normalise the entropy (Cyberchef doesn't have this) by dividing by the length. This is because the shorter the text is, the less entropy it has (and vice-versa for larger texts).
This idea is not properly tested as Cyberchef does not support normalised entropy.
We can also do this using Index of Coincidence.
The 10% rule
We can decrypt 10% of a string and if the first 10% is valid (using Quadgrams we can see if it passes a basic check, or another checker like Entropy or Chi Squared) we decrypt the r
We should only do this for slower ciphers like:
❓ Decay
Generally the deeper we go, the less likely we are to decrypt.
If for some reason we are 40 levels deep on one node, and have not explored the other nodes we should prioritise them.
We can do this by using Exponential Decay to prioritise the paths which have not been searched yet.
Exponential decay - Wikipedia
Success Time, Fail Time, Likely Chance
We should also take into account:
If it was successful, how long would it take on average?
In the absolute worst case where it tries all keys, how long does that take?
How likely is it that this text is encrypted with X? We can work this out on-the-fly using Entropy or one of our previous calculations.
So... What is the heuristic?
We should take all of these and multiply them, and divide them by the failure time.
If it takes very long to fail, we would want to do that roughly last so we can try the quicker things (decoders) things.
We will need to heavily benchmark this and test, we can create a platform in our app to do this.