A metaheuristic is a high-level abstraction of established sets of mathematical rules with some randomness that leads a search process to find (near)-optimal solutions to a problem.¹
Metaheuristics are used to solve optimization problems as an alternative to exact methods like Newton’s Method or Gradient Descent. Their advantage over such exact methods is their low computational costs. The search process of metaheuristic algorithms, like PSO, are based on two main principles.
Exploration (or Diversification)
Exploitation (or Intensification)
Through Exploration the algorithms wants to make sure the whole search space is covered. The algorithm is “exploring” the search space and therefore increasing the chances to find the optimal solution. In contrast to that is Exploitation which describes the process of using already known information and directing the search to promising solutions. The metaheuristic algorithm convergences towards a minimum but it’s not guaranteed to be the overall optimum. One challenge in applying algorithms like PSO is to avoid Exploitation towards local minima. However, over extensive Exploration is not the solution for that challenge because now convergence speed drops. It is always a tradeoff between Exploration and Exploitation.
Another key element of Metaheuristics are their Hyperparameters. Different Metaheuristics have different Hyperparameters. Because of the impact on convergence speed and accuracy of the solution it’s mostly useful (or necessary) to perform a study on the Hyperparameters of your algorithm. if you are interested in a specific Hyperparameter Tuning method have a look at one of my other stories, covering Deep Random Search.
Deep Random SearchA greedy alternative for your Hyperparameter Tuningmedium.com
Particle Swarm Optimization²
Let’s start with the obvious analogy. PSO works similar to a swarm of birds searching for a food source. Every bird has its own memory and can remember the best solution he has found so far. This is also called the cognitive component of the update rule. But the birds can communicate with each other and they can exchange their best found solution. This is the social componentof the update rule.
You can now define a weight or a degree of trust for each component. Either you want your particles more isolated from each other and increase the impact of the cognitive component or you want the opposite and therefore increase the impact of the social component. In addition to the weights for each component there is another Hyperparameter. In each iteration of PSO a random scaling factor (between 0 and 2 e.g.) is multiplied with each component. Most Metaheuristics have this probabilistic aspect.
Figure 1 shows the position update rule of the algorithm. In an analogy to classical mechanics, the next position depends on the current position and the velocity (and a timestep, but you can set this to 1 at the beginning, it’s just another scaling factor). The position is a solution vector for your optimization problem (arguments of the function you want to optimize).
Figure 1: PSO position update rule
Figure 2 shows the velocity update rule of the algorithm. The weight c and the random scaling factor r need to be defined for both cognitive and social component. The cognitive component is calculated via the difference of the current position and the individual best known solution of the particle XP. The social component is calculated via the difference of the current position and the global or swarms best known position XG.
Figure 2: PSO velocity update rule
There is a third component in the velocity update rule called inertia. Again, as an analogy to classical mechanics, this component affects wether a particle can change its velocity quickly or rather slowly.
And that’s basically it. You need the position and velocity update rule, define your Hyperparameters and are ready to go!
What is this useful for?
Metaheuristic optimization can be used in many different fields of application, e.g. mathematics (optimization problems) or engineering (controller tuning). One advantage over gradient based optimization algorithms is the fact that the function of interest has no need to be differentiable at any point in the search space. You also save the computational costs of calculating the gradients. In general, Metaheuristics are considered faster than classical approaches but again, there is no guarantee they find the global optimum.
Let’s go with an example and analyze a typical benchmark function, the sphere function. I won’t go much into detail here but to compare different algorithms there are many so called benchmark functions which are used to test an optimization algorithm and measure its performance. The sphere function has a global minimum of 0.
Figure 3: mathematical description of the sphere function Figure 4: sphere function (Source: http://benchmarkfcns.xyz/benchmarkfcns/spherefcn.html)
I defined a PSO algorithm with 50 particles. The sphere function has 10 dimensions in this example. After 5000 iterations (about 5 seconds in this case) the algorithm stopped and the best solution found was 1e-150. Figure 5 shows the result of the search for the first 50 iterations and two selected dimensions. It clearly shows how the particles are exploring the search space and later converging towards 0 (Exploration vs. Exploitation).
Figure 5: PSO on the sphere function
Particle Swarm Optimization is a metaheuristic optimization method. It uses the concept of exploration and exploitation. These class of algorithms can be faster than classical approaches like Newton’s method or Gradient Descent. But you have to be aware that Metaheuristics might not find the global optimum and get stuck in a local minimum.
The way how PSO works is similar to a swarm of birds searching for a food source. Each bird memorizes his own so far best spotted source. All birds can communicate within the swarm and can share their most promising spots with each other. There is a tradeoff between the impact of the social and cognitive component, always depending on your problem.
PSO has many Hyperparameters which need to be fitted to your problem. One tuning method e.g. could be a Deep Random Search.
Thank you for reading this far! If you have any thoughts, annotations or ideas about this story feel free to leave a comment. Also, if you want me to write a story about a specific topic regarding this story (or any other topic) just send me a message or leave a comment.
References
Blondin, M. J. (2020). Controller Tuning Optimization Methods for Multi-Constraints and Nonlinear Systems: A Metaheuristic Approach. Switzerland: Springer Switzerland. Retrieved on 15th January 2021 from https://link.springer.com/book/10.1007%2F978-3-030-64541-0
Venter, G. & Sobieski, J. (2002). Particle Swarm Optimization. 43rd AIAA: Structures, Structural Dynamics, and Materials Conference, Denver, Colorado
I had the pleasure of witnessing this interaction the other day: two kids were locked in a battle of wits to see who could think of the highest number.
Player one served with a million. Player two with an amazing return of a billion. Player one went for the smash: INFINITY. But player two caught it on the volley — INFINITY PLUS ONE.
At this point the umpire (me) intervenes with the ruling that infinity plus one is actually just infinity. (I ignored the technicality that if they are counting cardinals, then they should have said aleph null — they are just kids after all.)
How can this be? After all, at school they were taught that when you add a number to another number, you make it bigger. By that logic infinity plus one must be larger than infinity.
The problem with infinity is that it is larger than any (finite) number you can think of and it defies all intuition. Some interesting features of infinity are illustrated in a thought experiment presented by David Hilbert in 1924.
Hilbert’s Hotel
Every year there is a conference on hyper dimensional maths. The whole universe is invited and they all stay in Hilbert’s Hotel. Thankfully, this hotel has infinitely many rooms and (unlike any maths conference I have been to) all the rooms are booked up.
Problem 1
During dinner, a latecomer arrives. He rings the bell at the front desk and along comes a tired-looking bellhop.
“We’re full” the bellhop proclaims and slowly starts to trudge away.
The manager Hilbert immediately interrupts and asks the bellhop to get the person in room one to move all their stuff into room two. The occupant in room two then moves all their stuff to room three and so on and so forth. Since there are infinitely many rooms, this will not cause a problem since there will always be a next room to move into. This now frees up room one for the latecomer to move into.
Result!
What about if a (finite) group of latecomers arrive? Say three.
Using a similar method, Hilbert gets the bellhop to move the occupant of room one into room four (1+3), the occupant of room two moves into room five (2+3) and so on each person moving up by three rooms each time. This then frees up three rooms for our late party to move into.
So our tired bellhop (from having to move an infinite number of people one room up) can deal with any finite party coming into a full hotel.
Problem 2
Now in the middle of the night a SuperBus arrives. This bus is rather special since it contains an infinite number of people, all eager to get to their respective beds.
The bellhop gulps. The previous method definitely won’t work since he can’t move the occupant of room one into room ‘infinity plus one’. Let alone the occupant of room two and so on.
Apart from building a new hotel, the bellhop is all out of ideas. So he reluctantly wakes up Hilbert.
“We have a problem” he says sheepishly.
Still half asleep, Hilbert instructs the bellhop to move the guest in room one into room two, occupant of room two into room four, room three into room six. In general, the occupant of room n gets moved into room 2n. This frees up every odd room . Then everyone in the bus gets assigned a number m, then gets moved into room 2m-1, which is each odd number. Problem solved!
Problem 3
In the small hours of the morning, all hell breaks loose; an infinite hoard of SuperBusses arrive. Hilbert leaps into action, he had envisioned this event and already had a plan.
He assigns guest in room k to room number 2ᵏ. So the guest in room 5 gets moved to room 32.
He assigns every bus a prime number starting with 3 (there are infinitely many primes so this is fine) and then numbers each person in each bus. Then the kᵗʰ passenger in bus p gets assigned to room pᵏ. For example, the second passenger in the 11ᵗʰ bus gets sent to room 121.
Here we show which room number each person gets assigned to
This method leaves lots of rooms empty but that’s fine since at least we got everyone in and ready for the conference.
Extension Problems
There are a few more problems that I will leave here for you to ponder over.
Suppose the hotel is next to an ocean and an infinite fleet MegaBoats arrive each carrying an infinite fleet of MegaBuses. How can they be accomodated?
Photo by Youngje Park on Unsplash
This is three layers of infinity so to speak. Can you generalise to accomodate an arbitrary number of layers of infinite guests? How about infinitely many layers of infinite guests?
Infinity is a MESS
Infinity is so much bigger than we can understand: An infinite number of infinite number of infinite passengers. Little thought experiments like this one shows us that
Infinity is really large, like really really large
We really don’t understand infinity
It gets worse.
There are larger types of infinities and this is the smallest. The next highest infinity is mind-bogglingly larger than the one we have encountered in Hilbert’s Hotel. But for now, we will let the poor bellhop sleep for a bit.
*Unless otherwise stated, infinity will refer to countable infinity. This means there is a way ‘count’ up to it. It is the ‘last number’ in the list 1,2,3,4,…
Generalizing Leibniz’ formula for π, the alternating harmonic series, and the Basel problem
Complex Analysis
Beyond Infinity
Analytic Continuation and its Applications in Quantum Physics
Number Theory
Transcendental Numbers
Into the Unknown
Differential Equations
Numerical Solvers for ODEs
An introduction to Euler methods
Number Theory
The Euler-Mascheroni Constant
Remember the harmonic series? More often than not, it serves as one’s first encounter with a series wherein individual terms diminish successively yet the series diverges to infinity. The below quote nicely sums the up notoriousness of the harmonic series: Today I said to my calculus students, “I know,
Euler
The Most Beautiful Equation in the World
And the Geometry of Numbers
Euler
Infinity in Numbers — Leonhard Euler
This is a story about one of the greatest minds in history, and about the mysteries and beauty surrounding his legacy. The story is about Leonhard Euler (1707–1783).
Euler
About Euler’s Identity
It is so famous and well-known, that I feel adding any praise to it would only decrease its value 🙂 It is often cited as one of the most beautiful equations ever, by maths funs and others as well.
Euler
From Derangement To Euler’s Number
How e can rise from combinatorics
Number Theory
Another Derivation of Euler’s Fabulous Formula
Leonhard Euler (15 April 1707–18 September 1783) was a Swiss mathematician, who is considered to be one the greatest mathematicians of all…
Complex Analysis
Euler’s Fabulous Formula
Here we investigate and understand the brilliant formula which unlocks the power of complex numbers.
Number Theory
The Celebrated Harmonic Numbers and How to Approximate Them
We discuss the famous harmonic sum and describe a simple way to approximate it.
Want to win a million bucks? Just solve one of these problems. No strings attached. Ok maybe one string: the problems are somewhat hard. Scratch that, really hard.
The Gauntlet is Thrown
At the start of the millenium, the Clay Mathematics Institute put forward these seven problems which are deemed as some of the most difficult problems that remain open. Each problem has a one million dollar bounty for the first person to provide a valid proof (or disproof).
I have been wanting to write this article for quite some time, but struggled to decide at what level I should present the material. After a lot of umming and ahing I decided to present it at an elementary university level. A few things will be assumed (like knowledge of groups and complex plane) but everything that I think is ‘new’ will be explained.
At the time I am writting this, only the Poincaré conjecture has been solved. Grigori Perelman presented the proof in 2003 and he was officially awarded the Millenium Prize in 2010 which he declined.
I will be presenting this conjecture (now theorem) first and then the remaining unsolved problems in order of increasing complexity.
Poincaré Conjecture
Every closed, simply connected, 3-manifold is homeomorphic to the 3-sphere.
Let’s break this down word by word. Firstly, a manifold is an object or a generalisation of a space that is locally like Euclidean space. This means that if you zoom in on it, it looks like a line or a plane or regular 3-dimensional space or so on… An example of a manifold would be a sphere (like the surface of a ball). If you zoom in close enough, it looks flat. The dimension of a manifold is the dimension of the space that it locally looks like, for example, a ball locally looks like a plane which means it has dimension 2, a circle locally looks like a line and so it has dimension 1. For a manifold of dimension n, we sometimes write it as an n-manifold, similarly for a specific shape like a sphere which can be generalised to n dimenions, we write it as the n-sphere.
A manifold is closed if it has no boundary and is bounded, that is it doesn’t go off to infinity (technically a manifold is closed if it doesn’t have a boundary and is compact — but if the manifold is immersed in Euclidean n-space then given boundaryless, compact just requires being bounded). A line between 0 and 1 has a boundary at 0 and at 1 and so is not closed. A circle has no boundary and so is closed.
Closed 2-manifold — this particular one is called a 2-torus
A manifold is simply connected if it has no ‘holes’:
(A) is a simply connected space (B) is not simply connected
An equivalent formulation of being simply-connected is that each loop can be continuously tightened to a point.
(A) a loop in A can be tightened to a point (B) here a loop in B gets ‘caught’ on a hole and can’t be tightened to a point
Two manifolds are homeomorphic if you can deform one into the other and back again continously. Permissable deformations include stretching, squeezing and twisting, but not ripping, tearing and puncturing. This leads to the famous equivalence between a dougnut and a mug.
How to deform a doughnut into a mug
In topology, we want to classify all manifolds into classes where all manifolds within a certain class are all homeomorphic to each other.
In two dimensions, it is (somewhat) easy to see that if a manifold is closed and without holes then it is equivalent to a sphere. As it turns out, this is sufficient to determine whether a 2-manifold is homeomorphic to the sphere (2-sphere).
Poincaré hypothesised at the start of the 20th century that this is true also in three dimensions, that is any closed, simply connected 3-manifold is homeomorphic to the 3-sphere.
In 2002, Grigori Perelman proved this to be true by using techniques such as Ricci flow and manifold surgery.
P vs NP
Can every problem whose solution can be verified quickly, be solved quickly?
Problems can be categorised into different complexity classes. Here we are interested in the classes P and NP. These stand for Polynomial time and Non-deterministic Polynomial time, respectively.
In essence, a P problem can solved ‘quickly’ and verified ‘quickly’. Whereas an NP problem (currently) does not have a ‘quick’ solution. More specifically, given a problem with an input of size n, the time taken to solve it if it were in the P class grows according to some polynomial. Whereas if it is NP then it will grow faster.
An example of a problem that is thought to be NP (I say thought since it hinges on the verity of the hypothesis) is the (decision problem version of the) travelling salesman problem:
Given a list of cities and the distance between each, can you construct a route that visits each city whose total length is less than a given distance?
Solving this problem is hard and very taxing to solve but a solution is easy to check — a solution is a list of cities to visit in order, and one can verify that it is a valid solution just by adding up the distances and comparing it to the given bound. It is important to note that increasing the length of the list of cities will make the time to solve much faster than any polynomial.
On the other hand, an example of a P problem is verifying if a number is in a given list. It is easy to solve and easy to check, and if you double the size of the list, the time taken will also double (so the time taken doesn’t grow too fast).
What the P vs NP problem asks is if in fact NP problems are distinct from P problems. Otherwise framed, does there exist some secret or hidden algorithm that can solve previously considered hard problems quickly?
Navier–Stokes Existence and Smoothness
In three space dimensions and time, given an initial velocity, does there exists a vector velocity and a scalar pressure field, which are both smooth and globally defined, that solve the Navier–Stokes equations?
The Navier-Stokes equations are two non-linear partial differential equations that describe fluid motion in 3D. It is a set of two equations that link the velocity vector field and its rate of change with the pressure field, external forces applied to the liquid. The equations are written as such:
We won’t delve into what each term means but essentially, the first equation is the (viscous) fluids version of Newton’s F=ma — the forces contributing are the sum of pressure, viscous stress and external forces. The second equation is very simply a conservation of mass and requires that the fluid is incompressible.
For a solution to be ‘valid’ we have two conditions:
The vector field v and scalar field p are globally defined and continuous over the whole space.
The total kinetic energy is bounded. (The integral of the norm squared of v over the whole space is bounded.)
So the Navier-Stokes problem is boiled down to proving one of two cases:
The affirmative: given f = 0 and an initial velocity field (that has to satisfy certain conditions) there exists a velocity and pressure field that satisfies (1) and (2).
The breakdown: There exists an initial vector field and external force field where there is no solution that satisfies (1) and (2).
N-S equations govern the diffusion of milk in tea — Photo by Alex Boyd on Unsplash
Riemann Hypothesis
Do all non-trivial zeros of the Riemann zeta function have real part equal to 1/2?
Again, let us break this down. Firstly the Riemann zeta function is defined by the following equation
which is valid for s > 1. Note that for s = 1, the function reduces to the harmonic series which blows up. We can do some fancy maths to analytically continue (essentially extend) the function to the complex plane (apart from at s = 0 and 1) with the following functional relationship:
Now we want to find for which s, ζ(s) = 0. Now since cosine is 0 for odd negative integers, ζ(-2n) for positive integer n is 0. These are called trivial zeros since it is zero due to the nature of cosine. Instead, we are interested in when ζ is zero of its own accord.
It is known that all non-trivial zeros have real part between 0 and 1, known as the critical strip. As it turns out, it seems that if s is a non-trivial zero (i.e. if ζ(s) = 0 and s is not a negative even number) then s = 1/2 + iy for some value y. i.e. the real part of s is 1/2 this is known as the critical line.
Birch and Swinnerton-Dyer Conjecture
Given an elliptic curve E over ℚ, does the algebraic rank always coincide with the analytic rank?
An elliptic curve E, is the set of solutions of an equation of the form y²=x³+Ax+B with the constraint that the discriminant ∆=-16(4A³ + 27B²) ≠0. The constraint just ensures that the curve is sufficiently nice.
Two elliptic curves. Left: y² = x³-1.5x+1, right: y²=x³-4x+1
We now restrict the solutions to the elliptic curve by requiring that x and y be rational. This is what we mean by a curve over ℚ. Now we can use this curve E to form a group denoted E(ℚ). We a pretty neat binary operation: given two points, we draw a line through them, find the third intersection with E and reflect it through the x-axis.
How to add two points A and B to find C
In order to fully make it a group, we need to add a point at infinity which acts as the identity of the group (for the reader that is familiar with projective geometry, E is a non-singular projective curve so we get the identity for free from the ambiant space).
The first natural question to ask is what can we deduce about the structure of E(ℚ)?
A result due to Mordell and Weil tells us that E(ℚ) is finitely generated and can be written as
Where E(ℚ)_tors is all the points in E(ℚ) that have finite order. r is known as the algebraic rank of the curve E.
Great, now we have the first half. Now we need to understand the analytic rank.
Let us now restrict the solutions even further be considering E over the finite field of size p where p is a prime number
We define the following values
and then finally the L-series of E at s as such
recall that ∆ is the discriminant of the elliptic curve. Then we can expand L as a Taylor series around s = 1:
Here, r_an is the analytic rank of the curve. Those familiar with complex analysis will recognise r_an as the vanishing order of the zero.
At long last! We can write the Birch and Swinnerton-Dyer conjecture very simply as
Ok so what does this all mean? As it turns out, computing the algebraic rank is quite hard whereas the analytic rank is somewhat easier. This conjecture provides a bridge between the land of analysis and the land of algebra.
Yang–Mills Existence and Mass Gap
Given any compact simple gauge group G, does there exist a non-trivial quantum Yang–Mills theory on ℝ⁴ that has a mass gap Δ > 0?
A quick disclaimer here: I am hardly an expert in particle physics so I will give my best superficial understanding here.
What this problem is asking is to make modern physics mathematically rigourous.
We start off with the idea of gauge symmetries: these are essentially freedoms in how we describe a physical system. For example, it makes no difference how and where we orient our coordinate system.
A neat theorem due to Emmy Noether is that for every symmetry, there is a corresponding conservation law. For example:
Time invariance (i.e. it doesn’t matter if you start your experiment now or in 5 minutes after you’ve had a cup of tea) gives rise directly to the conservation of energy
Translational invariance gives rise to conservation of momentum
Next, moving on to Yang-Mills theory.
The best explanaition I could find is given by Lawrence Krauss. Imagine a chess board, if you swap every white square for a black square and every black square for a white one, then the game is pretty much identical. Not much has happened but there has been a change so that is quite a simple symmetry.
Imagine now however that I locally switch the colour of a certain square, and do so as much as I want throughout the board. The board will look very strange but I can write a rule book that has accounted for all the swaps I’ve made. This rule book then dictates how the game is played.
Now the rule book is in fact a field and the game is a Yang-Mills theory and swaping the colours locally is a gauge symmetry.
Photo by Hassan Pasha on Unsplash
So, let’s go over it:
A gauge group is a group of (possibly very bizzare) symmetries of a system, this gives rise to a conservation law and we can write a ‘rule book’ that is a field that defines how particles interact which is a Yang-Mills theory.
This has already been done in the case of the electromagnetic force and the strong nuclear force which are completely described using quantum electrodynamics and quantum chromodynamics.
What the Yang-Mills Existance (we will get to mass gap in a second) asks is, does this description exist for all the four fundamental forces? And more interestingly, can they be unified?
Photo by israel palacio on Unsplash
On to mass gap: an excitation in one of these fields is actually a particle. A mass gap is essentially a stipulation that the mass of these particles have to be bounded below, so that you can’t find a particle that is arbitrarily light. This is what we observe in nature. It is called a mass gap since there is a gap between 0 and the lightest particle.
So for a Yang-Mills theory to be ‘good’ at describing reality must also exhibit this mass gap.
Hodge Conjecture
Let X be a non-singular complex projective manifold. Then can every Hodge class on X be written as a linear combination with rational coefficients of the cohomology classes of complex subvarieties of X?
This one is a doozy. I’m going to go into a lot less detail here because blimey it is hard to understand.
There is a natural interchange between algebraic equations and geometrical figures. Solutions to x²+y²-1= 0 forms a circle and x+y-1=0 forms a line.
So we can come up with some crazy equations and the solution form a (sometimes very complicated) shape, these are called algebraic cycles. If these algebraic cycles are smooth enough then they can be called manifolds (recall these from the Poincaré conjecture).
So algebraic cycles (read solutions of equations) can form manifolds, if we add more equations then we get algebraic cycles on the manifold.
Adding z = 0 to the equation x²+y²+z²=1, we get a circle
Now from a topological perspective, we can draw crazy shapes on a manifold and then group these shapes together if they can be deformed into each other. They are grouped into homology classes.
Two different homology classes on a torus
Now this looks exactly like the interchange we considered above: we are moving from an algebraic description of a shape and a geometrical description. The problem is, given a manifold, when does a homology class contain one shape that can be described as an algebraic cycle on that manifold?
Now unfortunately, we have been dealing with manifolds that live in regular Euclidean space. The Hodge conjecture deals with manifolds that live in projective complex n-dimensional space (which has real dimension 2n). So here everything hits the fan. A manifold is non-singular if there is no ‘pointy bits’.
Hodge came up with a neat and elegant idea to tell if a homology class was equivalent to an algebraic cycle and this is in essence the Hodge conjecture. I will leave it to the interested reader to pursue a post-graduate degree in algebraic geometry if they want to understand more.
And there we go! If you made it this far, well done — give yourself a pat on the back.
The interested reader can have a peruse through The Millennium Problems: The Seven Greatest Unsolved Mathematical Puzzles Of Our Time by Keith Devlin. He gives a very good description of each of the problems.
The very interested reader can try to solve one! I dare you.
The terms machine learning or artificial intelligenceare thrown around a lot these days. It has become fashionable to say that such and such a device has artificial intelligence. But what does this mean?
My objective is to give a non-mathematical description of how AI algorithms work. Many papers on the topic quickly get bogged down in heavy duty mathematics, which I think obscures from the beauty of the algorithms.
We start with a ‘normal’ algorithm. An algorithm is a set of instructions that, given an input, will give a determined output. A simple example is an adding algorithm: you input two numbers and the algorithm spits out the sum of those numbers. Another way an algorithm can be viewed is as a classifier. We can consider the space of all possible inputs and the space of all possible outputs, also called labels. Then we can see an algorithm as a function assigning each point in input space the appropriate point in output space. We call the classifier h which maps from the input space to the output space. So in the case of our adding algorithm, the input space is all pairs of numbers and the output space is all the numbers.
Spice things up a bit
In the case of a normal algorithm, everything is determined. The classifier behaves in exactly the way we expect it to work. 2+3 will always come out as 5 (I have a degree in maths so I can say this with certainty). This algorithm is quite boring and easy to program. A programmer can write an explicit set of instructions on how to add two numbers together. But what if we consider something slightly more challenging? The classic example of number classification: I give you a hand written digit (0–9) and you have to tell me which one it is.
Hand written digit courtesy of the MNIST Dataset
I defy anyone to write an explicit set of instructions that can classify a hand written digit with more accuracy than just assigning labels at random (which would be right 10% of the time).
Interestingly enough, humans can do this without any issues — are we smarter than computers? This is a question best asked to a philosopher and is a major digression from the topic at hand but is worth thinking about.
Let’s get learning
This is where machine learning comes in to play. We want an algorithm (or classifier) that gets fed training data and uses that to learn how to classify unseen images. An important feature of our classifier is that it must generalise the data. We don’t need it to fit the data perfectly but we want it to be able to handle unseen data with ease and accuracy.
The grey line is very accurate when it comes to the training data but does not generalise well. The red line however generalises well and we hope will be able to classify unseen data
The algorithm takes the image as input and will spit out a list of ten numbers between 0 and 1. Each number corresponds to a digit, the higher the number, the more confident the algorithm thinks it is that digit.
Hand written digit courtesy of the MNIST Dataset — h ‘correctly’ classifies an 8. We will see later that this has quadratic loss of 0.13 (2 d.p.)
The classifier works by taking in the image but also has a bunch of parameters that we can play with to try and classify unseen images correctly.
Training wheels
Training data is images along with what digit it is — that is you already know the correct label. For example it would be an image of the number four along with the list (0,0,0,0,1,0,0,0,0,0) (recall that our algorithm spits out 10 numbers between 0 and 1 for the certainty that it is that digit). The algorithm then takes in the image and spits out its result. We need to then tell the classifier how ‘wrong’ it was, we call this the loss. One conventional way of doing this is by adding up the square of the differences, called quadratic loss. Say that the classifier spits out (0.43, 0.13, 0.05, 0.17, 0.57, 0.79, 0.70, 0.38, 0.93, 0.35). Then the loss would be 0.43² + 0.13² + 0.05² + 0.17² + (0.57–1)² + 0.79² + 0.70² + 0.38² + 0.93² + 0.35² giving a loss of 3.86 (2 d.p.). The aim is to obviously minimize the loss function, as it rewards the algorithm for accurately predicting what the number is but also rewards it for correctly identifying what it isn’t.
Minimize loss
So ideally we want to tweak the parameters of the classifier so that the loss function is zero on every training data. This is problematic for two reasons:
We could be overfitting our data
It is actually bloody difficult to do this
So we at least want to get the loss function to be pretty small on average. To do this we will apply gradient descent to the average loss function in parameter space (those of you that have read my article on dynamical systems will be familiar with gradient descent).
Instead of simply computing the loss for a specific input and set of parameters, we will compute the average loss for a specific set of parameters by averaging over all the training data. Remember the loss of a specific training data is just a number so taking the average is trivial (provided our data set is finite, mind you). In practice, it is inconvenient to take the average over all training data so we randomly choose a few pieces each time. So now we will consider the average loss where the input now is the parameters.
We look at the average loss function for every possible combination of parameters. In practice, we will only look at the loss function for parameters near those that are currently being used by our classifier.
Everything going downhill
The idea is that we look at how to make small nudges to our parameters that would have the biggest effect of decreasing the loss function. Or the direction of steepest descent of loss. We nudge the parameters a bit in that direction, then recompute the steepest descent and repeat the process. The analogy to keep in mind is that of a ball rolling down a mountain. At each point, it will go in the direction of steepest descent to try and minimize its altitude. We can think of the altitude of the ball as the loss — the higher it is the higher the loss, and the latitude and longitude as parameters. In this case there are only two parameters but in practice there can be hundreds of thousands.
Gradient descent in action
After applying gradient descent to our parameters, we reach a local minima of the loss function. That means we are doing the best we can for all parameters nearby. If the loss function at this minima turns out to be quite high (that is, it is not classifying inputs accurately) then tough luck. You could choose to start again with wildly different parameters and hope you don’t fall into the same local minima and also hope that the local minima you fall into is ‘better’ than the previous one. Turns out finding global minima is quite difficult.
Result!
Now we hopefully have a well-oiled algorithm that can classify hand written digits fairly confidently, by fairly confidently we mean that the average loss is low.
To recap:
We start off with an algorithm that wildly guesses what each image is
We feed it an image that we know what the answer is
Then we compute how wrong it is
According to how wrong it is and other mathematical mumbo-jumbo we slightly adjust the parameters in the algorithm to decrease the wrongness
Rinse and repeat steps 2–4
And that is it! That is fundamentally how machine learning works. This is the blandest sort of machine learning algorithm called neural network. We can always spice things up after but this is the core principles of how most artificial intelligences learn.
The whole structure of how a neural network learns is somewhat similar to how a baby learns. A baby is inquisitive about the world around it, maybe it puts its hand on top of the stove and burns itself. This is considered a high loss. But the baby then adjusts its brain to tell it not to do that again. Similarly, if the baby does its business in the toilet, it will get a reward — a low loss. It will then learn to repeat that. And so a human is formed that behaves as expected and can generalise to unfamiliar situations: you don’t have to teach an adult to use the toilet in a house they have never been to before.
So in many ways AI mimics the way humans learn to try and teach a bunch of computer transistors how to ‘think’. But is this any different to a bunch of neurons thinking?
The Indonesian Caves of Borneo Island offer an insight to the most primitive form of recorded communication. Around some 40,000 years ago, before the evolution of written language, physical illustrations on cave walls were the most precise method of recorded communication. As time progressed, the methodology of record-taking evolved from cave paintings to complex alphabets — offering complete expression through the complex structure that is language. In the English language, ideas as concrete as a “tree”, & as complex as “love” are expressed through the usage of 26 letters — offering zero intrinsic value besides that which society has assigned them.
Information, within the context of the recently-established branch, is defined as the resolution of uncertainty (or surprise). Consider communicating the presence of a tree. With a preserved image, for example, information on the shape, colors, & even relative size of neighboring trees are all discernible — but is it perfectly precise? Only relatively so — a discrepancy is revealed when comparing supplementary information provided by written language. With an unlimited amount of words, language offers deeply more descriptive information not casually observed, such as where the tree was planted, by whom, & the type of soil it absorbs. Different communication methods imply a difference in the uncertainty communicated.
Known as the Father of Information Theory, MIT mathematician Claude Shannon famously introduced a logical, mathematical way of measuring this difference in recorded communication methods: he defined it as entropy. Entropy adds a mathematical tool-set to measure the relationship between principles of disorder & uncertainty. A higher entropy value indicates a high level of uncertainty of information; or more practically, a higher number of possible outputs of a function. To arrive at the modern formula for measuring information, we first have to revert back to the ~1920s, when one Ralph Hartley first built on the work of Henry Nyquist.
Early Entropy— A Measure Of Uncertainty
Information is defined as the resolution of uncertainty — if no questions are necessary to determine a value, there is no information being presented. Entropy is directly related to the information of a system. The higher the entropy, the more uncertainty is attributed to the definition of a symbol (a number, letter, etc…). Entropy, or uncertainty, is mathematically at a maximum when a symbol could equally have many meanings (uniform distribution). The simplest unit of entropy is when a symbol could equally have two meanings. A coin-flip for example has this binary property — it’s either heads or tails.
Originally Published — https://www.setzeus.com/
In the graph above, the y-axis, H(x), represents entropy for a symbol with two possibilities (again, binary); notice that entropy, or uncertainty, is maximum (equal to 1) in the middle of the graph. This intuitively checks out, when we flip a coin, for example, the uncertainty of the value it’ll land on is maximized when it’s in the air; on the flip side, entropy is at a minimum (equal to 0), at opposite ends of the x-axis when we know the current value of the binary symbol. With that in mind, it’s worth stating that coin flips, yes/no questions, Booleans & binary digits (0 or 1) are all mathematically equivalent — they are all represented in information theory by the graph above.
This Symbol With A Binary Property, However We Want To Refer To It (a “Bit”), Is The Base Unit Of Entropy Used In All Of Information Theory
Chronologically, the term “bit” wasn’t adopted until much later in history, however, we’re adopting it now for ease of understanding as we work our way up to modern information theory. A fun trivia fact here — the word “bit” is short-hand for binary digit.
Reasonably, it follows that an intended message with a single-step solution (a single bit/coin-flip/yes or no/Boolean of uncertainty), has a lower entropy value than that which requires two, or more additional steps. For example, let’s consider a fair coin flip. When trying to determine whether the result is heads or tails, a single bit will suffice to decipher the value: “was it heads?” Either way, we are certain of the return value with a single response. This problem has an entropy value of one because the answer can be determined in a matter of one step:
Originally Published — https://www.setzeus.com/
When more complex problems are considered, it’ll likely take an additional number of Booleans to determine the answer with certainty. We now know that communicating a single symbol, the simplest symbol, a bit, yields an entropy value of one.
What If Sent A Message With Three Symbols (All Bits)?
Continuing the example above, let’s say we now wanted to send a three-symbol character with three coin flips. The number of total possible messages is now eight (2³), but the entropy of this message has a value of three.Remember entropy is not the number of message possibilities, rather it’s the minimum number of questions required to determine the message.
What If I Sent A Message With A Single Non-Binary Symbol (Character)?
Now say we want to send a single character, that means our symbol could be any one of twenty-six letters (1/26) — so what is the entropy value this time around? Essentially, how many yes/no questions, do we have to answer to determine the value of our letter, say J?
The most identifiable starting point is to simply ask, in alphabetical order, one character at a time: “is X the intended letter?” For example, is A the letter we’re trying to encode? How about B? Or C?…With this method, on average, a single symbol would take us 13 “yes/no” questions to determine (or have 13 bits of entropy). However, there’s a big caveat here, measuring information requires themost effective way of assigning value to symbols. Instead of individually searching character-by-charter, we borrow a concept from a binary search algorithm. Again searching for the letter J, we instead ask is J (our intended character), in the first half of the alphabet? Yes. Next, we split the remaining array, is our character in the first 6 characters (A-F)? No. And so on & so forth until we identify our symbol, J. Notice that this way of assigning value to a symbol using Booleans is much more effective; instead of an average of thirteen (13) yes/no questions, with this second method, we never need more than five (5) yes/no questions (occasionally only four (4)).
Through this observation, theoretically, the amount of entropy increases by a factor of two, for every possible symbol in the message space. With this relationship understood, calculating the most effective way to translate an alphabet character given uniform distribution is straightforward:
The formula above arrives at the approximate amount of entropy (4.7 bits) associated with sending a single, random, alphabet character. As Ralph Hartley concluded in his eminent paper, Transmission of Information:
What We Have Done Then Is To Take As Our Practical Measure Of InformationThe Logarithm Of The Number Of Possible Symbol Sequences
His early definition of information slightly differed in notation. He defined H (information) as H = n log (s), where H is information, n is the number of symbols (letters, numbers, etc…), & s is the number of different symbols available at each selection (essentially the length of the message space).
Credit Khan Academy
Still not quite at modern information theory, we’ve learned a solid amount:
Information Theory provides a way to quantify the amount of surprise for a communication event
Entropy, or uncertainty, is a measure of the minimum amount of yes/no questions that are required to determine a symbol value
Established that the binary digit, the bit, has an entropy value of 1 & is therefore the base unit within this field of math
Derived an early function of information that assumes a set of uniformly distributed symbols values
Up to this point, we’ve assumed that each value in a symbol set is randomly discrete, however, that’s a convenient oversimplification. As we know in real life, the theory isn’t quite this neat & all symbol values aren’t quite created equally. There are organic, measurable, patterns to our language & other forms of communication. As a quick thought experiment, count the number of times the letter “e” shows up in the previous sentence — does that distribution in characters represent a uniform distribution of 1/26?
When Symbols Aren’t Random — Markov Chains
Fast forward to 1948, when the father of modern information theory, Claude Shannon, proposed in his groundbreaking paper, “A Mathematical Theory of Communication” that there are patterns in communication that can be exploited to output the same message (value) in fewer steps (bits).
Written Language Offers Patterns That Make The Next Value In The Sequence More Predictable Due To Their Prior Values.
In other words, the prior value gives the next value less uncertainty (entropy). The best example of this is the predictability of a ‘U’ appearing after a ‘Q’ in English writing. If ‘Q’ is proceeded by ‘U’ 90% of the time, the potential outcome of the next letter is no longer in equilibrium throughout the entire system, it skews towards the value of ‘Q’, at a rate of 90%.
This creates a system in which the next value is dependent on the previous. Russian mathematician Andrey Markov proved this in his groundbreaking proof, ‘Markov’ Chains’. In this, he states that the probability of future values that are dependent on previous events are fixed in their probability. He proved that over the continuous running of a system, the outcomes will match their statistical probabilities.
Letter Frequency Analysis
Remembering the dependency that ‘Q’ has on ‘U’, with the 9/10 probability of the return value being ‘U’ (P(Xi)), the entropy, or uncertainty, of a ‘U’ appearing after a ‘Q’ is H(X) = 0.13 bits. The entropy for any value of a completely randomized alphabet, one with complete independence, is H(X) = 4.7 bits. With this established pattern, & the dependence ‘Q’ exhibits on ‘U’, the entropy is reduced by an astounding 4.57 bits. Instead of cutting down the alphabet by half on the first step, the question that reveals the most information states, “Is the value ‘U’?”. 90% of the time this will be true, & the entropy is only one bit. This allows for the removal of unnecessary questions, in turn, lowering the total entropy of the system. Through the computer analysis of millions of written works, the standard probabilities of each letter within the English language have been devised. These are standard probabilities that can be used for independent events. Taking dependent value outputs into consideration (Markov Chains), relative frequencies have also been established for the frequency of letters.
With this realization, Shannon modernized information theory by evolving Hartley’s function. With a set of random, uniform values X, we calculate the entropy of encoding a single symbol with the log (base 2) of X. With a set of related, interdependent values X, we calculate the entropy of a single symbol byadding up the individual values of entropy for each individual possible symbol value in the symbol set:
Originally Published — https://www.setzeus.com/
However, the formula above assumes uniform distribution, which through Markov chains we now know to be false. In order to account for this, we need amend the formula to multiply by the probability frequency of each symbol value x, (p(x)):
Originally Published On https://www.setzeus.com/public-blog-post/intro-to-information-theory
Last, we need to replace the n inside the log function. We’re calculating the number yes/no questions required to arrive at each individual character, instead of the aggregate outcome. Continuing with the alphabet example, we know from the Markov chain above that it doesn’t take an equal amount of bits to guess whether a character is e versus z. Therefore, for each sum, we need the number of that specific outcome occurring — we know it’s not 26, but rather (1 / p(x)):
The formula we’ve now derived is the very same one that Claude Shannon derived & catapulted him as one of the fathers of information theory. He slight re-arranged the variables above, this equation is at very core of modern information theory:
Credit Ian Stewart
H(x) is entropy, the measure of uncertainty associated with set variable X. P(x) is the probability of outcome x in variable X. And the base-2 Log(1/p(x)), is the number of bits required to decode the outcome x of variable X. Again, the base unit here, what H(x) is equal to, is defined in bits.
Theoretically, Shannon’s updated formula, which leverages the principle behind Markov chains, should decrease the entropy in a set of symbol values since we’ve moved away from uniform distribution. Once again returning to the alphabet, we can confirm this through an example. Recall in an earlier example we calculated that H(x) = 4.7 for a single alphabet character. Let’s compare that to H(x) calculated with the updated formula:
The Markov %s Used Here Are The Values Seen In The Graph Above
As seen from the sum in the bottom-right, this intuition indeed checks out with a final entropy value of H(x) = 4.18. With this general formula for entropy, since the 1950s, information theory & its applications have only grown faster.
Onward To Applications
The concepts learned here, built from the mathematical bit, up, are recognizable & powerful in the digital era. Likely one of the most widespread & influential uses of information theory is the role it plays in lossless data compression. This is a form of compression utilized in database records, word files, image files, & video files. In this form of compression, data can be perfectly reassembled to its original state from the compressed file. Utilizing the principles of Shannon’s entropy & Markov Chains, information can be perfectly retrieved without fault. This ability to compress has allowed for the mass production of devices that hold unfathomable amounts of data. This is most impressive in the case of music files. Early days of recorded music relied on vinyl records — an uncompressed format of information. A vinyl record could store a single album in a lossless, uncompressed state. With modern compression technology, music files are compressed to contain bits pertaining to pitch, volume, resonation, etc. Another powerful example is the increase in hardware storage capability, again another outcome that’s powered by the mathematical principles outlined here.
The digital age would only be a distant dream without Shannon’s contribution to the world of communication. In similar fashion to every other under-recognized mathematical principle, information theory plays a vital role in our day-to-day functions.
Captivated by the motion of waves & mesmerized by the perfectly-imperfect symmetry of leaves, it’s crystal-clear we’re a pattern-seeking species. Externally, studies have proven that we use patterns to weigh our environment of danger — a disruption in our daily routine (particularly back in hunter-&-gather societies) signals to our conscious that something is off. Internally, patterns are inscribed in our DNA; in an energy-conservation effort, most biological processes are duplicative & are therefore likely to generate a visual form indicative of patterns.
In math, the branch of math centered around the study of continuous, patterned, yet irregular scalar symmetry is known as Fractal Geometry.
Published In — https://www.setzeus.com/
The story for modern fractals traces back to the 17th century when Rene Descartes first introduced the concept graphing a polynomial function. As elegant & as evolutionary as they were, we (ie — the math community) eventually noticed a glaring issue in this definition: they failed to easily, predictably trace out the patterns noticed IRL. Using absurdly-long & convoluted polynomials in an attempt to graph out the examples shown above highlighted obvious shortcomings.
They were missing a key, required, shift in frame of reference. The trick to mathematically mapping organic patterns is to not analyze them as they are but rather to think of what it took to produce them.They were missing the principle of iterations.
Evolutionary forces carve out the most effective processes for a species & then repeat that process — thus, nature is commonly, organically generative. A more familiar, universal example of this is the famed Fibonacci sequence seen in sunflower patterns & spiral shells. A core reason for its popularity is that its intricacy is generated by iterating a stunningly-simple sequence:
Published In — https://www.setzeus.com/
The Fibonacci sequence is the simplest example of an iterative function, but it does the trick in highlighting it’s generative nature; for each new iteration, the input used is simply the output from the preceding iteration. It’s a basic but simple example of a fractal.
Fractals, the crux of fractal geometry, are infinitely complex & detailed patterns that are self-similar across different scales; they’re mathematical objects created by recursions of functions in the complex space. As we’ll see shortly, fractal geometry brings us much closer to replicating the irregularities & intricacies that surround us.
We’ve covered the concepts of iterating a function (ie — using its previous output as the next input), however, before moving on to Julia’s contributions, it’s worth covering the basics of complex functions.
Quick Prerequisite
A complex number, as hopefully most remember, is a two-part number with a real & an imaginary component: a + bi. When seen in applied exercises, the short-hand notion for these two-part complex numbers is the single letter: z.
Complex Numbers & Polynomials
When graphed, complex numbers use a different coordinate system, the complex plane; logically, the plane runs along two expected axis: real & imaginary.
Complex Plane & Example Complex Numbers
Seen above, the coordinate system is no longer Cartesian along an x-&-y axis, but rather structured along the real & imaginary axis. Graphing complex numbers, once algebraically separated into their two parts, is straightforward.
Magnitude Of Complex Numbers
Complex numbers, much like their counterpart, natural numbers, share a property of magnitude, or “size.” However, while the magnitude of natural numbers is immediately identifiable as the absolute value, arriving at the size of a complex number is a bit more involved:
Published In — https://www.setzeus.com/
Seen above, we turn to the good-ole Pythagorean Theorem; the relative “size” of a complex number is the shortest distance between the point & the origin in the complex plane.
Under Iteration, Limits & Orbits
As we’ve foreshadowed & will shortly see, the entire behavior for fractal geometry derives from iterating a complex function (specifically z² + c). The property we care most about is magnitude; analogous to approximating the limit of a natural function as X approaches infinity, the fractal concept known as “escaping” tests if a complex function orbit escapes from the origin (aka, the magnitude approaches infinity under iteration).
Prequel — Gaston Julia
The prequel to modern fractal geometry begins in the early 1900s with a young protagonist by the name Gaston Julia. A curious collegiate student fascinated with music & mathematics, he was particularly drawn to complex numbers & functions.
Gaston Julia
His contribution to modern fractal geometry started when his attention was piqued by the 1879 paper by Sir Arthur Cayley, The Newton-Fourier Imaginary Problem. In it, Cayley sought out the roots of the equation f(z) = z³ + c = 0 using the Newton-Raphson iterative method. Since there are three roots, he wondered if one could predict which of the three roots a given starting value of z would reach as a limit. He failed in his quest & left readers with the open challenge: “appears to present considerable difficulty.” While Cayley’s solution to finding “basins of attraction” for each root through iteration was indeed on the right track, the technology of his time limited his perspective; seen below, when iterated, the shape of these basins is infinitely complex — providing us with an accurate preview of the scalar-symmetry we see in fractals:
Bains Of Attraction: Z³+C
Unfortunately, life abruptly interrupted Gaston’s academic endeavors midway through his academic career at the University of Paris when he was drafted & scripted into joining the army for World War I. Compounding his misfortune, in 1915 he lost his nose & was nearly blinded; awarded the Legion of Honour for his valor, Julia, unfortunately, had to wear a black strap across his face for the rest of his life.
Released from service at the young age of twenty-five, Julia triumphantly returned to his favorite topic: iteration of complex polynomial functions. In an impressive & successful effort, he published a massive 199-page book memoir on the subject in 191. Titled Mémoire sur l’iteration des fonctions rationelles, it was published in the eminent Journal de Mathématiques Pures et Appliquées, won him the annual Grand Prix award from the French Academy of Sciences, & temporarily catapulted him to academic fame. The publication, along with contributions by Pierre Fatou, is the crux of fractal geometry; though they were quickly forgotten in their time, they incontrovertibly laid the foundations for future mathematicians.
Julia Sets
The groundbreaking dissertation drew particular attention to a crucial distinction between iterations of points (or orbits) that converge to a repeated pattern versus those that escape (aka their magnitude goes to infinity). Those that repeat after multiple iterations, typically marked by their mesmerizing visual patterns, are known as Julia Sets. The mathematical equivalent of Julia set inverses, Fatou Sets are iterated complex functions that lead to escaped orbits or clusters of orbits. Before moving on, it’s worth clarifying that the formula here doesn’t deviate from a complex binomial with only a changing constant (though remember these are complex number so they both have two parts):
Published In — https://www.setzeus.com/
Now, the Julia Sets below are for our visual assuage but they were not constructed by Gaston Julia as they require modern computing; this, however, does make Gaston’s accomplishment appear more impressive as it suggests that he mentally “viewed” the mesmerizing intricacy of bounded, iterated sets through manual work or thought experiments.
Six Julia Sets With Two Fatou Sets — Can You Tell Which Is Which?
Hopefully, the above provides a concrete example of the difference between these two types of sets; they’re all Julia sets except for the two Fatou sets on the bottom row. Again, each of these visualizations is simply a result of iterating the function Z² + C with different values of C (shown above). As exciting as these initial findings were, they fizzled out of the mainstream & further research remained dormant.
A quick thought experiment before we move on — the examples represent only eight different variations of C out of the entire complex plane. What if were to happen if we were to iterate every single point in search of Fatou & Julia sets?
Enter The Fractalist — Benoit Mandlebrot
Progress remained stagnant for decades. In fact, the now-accepted term fractals wasn’t coined until the 1970s, when one Benoit Mandlebrot brought the topic back to life.
Appropriately, Benoit’s entry to the field was inspired by nature; the first relevant publication for modern fractals is found in his 1975 investigative report, How Long Is the Coast of Britain? Statistical Self-Similarity and Fractional Dimension.In it, he explores geographical curves & discovers that while they’re undefinable, they are:
statistically “self-similar,” each portion can be considered a reduced-scale image of the whole. In that case, the degree of complication can be described by a quantity D that has many properties of a “dimensions,” though it isfractional.
Mandelbrot, who was born [in 1924] in Poland, had read the work of both men & studied under Julia in the 1940’s, thus, he knew where to turn. Shortly after his observations studying the Coast of Britain, he began using a computer to map out every Julia sets both in search of possible matches & to confirm his suspicion that Julia sets were marked by this fractal property. Mandelbrot observed & tracked the result of iterating each constant (C ) in the complex plane by placing a white or black dot depending on whether the function converged to a Julia or Fatou set.
The resultant diagram, seen on the left (or above on mobile) is today known as the famous Mandelbrot Set. There are many variations with rich colors & captivating gradients, however, the original diagram was a simple black-&-white rendition. For a more mathematical definition, a Mandelbrot set marks the set of points in the complex plane such that the corresponding Julia set is connected; or, in other words:
The Mandelbrot set is a dictionary of all Julia sets
Original Image Credit To Paul Bourke
The visual above is the crux of fractal geometry. Tying it all together, it highlights the relationship between the Mandelbrot Set & all the corresponding Julia Sets within. Every dot in black (within the set) connects to a continuous Julia set, yet the middle lowest red dot, stemming from the white (outside of the set), connects to a cluster, a Fatou set.
In Closing
Mandelbrot subsequently displayed images of the set & elaborated on its significance in speeches, papers & books. Much like Julia’s publication a near half-century ago, the Mandelbrot Set rocketed Benoit Mandelbrot to academic fame; consequently, the renewed interest established fractal geometry & eventually set the foundation for additional branches of math (such as chaos theory).
I won’t explore the philosophy here but it’s nothing short of poetic irony that it took imaginary numbers to mathematically map out reality. There is order behind the chaos; as Julia & Mandlebrot taught us, all we need to do is shift our perspective. This is hardly the beginning however, next, we’ll delve into some semi-original work as we attempt to bring significance to the Mandelbrot Set in three-dimensions…
A story on reproduction in biology, machines, and AI
25 Nov 2019 — 13 min read Wikipedia (https://en.wikipedia.org/wiki/Self-replicating_machine)
A story on reproduction in biology, machines, and AI
We are already used to see the impact of AI everywhere around us not just in digital life (from recommendations on Netflix to voice recognition by Siri and Alex) but also in physical: Amazon Go, CCTV surveillance and taxis without drivers and it indeed made our lives better. What is a bit disappointing, that the “intelligence” doesn’t seem anything like a human or even biological intelligence, because it is just a set of mathematical models that have been fit their degrees of freedom based on some empirical observations. What do we expect from the intelligent creature? Apart from the raw intelligence, it could probably sense of humor, compassion, ability to interpret its own decisions and, of course, the ability to reproduce. Everything but the latter point was already successfully implemented in some algorithmic form, that’s why this article will focus on trying to build such AI (an artificial neural network here) that is able to copy itself, i.e. is able to self-replication as the simplest form of reproduction.
We will review the history of self-replicating machines, the corresponding concepts and will code our own neural networks that are able to do it. During this research, very interesting but natural questions will arise: is this self-replication even useful for the algorithms? How they balance between reproduction and other tasks we want to teach them? Probably, we can learn something from this experiment as humans? This article is highly influenced by the work Neural Quines. As always, you can find all the source code and try the experiment by yourself in my GitHub.
Self-replication in nature
DNA replication http://jonlieffmd.com/blog/dna-proofreading-correcting-mutations-during-replication-cellullar-self-directed-engineering
As Wikipedia tells us, in nature the process of replication is essentially the reproduction:
Reproduction(orprocreationorbreeding) is thebiological processby which new individualorganisms— “offspring” — are produced from their “parents”. Reproduction is a fundamental feature of all knownlife; each individual organism exists as the result of reproduction. There are two forms of reproduction:asexualandsexual.
The creation of the “offsprings” that are different from the parents is the topic of another discussion, but self-replication was actually the very first biological act of reproduction. It was the very dawn of evolution and natural selection made the job pushing such creatures that were better replicating themselves. Thanks to the randomness and mutation the first simple cells appeared. Everything else is history.
Self-replicating machines
Rather scary mechanical self-reproduction from Cornell University
One of the first, celebrated John von Neumann started to think of a machine, that is also is able to create its offsprings or at least to copy itself. As a base model, he selected (pretty obvious for his time) a Turing Machine and tried to build such automata (he called it Universal Constructor A). The very first iteration looked like the following:
Von Neumann’s Theory of Self-Reproducing Automata: A Useful Framework for Biosemiotics?, Dennis P. Waters
Von Neumann meant to create a Turing Machine A that creates another automata X based on the instruction Ф(X) but realized that it is not enough for reproduction because apart of creating the automata X you also need to create the instruction Ф(X) for the future “babies”.
Next step he called Universal Copier B:
Von Neumann’s Theory of Self-Reproducing Automata: A Useful Framework for Biosemiotics?, Dennis P. Waters
This is basically a “Turing Quine” that literally outputs its own source code, in our case, the instruction Ф(X). What’s next? Exactly, somehow we need to combine UniversalConstructor A and Universal Copier B. Unfortunately we can’t just combine them in the (A+B) mode, because then what happens if the description Φ(A + B) is input into the automaton (A + B)? The result should be an extra copy of both the automaton and its description.
Again, Von Neumann envisioned a third control automaton C with executive control over the activities of (A + B) to ensure that the symbolic input description Φ(A + B) is used in the correct role at the correct time, i.e., either to construct or to copy. Check the paper to see the explanation of why three sub-assembly automata (A + B + C) is actually enough for the simplest self-replicating Turing machine, which is easy though.
Why do we need self-reproducing AI?
Seems like the idea of self-replication makes sense in nature (thanks cap) and in algorithms on different scales. But why do we need it in the AI programs?
If biological life began with the first self-replicator, so if we are already inspired by biological methods in creating artificial intelligence, shouldn’t we kick-off with the same starting point?
The latest research in neural networks is very focused on the right representation learning, that’s why many hypotheses to influence the weights like generative or multitask modeling are on fire today. Shouldn’t self-awareness and self-replication be such a representation goal as well?
It is important also for security, for instance, if an AI program can output it’s own “source code” and, hence, block other opportunities to read the weights from the file or reverse-engineer it somehow.
Another nice property of a self-reproducing system is the ability to self-repair from the damages as some living creatures already do.
Apart from technological challenges, humans tend to like becoming all-mighty gods and creating something that is able to recreate itself can look attractive from the “artificial life creating” point of view.
Quines
Talking about computer programs, such ones that are able to replicate themselves (i.e. to return their own source code) are known for some time as quines. For example, in Python, the code that outputs itself may look like the following, you can find more theory and sophisticated examples here:
s = ’s = %r\nprint(s%%s)’
print(s%s)
Talking about neural networks and machine learning models, we can consider as a neural quine such a model, that is able to reproduce its all its own parameters and structure very accurately. It means, having a multilayer perceptron, it has to be able to return all the weight matrices and biases vectors of itself.
The name “quine” was coined byDouglas Hofstadter, in his popular science bookGödel, Escher, Bach, in honor of philosopherWillard Van Orman Quine(1908–2000), who made an extensive study ofindirect self-reference, and in particular for the following paradox-producing expression, known asQuine’s paradox:“Yields falsehood when preceded by its quotation” yields falsehood when preceded by its quotation.
In our case of machine learning quines, we have several challenges upcoming:
How to define a mathematical setup that will allow a neural network to predict its own weights?
How to make such a neural network to solve other, useful tasks, meanwhile being able to reproduce itself?
Having built such a model, what does this self-replication mean? Does it help the training? What do we learn from it?
Neural Quines
Alright, so we agreed on having a neural network that predicts its weights as a neural quine. What is our next step? Correct, to predict directly the weights for example as a set of matrices with a mean squared error as a loss function… but here are several problems. First of all, the size of the output will be a lot of times larger than input size: this itself will lead to difficulties in convergence. Another point is when dealing with multilayer networks: should we set it up as multitask learning with having each separate weight matrix as a separate task? Or what should we do with convolutional or recurrent architectures? Seems like we need some indirect way to predict the weights.
The idea described in the original Neural Quines paper is indeed brilliant and is inspired by the HyperNEAT, a neuroevolution method for designing neural architectures. Instead of predicting all weights in the output same time, we can use the coordinate (or index) of each weight as the input and have a single number that represents the value of this weight as the output. This way we ensure the smallest possible dimension both for the input and the output without any constraint on the inner architecture and, hence, the weights! Schematically, the architecture will look like the following:
The architecture of the vanilla neural quine
But what should we do in the case when apart of self-replication we would like to solve some additional classification or regression problem? Thanks to the flexibility of the graph-based neural architectures and the ideas of multimodal and multitask learning, we can do that as well:
The architecture of the auxiliary task neural quine
Mathematically, we have a loss suitable for regression to measure the deviation between the actual weights of the model and predicted ones
Weights reproduction loss for the regular neural quine
and in case of solving additional task next to the self-replication (authors call it auxiliary), we can add it in the multitask manner with some regularization coefficient:
Total loss for the auxiliary neural quine
In PyTorch code, the architecture will look like the following (also, check the source code on GitHub):
To learn more about multitask learning and its benefits, please check out this article of mine, it shows some examples in time series analysis and computer vision alongside the source code!
Experiments
For testing out the idea I chose relatively simple multilayer perceptrons with a single hidden layer. As an auxiliary classification task, I’ve played with an MNIST dataset with pictures rescaled to 8×8 pixels to speed up the convergence and keep a number of weights to be replicated very limited (
Using no more than high school algebra, here’s how to solve the infamous question 6 from the 1988 International Mathematics Olympiad.
Motivation
This problem has a reputation for being one of the hardest, and perhaps the hardest, IMO problem of all time. And you can solve it only using high school algebra. Struggling through and understanding the solution will do several really cool things for you, but will take time, so motivation is essential.
First, you’ll understand that you can help advance mathematics. I spent several days solving this problem, whereas in the competition they have 4 and a half hours for 3 questions. But I’m not amazing at mathematics — I’m an undergraduate economist. I do maths for fun, and I didn’t give up. You don’t need to be a genius to be a good mathematician. If you work hard, you will be able to solve really hard problems and, maybe one day that will be solving hard research problems. In real life, it doesn’t matter if you are slower than the super geniuses, because there aren’t enough super geniuses to go round. We need ordinary people to work hard and solve problems. Rome wasn’t built in a day, and nor was it built by one person!
Second, you’ll get a taste for what mathematicians do when approaching a problem and the language they use. Normally it takes well into an undergraduate degree to see this because the material is advanced. I have made my solution so that it uses no more than algebra — that’s right, no dreaded calculus or anything! — so that it’s accessible. The solution is then broken down into these things called ‘Lemmas’ and ‘Colloraries’. Lemmas are when we solve a smaller sub-problem which will help solve the big problem. Corollaries are when we look at some implications of using the Lemmas.
The way mathematicians lay things out makes sense. The idea is that we break everything into manageable steps so I, as the problem-solver, can make incremental progress and so you, the reader, can understand every step of the proof. If you can’t understand it, give it a good 10 minutes or so, get pen and paper, and try and figure it out. Ask me in the comments if you’re still stuck, and I’ll reply to you or update my solution to improve the exposition.
Manageable steps is the reason why everyone can work hard and get good at math. If you can’t solve the problem, you play around and solve smaller, perhaps relevant, statements.
I’ll leave exercises throughout. It’s upto you whether or not to use them. If you do use them and work through them, you’ll understand the solution better. If you are stuck on one go back to it a bit later if you aren’t making any progress.
Statement of the problem
IMO proof 1988, question 6
Prove that if x, a, b are all integers then x is a square. For instance, a = 3, b = 27, x = 9 works. So does a = 125, b = 3120, x = 25
Exercise 1. Can you find any other examples of three integers which work? Can you spot any patterns?
Lemma 1
Solutions are paired. Finding one of a pair enables you to find the other.
Imagine you have a list (or ‘set’) of all pairs of integers (x, a).
Doing some algebra, we can see
Exercise 2. Can you work out how I got some or all of these results?
Now we whittle it down to all pairs of integers such that
is a square. Checking the potential even and odd pairings of x and a we can work out that that b will be an integer. (As we are just checking even and odd parities, below I only look at where oddness or evenness results after an operation)
Exercise 3. Why is the square root of an odd number odd? Why is the square root of an even number even? Why is an even number divided by two guaranteed to be an integer?
If we find an (x, a, b) that works, all we need is (x, a) to deduce both values of b which work for those values of x and a. This is because of the +/- in the formula we found for b.
Lemma 2: The Magical Step
There are no solutions where a < 0 and b > 0 (or vice-versa).
This isn’t immediately obvious at all. To start off with, we will rephrase the question. Eyeballing the equation and flipping signs appropriately, you will notice that this is the same as there being positive integers (x, a, b) such that
Exercise 4. Can you understand why this is true? (Hint. let a* be < 0 and x* 3/2, we will get unbounded growth, by pinging back and forth, using our improved bound on a to increase the bound on b, and then use the increased bound on b to increase the bound on a! But this is a contradiction. After all, a and b are finite, but by repeating our procedure we have also shown they are greater than any finite number.
Lemma 3
a,b ≥ root x for all positive integer solutions.
Suppose our arch-nemesis handed us (x, a) such that
Is an integer, and a < root x. Then we write the following as a solution
However, by assumption
This is a contradiction, using lemma 2! After all, we’d have found a solution with a > 0 and b < 0.
Thus both a,b ≥ root x
Lemma 4: The Magical Step revisited
One of a,b < x or one of the paired solutions has one of a*,b* < x
Start with b.
Thus, by assumption, we can deduce
Why? Suppose it were less than x. Then we would be done, as we would have a paired solution with b < x, which is what we wanted.
Now we want to find a suitable lower bound for that horrible square root expression. We have to be quite cunning, to make the inequality tight enough, but not so tight as to lose generality. For x >= 3, we can safely write
Exercise 8. Try and understand why the above inequalities hold.
This allows us to write
As both a and b are greater than x, we also get (by using the same logic with variable names switched)
Thus a > 2x, b > 2x.
But now we can repeat this procedure!!!!!
And we will also be able to deduce that b > 4x. We can repeat this forever. This means there are no solutions, as the lower bound for both a and b can be made arbitrarily large.
Exercise 9. That was hard. Reflect for a bit on why the logic works if you are unsure. Re-read the lemma again now and in a few hours or tomorrow.
Lemma 5
For each solution where neither of a, b root x. One of a and b is less than x by Lemma 4. Let’s say it’s a, which gives us the following inequality.
We find a solution b* such that
By the magical lemma, this is a contradiction, or b* = 0, which gives an easy set of cases to verify. (Alternatively, you could use the inequality above to show that there are no integer solutions in the range ax-2a/x and xa which work, as the only option is xa — 1, which can be checked by hand).
Corollary 1
For each solution (x, a, b) there is an implied solution where a = root x or b = root x.
We know that there is an implied solution where one of a or b is 0 by lemma 2.
We can easily see that solutions where a < 0 and b < 0 are mirrored in solutions where a > 0 and b > 0, as the negative signs cancel.
So we look at positive integer solutions. By lemma 3 there are no solutions where one of a or b is less than root x. (This used lemma 2).
Now we get onto lemma 4. This told us that, given a solution triple (x, a, b) one out of a*, b*, a, b was < x, where a* and b* are the paired or ‘implied’ solutions. Thus, suppose we found all solutions with one of a or b < x had x being a square. Then this would be true for all solutions where both a and b >= x, as x remains the same when we observe the a* and b* this solution is paired with.
Lemma 5 told us that this was indeed true. And so we are done. Can you see how to classify all the solutions? That’s coming next!!
Reflections and thoughts
That was hard. I find it hard sometimes to remember why a bit worked! If you have made it this far, you should feel proud, even if you understand just 10% of what happened. I always feel a bit baffled when I first encounter some hard maths, and it’s only over a long amount of pondering that I ‘get it’.
Exercise 10. Reflect on how well you understood the solution. Spend some time thinking about the bits you found hard, re-read them, write stuff out by hand on paper, ask questions in the comments. If you remember, read through the solution in a few days again, and then in a week or so, and see if you understand more of it.
Math is something to be taken slow. It takes time to digest and understand fully. I often spent days or even weeks mulling over a hard problem.
Free Math Resources
Khan Academy — website with Math, Science and coding, all free. Goes from basics right up to some university level stuff.
3Blue1Brown — Math educator. He creates amazing animated videos to help give an intuitive understanding of math.
Art of Problem Solving — range of math problems, especially competition math.
Past International Maths Olympiad Problems — these are hard. They’re fun to attempt, but be willing to spend a long time working on one.
Dexter’s Notes — Lecture notes and practice sheets from the Cambridge Mathematics Tripos. This is HARD! Not recommended for beginners. Part IA means first year undergraduate, Part IB means second year undergraduate, Part II is third year undergraduate, Part III is fourth year undergraduate / Master’s student.
MIT Opencourseware — Resources, practice exercises and exams from MIT.
I’m a mathematics and philosophy enthusiast, who studies economics at Cambridge University. You can contact me in the comments or via my facebook page. If you want some guidance in how to improve in math, I’d be glad to help!
Edit: 28.08.2020 I’m just updating this article to say I have now succeeded in switching to maths. Yay 🙂
You expect the physical world around you to behave a certain way. If you knock a glass off the table, it will fall to the floor and likely shatter. How do you know this? Primarily from everyday experience, which your brain distills into your intuition. Once you’ve seen enough
Quantum Mechanics
Ways of Interpreting Quantum Mechanics: A Quick Look
Copenhagen, Parallel Worlds and Infinite trajectories: Time to dive into the weird.
Quantum Mechanics
How Quantum Mechanics Alters Reality
The double-slit experiment
Quantum Mechanics
Cats and Collapses: Interpreting Quantum Mechanics
Everyone’s heard of Schrödinger’s Cat. But what can it tell us about the meaning of Quantum Mechanics?
Feynman
When Feynman met Dirac
Beloved late physicist Richard P. Feynman (1918–1988) first met his hero Paul Dirac (1902–1984) during Princeton University’s Bicentennial Celebration in 1946 and then again at least twice, in 1948 and 1962.
Quantum Physics
Schrödinger’s Equation
“So do any of you guys know quantum physics?” Paul Rudd in Avengers: Endgame
Quantum Mechanics
What is Wave-Particle Duality?
Why we need quantum physics.
Computer Science
About Scientific Theories
What amount of effort really goes towards formulation?