The Britishwriter, mathematician and logician Charles Lutwidge Dodgson (which was Lewis Carroll’s real name) worked in the fields of geometry, matrix algebra, mathematical logic and linear algebra. Dodgson was also an influential logician. (He introduced the Method of Trees; which was the earliest use of a truth tree.) For some time Dodgsonwas Mathematical Lecturer at Christ Church, Oxford University.
And, of course, Dodgson (under the name Lewis Carroll) also wrote Alice in Wonderland, Alice Through the Looking Glass and many other books and poems.
Lewis Carroll(I’ll use Dodgson’s better-known pen name from now on) formulated his premises paradox in 1895. (The argument was advanced in Carroll’s paper ‘What the Tortoise Said to Achilles’, which was published by the journal Mind.) This paradox refers to the possibility of infinite premises being required in order to reach a single conclusion.
Firstly, two or more premises are usually linked to a conclusion in a logical argument.
So how are they linked?
Now that can be a question of the philosophical nature of the link between premises and conclusion: whether it’s an example of entailment (or logical consequence), implication (or material implication) or whatnot. However, that wasn’t the point that Lewis Carroll was making. He was making a purely logical (not a philosophy-of-logic) point.
To put this simply: in order to justify (or explain) how any given premises are related to a conclusion (or how the premises entail or imply the conclusion), then a further premise will need to be brought into the argument in order to do so. Moreover, another premise will be required in order to tell us (or show us) how (or why) it’s the case that if the premises are true, then the conclusion must follow and also be true.
Now that added explanation (or justification) will itself be a premise within the argument.
Of course, the same problem will repeat itself.
Just as we brought in a premise to link two, three or more premises to a conclusion, now we have to say how this new premise is itself linked to those previous premises. Or, alternatively, we’ll need to know how all the premises (when taken together) are linked to the conclusion.
A solution has been offered to this logical paradox.
That solution is simply to argue that no added premises are needed in the first place. That is, the link between the premises and the conclusion doesn’t need to be explained (or justified) by a further premise.
So why is that?
The (or one) answer is to say that the premises and conclusion are simply linked by a rule ofinference. That rule itself explains (or shows) the relation between the premises and the conclusion. Nonetheless, that seems (at least at a prima facielevel) like a cop-out. To simply say that premises are linked to a conclusion by a rule of inference doesn’t appear to be saying anything… much. Surely that rule -again! — will need to be explained by a further premise..
It can also be argued that stating that premises are linked to a conclusion by an inference rule is itself a further premise.
However, that’s unless that rule of inference somehow works without any justification or explanation.
In other words, it’s not a justification: it’s just a rule. It’s something that(to use Wittgenstein’s much-quoted phrase) “can be shown but not said”, just as the nature of thelogical constantscan’t be said, only shown.
In probability theory and statistics, Markov’s inequality gives an upper bound for the probability that a non-negative function of a random variable is greater than or equal to some positive constant.
The inequality is named after Russian mathematician Andrey Markov (1856–1922) and is useful for relating probabilities to expectation by providing bounds for cumulative distribution of a stochastic variable.
Introduction
Suppose that you are tossing N coins, one after the other. A first natural question to ask is “what is roughly the number of HEADS you should expect to see?”. Well, if we assume that the coin flips are independent and unbiased, then on expectation you should see N/2 HEADS. OK, that seems reasonable so far.
So, let’s ask a second question: “how likely is it to see more than 2N/3 HEADS?” First, we note that it is definitely possible to observe at least 2N/3 HEADS, so the probability of such an event is positive. But, is there a way to bound such a probability? The answer is YES, and the simplest way to do so is by using the elegant Markov’s inequality.
The starting point for Markov’s inequality to apply is a random variable X that is non-negative. In our case, X represents the number of HEADS we observe in Ncoin flips. Let E[X] denote the expected value of this random variable, and let A > 0 be some positive number. Then Markov’s inequality says that the probability of X taking value at least as large as A is at most E[X]/A. Formally,
Markov’s inequality.
That’s it!
It is important to observe here that Markov’s inequality is interesting when one wants to compute how larger a random variable can get compared to its expected value. In particular, it directly gives the following bound, for any positive number t > 1:
A direct implication of Markov’s inequality. A picture describing the event that Markov’s inequality bounds. [Source: http://homepage.cs.uiowa.edu/~sriram/5360/fall18/notes/9.10/week4Notes.pdf]
An example
Before proving the inequality, let’s apply it to our example. We would like to get a bound on the probability that we observe at least 2N/3 HEADS in our N coin flips. As stated, it is easy to compute that E[X] = N/2. Thus, by applying Markov’s inequality, we get
So, right away, this implies that with probability at least 1/4, we will observe fewer than 2N/3 HEADS, or equivalently, we will observe more than N/3 TAILS.
Let’s see now how one can prove Markov’s inequality.
The proof of Markov’s Inequality
The proof is surprisingly simple! We will use the definition of the expected value and the law of total probability, or more precisely, the law of total expectation. We have
Since X is always non-negative, we immediately have that E[X | X< A] ≥ 0. Thus, the first term on the right-hand side of the equality is always non-negative. Let’s ignore this term, and get that
Almost there! The only observation we need to make is that when we condition the variable X to be at least A, then clearly, the expected value under this condition is always at least A. More formally,
Plugging this into the above inequality, we get that
By rearranging the terms, we conclude that
Markov’s inequality is officially proved!
The great thing about Markov’s inequality is that, besides being so easy to prove, it is sometimes tight! And even if it is not, it still often gives very useful bounds.
A pictorial proof of Markov’s inequality when X is a discrete random variable (like in our example) by Jeff Erickson. [Source: https://math.stackexchange.com/questions/518873/visualizing-markov-and-chebyshev-inequalities]
Tightness of Markov’s inequality
A simple example that shows that the inequality is tight is the following. Let Z be random variable defined as follows.
Note that its expected value is 1. Markov’s inequality now implies that
Indeed, the probability that Z is at least K is equal to the probability that Z is exactly K (as Z only takes values 0 and K), and this probability is equal to 1/K.
Conclusion
Markov’s inequality, although simple, is often very useful, and in many cases provides satisfactory bounds, especially when more sophisticated techniques cannot be applied (e.g., when we do not know much more for a random variable other than its expected value and the fact that it is always non-negative). Of course, there are cases where we can do a much better analysis (as in the example above with the independent coin flips, where one can apply more sophisticated techniques, as the Chernoff bound). Nevertheless, it is always good to keep it in mind, and, well, if you ever forget the statement, the proof is 2 minutes away!
When I studied topology as an undergraduate, I always found it difficult to answer the inevitable question from family and friends:
“What on Earth is topology?”
Every time, I would give them a slightly different answer, but I was never really satisfied with any of explanations. If you’ve ever googled topology, you will have no doubt encountered the animation of a doughnut morphing into a coffee mug. Most explanations I gave involved something related to this animation; how in topology a doughnut and a coffee mug are the same or how a sphere and a cube are the same. But an answer like this doesn’t really tell anyone what topology is really about, what is involved in actually working with topology nor explain why it’s even a worthwhile pursuit.
The well known coffee-mug doughnut animation. Source: Wikimedia Commons
If you take an undergraduate course in general topology, you might struggle with relating what you’re learning to the familiar doughnut coffee mug animation. The purpose of this essay is to establish the basic concept of general topology and show how this concept is related to this familiar animation and other geometrical ideas. Next, we will have a look at why it is actually useful and interesting to consider doughnuts and coffee mugs to be the same things.
In general, something that I find a lot of people struggle with, myself included, is trying to understand exactly how abstract areas of mathematics can actually have real world applications. After we develop an understanding of topological ideas, we’ll take a look at how topology is, perhaps unexpectedly, related to the way we think about real world concepts. Before this, we’ll take a look at the most basic idea in topology. This is the definition that you will be faced with if you ever open a topology textbook or take an introductory topology course.
Topological Space
A topological space is a set of mathematical objects with the most basic form of structure possible. When we talk about structure in mathematics, we often mean that we have the ability to add mathematical objects together, multiply them together, determine how far away the objects are from each other and many other ideas. Obviously, we can do all of these things with the numbers we encounter every day.
But the structure of a topological space is more basic than the ideas of addition, multiplication and distance. In fact, spaces where we have these things are specific cases of a topological space. Meaning that the real numbers are actually a very specific case of a topological space.
The structure on a topological space is called the topology of the space. All the topology is, is a collection of subsets of the set of mathematical objects, known as “the open sets” of the space. Which specific sets that are contained in the topology defines the structure of the space. This might seem very vague and abstract, but that’s because it is. It’s the most abstract form of structure that we have in mathematics.
It’s not necessary to understand this definition fully, just keep in mind that the topology, and the “open sets” inside of it somehow determine the structure of the space. It’s also important to remember that what makes a topological space distinct from another topological space is the sets that we choose to put in the topology of the space. Below I have given a more formal definition of a topological space if you’re interested, but it certainly isn’t necessary to read.
Definition of a topological spaceA topological space (X, τ), is a set X with a collection of subsets of X; τ. Such that:1. X and the empty set are contained in τ.2. Any union of sets in τ is also in τ.
3. Any finite intersection of sets in τ is also in τ.
So how does this relate to doughnuts and coffee mugs?
Often, topological spaces can be visualised by geometric objects, such as a sphere:
Figure 1: A sphere
The topological space that represents a sphere is the set of points such that if you were to plot them in three-dimensional space they would make up a sphere, along with a topology. Recalling that the topology defines the structure of the space, it is the topology that is keeping the sphere together. We can imagine the topology as “the thing that is keeping all of the points from falling to the ground”, making the sphere a single object, not just two hemispheres pressed against one another. Now, let’s suppose we have some other topological space visualised as this object below:
Figure 2: An ellipsoid
Suppose that the sphere above (Fig. 1) is made of play-dough, then we can easily stretch the sphere into this other object (Fig. 2). Being able to do this with a three-dimensional object means that these two objects are topologically the same. This might seem strange, but ask yourself, what is different about these two shapes? They look different, but if we can easily squish or stretch one into the other, are they actually distinct?
These two objects have the same topology, which means that, even though these two objects are geometrically different, they are topologically exactly the same object. We can stretch our play dough sphere into any strange shape imaginable and all of these shapes are exactly the same shape in the eyes of topology. Okay, maybe not every shape imaginable, there are a few rules with how we can stretch our play-dough:
We aren’t allowed to rip holes in our play-dough; or
Take two points of the play-dough and merge them together (we can’t make any doughnut shapes);
If we violate these rules with our stretching, then our two objects are no longer topologically the same. Topologists call this stretching, without breaking our established rules, a homeomorphism, which is just a way of mathematically describing exactly how we mold our play-dough into shapes of the same topologies. So if we can come up with a homeomorphism between two topological spaces, then these spaces have the same topology. This is where the coffee mug and doughnut animation comes in. We can come up with a topological space that describes a doughnut and then, imagining our doughnut is made of play-dough, stretch it into a coffee mug without breaking our rules. So yes, in topology a coffee mug and a doughnut are the same thing, topologically.
Figure 3. A not-particularly-delicious-looking doughnut
Why is a sphere not a doughnut?
Now that we know how we can tell if two objects in topology are the same, let’s now look at how we can tell if two objects are topologically different. There are a number of different properties that topological spaces have that allow us to distinguish them. For three-dimensional objects, such as spheres and doughnuts, the main thing we can use to distinguish objects is the number of holes they have. Two objects are topologically different if one object has more holes than the other. This is because they break our previously established rules for stretching our play-dough. To create a hole, we either have to rip a hole in the play-dough or wrap the play-dough into a doughnut shape and merge two ends together.
Figure 4. We can mold a play-dough sphere into a doughnut shape, but the edges can’t be merged together without breaking our rules. The two circle faces seen in the macaroni shape still exist when we bend it into a doughnut.
Another common way to distinguish between three-dimensional objects topologically, is to imagine walking on them. Consider walking on a sphere for instance. Suppose you start at some point and walk all the way around a great circle on the sphere. Now, after you’ve reached the same point again, rotate 90 degrees in either direction and walk around another great circle. During this second walk around the sphere, you will cross over your first path. This occurs if you do this on any point on the sphere.
Figure 5. A sphere with two intersecting paths
This same phenomenon happens on any three-dimensional object that is topologically the same as a sphere. However, on some objects which are not topologically the same as a sphere, there is a way we can do this without crossing the first path. One particular object on which this works is a doughnut, as you can see here:
Figure 6. If we start where the blue and green paths meet and follow the blue path, then follow the green path, we can walk without crossing over where we’ve already walked.
There are many different properties that are preserved between objects that are topologically the same, but are not necessarily preserved between objects that are topologically different. These topological properties are the main tool used to determine whether two objects are different.
Other topological objects
So far we’ve only talked about topological spaces that can be visualised in 3 dimensions, but a nice thing about topology is that it allows us to easily describe objects that exist in 4, 5 or higher dimensions using the same tools.
One such object that often comes up in topology is the Klein bottle, which if you’re a fan of the YouTube channel Numberphile (or Clifford Stoll!), you will no doubt have seen:
A representation of the Klein bottle in three-dimensional space
Strictly speaking, we can’t actually observe a true Klein bottle in three-dimensional space, but by allowing it to intersect itself, we can get some idea of what its properties are. In four-dimensional space this object doesn’t actually intersect itself. It’s hard to imagine, but it bends around in the fourth dimension to connect back into itself. The Klein bottle might look like it has an inside and an outside, but you can trace a continuous path from a particular point that goes along the ‘outside’ and ‘inside’ of the Klein bottle, returning to the original point — the 3D representation is the same surface, topologically. Because of this, the Klein bottle has no volume.
An interesting thing about paths on a Klein bottle however, is that if you were to walk along the path described above, when you returned to your original point, you would actually be a mirror image of yourself. This is a topological property of objects that are topologically equivalent (or homeomorphic) to the Klein bottle. So clearly the Klein bottle is not homeomorphic to the sphere or to the doughnut, as no matter how we walk on a sphere or doughnut, we will never be the mirror image of ourselves when we return to our starting point. If an object has this property, they are called non-orientable. Klein bottles are non-orientable and spheres and doughnuts are orientable. Another famous non-orientable surface is the Möbius strip. This object can easily be made with a strip of paper and there are plenty of online tutorials on how to do this.
When the crab walks around the Möbius strip and returns to its original position, it is a mirror image of itself. Source: Wikimedia Commons
While the Möbius strip is non-orientable, it is not topologically equivalent to the Klein bottle, but is integral in its construction. A Klein bottle is actually constructed by gluing the edges of two Möbius strips together, attempting to do this in three-dimensional space is actually impossible (you can try it).
Constructing a doughnut from a sheet of paper
A more practical way of thinking about the topology of objects that are hard to visualise in three-dimensional space (such as the Klein bottle), is by considering their gluing diagram. A gluing diagram acts as instructions on how we can construct an object with a certain topology. This construction works by stretching and gluing the edges of a 2D shape.
Before we consider the gluing diagram of a complex shape, let’s first consider the gluing diagram for a more simple shape; a doughnut:
Figure 7. The gluing diagram for a doughnut
Imagine the square in the diagram is made of play-dough. Next, imagine stretching the square to attach or “glue” opposite edges. When we glue these edges together, we need the arrows to be pointing in the same direction. So we stretch the above diagram out as follows:
Figure 8: How to construct a doughnut from its gluing diagram
The next diagram is similar to Figure 7 except the two red arrows are now in opposite directions. This means that we need to twist the object so that the arrows are pointing in the same direction before we glue the edges together:
Figure 9. A more complex gluing diagram
The first step in the above gluing diagram is to stretch the square so the the two blue lines meet and we construct a cylinder, just like the first step for building the doughnut. Except now, the two red arrows are pointing in opposite directions to each other, whereas the red arrows for the doughnut gluing were facing in the same direction. This means that we have to somehow twist one end of the cylinder so that the arrows are pointing the same direction before gluing them together. As you might imagine, this is physically impossible. Hence the surface that results from this gluing diagram is physically impossible as well. In fact, it’s a physically impossible surface we’ve already seen, the Klein bottle!
Source:
Fouriest Series
on tumblr
Gluing diagrams are an easy way to see whether an object is orientable or not. We can imagine that walking on a gluing diagram works similarly to how the world in Pac-Man works. When Pac-Man reaches one side of the world, he appears coming out of the other side. If we imagine Pac-Man moving on a gluing diagram, when he enters one side, he will emerge from the other side of the same colour, with the arrow determining his orientation.
Suppose Pac-Man enters the right-hand side of the torus gluing diagram, then he will emerge from the left hand side unchanged. This is how the topology of the normal Pac-Man world works.
Figure 10: Pac-Man walking on a torus
Now consider if Pac-Man enters the right-hand side of the Klein bottle gluing diagram. Then Pac-Man will emerge from the left hand side as a vertical flip of himself:
Figure 11: Pac-Man walking on a Klein bottle
So gluing diagrams allow us to easily consider some topological properties of objects that would otherwise be difficult to work with.
Why is topology useful?
Topology is actually surprisingly useful in the world of statistics. An emerging research area in statistics is topological data analysis. Useful data has some sort of structure, in the form of patterns or trends, and data analysis is essentially the process of uncovering this structure. Finding structure in data can often depend on how we view it, that is, what statistical tests are used, which variables we compare to other variables and what visual representations we use.
From topology, we know that things that look completely different can actually have the same structure. This idea can apply to data as well, as data can appear completely different depending on how we view it, even though we are always dealing with the same data.
In topological data analysis, the structure of data is treated topologically. We know that a topological property is a particular feature of an object that is preserved through objects that are topologically the same. So when performing topological data analysis, we search for certain properties that persist in the data after viewing it in a variety of different ways. We can think of this as being analogous to stretching our data like it’s play-dough. By doing this, we can determine the real structure of the data and distinguish it from any dependence on how the data was observed.
This is just one of many applications of topology in the so called ‘real world.’ Other applications of topology also involve determining whether things that look different are actually the same. This is very important in situations where different people might choose to represent the same information in different ways. A few examples of things that have various different representations include, molecular structure, geographical maps, DNA structure and knots in ropes or string.
It can be hard to see initially, but topology is the foundation for most areas in mathematics. Defining exactly how topology is ‘used’ is quite difficult, as it’s so ingrained in the way mathematics works that often we don’t even notice we are using it. Only relatively recently has topology been considered and researched independently to other areas of mathematics and new research and applications of topology continue to arise.
Epilogue
I hope that after reading this that you can now appreciate that a doughnut and a coffee mug being the same is a much deeper and useful idea then it might initially have seemed. If you’re interested in digging deeper into topology, here are some recommended sources:
Disclaimer: All graphics in this story without a source were created using Wolfram Mathematica.
Welcome back! In my last article Formal Truth and Logic we introduced many fundamental concepts (e.g. “What precisely is a ‘statement’?” or “How can we compose logic statements?”) of logic in Mathematics. If you require extra context on the notations or definitions used throughout this article, I highly recommend reading the previous article in case you have not done it already.
Mathematics as a science is in a distinguished position compared to most other sciences. This is due to its ability of making statements which are universally true (or false) based solely on the initially laid out assumptions of a statement. Why is that significant? It allows to draw conclusions deductively — meaning that conclusions preserve the truth of their original statements (s. [1] p. 5).
Fig. 1 Example of a deductive conclusion.
It is converse to induction, the way that empirical sciences advance: Inductionbuilds on collected evidence to generalize from individual observations to a general principle. Scientific methods nowadays are highly sophisticated. They are intended to minimize the likelihood of drawing flawed conclusions from well-designed studies. But the defining factor in this process is human judgement and the rigorous application of these methods in good faith. After all this, the result is still an approximation whose initial assumption of uniformity —that the studied phenomena can reasonably be extrapolated from the past into the future — has not and in most instances cannot be validated (s. [1] ibid.).
If we build on the other hand on statements that are known to be true to construct new statements which preserve this trueness, then we can substitute these statements as new primitives for further deductions. They follow the pattern “If X is true, then Y” or “Given X, then Y”. This way we gradually build up a tree of knowledge, in which every node is a universally valid statement only qualified through its specified premises and no other undefined environmental factors. Its validity is given by the strict application of deductive rules and the prior more primitive statements which allow to relate assumptions to the conclusion, tracing back all the way down to the fundamental axioms.
Proofs to Begin With
It all starts with a statement. The statement alone could possibly be true, possibly not. This statement is called a conjecture and becomes a theorem, once it has successfully been proven true.
Theorem: For two positive numbers a, b ∈ ℝ, if a < b then 0 < b -a.
Proof:The order axioms apply to the set of real numbers ℝ. This includes the law of monotony, which states that if a < b, then also a + c < b + c. That means a < b ⇔ a + (-a) < b + (-a) ⇔ 0 < b-a, the initial statement (s. [5] p. 293f).
Proofs and theorems build on each other and create a hierarchy. In textbooks this hierarchy is often directional from the more fundamemtal to the more advanced topics as the theory is gradually built up. Theorems and proofs are introduced in a didactically sensible order to support the later more complex concepts.
Theorem:For two positive numbers a, b ∈ ℝ if a < b then -b < -a.
Proof:The previous theorem states that a < b implies 0 < b-a. Hence, a < b ⇒ 0 < b-a ⇔ 0 < -(-b)-a ⇔ 0 + (-b) < -(-b)-a + (-b) ⇔ -b < -a (s. [5] ibid.).
But you could well imagine that there are many valid ways that lead to success. A more advanced property or concept could lead to a more elegant more concise way to derive one and the same theorem.
A proof does not necessarily need to adhere to a formal structure. Its logical structure and conclusiveness alone must be evident. A proof can hence be surprisingly sparse in words:
Theorem:Let c denote the length of the hypothenuse and a and b denote the lengths of the other two sides of a right-angled triangle, then a² + b² = c².
Proof:
Fig. 2 Proving the Pythagorean Theorem geometrically.
Theorems and Lemmata
More often than not, the complexity of the relationship between different objects and their properties requires a strategic approach. The properties on which the to-be-proven theorem leans itself though can be extremely specific in their scientific merit to the surveyed relationship. In other words: Sometimes, premises required to prove your theorem are just way too specific to what you are currently trying to do.
Similar to how in good software engineering one decouples the interface of a function from its implementation, mathematicians often preceed their main theorem with a series of one or more lemmata. Structurally, a lemma is a supportive theorem: A proven statment describing a relationship between a number of premises and a conclusion. They serve to derive those aformentioned highly technical properties, which in turn make it possible to cleanly and concisely prove the main theorem. Or, they split the underlying components of a theorem into separate components— Divide-And-Conquer style.
The Root of All Proofs
So, proofs build on one another. But what lies at the root of them? As children we might have been asked this as a brain teaser: Define a term, an object, or a concept, without using this term or any of its synonyms in its definition. In Mathematics, we would like to have a logically constructed system, too, without self-references or any of its definitions relying circularly on one another.
This is where axioms come in: Elementary statements defined to be true (s. [5] p. 287). How is that possible? Well, defining something to be universally true without providing proof is only possible via general consensus. Yet, the set of fundamental axioms is carefully chosen to support the entire existing body of knowledge. There are several — admittedly informal — quality criteria for an axiomatic system to be recognised as such:
Each axiom should be logically independent, i.e. it should not be decomposable into separate more elementary statments, not even into other axioms.
The system should be conistent, i.e. under no circumstance should it be possible to derive a contradiction under that system. A contradiction occurs if for two statements A and B we can legally derive both A → B and A → ¬B under the same logic system. Such a system would be considered inconsistent.
Ideally, we would like such a system to be also complete, i.e. we would like every true statement to be derivable under that system. However, this criterium is a weak criterium, as it has been proven that a system that is consistent, cannot be fully complete at the same time.
Could we prove an axiom wrong? And if we did, would that bring down all of modern Maths? Frankly, nothing would stop us from trying. But it would propably not lead to the fall of Mathematics as we know it. The reason for that lies in the selection process of the axioms: They are selected to match and describe intuitive properties of Mathematical objects like sets, fields, and numbers, that we can observe as we follow a Mathematical thought chain.
A good example are the order axioms which describe the ordering properties of all number sets from the natural numbers ℕ to the real numbers ℝ (cf. [5] p. 289). Showing that one of its axioms violates the consistency criterium, does not disprove the ordering property over ℝ per se, but it proves that our description of this fundamental property was inconsistent. If you can define a set on the other hand, which violates these axioms by design, then the ordering property simply does not apply in this set, meaning any other theorems derived from the order axioms would not apply either.
One other example comes from the field axioms, which state that a mathematical field 𝔽 satisfies these properties (cf. [4] p. 50f; s. [5] p. 288):
… is commutative.
… is associative.
… contains a neutral element 0 ∈ 𝔽 such that 0 + a = a for any a ∈ 𝔽.
… has for every a ∈ 𝔽 an inverse element -a such that (-a) + a = 0.
2. Multiplication over 𝔽…
… is commutative.
… is associative.
… contains a neutral element 1∈ 𝔽 such that 1 ∙ a = a for any a ∈ 𝔽∖{0}.
… has for every a ∈ 𝔽∖{0} an inverse element a⁻¹ such that a⁻¹ ∙ a = 1.
3. Addition and Mulitplication over 𝔽…
… is left-distributive.
… is right-distributive.
It should be fairly easy to spot that all these properties apply to the real numbers ℝ and the rational numbers ℚ. But does it apply to the integers ℤ? Intuitively, we might guess, yes: Addition and multiplication are both commutative, associative and distributive over ℤ. We can name the neutral elements of addition and multiplication: 0 and 1.
What about the inverse elements? The additive inverse element for any integer ais -a. The multiplicative inverse element for any integer a is a⁻¹. In the case a = 1 or a = -1 the inverse element is a⁻¹ = 1⁻¹ = 1. If a = 2, the inverse element is 2⁻¹ = ½, which is obviously not an integer (s. [4] p. 48). By virtue of our defined axioms, we can conclude that the integers ℤ do not fulfill the qualities of a mathematical field (cf. [4] p. 51 and p. 63f).
Strategies for Proving
In Maths, theorems are used to state facts about the relationship between multiple statements. They are of the form
(1)A₁ ∧ A₂ ∧ … ∧ Aₙ → B
where the statement B is called the conclusion and the statments
(2)A₁ ∧ A₂ ∧ … ∧ Aₙ = ⋀︁ᵢ₌₁ⁿ Aᵢ
are called the premises. This type of statement is called an argument and it is considered a valid argument if when all premises Aᵢ are true, the conclusion B is true as well (s. [6] p. 597). In other words: For the argument to be valid, the conclusion can be never be false if all the premises are true. This matches our definition of an implication from our last article, the Introduction into Mathematical Logic. Remember, that we defined an implication itself to be false if and only if we attempt to imply something false from a correct statement.
A theorem is accompanied by a proof, deducing the conclusion from the inital premises. One little subtlety to consider: Desigining our argument such that it qualifies as valid is a premise in itself. What we are interested in is whether we can legitimately conclude statement B from the implication. Thus, suppose ⋀︁ᵢ₌₁ⁿ Aᵢ = A, then the logic structure of a proof takes the form
(3)(A → B) ∧ A → B.
Direct Proof
A direct proof is the simplest form of a proof matching the structure of statement (3). Setting the premises A as given, we derive conclusion B by sequentially applying axioms, definitions, and other theorems (cf. [4] p. 31).
The initial examples further above in this article all are examples for direct proofs.
Proof by Contradiction
Often times, it could prove difficult to show that A → B is true, as opposed to showing that the opposite ¬(A → B) is false. If ¬(A → B) leads to a contradiction, it must follow that A → B is correct (s. [4] p. 31f; [6] p. 618). The statements A → B and ¬(A∧¬B) are equivalent — I encourage you to verify this yourself by comparing their truth tables. Thus, if we would like to prove a statement of the form “If A, then B…”, we could instead phrase it as “Suppose A and ¬B…” and try to reach a contradiction (s. [3] p. 102f).
The probably most prominent application of proof by contradiction is the proof that shows that the square root of 2 is irrational:
Theorem: √2 is an irrational number (i.e. it cannot be represented as a fraction with both an integer numerator and an integer denominator).
Proof:Suppose for a moment that √2 were a rational number, then it would be possible to represent √2 as a simplified fraction with a, b ∈ ℤ:
(4) √2 = a / b ⇔ 2 = a² / b² ⇔ a² = 2b².
2b² is definitely an even number, therefore a² must be even. Suppose a were an odd number. That means, you could write a as 2k + 1. Inserting into a² yields:
(5) a² = (2k + 1)² = 4k² + 4k + 1.
But this is contradicting our statement, that a² must be even. Hence, a must be an even number, too. We can write a as 2k and insert again into a²:
(6) a² = 2b² ⇔ 4k² = 2b² ⇔ 2k² = b².
Thus, also b must be an even number. Since a and b must evidently both be even numbers, this contradicts our initial statement, that a / b was a simplified fraction. Ergo, it is not possible to represent √2 as a rational fraction and √2 must be irrational (s. [5] p. 319).
Proof by Contraposition
Although — as stated before — , the statements A → B and B → A are emphatically not equivalent, the statements A → B and ¬B → ¬A on the other hand, very much are (cf. [3] p. 51 and 96f; s. [4] p. 31). This becomes evident when comparing their truth tables:
Fig. 3 Proof by contraposition.
So, similar to proof by contradiction, if it appears difficult to show that A → B, it might instead be easier to show ¬B → ¬A, which is called proof by contraposition. More intuitively, the statements
(7) “If I called my mother, then she would be happy.”
and
(8) “If my mother is not happy, then I did not call her.”
convey the same relationship between the two statements “I called my mother” and “My mother is happy”. On the other hand, it would be impossible to extrapolate the same information from the statement
(9) “If my mother is happy, then I called her.”
Given statement (7) there is not enough information to determine whether statement (9) is equally true.
Theorem:For any a∈ ℝ, if a² is an even number, than a is also an even number.
Proof:We prove this theorem by contraposition, showing that if a is not even, that a²cannot be even either. Hence, suppose a is an odd number. It follows that a can be represented as an even number plus one:
(10) a = 2k + 1.
Squaring both sides gives
(11) a² = (2k + 1)² = 4k² + 4k + 1, which is always an odd number.
We have shown that if a is odd, then a² is necessarily odd, too. It follows that if a² is an even number, then a must also be an even number.
Proof by Mathematical Induction
Mathematical induction is working the same way we described induction further above, generalizing from an initial clause to a universal statement. Though, in this instance the induction is performed via strict application of deductive rules.
The significance mathematical induction has for Computer Science would easily warrant its own aritcle. It is particularly useful for statements that involve a form of discrete repetition, like sequences, series, summation, etc. — generally any time we hear a sentence like “Show that for all n ∈ ℕ…”.
It begins with the induction hypothesis, the statement that we would like to prove. The first objective is to prove that the hypothesis is definitively true for one specific instance of n — the base case— for example n = 1. In the next step we assume the hypothesis to be true, for arbitrary n ∈ ℕ. We try to show that if we take as given that the hypothesis applies to any n (and we already know one specific n, for which we have proven it to be true), that this implies the hypothesis to apply to n + 1 without any further restrictions. This is called the induction step (s. [4] p. 27).
So, suppose P(n) with n∈ ℕ to be the above mentioned induction hypothesis, we first show that P(1) is true. Then, assuming that P(n) were already proven, we show that P(n + 1)must equally be true. If we succeed, we can start again with the one case that we have already proven P(n = 1) and deduce that P(n = 2) must evidently be true as well. Given that P(n = 2) has now also been proven, P(n = 3)must also be true and so forth, proving P(n) automatically for all n∈ ℕ.
Theorem:Let n, m ∈ ℕ with n ≤ m and let A be a set with m elements. There are ∏ᵢ₌₀ⁿ⁻¹(m-i) ways to choose n elements from A, where each element in A can be chosen only once.
Proof: Let P(n) be the statement, that ∏ᵢ₌₀ⁿ⁻¹(m-i) is the formula for calculating the number for ways to choose n elements from a set A of size m. We prove P(n) via induction over n.
Base case: Let n = 1. There are m ways to choose exactly 1 element from A:
(12) ∏ᵢ₌₀¹⁻¹(m–i) = (m–0) = m, so P(1) is true.
Induction step: Assuming P(n) holds for any n, then P(n+1) must also hold true. We have to show that there are ∏ᵢ₌₀ⁿ⁺¹⁻¹(m-i) ways to choose n+1 elements from A. For every (n+1)ᵗʰ element that we choose, there are n elements that have already been selected and removed from the available choices. It follows there are (m-n) elements left from which to choose. Therule of productsays we can multiply the number of choices for the (n+1)ᵗʰ element with the number of ways to choose all the previous n elements. Therefore, the possible number of ways to choose the first n+1 elements is:
(13) (m-n) ∙ ∏ᵢ₌₀ⁿ⁻¹(m-i) = ∏ᵢ₌₀ⁿ(m-i).
The statement P(n+1) evidently holds true, establishing the inductive step. Since the base case and the inductive step have both been shown to hold true, it follows that the initial statement P(n) holds for every n ∈ ℕ.
Proving the Equivalence of Two or More Statments
Bear in mind, that we had stressed in the last article that an implication alone is unidirectional. That means, that the statments A → B and B → A have strictly different meanings. To show the equivalence A ↔ B of two statements A and B, it is necessary to show that the implications A → B and B → A are each true individually (s. [3] p. 53f; [4] p. 31).
Fig. 4 The equivalence of two statements.
With 3, 4, 5 or even more statements that are all supposedly equivalent the number of pairwise implications to prove grows very rapidly (s. Fig. 5.a).
But implications are transitive, meaning that if A → B is true, and B → C is true, it follows that A → C is also true. And if A → B, B → C and C → A are all true statements, then by their transitive property we have already provided sufficient proof that A ↔ B ↔ C, i.e. that all three statements are in fact equivalent — again, I encourage you to compare their truth tables and verify this yourself.
So, to show that the statements A, B, C, … all are equivalent, we can line them up in an implication chain A → B, and B → C, and C → …, and …→ A. If we can show that each statment in the line correctly implies the next, then we have established the equivalence of all the statements in the chain (s. [4] p. 32). Note that the very last statement in the line must be shown to imply the first statement as well, to form a closed and valid implication circle (s. Fig 5.b).
Fig. 5 Equivalence between multiple statements.
Conclusion
We would like to think of proofs — just as we would probably like to think of any scientific statement — as giving us insight into a deeper causal relationship. But, if we were to relate two seemingly unrelated phenomena together and could prove that this relationship is true for all possible evaluations of the argument, then what does this tell us about its validity? What precisely does it tell us about its causality? The logicists view has been that Mathematic statements derive their truthness “by virtue of their form” (s. [2] p. 208), meaning the correctness of their syntactic structure and the mechanical application of inference rules, but distinct from their semantic meaning. Consequently, logic as a discipline strictly distinguishes between the syntax and the semantics of logic expressions (cf. [6] p. 582ff).
Although this article concludes our introduction into foundational logic in Mathematics, it would be interesting to pick apart what precisely this last bit means. I might pick the topic up one more time in a future article, delving deeper into statement and predicate logic. The direction forward though, will be to move to topics where what we have learned about logic arithmetics and proofs will become very relevant in an applied setting.
For example: what we have learned serves excellently as a bridge to digital logic, paving the way to talk about the architecture of computer systems. Hope to see you there!
References
[1] A. Schulz, L.Schlipf, and J. Rollin: Einführung in die wissenschaftliche Methodik der Informatik: Wissenschaftstheorie. Kurseinheit 2, Volume 2, FernUniversität, Hagen, 2019.
[2] E. Snapper: The three crises in mathematics: Logicism, intuitionism and formalism. Mathematics Magazine, 52(4), 1979, p. 207–216.
[3] D. J. Velleman: How to Prove It: A Structured Approach. 3rd ed., Cambridge University Press, Cambridge, 2019.
This article flows out of a video conversation between the YouTubers and mathematicians Grant Sanderson and Matt Parker (references below) and an email conversation between myself and a friend of mine: Peter Gaffney as we explored our own solutions to this beguiling puzzle.
Exploring different approaches to this seemingly impossible problem reveals some interesting connections between various branches of mathematics including logic, set theory, multidimensional vector spaces and even category theory.
Outline of the Puzzle
You and a fellow prisoner are imprisoned in a dungeon and are facing execution. As is usual in these problems, the prison warden has both a love of recreational mathematics and an unusual amount of judicial latitude when it comes to deciding your fate. They want to give you and your cellmate a chance of freedom but don’t want to make it too easy for you.
They have a chessboard where each square is covered by a coin — either heads or tails. Moreover it’s a special chess board with a hidden compartment in each square. A single one of these squares contains a symbolic key to the jail and freedom for you and your cell companion. You will know which square contains the key and your fellow prisoner has to guess.
The rules are as follows:
(1) You and your cellmate can discuss how to encode a message using the chessboard but the prison warden can hear and understand everything that you say.
(2) Once you have decided on a system, your companion leaves the room.
(3) You observe the prison warden hiding the key in one square and then arranging the 64 coins as heads or tails however they deem fit — presumably trying to frustrate your system.
(4) You then turn over exactly one coin on the chess board and leave the room.
(5) Your companion re-enters the room, without having any opportunity to see or communicate with you. He observes the chessboard and the arrangement of coins and points to the square where he believes the key and freedom reside.
(6) Nothing sneaky is allowed on pain of immediate death i.e. This is a pure logic problem- there is no meta game. Paper and Pencil, calculator and plenty of time are available to you and your cellmate if necessary.
In order to make our diagrams easier to understand, we’ll use dark and light counters to represent heads and tails and limit ourselves to a 4 x 4 chessboard with 2⁴=16 squares. However everything we discover will apply equally well to a standard chess board with 2⁶=64 squares or even far larger chessboards provided the total number of squares is a power of two.
Diagram 2: The chessboard from Diagram 1, now covered in counters after the secret compartment has been closed
Some Properties of the Problem
The prisoner warden gets to set all the coins on the board and we only get to flip one to convey our message- a tiny signal in all that noise. At first glance this seems completely impossible but let’s start by understanding what is relevant in our puzzle and what we can discard. Then we’ll see if we can start with a simpler version of the problem.
For starters, the geometry of the chessboard is irrelevant, because we are not dealing with chess moves or anything else that would make the relationship of squares to each other important. The squares could all be in a row, in a four dimensional 2x2x2x2 hypercube or any other configuration. Additionally the order of squares doesn’t matter as long as we keep track.
In some sense, the chessboard must encode the 4 bits of information required to specify the row and column of the key square. Given that the prison warden will have chosen a configuration either at random or maliciously, some or all the bits encoded may be wrong. So it must be possible to choose a coin to flip (represented here by the counter changing color) that will fix the wrong bits without impacting the correct ones. In the case where the chessboard already encodes 4 correct bits, there must be at least one square that has no effect on the 4 bits when its coin is flipped.
A single square chess board (1 = 2⁰) has no information to transfer because the key can only be in that square. The first non trivial case is two squares (2 = 2¹). We will start with this and generalize using 4 distinct approaches to get to a solution.
Diagram 3: The simplest non trivial case of 2 squares.
Fell free to take a pause before continuing. It’s a fun problem that is very rewarding to mull over.
Let’s start by solving the two square case which, based on the hints above, is relatively simple. We need one square that does nothing when we flip its coin (aka change its counter )- let’s choose the one on the left. This means that the right square needs to encode the location of the key. Let’s choose light for left and dark for right. When the warden has finished with the board, if the right-hand coin indicates the incorrect square for the key, we flip it. However if it is already correct, we flip the left-hand coin which has no effect on our key location message.
Now we’re ready to scale up to bigger boards and consider various approaches. These approaches can be read separately if you want to dip in and out of this article.
Approach A: Hypercube
Let’s imagine our squares to be arranged into a 2x2x2x2 tesseract — a 4 dimensional hypercube of side length 2.
If we consider each dimension separately, we will have 4 separate pairs of squares and we already know how to solve a single pair of squares. However we do have an issue as we no longer have single coins for each square. Each square has behind it a collection of 2³=8 coins, one for every possible combination of the other dimensions. Somehow we need to be able to aggregate all these to produce a single virtual “coin” on the right representing that dimension. A single coin is an odd number which suggests the possibility that we regard the virtual “coin” as dark if there is an odd number of dark coins in this collection of 8 and light otherwise- this is often called parity. To make this clearer, let’s temporarily drop to 2 dimensions and show the parity coins that are aggregated from a collection of two coins in this case, as shown in Diagram 4.
Diagram 4: A 2×2 chessboard with the parity “coins” worked out.
We can change from odd to even for our dimension or visa versa by flipping either of the 2 coins. Even nicer, the dimensions are of course independent. There is a coin for every single combination of dimension coordinates. So we can work out the coordinate of the coin to flip (left or right) separately for each dimension. If the current parity coin for a dimension Dₓ does not indicate the location of the key in that dimension, we can change the parity by flipping a coin that is on the right for that dimension: Dₓ=R. If the parity already indicates the location of the key in that dimension, then we flip a coin on the left: Dₓ=L. When we have done this for all available dimensions, we will have unique coordinates for the coin to flip.
So we have a way to solve a 2×2 square or indeed our 2x2x2x2 hypercube. But how to we map the chessboard squares onto this hypercube? One approach is to drop down to a single dimension first, as in Diagram 5.
Diagram 5: Making a chessboard one dimensional
We can then use the algorithm that many programming languages use for implementing a multi-dimensional array in a linear address space. This involves a form of recursive nesting which should be fairly obvious from Diagram 6.
Diagram 6: Allocating a multi-dimensional array as a line.
Putting all this together, we get the labeling shown in Diagram 7.
Diagram 7: 4 Dimensional Labeling for our chessboard with the location of the key marked
Now we can produce the virtual coins for each dimension Dₓ by working out the parity of each collection of 8 coins where Dₓ=R as shown in Diagram 8.
Diagram 8: Calculating the parity for each dimension in the hypercube
The board currently codes for the position: D₀=R, D₁=L, D₂=R, D₃=R. We want it to code for the position: D₀=L, D₁=R, D₂=R, D₃=L. D₀, D₁ and D₃ are incorrect. We indicate a change with R and can leave the parity of a dimension alone with L so we can fix the position encoded by the board by flipping the coin at D₀=R, D₁=R, D₂=L, D₃=R.
Approach B: Binary
Let’s consider our board again but label our rows and columns. Instead of using the usual letters and decimal numbers, we label both with a pair of binary digits (bits) in Diagram 9. This highlights that we need to somehow encode the position of our key (in the third column of the second row) as 0110.
Diagram 9: One possible configuration of “coins”.
With our two square chessboard, we can label the first square 0 and the second square 1 but how do we extend this to larger chessboards? We can number pairs of squares 0 and 1 and then do the same for pairs of pairs and so on as in Diagram 10.
Diagram 10: Labeling the squares on our chessboard
And yes, we’ve just reinvented binary counting meaning that each of the squares on our chessboard can be labelled by a 4 digit binary number as in Diagram 11.
Diagram 11: Labeling each square with a binary number.
Because the width of the chessboard is an even binary number: 4=2², this gives us a simple pattern when we consider the squares that have one in various positions. If we use Blue for bit 0 or D₀, Red is bit 1 or D₁, Green is bit 2 or D₂ and Yellow is bit 3 or D₃ we can more easily see how the chessboard is divided into bands as shown in Diagram 12.
Diagram 12: Bands on the chessboard corresponding to locations with different bits set to 1
In a sense this is the result of collapsing our 4D tesseract into 2D. Two of the dimensions represented by the pink and yellow bands have maintained their left right symmetry. The other two dimensions have had their symmetry broken by folding. We assigned binary numbers to each square on the chessboard but another way of looking at these numbers is as bit vectors: that is the coordinates of a coin location in a 4 dimensional vector space as in Equation 1.
Equation 1: A bit vector as a linear combination of basis vectors
As with all vector spaces, a general vector here is a linear combination of the independent basis vectors. However the scaling factors are all either 0 or 1 which implies that each basis vector is part of the linear combination or it is not. In effect, this vector space describes the 2x2x2x2 tesseract from Approach A.
Now we could just calculate the parity of each band in Diagram 12 as before i.e. we could check if there is an odd or an even number of dark coins in each band. However we don’t actually need to count, we can just toggle between odd and even every time we encounter a dark coin in the band — this is effectively modulo 2 addition or can also be considered as a NOT operation where 0 and 1 represent False and True. We only want this NOT operation to take place when we encounter a dark coin that is in a band represented by a 1 bit in the appropriate position. There is a neat way to do this using an Exclusive OR operation.
One way of looking at exclusive OR is as a controlled NOT with a control argument and a data argument. When the control argument is zero, nothing happens to the data argument. However when the control argument is one, the data argument gets inverted either from 0 to 1 or 1 to 0. Finally the exclusive OR operator available in many computer languages doesn’t just operate on a single bit. It can operate on two bit vectors where the corresponding individual bits in each vector are exclusive ORed to get the result.
So all we need to do is the following. Scan the board from top to bottom calculating a cumulative value which starts as the vector 0000. Each time we encounter a dark coin, we perform an exclusive OR of our square’s location label with the current cumulative value to get the new cumulative value. At the end, this cumulative value represents the location of a square. If we use the board in Diagram 9, this turns out to be 1011. However we need the board to represent the third column of the second row or 0110. To list the bits that need to be corrected, we simply exclusive OR these two vectors together to get 1101 which is the location of the coin that we need to flip.
You may wonder whether there are other logical operations we could choose when assembling the cumulative value that will indicate the key location. In practice, the choice of operator is quite constrained. We want each bit location to be independent which rules out operations involving borrows or carries. Operations like AND or OR aren’t fit for purpose either, because AND and OR ignore their second argument when the first one is 0 or 1 respectively. This makes it possible for a malicious warden to choose coins that will lock out information from a region of the board where he can then hide the key. So we need a logical operation that is always sensitive to changes in each input argument. That only leaves Exclusive OR which returns True when its arguments are different or its inverse: Equivalence which returns True when its arguments are the same.
Approach C: Exceptions to the exceptions
To ease into this approach, keep in mind the rule for leap years which has a basic condition modified by subsequent conditions:
It’s a leap year if the year is divisible by 4 unless (the year is divisible by 100 unless (the year is divisible by 400)).
Diagram 13: Our chessboard with a purpose applied to each square
On the chessboard in Diagram 13, we use combinations of letters to label the rows and columns. For example A by itself represents the 2nd column; B by itself represents the the third column; A and B together represent the 4th columns and no A or B indicates the first column. This means that A can indicate either the second or the forth column. The squares are also labeled with combinations of letters. By analogy with the leap year rule we have the following (involving all the squares that contain A in their label):
A square is set if it has a black “coin”. We are referring to an A column if square A is set unless AB is set unless AC is set unless ABC is set unless AD is set unless ABD is set unless ACD is set unless ABCD is set.
The rules for B columns and C and D rows are very similar. English is a little vague when it comes to logic but X unless Y could be regarded as X and not Y. So if X and Y are invited to a party but really don’t like each other we might have the rule X unless Y and Y unless X, more formally (X and not Y) and (Y and not X) which is exclusive OR. Because our Xs and Ys are disjoint, the full rule of X will give us one half of the exclusive OR and the full rule for Y will give us the other half.
When my friend Peter came up with this approach, knowing my love of Pascal’s triangle, he was keen to point out the numbers of squares are 1 with no letters, 4with one letter, 6 with 2 letters, 4 with 3 letters and 1 with 4 letters which is a row of the triangle: 1 4 6 4 1.
To understand why this is, let’s look at the simpler 2×2 example in Diagram 14 . We are just interested in numbers of letters, so we can replace our As and Bs with the more general X and we get a graphical representation of (1+x)² and another row of the triangle: 1 + 2 x+1 x²
Diagram 14: A justification for why Pascal’s triangle makes an appearance
By analogy our 4×4 would represent (1+x)⁴ but that does suggest we should reorder our square labeling as in Diagram 15. Please note that this might look more elegant, but the order of the squares is irrelevant and their labeling doesn’t need to match up with the row and column labeling used to identify the key location.
Diagram 15: The chessboard with the squares reordered and those containing a light “coin” showing grey text
When we evaluate the four rules for the board configuration shown in Diagram 9, we discover that the rule for B produces False while the rules for A,C and D all produce True indicating the second column of the forth row. The key is located in BC so we need to changes the results of rules AB and D which we can do by flipping the coin at ABD to light color.
We can emphasize how the rows and columns overlap by using colors: blue for A, red for B, green for C and yellow for D as in Diagram 16.
Diagram 16: The same chessboard using separate colors for each letter
While both A and C are in two pieces each, this is basically a Venn Diagram for 4 sets similar to that in Diagram 17.
Diagram 17: A complete Venn Diagram for 4 sets.
Diagram 16 keeps all the overlaps in the Venn Diagram the same size. In two dimensions, this is only possible with some non contiguous sets in the diagram. Of course Approach A is also effectively a Venn Diagram in 4 dimensions where all the overlaps are the same size.
Nothing to do with our problem here, but interestingly, when the consistency of overlap size is relaxed, it’s possible to get some remarkably symmetrical Venn diagrams in 2D such as this gorgeous 11 set example:
Diagram 18: A symmetrical Venn Diagram for 11 sets by Mamakani and Ruskey
Back to our chessboard where we can go even further. Let’s say A,B,C and D represent single element sets so that A={0}, B={1}, C={2}, D={3}, then all the squares on the chessboard reading from top left to bottom right represent the entire Power Set for the set {0,1,2,3} as shown in Equation 2. A Power Set is the set of all possible subsets of a given set.
Equation 2: The Power Set for {0,1,2,3}
Thus one can see how a power set for a set with n elements itself has 2ⁿ elements. As each of the subsets represents one way of choosing k elements from a set with n elements, we have another reason for Pascal’s triangle in Equation 3.
Equation 3: Numbers of subsets with k elements in the power set for a set with n elements
Our chessboard, by having some dark coins, has selected one particular subset from the Power Set. This subset is itself a collection of elements which themselves are subsets of {0,1,2,3}. The subset selected by the chessboard is in fact (A∪B)⊕(A∪C)⊕D⊕(B∪C)⊕(A∪B∪D)⊕(A∪C∪D)⊕(A∪B∪C∪D). In order to reduce this expression, we’re going to make use of the identities in Equation 4.
Equation 4: Set Identity Rules that can be used for reduction
The first two rules in Equation 4 work in tandem to ensure that an odd number of Xs exclusive ORed together produces an X whereas an even number produces the null set. The third rule distributes exclusive OR across union (also called disjunction). It’s worth emphasizing that this is not a general rule and only works because Y and Z are disjoint. Not relevant here but there is a general distribution rule involving exclusive OR shown in Equation 5:
Equation 5: General distribution rule for intersection over exclusive OR
Intersection (also called conjunction) distributes over exclusive OR. In fact these two operations form the field GF(2) where intersection is equivalent to modulo 2 multiplication and exclusive OR is equivalent to modulo 2 addition.
So now, with the help of our identities, we can do the reduction as shown in equation 6 which indicates the square ACD as expected.
Equation 6: Reducing the Chessboard set expression
So to sum up, the chessboard is the set of all possible subsets of the set of 4 elements- the power set of {0,1,2,3}. For a set of n elements, the power set contains 2ⁿ elements and so there are 2⁴=16 squares on the chessboard , one for each subset. Moreover the dark coins select a set of these subsets which is in fact a subset of the power set- one of 2^(2^n) = 2¹⁶ = 65536 such subsets. Our algorithm for identifying the key square from a board configuration is in fact a mapping from a subset of the power set of {0,1,2,3} to a subset of {0,1,2,3}.
If you’ve ever playedReversi or Othello, you’ve engaged in a power struggle over a power set where one player is striving to fill the set and the other to empty it. 🙂
In approach B, each square was identified by a unique binary number which could also be regarded as a bit vector. In approach C, each square is identified by a unique set. This is not that surprising because in programming languages, bit vectors are often used to encode sets that may only contain small numbers. The typical word length in a modern computer is 64 which allows us to efficiently encode sets that contain as elements any of the numbers from 0 to 63. For sparse sets with lots of missing elements, a hash table is more efficient because a 0 bit is not needed to record each absence.
Approach D: Self Composition
Diagram 19: Self Composition of some 2 and 3 dimensional shapes.
When Category Theory was developed to formalize fundamental structures in various branches of mathematics, it was so abstract that few thought it would ever have a practical application. Yet very quickly the idea of Composition, for example, has turned out to be an extremely powerful way of addressing problem domains using Functional Programming. You can get a very informal idea of self composition from Diagram 19. Here we see that a square can be composed from 4 smaller squares and a triangle can be composed from 4 smaller triangles, 3 at the corners and one upside down in the middle. Hexagons can tile the plane but they can’t self compose. Extending to three dimensions, we can see that a cube is composed of 8 smaller cubes. It might seem that a tetrahedron could be composed from 5 smaller tetrahedrons, one at each corner and one upside down again in the middle but that shape in the middle turns out to be an octahedron.
But I digress. How can self composition help us with our chessboard? Let’s look at the simplest non trivial case of two squares again as in Diagram 3. Is there some sense in which this two square chessboard can be composed from two additional two square chessboards? We can obviously discard features not relevant to our problem like the size and color of the squares. In fact we can discard everything but the coin on the right which tells us which square the key is under (dark=right, light= left). In the case of a two square chessboard containing a two square chessboard in each of the left and right squares, things work nicely because each two square chessboard simplifies to a coin. But how do we generalize this to multiple levels? The answer is to consider a single coin as a special case of a stack of coins.
Rather than nesting all these chessboards in our diagrams, it will be a lot neater to show the whole lot as a binary tree. Here each step of composition occurs moving up as we combine two squares containing two stacks of coins to form a single stack of coins as in Diagram 20 which will make more sense shortly.
Diagram 20: Working up the tree using our composition rule at each step
Finally we get to the crux of this approach: we need a rule for combining the stacks of coins. We know the rule at the lowest level already. Make a stack of one from the coin on the right.
We’re going to need 4 coins when we get to the root which we can then use to work our way back down the tree using light for left and dark for right. I got stuck initially with this method until I realized that an extra coin was needed at the top of the stack as a form of eager evaluation. Our top coin is the parity of the entire board at that node in the tree. Ordinarily this is of no interest because we can’t use it to communicate anything to our cellmate. We are obliged to flip a coin so the parity of the complete board will always change from odd to even or even to odd beyond our control. However if this is not the root node of the tree, the parity of the complete board will be of use if that board turns out to be on the right hand of the next node up. If it is on the left hand, the whole board parity is discarded. Producing intermediate results that we might need and discarding them if we don’t is a characteristic of parallel systems.
This all leads us to the rule for combining stacks down in Equation 7.
Equation 7: The rule for combining stacks of coins
Here ⚬ represents the operation of composing two stacks of coins to form a new stack that is one higher. Stacks of coins are numbered from the top down. [0] indicates the top coin and [>0] indicates every coin except the top. The comma represents concatenation where multiple stacks are merged to form a single stack. Finally ⊕ represents exclusive OR which can be applied to two single coins. However it can also be applied to two stacks of coins where this implies that the operation is actually being applied to corresponding individual coins in each stack.
⊕ requires its arguments to have the same number of coins. The result in other cases is undefined because there will be at least one coin with no corresponding coin in the other stack. To avoid this undefined behavior, we need to always combine stacks with the same size. This implies that that the tree we are working with needs to be perfectly balanced i.e. have the same depth everywhere. This in turn means that the chessboard can only contain a number of squares that is expressible as a whole power of two. 3Blue1Brown has a very neat alternative justification for this limitation involving graphs and coloring.
The rule in Equation 7 might be easier to understand if we work through an actual example with coin stacks as down in Equation 8.
Equation 8: An example of this rule being applied to two stacks of coins
Next we need to build up a stack of coins corresponding to the location of the key as shown in Diagram 21. Each time we move up the tree we add a coin to the stack. If we entered the node from the left, this coin is light and if we entered from the right it is dark.
Diagram 21: Working our which stack of coins will address the location of the key
Now (Equation 9), we take the stack of coins assembled in Diagram 21, discard the top one for whole board parity and exclusive OR this stack with the stack that we generated in Diagram 20.
Equation 9: Exclusive ORing two stacks at the root to get the stack we need to walk down the tree
We now have the stack of coins that address the square on the chessboard containing the coin that we need to flip. We walk down the tree (Diagram 22), taking the top coin off the stack at each step and going left if it is light and right if it is dark. And that’s all there is to it. 🙂
Diagram 22: Walking down the tree using the stack of coins as the address of the flip square.
This might seem more complicated than the other approaches. However it has one useful feature- it is a parallel algorithm. For the last half century, we have been thoroughly spoiled by Moore’s law where processor speed has been doubling roughly every two years. These halcyon days are coming to an end, so we need to move away from doing things one at a time. Algorithms and their implementation will need to be inherently parallel to address the problems in our future.
If you’d like to see how to go from these binary trees to modular logic networks that can perform the individual tasks for both prisoners, do let me know in the comments.
Practical Application: Huffman Coding
This is not just a fun problem in recreational mathematics. We can turn it on its head to implement Huffman coding — an early form of error compression used in digital communications and recording. This time the arrangement of coins on the chessboard represents a message and the symbolic key is always under square 0000 in the top left corner. A single coin can then be flipped representing an error in the message. Using our existing algorithm we can determine which coin this is and flip it back in order for the 4 bits to address square 0000 again i.e. even parity on the right for every dimension.
In order to always achieve even parity, we need a single coin for every dimension that can be flipped if necessary to make the number of coins on the right even. These parity squares are not available for data. Also the 0000 square is not available because an error can’t be detected here. For a small number of dimensions this is ridiculously inefficient. For example for 2×2 board, we have only 1 data bit. However each time we add a dimension we double the number of bits available and only have to subtract 1 for use as a parity bit. This deal gets better and better every time we double the block size so the efficiency is (2ⁿ -n -1) / 2ⁿ which works out to 69% data squares for our small chessboard (n=4) and 89% data squares for a normal sized chessboard (n=6). However if we go too far we will make the probability of double or triple errors too likely and these can’t be corrected by this scheme.
Diagram 23: The parity squares for n=2, 4 and 6
3blue1brown explores Huffman Coding in more detail here. The Exclusive OR operation is key to the efficient operation of this algorithm. This operation has all sorts of other cool applications. It can be used to encrypt a message by exclusive ORing the data with a random key of the same length. The receiver repeats the operation to recover the data. It can allow one drive to backup two other drives in some forms of RAID storage. It truly is the star of boolean algebra.
Conclusion
So having investigated these various approaches, we may have developed an intuition for the problem space. We can change which squares we use to encode different bit combinations. We can aggregate tails instead of heads; We can aggregate the coins on the left instead of the right. We can use the equivalence operation instead of exclusive OR. However all these just represent symmetries. It makes no fundamental difference to the underlying solution regardless of which of these we choose. And of course all our various approaches used different mathematics to converge on the same solution.
As far as this problem is concerned, there is no difference between a binary dimension in a vector space, a binary digit, a set or indeed a level in a perfectly balanced binary tree.
For other takes on this wonderfully evocative problem, have a look Matt Parker and Grant Sanderson’s video conversation discussing solutions in an actual dungeon (well almost). There is an interactive version of approach B available here. And if you want to delve even deeper, have a look at the topics for Information Theory, Entropy and Combinatorics in Wikipedia.
In the late 1800s, mathematics was undergoing a radical shift. Led by David Hilbert (among others), a new group of mathematicians became more interested in abstracting ideas rather than solving problems. Initially criticized for lacking practicality, this difference in philosophy has led to a vast amount of interesting results, one of them being Category Theory.
Category Theory was initially developed in the 1940’s by Samuel Eilenberg and Saunders Mac Lane as an attempt to create a general language that can be applied to any field of mathematics. Instead of focusing on exact definitions, Category Theory emphasizes the relationships between objects. This can lead to fascinating connections between a variety of seemingly unrelated concepts. The two main aspects of Category Theory are categories and functors. Let’s see what both of these are.
A category is defined as a collection of objects that have morphisms between them. This is super broad, some examples will help!
The category Set is the collection of all sets and the morphisms are functions that go between these sets.
The category Group is the collection of all groups and the morphisms are group homomorphisms between these groups.
The category Vector(K) is the collection of vector spaces (over the field K) and the morphisms are linear transformations between these vector spaces.
Hopefully these three examples gives you an idea of just how broad Category Theory is, we’ve just seen examples in three major fields of math!
I’ve left out a few details in this definition. First, there must always be an identity morphism, which just takes every object to itself. Second is the property of composition. This means that the morphisms in a category must be composible. We can easily think about this concept in the Set category by function composition. For example, if we have two functions:
Then we can compose these functions to give
The compsitionality function also holds for the homomorphisms of the Groupcategory and the linear transformations of the Vector(K) category.
The second major aspect of Category Theory are functors. These are maps between categories, so a functormust tell us both how to map both objects and morphisms of one category to another. A functormust respect morphism composition and map the identity morphism of the input category to the identity morphism of output category.
Hopefully you can see how powerful functors are. There are a variety of standard functors that can give insight into different aspects of math. I’ll give a basic one here, but more can be found in the suggested reading at the end.
The Forgetful Functor will “forget” some properties of the original object. When this functor goes from Group to Set, it maps every group to a set containing the elements of that group. Each group homomorphism is mapped to a function that brings elements from the domain group to elements of the image group. An example of the Forgetful Functor is shown below.
The Forgetful Functor showing a map between two objects and a morphisms.
To make things even more abstract, we can look at the category of functors and examine the functors between functors. This can be done repeatedly, and this “layering” of categories is called Higher Category Theory. Working with the category Functor has led to fascinating results, such as the Yoneda Lemma. Studies such as this led mathematician Norman Steenrod to call it “abstract nonsense.”
Given its breadth, Category Theory has found itself into a wide array of fields such as theoretical physics, computational theory, and resource theory. It is still a young field, with so much more to offer.
If you want to learn more about Category Theory, here are some links to get you started.
Seven Sketches in Compositionality: An Invitation to Applied Category Theoryis a fantastic, approachable textbook that gives an introduction into Category Theory through its applications. One one occasion while reading, I found myself very confused by a passage and emailed David Spivak on a whim. He responded the next day with a fantastic explanation! My only complaint is that is can get lost in definitions without always having a clear big picture. I highly recommend this book.
What is Category Theory Anyway? is a series of blog posts that gives a very clear walkthrough the basics on Category Theory. While the previous textbook can get lost in the details, this series is the opposite! The greater picture is always clear in the writing, while sometimes lacking formality. I found the two works complemented each other nicely.
Categories for the Working Mathematician is commonly referred to as the “definitive” book on Category Theory. Written by one of the founders, it has a vast array of advancements that has been made since Category Theory was created. I would use this book for reference.
Category Theory in Context is fantastic, recent book written by one the world’s leading mathematicians in Category Theory: Emily Riehl. I have not personally read this book, but have heard great things about it.
Thanks for reading! Leave a comment if you have any thoughts or questions about this article.
If you haven’t read part 1, you can find it here. The general gist of it is that we look at our data and have a guess at what sort of model would fit it best, then we find the parameters for that model that are the most likely to have generated our data.
As promised, we will look at something more complicated now.
Consider the histogram of our data is the following:
It doesn’t look like anything you will have encountered in high school statistics.
Well, sort of.
One could argue that it looks like two Gaussians mixed together.
To figure out what model to use, we can think about how this data was generated.
I propose the following ansatz model:
Where j is a Bernouilli variable.
Let us unwrap this equation. We have two Gaussians with means μ_1, μ_2 and variances σ²_1, σ²_2. The probability density function of these is given by p(x|j) where j can be 1 or 2. We have another random variable, j, that is a Bernouilli variable that acts as a mixing component.
To generate the data, simply sample the Bernouilli variable, this tells us which Gaussian to further sample from.
So when we are estimating our parameters, not only do we have to estimate the means and variances but we have an extra thing to calculate and that is which data point comes from which Gaussian. We wrap this up in an indicator variable called Z, defined as follows:
Where x^i is the i-th data point and G_j is the j-th Gaussian. Now we can write the probability of observing a data point is:
Z is a (latent) random variable so we can compute the expected value of it. We call this expected value the responsibility and can define it as follows:
The last line follows by Bayes’ theorem.
We are now in a position to write down the likelihood of observing our data D. Recall that the likelihood is just the product of all the probabilities:
For technical reasons I won’t go into, instead of computing the log-likelihood like we did last time, we will compute the expected log-likelihood. I will leave it as an exercise to show that the following holds:
Now it is not so easy doing this by hand so instead we are going to optimise it via coordinate wise ascent. What this means in our case is that we will optimise the mean and variance parameters, then optimise the mixture coefficients p(j). Using these new parameters, we recompute the responsibilities, and repeat.
Step 1: Optimise Gaussian parameters
Note that the only term that is dependent on the Gaussians in the expected log-likelihood is T_2. So we can differentiate the expected log-likelihood with respect to μ_k and σ_k to find
Similar to the previous article, we set these to 0 to find the optima. (We would technically need to also look at the second derivative to make sure that we are finding a maxima but that is very finicky so will be omitted.)
We now get our maximum likelihood estimates (MLE) for mean and variance:
Intuitively, this is just the regular estimates that we derived in the previous part but each data point is weighted with how ‘responsible’ that Gaussian is for the point in question and then normalised. This can be seen by looking at the best case scenario: if we know exactly which point comes from which Gaussian, then the responsibility is either 0 or 1 and these equations collapse to exactly the regular maximum likelihood estimates of mean and variance.
Step 2: Optimise the mixture coefficients
Now we want to maximise our mixture coefficients. Recall that these are the probability of sampling from each Gaussian.
Similar to step 1, we note that T_1 is the only term that is dependent on the mixture coefficients. We could just do exactly what we did above — that is differentiate with respect to p(k) and set it to zero.
But when is math ever that simple?
Since the mixture coefficients form a probability distribution, we have an additional condition that p(1)+p(2) = 1. So we can’t just optimise willy-nilly, we have to include this constraint in the equation. This is exactly the kind of problem that Lagrange multipliers is useful for.
If you aren’t familiar with Lagrange multipliers, don’t worry I won’t dive into the derivation. If, on the other hand, you do know the method of Lagrange multipliers, I leave you to check the derivation yourself.
In any case, we get the new mixture coefficients:
Again, this makes sense if you interpret it as the average number of points each Gaussian is responsible for.
Step 3: Recompute Responsibilities
Now that we have optimised the parameters of the model, we can recompute the responsibilities.
This can be done using the Bayesian expansion from the definition of responsibility.
OK great so we have this theoretical framework, now to apply it to our data.
We randomly initialise the means, variances, and mixture coefficients. The means and variances have to be somewhat close to the true value or else the algorithm will struggle to converge.
Then from that we can compute the responsibilities.
Using these responsibilities, we can recompute the parameters via the regime described above.
This process will converge to the true model. Below are some snapshots of the converging process
Taken after 1, 2, 10, 20 iterations — The Python notebook can be found here
This can be extended to multiple Gaussians and is not restricted to Gaussians at all.
Consider for example a shop owner who wants to model the number of customers that walk into their shop throughout the day. A sensible base model would be a Poisson distribution. However, we can expect that the mean number of visitors in the shop will vary throughout the day — for example it could be busier at lunch time and then again in the evening. So a mixture model of Poisson variables would be appropriate.
I will leave it as an exercise to derive the update equation for the parameter of the Poisson. As a few hints:
The expected log likelihood equation holds, you just need to change the value of the distribution. The equation for the mixture coefficients are model independent.
Fibonacci numbers are one of the most captivating things in mathematics. They hold a special place in almost every mathematician’s heart. Throughout history, people have done a lot of research around these numbers, and as a result, quite a lot of interesting facts have been discovered.
Let us see how they look like –
Any number in this sequence is the sum of the previous two numbers, and this pattern is mathematically written as
where n is a positive integer greater than 1, Fₙ is the n−th Fibonacci number with F₀=0and F₁=1.
Now, this expression is fairly easy to understand and quite sufficient to produce any Fibonacci number by plugging the required value of n. If at all, its only drawback is that, if we want to know a particular number, Fₙ in the sequence, we need two numbers Fₙ₋₁ and Fₙ₋₂ that came before it; that’s just how this formula works. It is not hard to imagine that if we need a number that is far ahead into the sequence, we will have to do a lot of “back” calculations, which might be tedious.
In this article, we are going to discuss another formula to obtain any Fibonacci number in the sequence, which is (arguably) easier to work with.
Let us define a function F(x), such that it can be expanded in a power series like this
Here we have omitted F₀, because F₀=0, by definition.
(Issues regarding the convergence and uniqueness of the series are beyond the scope of the article)
Our job is to find an explicit form of the function, F(x), such that the coefficients, Fₙ are the Fibonacci numbers.
In order to make use of this function, first we have to rearrange the original formula. If we make the replacement
the original formula becomes
It is easy to check that this modification still produces the same sequence of numbers, starting from n=1, instead of n=0.
Next, we multiply the last equation by xⁿ to get
and perform a summation over n to get
Let us first consider the left hand side of this equation —
Now, we try to represent this expansion in terms of F(x), by doing the following simple manipulations –
Multiply and divide by x to get
Add and subtract x ⋅ F₁ to get
Using the definition of F(x), this expression can now be written as
Therefore, using the fact that F₁=1, we can write the entire left hand side as
Let us now consider the right hand side —
If we expand these two terms, we get
We have again omitted F₀, because F₀=0.
By taking out a factor of x from the second expansion, we get
Using the definition of F(x), this can finally be written as
Therefore, by equating the left and the right hand sides, the original formula can be re-written in terms of F(x) as,
Let us now simplify this expression a bit more. The denominator is a quadratic equation whose roots can easily be obtained to be
(For an easy graphical method of finding roots, check out this article)
Using these roots, it is possible to write the denominator as
so that we can write
We can split the denominator and write this as
Before we proceed, we need to understand a useful fact about geometric series. If we have an infinite series
with |ax| < 1, then its sum is given by
This means, if the sum of an infinite geometric series is finite, we can always have the following equality –
Using this idea, we can write the expression of F(x) as
The factor of 1/√5 is due to the substitution of α and β.
Recalling the original definition of F(x), we can finally write the following equality
and by comparing the n−th terms on both sides, we get a nice result
with
(The number α is also a very interesting object in itself. It goes by the name of golden ratio, which deserves its own separate article.)
We can see from the following table, that by plugging the values of n, we can directly find all Fibonacci numbers!
This article was originally published at physicsgarage.
Electromagnetism as compared to gravitoelectromagnetism (GEM) was formulated by the proponents through painstaking observations on the behavior of real objects, i.e. magnets and inductors in their respective electric and magnetic fields. A similar observation on a gravitational field, while it would seemingly be easy, is on the contrary counterintuitive, especially when the supposed objects possesses both particle and wave-like properties of significantly foreign aspects. As a result, researchers have mainly focused on theorizing the similarity between electromagnetism and gravity in order to study the gravitational field¹²³⁴.
Today Electrogravitostatics — a terminology to explain the relationship between all the three aspects, is a gray area even though a lot of research is currently on focus. Instead of heading directly to the complex equations of electromagnetism, in this essay we will first explain gravitational potential energy as a somehow clear case that binds the three components together. We will then attempt to come up with relevant equations relating to electrogravitostatics.
From Gravity to Electricity
We will set up a number of comparative analogs between gravitational, electric, and magnetic fields. Suspended objects possess, rather gains energy purely due to their positions and a change in masses relative to the ground on Earth surface — this is referred to as gravitational potential energy. In practice, this is the form that we convert to electricity, for instance through the generation of hydroelectricity, and is calculated as E = mgH, the product of mass, gravitational strength, and the object’s height, or its position relative to the ground. In hydroelectric power generation, the logic being, water falling from a defined height H (say of a waterfall), is run through the blades of a turbine (Fig. 1), turning it into a mechanical rotating machine. From here, electromagnetics as formulated by James Clark Maxwell takes over. Located at the center of the turbine is a rotating field magnet and a stationary armature (look up this word).
Fig 1: Author’s illustration of hydro-electric generation and transmission equipment setup
When the mass is expressed as a function of density, m=ρ V, and time derivative dE/dt is taken, assuming a continuous function as of fluid, we get the power, P generated by the object such that P=ρgHQ. In the case of hydroelectricity, Q is the volumetric flow rate of water. An extra term, η is however introduced to correct such obtained power, leading to Euler equation P =ηρgH Q, where η is the turbine’s efficiency.
Gravity and Statics
What we explained in the previous section is nothing new. The actual meat on the bone is how (upon raising) a massive object gains the gravitational potential that would be converted to mechanical and consequently electrical energy — gravitostatics. To understand this, you will want to answer these two questions:
How high would you have to raise a body of mass m, in order to attain the total energy E=mc² predicted by Albert Einstein?
What are the trends in gravitational potential energy density as H increases?
The short answer to the first question is, it never happens. The explanation is given by answering the second question. The maximum gravitational energy possible is such that:
Eq. (1)
Now, the real question is how does this happen?
Assuming the mass remains constant, the only explanation that can be given is that the energy density of a body reduces when moved away from a gravitational field. That is to imply, for every meter you raise a mass m, away from the Earth surface, its internal energy reduces purely due to position. This rate of change is given by:
Eq. (2)
Where, ∇ is nabla — an operand for spatial derivative (wrt H), ρ is the mass-energy density in joules per kilogram, rₛ is the Schwarzschild radius of the object, and r is its radius, c is the speed of light. Eq. (2) is what gravity really is, when the object under investigation is taken to be Earth. Therefore, the maximum value of H you are looking for (in the first question) is given by:
Eq. (3)
When Eq. (2–3) are applied to Earth as a planet, we get the properties of Earth’s gravitational field as follows:
Eq. (4)
Explaining the Change in Potential Energy Density
The scenario in the outgoing sections implies that any object moved away from a gravitational field, e.g. elevated over the Earth surface loses energy, and thus it is no longer E=mc² in the strictest definition. FYI, E=mc² is an abbreviated version, since the full version contains both rest and relativistic masses such that E² = m²c⁴ + p²c². Therefore the shorter version is normally used for stationary bodies since they posses no momentum, i.e. p=0.
We are now alluding that relativistic (p²c²) is not the only component to be eliminated due to rest conditions, there is reduction in mass-energy just by moving the object away from the Earth surface. This energy is equivalent to the gravitational potential E=-mgH, and is static in nature since it is lost back when the object is restored to the ground, and hence what we refer to simply as potential energy. It is as though the body gets charged and discharged when raised or lowered respectively over the surface of Earth.
How this Compares with Electrostatics
You might have noticed above system doesn’t look like the statics you know. One hypothesis is to think this way: there is no difference between pulling a part two stuck magnets, and lifting an object from the ground in that, the forces acting are the same but the fields are different. One is magnetic field, the other is gravitational field. Such bodies elevated over a gravity field gains mass, more like when magnets suck all steel fillings around, and seemingly becoming one large of object.
An even better way to see this is by recalling a phenomenon first observed by Dr. William Wall in 1708. He noticed what is now referred to as global atmospheric electrical circuit⁵, in which the atmosphere is positively charged while the ground surface of Earth has a negative charge (see Fig. 2).
Fig. 2: Author’s Illustration of Global Atmospheric Electrical Circuit
Looking at it in that sense therefore allows us to hypothesize that any object elevated from the ground instantly forms a two-plate capacitor with the first “plate” being the surface of Earth, while the second one is the object itself. We also remember from basic electrostatics that for a two plate capacitor, the capacitance C, is estimated by multiplying the relative permittivity εᵣ, with the total number of plates, and the separation distance H = A/d — in Fig. 2. Notice that since n=2, and εᵣ=1 (for vacuum and air), the capacitance thus generated is exactly equal to the height Hin quantity, but with the units of Farads. For such a capacitor, the total stored energy is given by E=CV². We seem to know what the energy is from gravitational potential, E = mgH, the capacitance equation is a shown in the illustration. We can therefore solve for the voltage equivalent of gravitational field energy as follows:
Eq. (5)
Going by Eq. (5), the voltage equivalent of a free falling body must be given as:
Eq. (6)
True enough, we can then substitute the capacitance C, and the voltage obtained in Eq. (6) onto the conventional formula E=CV² to obtain the following:
Eq. (7)
We can also validate Eq. (7) by substituting the maximum height and g obtained in (Eq.3–4) to get the maximum gravitational potential energy Eq. (1) in the following manner:
Eq. (8)
Past studies equate mass to charge when estimating Lorentz force in Maxwell-like equations. This, in truth is unacceptable for two reasons, gravity is currently assumed to be a mono-pole, meaning it has no North of South pole like ordinary magnets. Secondly, when that comparison is made, there is an overlap between charge units and mass, making it an obvious mistake that raises nonconformity to the fundamental laws of physics and mathematics.
The significance of this method will therefore help anyone trying to relate gravitational field studies to electromagnetism. For instance, following the above methodology, one can expressly write the charge equivalent of mass now as:
Eq. (9)
Conclusion
As a concluding remark, the above system of equations are fully commensurate with all the laws of physics and mathematics, and are dimensionally sound. It is evident also that they obey other electrostatic methods such as q=CV, not mentioned here, as well as reducing it back to E=CV². More importantly, mass is no longer equated directly to charge, but it’s own charge equivalent must always be estimated through the method presented in Eq. (9).
References
¹ Heaviside, O., 1893. A gravitational and electromagnetic analogy. The Electrician, 31(18), pp.5125–5134.
² Padmanabhan, T., 2010. Thermodynamical aspects of gravity: new insights. Reports on Progress in Physics, 73(4), p.046901.
³ Bakopoulos, A., 2016. Gravitoelectromagnetism: Basic principles, novel approaches and their application to Electromagnetism. arXiv preprint arXiv:1610.08357.
⁴ López, G.V., 2018. On Maxwell Equations for Gravitational Field. Journal of Applied Mathematics and Physics, 6(04), p.932.
⁵ Harrison, R.G., 2011. Fair weather atmospheric electricity. In Journal of Physics: Conference Series (Vol. 301, №1, p. 012001). IOP Publishing.
I recently started a job in which I, trained as a mathematician, am working with a number of colleagues who were trained as physicists. At lunch, this situation led to a discussion of the connections between the two fields, which in turn led to a discussion of Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences.” I reluctantly admitted that although I had heard and read about the article, and thought I understood its general gist, I had never taken the time to read and digest the article for myself. To remedy this sad state of affairs I decided to carefully read it and provide a summary to increase my understanding of the article. I hope that you, the reader, find it helpful as well.
I should note that my intent here is to dissect and digest the article itself, not the responses to it (for instance, Richard Hamming’s 1980 article, “The Unreasonable Effectiveness of Mathematics”) or any questions connected with it. Wigner’s article serves as the tip of a very interesting conceptual iceberg which is concerned with, among other topics, the connections between mathematics and science, and of mathematical philosophy and ontological questions of mathematics. I plan to explore the Wigner’s article here and perhaps address these related issues separately in future articles.
Eugene Wigner
Eugene Wigner (1902–1995) (natively, Wigner Jenő), attended the Kaiser Wilhelm Institute in Berlin, where he was an assistant to Karl Weissenberg and Richard Becker, who introduced group theory into physics, and then at the University of Göttingen, where he served as assistant to the great David Hilbert. He later worked on the Manhattan Project, after which he hired as the Director of Research and Development at the Clinton Laboratory (now the Oak Ridge National Laboratory). After the war he worked at several governmental institutes, including the National Bureau of Standards, the National Research Council, the National Science Foundation and the General Advisory Committee of the Atomic Energy Commission. In 1960, he received the Nobel Prize in Physics for “contributions to the theory of the atomic nucleus and the elementary particles, particularly through the discovery and application of fundamental symmetry principles.”
In 1960, his article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” was published in Communications on Pure and Applied Mathematics. The article poses two related issues: the effectiveness of mathematics in physics (beyond that of a mere supporting tool), and the uniqueness of mathematical theories in physics. To the first point, the effectiveness of mathematics, he says, “The enormous usefulness of mathematics in the natural sciences is something bordering on the mysterious and there is no rational explanation for it.” To address this issue, he addresses the following questions in turn: “What is mathematics?”, “What is physics?”, “What is the role of mathematics in physical theories?”, and “Is the success of physical theories truly surprising?” To the second point, the uniqueness of physical theories, he has less to say, although his discourse on this subject is also interesting and provocative.
Let’s explore his responses to these questions in turn.
What is mathematics?
Wigner defines mathematics as “the science of skillful operations with concepts and rules invented just for this purpose.” The emphasis is on the concepts, without which, he notes, only a handful of interesting theorems could be proved from the axioms. Furthermore, these concepts are not chosen for simplicity, but rather “with a view of permitting ingenious logical operations which appeal to our aesthetic sense both as operations and also in their results of great generality and simplicity.”
An example of such a mathematical concept is the complex numbers. Complex numbers are not born of experience, but rather of mathematical beauty. The theory of complex numbers, he ascertains, was not pursued for applications, but rather for the beauty inherent in the theory. In other words, mathematicians study mathematics not as ends to a mean, but as a subject in and of itself.
What is physics?
Physics, on the other hand, is concerned with the laws of nature. Such a law predicts some future event given a present event, or, put another way, is a conditional (if-then) statement. An example is Galileo’s law of gravitation — the observation that if two rocks are dropped at the same time from the same height then they will land on the ground below at the same time. Such a law is surprising, asserts Wigner, because it is invariant (the experiment with the rocks does not matter where on Earth you carry it out, or at what time you perform it) and because it depends only on (a) specific variable(s), and not on others that conceivable might have had an effect (the objects need not be rocks, and the result depends only on the mass of the objects, not on size, shape, or other factors one might consider).
How is math used in physics? Is its success in physical theories truly surprising?
The use of mathematics in physics as a tool is clear. Since physics is formulated in terms of laws written as conditional statements, mathematics is the obvious choice to formulate these expressions. But the role of mathematics in physics goes far beyond this obvious point. According to Wigner, “…the laws of nature must have been formulated in the language of mathematics to be an object for the use of applied mathematics… The mathematical formulation of the physicist’s often crude experience leads in an uncanny number of cases to an amazingly accurate description of a large class of phenomena.” In other words, we “get more out of the equations than we put in.”
He gives three examples of this amazing accuracy: the laws of falling bodies, quantum mechanics, and quantum electrodynamics. Let’s take these examples one at a time.
Isaac Newton noticed that the path of a falling body (perhaps a thrown rock) on the Earth and the path of the moon in the sky are two particular cases of the more general notion of an ellipse. From this observation, he postulated the universal law of gravitation, which states that the gravitation between two objects is proportional to their masses. While Newton, given the restrictions of his day, could only verify the results with an accuracy of 4%, the law was later proved to be accurate to within less than a ten thousandth of one percent. This law, therefore, is a fantastic example of a mathematical formalism that has proved accurate beyond any reasonable expectations.
Quantum mechanics gives an even more astounding example. Matrix algebra had been studied independently of any applications by pure mathematicians for some time when Max Born realized that Werner Heisenberg’s rules of computation were formally identical with the rules of computation of matrices. Born, Pascual Jordan, and Heisenberg then replaced the position and momentum variables of Heisenberg’s equations of classical mechanics by these matrices and applied that result to an idealized problem. The new formulation worked, but would it work in a realistic setting, not just a toy problem? Within months, Wolfgang Pauli applied the new formulation to a realistic problem (a hydrogen atom) and the results matched up. Since Heisenberg’s original calculations were abstracted from problems that included the old theory of hydrogen atoms to begin with, this result was not too surprising. However, the “miracle” occurred next, when the matrix mechanics were applied to problems for which the Heisenberg rules no longer applied. — equations of motion of atoms with greater numbers of atoms. These observations were shown to agree with experimental data to within one part in ten million! Once again, mathematics developed independently of physics has been applied to physics to give spectacularly accurate results, far beyond the expectations of the original theory.
The third example is the theory of the “Lamb shift” in quantum electrodynamics. This theory was developed purely in terms of mathematics, devoid of any experimental data, and only tested after the fact. The agreement with experimental data is better than one part in a thousand.
Uniqueness of physical theories
For the second big issue, the uniqueness of physical theories, Wigner states, “We cannot know whether a theory formulated in terms of mathematical concepts is uniquely appropriate.” [emphasis mine]. By this, he means that just because we develop a theory that describes some physical phenomena, we do not know if there might be another theory, based on for instance, different choices of variables, or different mathematical machinery, that might produce the same results.
For example, consider the incompatibility of the theory of quantum phenomena and the theory of relativity, two models that have their roots in mutually exclusive groups of phenomena. Relativity theory applies to macroscopic bodies (for instance, starts) and is formulated in terms of four-dimensional Riemannian space, whereas quantum theory applies to the microscopic world and is modeled using infinite dimensional Hilbert space. The point of contraction is the so-called “event of coincidence” which refers to the collision of infinitely small particles. The result from relativity theory is primitive and defines a point in space-time, whereas quantum theory yields a non-primitive solution which is not isolated in space-time. We therefore know of an example for which these two theories are directly contradictory. Perhaps there they are both approximations of a larger theory that will explain both cases. As of this writing, we do not know.
References
Hamming, R. W. (1980). “The Unreasonable Effectiveness of Mathematics”. The American Mathematical Monthly. 87 (2): 81–90.
Wigner, E. P. (1960). “The unreasonable effectiveness of mathematics in the natural sciences. Richard Courant lecture in mathematical sciences delivered at New York University, May 11, 1959”. Communications on Pure and Applied Mathematics. 13: 114.