Robert Cunningham
· 4 min read

Notes on the Best Lectures at MIT

This semester I took 18.404 (“Theory of Computation”). It’s widely known as being among the best classes at MIT.

18.404 vis-a-vis 1100 other classes. "Adjusted ratings" means that ratings are adjusted to give more weight to large classes with high ratings; this reflects the intuition that 30 ratings averaging 6.9 gives more signal than one rating averaging 7.0.

Prof. Sipser teaches 18.404 in the standard format, with eighty minute lectures on Tuesdays and Thursdays. Perhaps it’s because the format is standard that the rating is impressive. It’s like winning a Michelin star for boiling an egg. What makes a large lecture amazing?

If there’s one thing that distinguishes Sipser’s teaching style, it’s this: the main thrust of the lectures is clear, and he does not get bogged down by anything. Because of this, he manages to keep the interest and attention of all 200 students, instead of only the first two rows.

For example, instead of encouraging esoteric and nitpicky questions, he frequently prompts by asking whether “anyone has a question that will help others to understand” or whether “anyone who doesn’t quite understand can ask a clarifying question.” One time someone asked about whether [some quantity] should be [that quantity]+1. He said something along the lines of “in this case, I happen to know that there’s no +1, but really, what’s a +1 between friends?”

Sometimes he’ll prove a result in the     \implies direction but refer to the book for the     \impliedby direction. Sometimes he proves the one-dimensional case, which captures the intuition, and refers us to the two-dimensional generalization in the book. Sometimes he’ll cite a result and say “this isn’t too important; you can go back later and prove it to yourself if you need to, but it’s not critical.” [1]

He frequently restates important earlier results from the lecture, because he says that students sometimes space out for a few minutes. Every once in a while, he stops and says “so far we haven’t done anything crazy; if you’ve fallen off here, we can get you back on the train pretty easily. But in 10 minutes we’ll be too far gone.”

So the lectures are tightly focused on the big picture, and Sipser aggressively manages questions, proof scope, and attention to proof details to achieve this focus.

In addition to the focus on the big picture, the delivery of the main ideas is meticulously planned. It’s as if he takes all of the attention that usually goes into the detail of proofs and redirects it into the details of formulating the main ideas clearly. When I asked him about it, he explicitly mentioned the following:

  • He tries to formulate theorem statements without negation, which can be hard to wrap your head around.
  • He tries to lay out the structure of analogies on the board in a way that communicates their meaning.
  • He wants [the algorithm], which is the main point of the lecture, to be on the central board so it draws attention. Expecting it to take two blackboard panes, he skips over them in the first part of the lecture.

Some of this, I’m sure, comes from practice. He’s been teaching 18.404 since at least 2001. But he also mentions that teaching well is a huge time commitment; apparently he spent roughly fifteen hours preparing for each lecture over the pandemic. At eighty minutes per lecture, that’s eleven minutes of preparation for every minute spent lecturing.

So why are the lectures in 18.404 rated so well? I think it’s a combination of an unwavering focus on the big picture, and close attention to detail in how to most effectively communicate the big picture*.* This combination allows Prof. Sipser to teach effectively to most of the audience in a huge class, who then turn around and give 18.404 stellar ratings.


[1] A common failure mode for lecturers is to introduce a theorem that only 2% of the class understands and then spend three lectures proving it in detail. In short, 18.404 is the exact opposite of this.

Hey! The purpose of this series is to teach you how to design, manufacture, and program a circuit board. Afterwards, you’ll be well positioned to work on a simple circuit board that drives something like bluetooth headphones, a quadcopter, or an electric skateboard.

My first circuit board <3 (and the one we're making in this tutorial!)

The focus throughout the series is on teaching a small set of skills that can get you surprisingly far: the equivalent of pip install and for-loops for electrical engineering.

I’m still quite new to this, but making a circuit board is way easier than I thought before I started. I’ve written this tutorial with the goal of making it widely accessible. Notably,

  • You don’t need to understand advanced stuff about circuits, like inductance. It might be useful if you remember some high school physics, like Ohm’s law, but you can pick this up as you need it. The most advanced individual circuits we’ll see look a lot like this.
  • You don’t need to know how to solder. I don’t. You can get your boards manufactured overseas with all the components perfectly inserted and soldered by a machine.
  • It doesn’t cost a lot of money. I spent less than $100 total on this project. That bought me five fully assembled boards and all the equipment necessary to program them. If you grew up with Arduino and SparkFun, you might be surprised (like I was) that a MPU6050 inertial measurement unit ($25 on a breakout board from SparkFun) actually costs $2 as a bare component (and less in bulk).

Concretely, we’ll be designing a circuit board that controls a robot. It’ll read data from a gyroscope and drive four servo motors. We’ll control it with a few buttons, power it with four AA batteries, and program it in Rust.

Let’s dive in!

Creating a Schematic

Our first order of business will be creating a schematic of our circuit board. A schematic is an abstract diagram that indicates which components we want to use, and how they should be electrically connected to one another.

Inertial Measurement Unit

The first component we’ll add to our schematic will be an inertial measurement unit (IMU). IMUs contain several sensors like gyroscopes and accelerometers integrated into a single chip, and tell you how fast you’re turning, where you’re pointing, and whether you’re accelerating.

We’ll start by choosing a specific IMU. Luckily, there’s a great third party comparison of the gyroscopes on many popular IMUs. You can go through this list and choose any that strikes your preferred price-performance-functionality tradeoff; we’ll go with the MPU 6050, which is a classic. [1]

Open up your schematic editor of choice (this tutorial uses EasyEDA [2]). Create a new schematic, then go to Place > Symbol, and search for the MPU 6050. Once you place it, you’ll see an abstract representation of the MPU 6050 with all its pins. To figure out what these pins do, you’ll need to take a look at the part’s datasheet (Google is your friend). These tend to be long and verbose; the MPU6050 datasheet is more than 50 pages long. A good place to start is by looking at the pin out table. This table describes the purpose of each of the pins on the bottom of the MPU6050. Some of these descriptions are self-explanatory; pin 18 is labelled “GND” and is supposed to be connected to ground. We can start by connecting that one in our schematic editor.

We can also put the X symbol on all eight NC pins and both RESV pins, which the pinout chart labels as “Not internally connected.” and “Reserved. Do not connect.” respectively. Although the X symbol doesn’t do anything, it’s a good reminder that we don’t need to connect anything there.

We can X out all the unused pins.
Soon, though, we're going to start running into less obvious pins. For example, it's totally unclear how the **REGOUT** pin, which the datasheet labels as "Regulator filter capacitor connection," is supposed to connect.

That’s where the Typical Operating Circuit comes in handy. This is a different diagram in the datasheet where the manufacturer basically lets you know how reasonable people set up their ICs. There might be three pages of information on which capacitor values are acceptable, but the typical operating circuit should give you reasonable defaults. Here, we can see that pin 10 (REGOUT) is supposed to be connected to a 0.1uF capacitor and then to GND. Let’s add that to the schematic (recall that 0.1uF = 100nF).

You’ll notice that there are lots of 100nF capacitors in the parts library. The ones that we want are in the “Basic Part” class (these are much cheaper than the “Extended Part” versions in low volumes). The four basic 100nF capacitors differ in the 4-number code that comes afterward; those are sizes. We chose the 0603 size, which is 0.06 inches by 0.03 inches. Then we connect it to REGOUT on one side and ground on the other.

You might wonder what the purpose of this capacitor is. It’s called a bypass capacitor, which you can think of like a little backup battery kept close to the chip. When the electrical inputs to the chip are noisy, the bypass capacitors effectively absorb and supply power to smooth over those irregularities.

You can copy capacitors from the typical application circuit for VLOGIC, CPOUT, and VDD. What about the rest of the pins?

CLKIN and FSYNC say “optional, connect to ground if unused.” We’re not trying to do anything fancy, so we ground them.

SDA and SCL implement a communication protocol called I2C. We’ll use these two pins to send questions and instructions from our microcontroller to our MPU6050.

How do we send bits over these wires? We start by connecting SDA and SCL to +3.3v on the schematic. To send a logical zero, the microcontroller connects its side of the wire to ground internally, which forces the line to zero volts. To send a logical one, the microcontroller does nothing with the pin, and the +3.3v connection tugs the line high automatically.

There’s one catch. If you just connect +3.3v directly to ground inside your microcontroller, you have a short circuit and your chip will get hot and fry itself. To avoid this problem, we put a “pull-up” resistor between the power supply and the microcontroller. This sharply limits how much power an internal ground can pull from the 3.3v supply.

This all sounds complicated, but it’s not so bad once you do it once. Here’s how it’ll look when you’re done. Notice the 4.7K pull-up resistors between the power supply (V3.3) and the pins. The SDA and SCL labels will eventually connect to SDA and SCL pins on the microcontroller. The V3.3 label indicates a connection to +3.3 volts, supplied from somewhere else on the board. AUX_DA and AUX_CL are another I2C pair for “connecting external sensors,” which we don’t have. We shouldn’t leave them unconnected, since I2C lines need to be kept at 3.3v by default, so we connect them to our 3.3v power through another pair of 4.7k pull-up resistors.

AD0 controls the “I2C slave address LSB.” If you search the datasheet for AD0, you’ll find this diagram, which says that the MPU6050’s I2C address is 1101000 when AD0 is low, and 1101001 when AD0 is high. Either is fine for us; we’ll choose to set it low by connecting it to ground. INT is an interrupt pin; the gyroscope can activate it to quickly signal something like “this chip is in free-fall.” We won’t use it.

EP is on the schematic, but not on the datasheet pinout. Stack Exchange says that it refers to the big pad on the bottom of the chip, which should sometimes be soldered to ground as a heatsink. However, looking through the datasheet, we see: We therefore choose not to connect EP.

When all of that is said and done, your schematic should look something like this. Congratulations! You’ve set up your first chip.

Power (3.3v LDO)

We said before that the 3.3V label indicates a connection to +3.3 volts, generated somewhere else on the board. Let’s tackle that now.

The MPU6050 needs 3.3 volts to run, but we’re planning to run our controller on a set of four AA batteries, which together will output 4 * 1.5 volts = 6 volts.

To decrease the voltage from 6v to 3.3v, we’ll use a LDO (Low Dropout regulator). LDOs are simple, cheap, and inefficient: they take an unknown input voltage and output a constant output voltage. If we input 6 volts, our LDO should output 3.3 volts and burn off the other 2.7 volts as heat.

We’ll try to choose a popular and well-documented LDO, so go to this website and search for “3.3v LDO,” then sort down by stock. You’ll see at the top there are a bunch of 1117-series parts (notice how they all have 1117 in their part number). We’ll grab one of these 1117 chips. Digging through its datasheet, we find this wonderful typical operating circuit in Chinese. We copy that diagram into our schematic as follows.

Here’s our final LDO setup.

Battery Connector

Next, we’ll connect the input side of the LDO to our battery with a battery connector. Our generic battery pack comes with a 2.54-mm spaced connector. Using the same parts website, I eventually found a matching header. It’s pretty simple to connect in the schematic. Make sure you put the VBAT connection on the + side of your battery, and the GND connection for the - side.

Microcontroller: Selection

Next we’ll select a microcontroller, which is the star of the show.

We’ll use a chip from the STM32 series, which is perhaps the most popular modern line of microcontrollers. More specifically, we’ll choose the STM32F3 series, which is based on a Cortex-M4 core (mid-range) and has floating point support. It also has a good Rust HAL for later!

Reading the STM32F3xx documentation page, it seems like the F302 is a standard choice without bells and whistles like high resolution timers.

Finally, the STM32F302 can be configured with varying amount of flash memory, RAM, and pins. We’ll ultimately go with the STM32F302C8T6, which has 64k of flash, 16k of RAM, and comes in a 48 pin package. Here she is: Phew! That’s a lot of pins to connect.

Microcontroller: Power

Let’s start with power. Looking at the STM32F3xx datasheet section on power supply, we see the following diagram. Fortunately, everything on the right is inside the microcontroller; we only have to worry about the left side. According to the top of the diagram, VBAT should be connected directly to the battery.

According to the middle of the diagram, we should connect each of the VDD pins to 3.3v, and each of the VSS pins to ground. We’re also supposed to add four 100nF and one 4.7uF bypass capacitors.

According to the bottom of the diagram, we should also attach VDDA and VSSA to power and ground respectively, with one 10nF and one 1uF capacitor. (If we wanted particularly accurate analog readings, we could take further care to isolate these pins from other electrical noise on the board.) The schematic can be deceiving; although these look like four separate circuits, the V3.3s and GNDs all link together. This is why it doesn’t matter that the 4.7uF capacitor is only graphically attached to the bottom pins; it’s electrically connected to all of them. [3]

Microcontroller: Clocks

The STM32F302 has high speed and low speed internal clocks. They’re not particularly accurate. If we need higher timing accuracy, we can choose to add external crystal oscillators to the OSC_IN, OSC_OUT, OSC32_IN, and OSC32_OUT pins. Since we don’t need particularly high timing accuracy, we’ll save the cost and leave these as generic I/O pins.

Microcontroller: Reset

We need a way to restart the microcontroller; that’s handled by the NRST pin. According to the STM32F302 datasheet, NRST is active low. This means that the pin should usually be high (at 3.3v). When we push the reset button, it should temporarily drop to ground to restart the board. We’ll do that with the following circuit. SW1 is a generic pushbutton; when you push, it connects the 1/2 side to the 3/4 side. By default, NRST is held to 3.3v, since it’s connected to V3.3 through that 47k resistor. When SW1 is held, NRST is also connected to ground through a 10k resistor. Since the ground resistor is smaller, the voltage on NRST jumps closer to ground. This gadget is called a voltage divider. [4]

Microcontroller: Boot

When the microcontroller starts, where should it execute code from? Most often, it should execute from main memory (where our programs are written). But sometimes we might want to boot from a stable system bootloader instead.

This choice is controlled by the Boot0 pin. So usually Boot0 should usually be low, corresponding to main flash memory, but when we hold a button, we’ll set Boot0 high to boot from system memory instead. We won’t worry about the third chart entry, embedded SRAM, for now.

To achieve this, we’ll use the same gadget we used for NRST, but with V3.3 and GND reversed. If we want to boot from system memory, we’ll hold down the SW2 button during boot up.

Microcontroller: Debug

Now we need a way to program our microcontroller. We’ll do that using a debug tool that speaks the SWD (serial wire debug) protocol. The connector for that tool is a 2x5 male header with 1.27mm spacing, laid out like this: None of our microcontroller pins are labelled “SWDIO”, but we can reference this handy table to see which pin handles that function. In our case, it looks like the PA13 pin handles SWDIO. We also need to attach SWDCLK, SWO, and NRST. First we’ll put little tags on the relevant pins: Then we’ll grab a 2x5 pin header with 1.27mm spacing and attach those tags to it, along with power and ground. We now have the basics of the microcontroller fully configured. All that’s left is to attach inputs and outputs for the IMU and servos.

Microcontroller: IMU I2C

Recall that our IMU speaks to our microcontroller through the I2C protocol, using an SDA and an SCL pin. Where can we find SDA and SCL on our microcontroller?

Scrolling through the pinout table, we notice that PA10 has an alternate function as “I2C2_SDA”, and PA9 is similarly listed for “I2C2_SCL”. Since we don’t need PA9 and PA10 for anything else, we’ll use the second I2C interface, “I2C2”. (If we had needed PA9 for a different alternate function, we could have used I2C1 or I2C3 instead.) We connect PA9 and PA10 to the IMU with the same SDA and SCL tags we used back then.

Servos

Finally, for the servos. We can control servos by sending them pulses of varying lengths, using a system called pulse width modulation (PWM). We’ll discuss this more in part 3, when we program the microcontroller. For now, we need to take care to attach the servos to IO pins with the hardware needed to generate PWM signals.

Searching for “PWM” in the datasheet, we find the following. And looking at TIM2, we’re in luck! It can drive four independent PWM channels for our four servos. Checking the pinout table, we see that TIM2_CH1 corresponds to PA0, TIM2_CH2 to PA1, and so forth. We connect some labels as usual. Our servos have a 3x1 female connector with 2.54mm spaced pins. They accept power, ground, and signal on those three lines, and power in the middle. We grab four corresponding male headers from the parts library, and GND, VBAT, and each signal line. That’s all; the finished schematic should look like this. A word of congratulations: designing schematics is probably the hardest part of this project, especially for someone with programming background. And designing your first schematic is much harder than your second: when you design your second schematic, you’ll already understand LDOs, SWD, timers, pull-up resistors, etc. Now you’re over the hump!

Unfortunately, we’re not ready to send this off to a fabrication facility yet. Schematics represent electrical connections, but they don’t say anything about how the PCB should be laid out in physical space. In part two, we’ll cover routing, which is the process of flattening that electrical graph onto a series of copper traces on a two dimensional piece of fiberglass. Then we’ll submit it for manufacturing. In part three, we’ll program our new PCB in Rust!


Notes

[1] Instead of buying a MPU 6050 on a breakout board from Adafruit, of course, we’re going to buy just the little black chip directly from LCSC.

[2] You’ll need a schematic editor (a piece of EDA, or electronic design automation, software). Although it pains me, for beginners I recommend EasyEDA over alternatives like KiCad and anything from AutoCAD. Hardcore professionals seem to use either KiCad or an AutoCAD product for schematic design. Unfortunately, if you do that, “please place a 10K ohm resistor here” is a three step process:

  • Place a resistor symbol onto your schematic.
  • Specify the resistor’s footprint. The footprint answers questions like “how wide is this resistor?” and “where should the conductive pads be placed?” and “how wide are the conductive pads?”
  • Choose which exact 10K resistor you want (e.g. a CR0402FF2103G), after making sure your manufacturer has enough of them in stock and at a reasonable price.

The alternative is EasyEDA, which is owned by LCSC, a very large Chinese electronics manufacturer. It’s pretty mediocre software overall, but placing a resistor is a single step. You choose the exact model immediately (CR0402FF2103G) with access to LCSC’s current stock and pricing information, and it automatically knows the footprint. As you’d expect, getting the whole thing manufactured by LCSC is extremely easy.

I’m no KiCAD expert; if there’s some way to get KiCAD to integrate more easily with manufacturers, or if I’m missing something else, please let me know!

[3] However, it’s good to eventually have the bypass capacitors be physically close to the power pins, and this layout will helps me remember that when I do the routing.

[4] This gadget is sometimes called a voltage divider. We can quickly calculate the low voltage with ohm’s law (given that the currents are equal): 10 / (47 + 10) * 3.3v = 0.58v.

Here’s a surprising result. Any program that transforms source code must have an input that behaves the same before and after being transformed. In other words, for this input, called a fixed point, the input and output programs behave exactly the same way.

Sometimes fixed points are easy to find. The code console.log("hello") is an obvious fixed point of a transformation that swaps the characters ‘a’ and ‘z’:

const swapAandZ = (input) => input.replace(/[az]/g, c => c === 'a' ? 'z' : 'a')

const input: string = 'console.log("hello")'

eval(input) === eval(swapAandZ(input))
// our === implies the same return value, console outputs, error text, etc.

Some are slightly harder; can you think of a fixed point of “append the source to itself”?

It turns out that while true {...} (or the empty program) are both fixed points.

const double = (source: string) => source + source

const input = "while (true) {console.log('loop');} "

eval(input) // => loop loop loop loop loop ...
eval(double(input)) // => loop loop loop loop loop ...

But some examples are extremely non-obvious. Can you think of a fixed point of “hash the source and execute that hash”? The program would need to behave exactly the same hashed and unhashed.

const hash = (source: string) => createHash('sha256').update(source).digest('hex');

const input = "some crazy program"

// eval(input) === eval(hash(input))

But we’ll prove that in general, any transformation, no matter how crazy, must have a fixed point.

// for any transformation
const transform: (source: string) => string;

// there exists an input
const input: string;

// where executing the program is the same as executing its transformation
eval(input) === eval(transform(input))

The technique we’ll use to prove this remarkable truth is diagonalization. To warm up to the technique, we’ll start with two simple examples: the proof that there are more reals than natural numbers, and a proof that a seemingly powerful class of functions, the primitive recursive functions, can’t be complete.

We’ll use that newfound expertise with diagonalization to show the above result, that all code transformation programs have a fixed point. This result is called Rogers’s Theorem.

Rogers’s theorem will then help us understand a terse and opaque piece of math called the Diagonal Lemma. We’ll close by using the Diagonal Lemma for a pair of excellent proofs of Gödel’s Incompleteness Theorem (all theories must be either incomplete or inconsistent) and Tarski’s Undefinability of Truth (theories cannot define an internal notion of truth).

The Diagonalization Trilemma

Diagonalization. Let’s say you have an “evaluation” function f:X×XYf: X \times X \rightarrow Y. You can think of f(3)(5)f(3)(5) as selecting the 3rd object from a long list of objects, and evaluating it on 5. [0]

Let’s say you also have a δ:YY\delta: Y \rightarrow Y with no fixed point. You can think of δ\delta as a “give me something different” function: it changes every yy into a different yy.

Then there exists a function g:XYg: X \rightarrow Y that isn’t equal to f(k)f(k) for any kk. In other words, there’s at least one new object gg that isn’t equal to f(3) or f(55) or f(574), or any other object f(x)f(x) in the original list of objects. (Note that f(55)f(55) and gg are both functions representing objects.)

Proof. Take g(x)g(x) as δ(f(x)(x))\delta(f(x)(x)). For any kk, gf(k)g \neq f(k), since g(k)f(k)(k)g(k) \neq f(k)(k). \blacksquare

Spelling out the proof, if you’re concerned that maybe g=f(234)g = f(234), you can check that at gg and f(234)f(234) differ at index 234: g(234)f(234)(234)g(234) \neq f(234)(234). That’s because g(234)g(234) does equal δ(f(234)(234))\delta(f(234)(234)), which is different from f(234)(234)f(234)(234) because δ\delta is the “give me something different” function.

The most important part of this proof is that it establishes a crucial trilemma, which we’ll use for the rest of this post.

Trilemma. For all sets XX and YY with a function f:X×XYf: X \times X \rightarrow Y, and a function δ:YY\delta: Y \rightarrow Y, at least one of the following three things must be true:

  1. δ:YY\delta: Y \rightarrow Y has a fixed point.
  2. There exists some g:XYg: X \rightarrow Y which isn’t found in f(n)f(n) for nNn \in \mathbb{N}, but gg is obviously the right kind of object, so the list of objects f(n)f(n) must not be exhaustive.
  3. There exists some g:XYg: X \rightarrow Y which isn’t found in f(n)f(n) for nNn \in \mathbb{N}, but f(n)f(n) obviously lists all the possible objects, so gg must actually be a new and different kind of object.

Example: Primitive Recursion

The first objects we’ll look at are the primitive recursive functions. Primitive recursive functions are all the functions made of combinations five functional building blocks: z,s,p,cz, s, p, c and ρ\rho. Don’t worry if you haven’t seen them before. The upshot is that you can easily list them, numbering them sequentially:

1 => z() // zero function
2 => s() // +1 function
3 => p()
...
372 => c(s, z) // composition of 0 and +1 (1)
373 => c(s, s) // composition of +1 and +1 (+2)
...

We’ll choose f(i)(j)f(i)(j) as the evaluation function, which takes the iith primitive recursive function and computes it on jj. For example, evaluating the 373rd function at the value 3:

f(373, 3) => c(s, s)(3) => s(s(3)) => 3 + 1 + 1 => 5

We can easily build a function δ(x)=x+1\delta(x) = x + 1 that has no fixed point.

So we consult our trilemma, and decide that either (1) δ\delta is broken, (2) the list f(i)f(i) isn’t complete, or (3) gg is a new kind of object.

δ\delta has no fixed point, and f(n)f(n) obviously enumerates all the primitive recursive functions (by definition). So it must be that g(i)=δ(f(i)(i))g(i) = \delta(f(i)(i)) is a new kind of object: a non-primitive-recursive (but computable) function.

This might surprise you if you’ve only played with primitive recursive functions for a little bit: they can compute functions like plus, times, isPrime, nthPrime, ifelse, and more. But it turns out that they aren’t Turing complete, since they can’t simulate the eval function.

Example: Reals and Naturals

You might recognize the name diagonalization from a proof you saw in math class. Georg Cantor famously used diagonalization to prove that there are more real numbers than natural numbers. In our trilemma framing, the argument goes like this.

Start with a list of rational numbers.

1 => 0.139042986720983745...
2 => 0.349857294587293845...
3 => 0.239841509204872018...

Our evaluation function f(i)(j)f(i)(j) takes the iith real number and then gets its jjth decimal digit:

f(3)(3) => 0.23[9]841509204872018... => 9

We can build a function delta = (x) => x == 1 ? 2 : 1 which maps everything (except 1) to 1, and then sends 1 to 2. It definitely has no fixed point.

We find that we’ve eliminated two options from our trilemma. δ\delta has no fixed point. Our new gg is definitely a real number, since we can get its decimal digit at any location: the first three are 0.211...0.211.... So then it must be (2) to blame–our original list of real numbers couldn’t have been exhaustive.

And so we see that there were more reals than we could list countably.

Example: Church-Turing & Rogers’ Theorem

This time we’re going to list all the JavaScript programs, again alphabetically:

1 => eval('a')
2 => eval('b')
3 => eval('c')
...
10348423 => eval('x => x + x')

Predictably, our evaluation function f(i)(j)f(i)(j) takes the iith program in the list and runs it on the input jj.

f(10348423)(1) => eval('x => x + x')((eval('a')) => 'aa'

We’re going to use delta = (x) => x + " + 1". Recall that δ\delta operates on the source, so for example

delta(f(10348423)(1)) // => eval('x => x + x')((eval('a')) + 1 => 'aa' + 1 => 'aa1'

But something seems wrong here. Our argument above says that either:

  1. δ\delta has a fixed point. It doesn’t, right?
  2. Our list of JavaScript programs isn’t exhaustive. But surely we’ll eventually print out even the longest of JavaScript programs by iterating alphabetically?
  3. Our newly manufactured JavaScript program, gg, has no string representation. But surely every JavaScript program is representable by a string?

Do you see the issue? It turns out that our δ\delta does have a fixed point. In particular, eval("while (true) {}") is a fixed point of δ\delta. If your program loops forever, and you add one to the result, it still loops forever.

Not only that, but all viable δ\deltas share this flaw. To build the “change that output” program δ\delta, you need to know whether your original program halts or not. And in general, you can’t know whether it halts or loops, so you can’t build a good δ\delta. [1]

This constitutes a somewhat shocking proof that it’s impossible to build a function δ\delta that takes any program and modifies it to produce a different result. In other words, all programs mapping functions to functions have a fixed point. This result is called Rogers’s Theorem.

There’s a nice immediate corollary: any language allows you either solve the halting problem or build an interpreter for itself. JavaScript allows interpreters but not halting-solvers. Primitive recursive functions allow halting-solvers (all P.R. functions halt), but can’t build their own interpreters. These correspond to the (1) and (3) branches of the trilemma respectively.

New Tool: Diagonal Lemma

We’ve just seen that all programs that map programs to other programs have a fixed point. Next, we’re going to see a very similar result, but this time applied to mathematical theorems. It’s wrapped up in an opaque piece of math called the Diagonal Lemma. But once we understand the Diagonal Lemma, it opens the door to amazingly short demonstrations of Godel’s Theorem and Tarski’s Theorem.

Let’s say you’re working in some proof system, which is basically some laws (“axioms”) plus some deductive rules. Call that system TT.

Give me a statement A(x)A(x). Then TT can prove that there exists a sentence ϕ\phi where ϕ\phi is only true if A(ϕ)A(\ulcorner \phi \urcorner) is true, where ϕ\ulcorner \phi \urcorner is the numerical representation of the bytes in ϕ\phi. Mathematically, we can express the same as follows, with TT \vdash read “T proves:”

AϕTϕ    A(ϕ)\forall A \, \exists \phi \quad T \vdash \phi \iff A(\ulcorner \phi \urcorner)

So let’s say that we’re dealing with A(x):=x2=4A(x) := x * 2 = 4. Then there’s a sentence ϕ\phi where strong enough theories can prove that ϕ    ϕ2=4\phi \iff \ulcorner \phi \urcorner * 2 = 4. In this case, we can choose ϕ\phi to be the sentence 0=10 = 1. Then ϕ=3161393\ulcorner \phi \urcorner = 3161393, and indeed TT can prove that

0=1    31613932=40 = 1 \iff 3161393 * 2 = 4

since both sides are false.

In the domain of programs, we’ve seen that any program you can name that transform programs into other programs has a fixed point. The Diagonal Lemma says something extremely similar for math: any statement you can name that maps the numerical representation of a statement to another statement has a fixed point. And as a bonus, strong enough theories can prove that.

Proof Sketch: Diagonal Lemma

Consider a sufficiently strong theory TT and any arbitrary formula AA with one free variable. Then there exists a ψ\psi so that TT proves that ψ    A(ψ)\psi \iff A(\ulcorner \psi \urcorner).

The set of formulas with one free variable are enumerable (for example, by printing all strings and then discarding any that have more than one variable symbol):

274 => "x = 1"
...
672 => "x = x"
...
1837501 => "5x = x && 4 = 5"

We’re going to define a function f(s)(s)f(s)(s') which evaluates ss on the byte-representation of ss'. For example, f(xx is prime)(xx * xx = 100) evaluates to “33823693116354608 is prime” (the string “x*x= 100” in UTF-8 has the decimal representation 33823693116354608). Notice that ff maps two formulas with a free variable into a single sentence with no free variables.

Let δ(ϕ)=A(ϕ)\delta(\phi) = A(\ulcorner \phi \urcorner).

Now for the trilemma. All the formulas in our system definitely have a written representation, and all written representations with one free variable definitely count as formulas in our system.

So it must be that δ\delta has a fixed point. And that fixed point is exactly what we’re after: some ψ\psi so that ψ\psi and A(ψ)A(\ulcorner \psi \urcorner) are logically equivalent. (Notice that we’ve shifted from “output equivalence” to “logical equivalence.”)

So we’ve seen how we can prove this. We’re going to omit the proof that TT itself can also prove this, since it’s a bit too hairy for our taste. If you look into it, it basically involves proving that TT can construct and prove this same argument inside itself.

Proof Sketch: Gödel’s Incompleteness Theorem

We’ve finally arrived at Gödel’s Incompleteness Theorem. Here’s a quick reminder of the statement.

Gödel’s First Incompleteness Theorem. Any sufficiently powerful theory must be either inconsistent or incomplete.

The core idea is to construct a sentence SS like “this sentence is unprovable.” Then if SS is provable, it’s wrong (the surrounding theory can prove things that are contradictory; this is called being inconsistent). If TT can’t prove SS, then SS is true (the surrounding theory has truths that it can’t prove; this is called being incomplete).

That’s a slick argument, but are sentences like “this sentence is unprovable” even allowed? How do you express something like “this sentence” formally? The Diagonal Lemma allows us to do just that.

Let A(x)=pA(\ulcorner x \urcorner) = \forall p, the string pp is not a proof of xx. In other words, A(x)A(\ulcorner x \urcorner) = ”xx is unprovable”.

Then by the Diagonal Lemma, we can use TT to prove that there exists an xx with x    A(x)x \iff A(\ulcorner x \urcorner), or x    x \iff (xx is unprovable). And ”x    x \iff (xx is unprovable)” is exactly “this sentence is unprovable.” So we can build these sentences!

Proof Sketch: Tarski’s Theorem

Here’s a quick reminder of what Tarski’s Theorem says.

Tarski’s Theorem. Any sufficiently powerful theory cannot define its internal notion of truth.

To prove it, we take A(x)A(\ulcorner x \urcorner) = (xx is not true). The Diagonal Lemma guarantees that we can prove the existence of an xx such that x    x \iff (xx is not true). But ”xx is true     \iff xx is not true” is obviously contradictory.

The only thing we can point to is our original AA, which must have been flawed. It must be that theories cannot define predicates like xx is true or xx is not true.

Conclusion

I hope you’re convinced that diagonalization is an extremely powerful tool.

From our original diagonalization trilemma, we were able to immediately show that primitive recursive functions aren’t complete, that there are more reals than naturals, and that program-transforming-programs must all have fixed points.

Then we used the same trilemma to construct a new tool, the Diagonal Lemma, and closed out the post by using it to construct the strange sentences required for Gödel and Tarski’s famous theorems.

We’ve kept the math relatively light, leaning on intuition and code where necessary. If you want a more rigorous treatment of this topic, I highly recommend continuing with Yanofsky [0], upon which this post is based.

Notes

[0] Note that we’ll use the curried f(3)(5)f(3)(5) and uncurried f(3,5)f(3, 5) forms interchangeably in this post.

[1] A delightful corollary: if your language is weak enough that you can solve the halting problem, then it must not be able to build an interpreter for itself. That’s what happens with P.R. functions, which all halt.

Appendix

Gödel Numbers

Gödel numbers are a way of converting formula into numbers. We denote the Gödel number of xx as x\ulcorner x \urcorner. You can imagine x\ulcorner x \urcorner just interpreting the underlying bytes of xx as an integer instead of a string.

const gödel = (s: string): bigint => {
    const byteArray = new TextEncoder().encode(s);
    const hexString = Array.from(byteArray).map(byte => byte.toString(16).padStart(2, '0')).join('');
    return BigInt(`0x${hexString}`);
}

const unGödel = (n: bigint): string => {
    const hexString = n.toString(16);
    const byteArray = new Uint8Array(hexString.match(/.{1,2}/g)!.map(byte => parseInt(byte, 16)));
    return new TextDecoder().decode(byteArray);
}

console.log(gödel("x * 2 = 4")) // 2215926989203043065908n
console.log(unGödel(2215926989203043065908n)) // x * 2 = 4

Diagonalization Proof (no exposition)

Here’s the straight math version of the general diagonalization argument.

Diagonalization. For all sets XX and YY with a function f:X×XYf: X \times X \rightarrow Y and a function δ:YY\delta: Y \rightarrow Y with no fixed point, there exists at least one function g:XYg: X \rightarrow Y such that there exists no kk such that f(k)=gf(k) = g.

Proof. Let g(x)=δ(f(x,x))g(x) = \delta(f(x, x)). For any kk, gf(k)g \neq f(k), since g(k)f(k)(k)g(k) \neq f(k)(k).

Acknowledgements

This post (and the primitive recursive post) were prepared as part of my final work for the MIT class Seminar in Logic (18.504). Thanks to Henry Cohn, Lisa Kondrich, and Laker Newhouse for their extensive suggestions, feedback, and help with both this post and the primitive recursive introduction post.

References

[0] Noson S. Yanofsky (2003). A Universal Approach to Self-Referential Paradoxes, Incompleteness and Fixed Points*. Bulletin of Symbolic Logic, 9(3), 362–386.*

[1] Garrabrant, S., & Eisenstat, S. (n.d.). Diagonalization fixed point exercises. Retrieved from https://www.lesswrong.com/posts/FZkLa3GRLW97fpknG/diagonalization-fixed-point-exercises

[2] Kelly, K. T. (n.d.). Computability and Learnability.

[3] Primitive recursive function. (2023). Retrieved from https://en.wikipedia.org/wiki/Primitive_recursive_function

[4] Smith, P. (2013). An Introduction to Gödel’s Theorems. Cambridge, England: Cambridge University Press.

[5] Solomon, E. (1958). Unpacking the diagonal lemma. Retrieved from https://math.stackexchange.com/questions/127852/unpacking-the-diagonal-lemma