justinpombrio.netPuzzles, CS, math, and physics
http://justinpombrio.net
Preventing Log4j with Capabilities<p>A capability-safe language would have minimized the impact of, or even
prevented, the <a href="https://www.lunasec.io/docs/blog/log4j-zero-day/">log4j
vulnerability</a>.</p>
<p>If this is the first time you’re hearing about that vulnerability, you should go
read about it instead of this post! And also go patch it, if you have Java
software that uses or might transitively use <code>log4j</code> to log. It’s a doozy.</p>
<p>There are multiple issues surrouding this vulnerability that I’ll talk about in
this post:</p>
<ol>
<li>It’s doing string interpolation on user supplied strings.</li>
<li>It’s accessing the network, without anyone realizing it might do that.
(This is the part that capabilities would help with.)</li>
</ol>
<h2 id="the-surface-issue-string-interpolation-on-user-supplied-strings">The surface issue: String interpolation on user supplied strings</h2>
<p>The first issue is that <code>log4j</code> was performing string replacement on user
supplied strings, not just on the template strings written by the developer.
For example, if the developer using <code>log4j</code> writes this:</p>
<div class="highlight"><pre><code class="java"><span class="n">logger</span><span class="o">.</span><span class="na">debug</span><span class="o">(</span><span class="s">"user-name={}"</span><span class="o">,</span> <span class="n">userName</span><span class="o">);</span>
</code></pre>
</div>
<p>Then <code>log4j</code> will substitute <code>userName</code> into <code>{}</code> as it should, but it will
<em>also</em> perform string replacements inside <code>userName</code>. So if someone picks the
user name <code>{o}${o}</code> because they think it looks like a pair of glasses, then
this line of code will attempt to expand <code>${o}</code> by looking up what <code>o</code> expands
to in a developer provided configuration file.</p>
<p>But that’s non-sensical: a user name is a <em>string</em>, not a <em>logging template</em>.
The user that picked <code>{o}${o}</code> probably doesn’t even know how to program and
they were not attempting to write a <code>log4j</code> string template, they were drawing a
pair of glasses!</p>
<p>Contrast that to this code:</p>
<div class="highlight"><pre><code class="java"><span class="n">logger</span><span class="o">.</span><span class="na">debug</span><span class="o">(</span><span class="s">"user-name="</span> <span class="o">+</span> <span class="n">userName</span><span class="o">);</span>
</code></pre>
</div>
<p>In this case, <code>logger.debug()</code> has no way of knowing that <code>userName</code> might
contain user data: it was simply just handed a single argument, and its first
argument is meant to be a template. Thus it is appropriate for it to try to
expand <code>${o}</code> in the log message. Writing this code is a mistake by the
<em>developer</em> using <code>log4j</code>, whereas the behavior of the <code>"user-name={}"</code> code is
a bug in <code>log4j</code>.</p>
<p>If this bug wasn’t present then there would probably be a lot fewer vulnerable
applications in the wild, because many developers using <code>log4j</code> probably did use
the form <code>"user-name={}"</code> instead of the form <code>"user-name=" + userName</code>.</p>
<h2 id="the-deeper-issue-why-is-my-logger-using-the-network">The deeper issue: Why is my logger using the network?</h2>
<p>That was one issue. Another, deeper, issue is that <code>log4j</code> was fetching
arbitrary Java code off the network and executing it, <em>when no one expected it
to</em>.</p>
<p>I have, in my head, a little person who was born and raised in a world where
capability-safe software is the default. And this person is yelling. He is
yelling:</p>
<blockquote>
<p>I hear this vulnerability affected pretty much everyone using <code>log4j</code>. But
why did everyone pass the network to the logger? Sure, maybe some people
wanted to use JNDI and LDAP or something, but most people didn’t, so why would
those people <em>give the logger the network</em>?</p>
</blockquote>
<p>The answer, of course, is that no one “gave” the logger access to the network.
It just <em>had</em> access, because <em>all code</em> in Java has access to the network.
You can tell by these type signatures in the Java <code>net</code> library:</p>
<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">URL</span> <span class="o">{</span>
<span class="c1">// Make a new URL. Anyone can do it.</span>
<span class="kd">public</span> <span class="nf">URL</span><span class="o">(</span><span class="n">String</span><span class="o">)</span> <span class="o">;</span>
<span class="c1">// Turn a URL into a URLConnection. Anyone can do it.</span>
<span class="c1">// (Despite the name, this doesn't actually open the connection,</span>
<span class="c1">// it just makes a URLConnection object.)</span>
<span class="kd">public</span> <span class="n">URLConnection</span> <span class="nf">openConnection</span><span class="o">();</span>
<span class="o">}</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">URLConnection</span> <span class="o">{</span>
<span class="c1">// Actually open the connection. Anyone can do it.</span>
<span class="kd">public</span> <span class="kd">abstract</span> <span class="kt">void</span> <span class="nf">connect</span><span class="o">();</span>
<span class="o">}</span>
</code></pre>
</div>
<p>(Links: <a href="https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#URL(java.lang.String)">new URL</a>, <a href="https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#openConnection()">openConnection</a>, <a href="https://docs.oracle.com/javase/7/docs/api/java/net/URLConnection.html#connect()">connect</a>.)</p>
<p>By chaining these three methods together, arbitrary Java code can open a
connection to any URL it wants to. This may not look strange to you or me, but
it looks <em>very</em> problematic to the little capability-person in my head. He is
saying:</p>
<blockquote>
<p>Wait, these three methods together let you create a network connection <em>from
nothing</em>? That’s a violation of the integrity of your type system!</p>
</blockquote>
<blockquote>
<p>It’s like… say there is an authentication package, that <em>all</em> authentication
goes through, and you can try to authenticate a user, and if it passes you’ll
get an <code>AuthenticatedUser</code>, and then you can use the <code>AuthenticatedUser</code> to
perform more privileged actions.</p>
<p>For this to work well, it’s important that <em>all</em> authentication happens in the
authentication package, and that <em>only it</em> can create an <code>AuthenticatedUser</code>.
You can do this in Java, by making the constructors for <code>AuthenticatedUser</code>
non-public and ensuring that they are only called in the authentication
package, and only if the authentication succeeds. This can be a very useful
abstraction in a large codebase: it tells you that (certain kinds of)
authentication bugs can <em>only</em> happen inside the authentication package.</p>
<p>And this abstraction <em>breaks</em> if random code can conjure up an
<code>AuthenticatedUser</code> <em>from nothing</em>, and use it to perform privileged actions.</p>
<p>Likewise, you shouldn’t be able to conjure up a network connection <em>from
nothing</em>. Any network connection must ultimately originate from the <code>Network</code>
object.</p>
<p>[…listening…]</p>
<p>Oh, you don’t <em>have</em> a <code>Network</code> object? So any random library code can just
access the network, on its own. And this is true not only of your
dependencies, but the dependencies of your dependencies. So the only way to
check if your application might transitively access the network would be to like…
search through the source code of all of your transitive dependencies? <em>Wow.</em>
Just wow. And you’re wondering why you have so many vulner—</p>
</blockquote>
<p>Let’s cut off my imaginary capabilities-person there, he’s getting a little
snarky.</p>
<p>What he’s making fun of us for not having is a different API that looks
like this (or something like it; there are a lot of ways to organize it):</p>
<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">URL</span> <span class="o">{</span>
<span class="c1">// Make a new URL. Anyone can do it.</span>
<span class="k">new</span> <span class="nf">URL</span><span class="o">(</span><span class="n">String</span><span class="o">);</span>
<span class="o">}</span>
<span class="c1">// The ultimate source of all network access.</span>
<span class="c1">// This class is a capability. It grants access to all URLs.</span>
<span class="kd">class</span> <span class="nc">Network</span> <span class="o">{</span>
<span class="c1">// Turn a URL into a URLConnection.</span>
<span class="c1">// You can only do this if you have a Network object.</span>
<span class="c1">// Once it's done, the URLConnection grants the capability to open the</span>
<span class="c1">// connection to that url.</span>
<span class="kd">public</span> <span class="n">URLConnection</span> <span class="nf">openConnection</span><span class="o">(</span><span class="n">URL</span><span class="o">);</span>
<span class="o">}</span>
<span class="c1">// This class is a capability. It grants access to one particular URL.</span>
<span class="kd">class</span> <span class="nc">URLConnection</span> <span class="o">{</span>
<span class="c1">// Actually open the connection.</span>
<span class="kd">public</span> <span class="kd">abstract</span> <span class="kt">void</span> <span class="nf">connect</span><span class="o">();</span>
<span class="o">}</span>
</code></pre>
</div>
<p>This begs the question: who can construct a <code>Network</code> object? If anyone can just
make one, then nothing substantial has changed. The <code>log4j</code> package (or really,
its JNDI dependency) would privately construct a <code>Network</code> and otherwise do the
same thing.</p>
<p>The trick is, <em>there are no constructors for <code>Network</code></em>. Instead, there is
exactly one <code>Network</code> object ever in existence, and it is passed in to the
program at one location, perhaps as an argument to <code>main</code>. (Actually, if the
operating system was capability-safe and Java was cooperating with it, then
<code>main</code> would be given a <code>Network</code> if and only if the executable was given
network access.) And likewise for similar capabilities like a <code>FileSystem</code>
object:</p>
<div class="highlight"><pre><code class="java"><span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span>
<span class="n">String</span> <span class="n">args</span><span class="o">[],</span>
<span class="n">Network</span> <span class="n">network</span><span class="o">,</span>
<span class="n">Filesystem</span> <span class="n">filesystem</span><span class="o">)</span> <span class="o">{</span>
<span class="o">...</span>
<span class="o">}</span>
</code></pre>
</div>
<p>The point is to use unforgeable Java objects to grant capabilities. Unforgable
means that arbitrary code can’t create one from nothing; this can be
accomplished in Java simply by it not having constructors. If you pass some code
a reference to one of these capability objects, directly or indirectly, you are
granting it access to the resource it represents. This is the essense of
capabilities: <strong>unforgeable objects that grant access to the resource they
represent</strong>. It’s very simple.</p>
<p>Let’s see how capability safety would influence <code>log4j</code>. First off, here’s what
<code>log4j</code>’s interface looks like currently:</p>
<div class="highlight"><pre><code class="java"><span class="kn">import</span> <span class="nn">org.apache.log4j.Logger</span><span class="o">;</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">Incrementer</span> <span class="o">{</span>
<span class="kd">private</span> <span class="kd">static</span> <span class="kd">final</span> <span class="n">Logger</span> <span class="n">LOGGER</span>
<span class="o">=</span> <span class="n">Logger</span><span class="o">.</span><span class="na">getLogger</span><span class="o">(</span><span class="n">Incrementer</span><span class="o">.</span><span class="na">class</span><span class="o">);</span>
<span class="kd">public</span> <span class="kt">int</span> <span class="nf">increment</span><span class="o">(</span><span class="kt">int</span> <span class="n">number</span><span class="o">)</span> <span class="o">{</span>
<span class="n">LOGGER</span><span class="o">.</span><span class="na">info</span><span class="o">(</span><span class="s">"Adding one"</span><span class="o">);</span>
<span class="k">return</span> <span class="n">number</span> <span class="o">+</span> <span class="mi">1</span><span class="o">;</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre>
</div>
<p>Notice that <code>Network</code> isn’t passed to <code>LOGGER</code>. Thus <code>log4j</code> <em>can’t</em> access
the network! So when the <code>log4j</code> maintainers considered implementing the JNDI
feature that introduced the vulnerability, a few things could have happened
next:</p>
<ol>
<li>They decide that it’s totally reasonable for a logging
library to access the network all the time by default, and add a <code>Network</code>
argument to the <code>Logger.getLogger()</code> method. Some users accept this and fall
prey to the vulnerability, but more discerning users are concerned by this
request for network access and switch to a simpler logging library that
doesn’t require it, thus avoiding the vulnerability.</li>
<li>They don’t think a logging library should be accessing the
network at all, or feel that requiring a <code>Network</code> parameter would be
frightening or inconvenient to users, and reject the feature.</li>
<li>They think that the feature is worthwhile, but don’t want
the breaking API change of modifying <code>getLogger</code> to require a <code>Network</code>. So
instead they introduce a new method, perhaps called
<code>Logger.getLoggerWithNetwork(MyClass.class, network)</code>. Since most users don’t
use this method and <code>log4j</code> can’t access the network without it, this
prevents the great majority of vulnerabilities.</li>
</ol>
<p>All three possibilities are better than what actually happened, which was that
<code>log4j</code> suddenly gained the ability to access the network, but its API did
not change to reflect this so users did not notice. Thus, a capability-safe
language would have saved, or at least mitigated, the day.</p>
<p>So future language designers, please consider making your language capability
safe.</p>
<p>I’m not actually sure what’s a good reference for more reading, besides Mark
Miller’s thesis if you really want to get into it. But here are some
possibilities:</p>
<ul>
<li><a href="http://www.erights.org/talks/thesis/markm-thesis.pdf">Mark Miller’s thesis</a></li>
<li><a href="https://fuchsia.dev/fuchsia-src/concepts/components/v2/capabilities?hl=en">Capabilities in
Fuschia</a></li>
<li><a href="https://github.com/bytecodealliance/cap-std">Capabilities in Rust</a></li>
</ul>
Sun, 26 Dec 2021 00:00:00 -0500
http://justinpombrio.net//2021/12/26/preventing-log4j-with-capabilities.html
http://justinpombrio.net//2021/12/26/preventing-log4j-with-capabilities.htmlTraffic Engineering with Portals, Part II<p>In <a href="https://justinpombrio.net/2021/05/15/traffic-engineering-with-portals.html">part I</a>, I
introduced the question of how best to arrange of network of portals for efficient travel, and
proposed a (to my mind) satisfying <em>bit shifting</em> solution.</p>
<p><em>(This post is going to go faster with less detailed explanation than the last. If parts are
confusing, try working the math out for yourself!)</em></p>
<h2 id="slightly-more-efficient-portal-network">Slightly more efficient portal network</h2>
<p>But you can do better than bit-shifting! Alex Elsayed points out that my “bit shifting” approach has
been studied, and is called <a href="https://en.wikipedia.org/wiki/De_Bruijn_graph">De-Bruijn graphs</a>. And
further, there’s a better approach, called <a href="https://en.wikipedia.org/wiki/Kautz_graph">Kautz
graphs</a>! For portal networks with out-degree 2, Kautz
graphs let you have 50% more nodes, or alternatively reduce the route length by about 1/2.</p>
<p>To repeat from last time, my bit-shifting solution—a.k.a. De-Bruijn graphs—was to label each hub
with a 20-bit address, and have two outgoing portals from each hub, labeled 0 and 1. Exiting through
the 0 portal shifts the bits in the address left and adds a zero at the end:</p>
<pre><code>a1 a2 ... a19 a20 -> a2 a3 ... a19 a20 0
</code></pre>
<p>and the 1 portal adds a one at the end instead:</p>
<pre><code>a1 a2 ... a19 a20 -> a2 a3 ... a19 a20 1
</code></pre>
<p>If you follow the sequence of portals that spells out your destination address in binary, you’ll get
there in 20 hops. And this is exactly what De-Bruijn graphs are.</p>
<p>Kautz graphs are similar, with one modification: adjacent digits of the address are required to be
different. This makes a binary addresses pretty useless (it would have to alternate 010101…!), so
let’s assume a 20-digit ternary address instead (larger bases also work). How many such addresses
are there? There are three possibilities for the first digit, and two possibilities for each
subsequent digit (since it can’t be the same as the previous digit). Thus a 20-digit ternary Kautz
address has <code>3*2^19</code> possibilities.</p>
<p>And how about portals and routes? There are two exit portals from each hub (there <em>would</em> be three,
but one of the digits would make an illegal address so you can skip that one). Typically, if you
want to get from address <code>abcd</code> to address <code>WXYZ</code> (assuming that addresses are length 4 to keep this
example short) you can proceed as with a DeBruijn network:</p>
<pre><code> abcd
portal_W: bcdW
portal_X: cdWX
portal_Y: dWXY
portal_Z: WXYZ
</code></pre>
<p>But if <code>W</code> is the same digit as <code>d</code>, then this route would go through illegal addresses! So in that
case, to get from <code>abcd</code> to <code>WXYZ</code> you have to take a different route, one that is one hop shorter:</p>
<pre><code> abcd
portal_X: bcdX
portal_Y: cdXY
portal_Z: dXYZ
</code></pre>
<p>There’s a 1/3 chance that <code>W</code> is the same digit as <code>d</code>, so the average route length for 20 digit
addresses is 19 2/3.</p>
<p>Putting this all together, we can compare bit-shifting, a.k.a. De-Bruijn, with Kautz:</p>
<pre><code>Method | Num hubs | Deg | Avg. route len
---------+----------+-----+---------------
DeBruijn | 2^20 | 2 | 20
Kautz | 3*2^19 | 2 | 19.67
</code></pre>
<p>(“degree” means “number of exit portals per hub”.) So the Kautz approach connects more hubs while
having slightly shorter routes.</p>
<h2 id="much-more-efficient-portal-network">Much more efficient portal network</h2>
<p>But you can do even better than that! My friend David Meierfrankenfeld designed a portal network
that’s better than I thought was possible.</p>
<p>It works like this. We’ll split the hubs into two categories: <em>destination hubs</em> and <em>transit hubs</em>,
and arrange the network like this:</p>
<ul>
<li>There are <code>8^6 = 2^18</code> transit hubs. These transit hubs are connected <em>to each other</em> via my “bit
shifting” (a.k.a. DeBruijn) arrangement but in octal instead of binary. Thus you can get from any
transit hub to any other in 6 hops.</li>
<li>Each transit hub is “assigned” 8 destination hubs. Thus there are <code>8^6*8 = 2^21</code> destination hubs,
and 8/9 of all hubs are destination hubs. The eight destination hubs assigned to a transit hub <code>h</code>
are connected in two strings of four, like this: <code>h->1->2->3->4->h</code> and <code>h->5->6->7->8->h</code>. Thus
each transit hub has degree ten: 8 portals to other transit hubs, and 2 to destination hubs. And
each destination hub has degree 1.</li>
<li>
<p>An address takes the form:</p>
<pre><code> T[6 digit octal code][a or b repeated 0 to 4 times]
</code></pre>
<p>For example, <code>221357aa</code> is a valid address.</p>
</li>
</ul>
<p>To navigate from one destination hub to another, you walk along the string of destination hubs until
you get to the transit hub, then use the octal address to get to the target transit hub, then walk
along the appropriate string of destination hubs according to the (a or b repeated 0 to 4 times)
portion of the address.</p>
<p>This portal network is surprisingly efficient:</p>
<ul>
<li>The average degree is 2. This is because 1/9 of the hubs have degree 10 and 8/9 have degree 1, so
the average degree is <code>1/9 * 10 + 8/9 * 1 = 18/9 = 2</code>.</li>
<li>The average route length is 11. To prove this, first notice that the route between two transit
hubs is always 6. And the average extra route length following the a/b portion of an address is 5
(2.5 going out plus 2.5 going in), so the average route length is <code>6 + 5 = 11</code>. (One nice property
is that the <em>round trip</em> route length is always exactly 22.)</li>
</ul>
<p>This is a good deal more efficient than bit shifting! The thing to compare it to is bit shifting
(i.e. a De Bruijn graph) using a binary address of length 21:</p>
<pre><code>Method | Num hubs | Avg. deg | Avg. route len
-----------------+------------+----------+---------------
DeBruijn | 2^21 | 2 | 21
Meierfrankenfeld | (9/8)*2^21 | 2 | 11
</code></pre>
<p>That nearly halves the average route length!</p>
<p><strong>Question for the reader:</strong> in the last post, I proved that for <code>2^20</code> hubs and degree 2, you
couldn’t have route lengths smaller than <code>2^19</code>. Which is very wrong! Meierfrankenfeld’s approach
has routes of length 11 for even more hubs than that. What assumption(s) was the proof making that
are invalidated in Meierfrankenfeld’s approach?</p>
<p>This raises another question. There must be some upper bound on how well you could do. What is it?</p>
<p>I have a hypothesis, which I will humbly call the:</p>
<h2 id="fundamental-theorem-of-portal-traffic-engineering">Fundamental Theorem of Portal Traffic Engineering</h2>
<p>Suppose we have a portal network (i.e. graph), with <code>N</code> hubs (i.e. nodes) and exactly one
designated <em>route</em> from any hub to any other hub. That is, while there may be more than one
<em>path</em> from one hub to another, exactly one of these paths is a <em>route</em>.</p>
<p>Furthermore, define:</p>
<ul>
<li><code>len</code> is a random variable giving the length of a route chosen uniformly at random.</li>
<li><code>H[len]</code> is the <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">entropy</a> of <code>len</code>.
For example, if half the routes are one length and half another, then <code>H[len]</code> is <code>log 2</code>, i.e.
1 bit. Let’s use base 2 for the logs.</li>
<li><code>Avg[len]</code> is the average route length.</li>
<li><code>flow(h)</code> is the fraction of all traffic that proceeds through an exit portal at hub h.
In other words, if you snip each route into individual <em>hops</em> from one portal to another, it’s the
fraction of all hops (among all routes) that start at h. So <code>Sum_h flow(h) = 1</code> by definition.</li>
<li><code>deg(h)</code> is the degree of hub h, i.e. the number of exit portals at h.</li>
<li><code>N</code> is the number of hubs.</li>
</ul>
<p>Then:</p>
<pre><code>H[len] + Avg[len] * Sum_h (flow(h) * log deg(h)) >= log N
</code></pre>
<p>Call the left hand side of this inequality <code>E</code>, because it’s an upper bound of the <em>entropy</em> of the
routes.</p>
<h3 id="proof-sketch">Proof Sketch</h3>
<p>Place a person at a source hub selected uniformly at random. We’re going to have them follow a route
at random, and compute (an upper bound of) the entropy of the decisions they make along the way.
This entropy must be at least the log of the number of destinations (<code>log N</code>), or else there would
be fewer routes than destinations. So:</p>
<pre><code>H[travel decisions] >= log N
</code></pre>
<p>We can ask them to first decide on a route length L (from among the valid route lengths starting
from their source hub), and then make L decisions of which exit portal to take. If we assume that
several things are independent—the source hub, the route length, and the exit portal
choices—then we have:</p>
<pre><code>H[len] + Avg[len] * Avg[entropy of which portal to take] >= log N
</code></pre>
<p>Why can we assume that all of these things are independent? They very well might not be. But entropy
is maximized if they are, and we’re computing an upper bound, so we can conservatively assume that
they are independent.</p>
<p>That last inequality isn’t quite right, though. The <code>Avg[entropy of which portal to take]</code> part is
assuming that the person is equally likely to traverse any hub. But they’re more likely to traverse
some hubs than others. So we should take a <em>weighted average</em>: <code>log deg(h)</code> weighted by <code>flow(h)</code>
(the chance that a hop is leaving <code>h</code>, of all places). As before, we’re conservatively assuming
that several things are independent. This gives:</p>
<pre><code>H[len] + Avg[len] * Sum_h (flow(h) * log deg(h)) >= log N
</code></pre>
<p>Which is the theorem.</p>
<hr />
<p>Place a person who knows the navigation rules for the portal network at a source hub selected
uniformly at random. Hand them a uniformly random destination address, and tell them to travel
there. The entropy of the decisions they make while traveling must be, in expectation, at least as
large as the entropy of the destination address. Otherwise there would be more destinations than
routes. The entropy of the destination address is <code>log N</code>, so:</p>
<pre><code>H[travel decisions] >= log N
</code></pre>
<p>The decisions they make can be split into the decision of how long a route to take, plus the
sequence of decisions about what exit portal to take at each step of the route. The entropy of the
route length is simply <code>H[len]</code>. And while traveling the route, they must make a decision of which
exit portal to go through at each hub. So:</p>
<pre><code>H[len] + Avg[len] * Avg[entropy of which portal to take] >= log N
</code></pre>
<p>(Notice that I took the <em>average</em> of the path length and entropy. I think this works out due to the
linearity of expectations.)</p>
<p>What’s the entropy of the decision of which portal to take at a hub <code>h</code>? There are <code>deg(h)</code> exit
portals, so it’s at most <code>log deg(h)</code>:</p>
<pre><code>H[len] + Avg[len] * Avg[log deg(h)] >= log N
</code></pre>
<p>But wait! This is assuming that we’re equally likely to be at any hub. But we’re more likely to be
at some hubs than others. So we need to take a <em>weighted average</em>: <code>log deg(h)</code> weighted by
<code>flow(h)</code> (the chance that we’re at <code>h</code>, of all places). This gives:</p>
<pre><code>H[len] + Avg[len] * Sum_h (flow(h) * log deg(h)) >= log N
</code></pre>
<p>Which is the theorem.</p>
<h3 id="what-it-means">What it Means</h3>
<p>What is this saying?</p>
<p>We <em>want</em> to minimize <code>Avg[len]</code> and <code>flow(h)</code> and <code>deg(h)</code>. But they all contribute positively to
the left hand side of the inequality, which must be <em>at least</em> <code>log N</code>, so we can’t minimize them
all at once! Instead, there’s a tradeoff. I would phrase it as a tradeoff between:</p>
<ul>
<li>The average route length</li>
<li>The average degree</li>
<li>How much traffic goes through higher-degree nodes</li>
</ul>
<p>I’ll leave off with a table of entropy stats for various portal networks:</p>
<pre><code>Method | hubs |H[len]|Avg[len]| flow |Avg[deg]| E |E-log N
-----------------+------+------+--------+----- +--------+------+-------
ring | 2^20 | 20 | 2^19 | 1 | 1 | 20 | 0
star | 2^20 | 0 | 2 | 1 | 2^20 | ~40 | 20
bit flipping | 2^20 | 3.2 | 10 | 1 | 20 | ~46 | 26
DeBruijn base 2 | 2^20 | 0 | 20 | 1 | 2 | 20 | 0
DeBruijn base 16 | 2^20 | 0 | 5 | 1 | 16 | 20 | 0
Meierfrankenfeld |~2^21 | 2.8 | 11 | 0.55 | 2 | 22.7 | 1.7
Kautz graph |3*2^19| 1 | 19.67 | 1 | 2 | 20.5 | 0.08
</code></pre>
<p>All these portal networks have a particular form that simplifies the formula: some of their hubs
have degree 1 and thus <code>log deg(h)</code> 0 and can be ignored, and the rest have a fixed degree <code>d</code>. This
allows the formula to be simplified to:</p>
<pre><code>H[len] + Avg[len] * flow * log d >= log N
</code></pre>
<p>where <code>flow</code> is the fraction of traffic that flows through hubs with degree d.</p>
<p>You can see from this table what kind of tradeoffs different portal networks make:</p>
<ul>
<li>A ring portal network decreases average degree at the expense of increasing average route length.</li>
<li>A DeBruijn portal network maximizes traffic uniformity, putting all traffic through hubs of the
same degree.</li>
<li>A Kautz portal network is similar, but increases the <code>H[len]</code> term to decrease the average route
length a bit.</li>
<li>A Meierfrankenfeld portal network routes more traffic through higher degree nodes, in exchange for
greatly decreasing route length.</li>
<li>Star and bit-flipping are just bad.</li>
</ul>
<p>The last column shows how “tight” <code>E</code> is to what the theorem says is possible (0 means it couldn’t
possibly be any smaller). The fact that many of these different approaches are so close to the
boundary of what the theorem says is possible, gives me some confidence that it might be correct.</p>
Sun, 17 Oct 2021 00:00:00 -0400
http://justinpombrio.net//2021/10/17/traffic-engineering-with-portals-part-ii.html
http://justinpombrio.net//2021/10/17/traffic-engineering-with-portals-part-ii.htmlTraffic Engineering with Portals<p>Suppose you invented portals. Like you were able to make a pair of portals, and if you stepped
through one you’d come out the other, wherever it was. What would you do with them?</p>
<p>Well the first thing to do is obviously to solve the world’s energy problems. Anchor one portal
above another, so that if you fall through the lower portal, you’ll come out of the upper portal —
just above where you were. Add a turbine generator and pour in some water and BAM, infinite energy.</p>
<p><a href="https://www.youtube.com/watch?v=HWA3Jjwr1Lo">Two Portals</a> → <a href="https://www.reddit.com/r/woahdude/comments/2ieskm/waterfall_a_lithograph_by_dutch_artist_m_c_escher/">Free Energy</a></p>
<p>But what next?</p>
<p>Say you want to bring the world together. Build “hubs” all over the world, that connect to each
other. If you build a million hubs, that’s one per every ~7000 people. That might be enough.</p>
<p>How can you connect the hubs, so that it’s easy to get from any hub to any other?</p>
<p>If you connect every hub directly to every other hub with a portal, then you can always get from a
starting hub to a destination hub in one hop. But that’s… less than practical. You would need to
make one million squared equals one trillion portals: over a hundred portals <em>per person on Earth</em>.
Where would we even put them all?</p>
<p>One million portals per hub is too many. Let’s go minimal instead. How about connecting the portals
in a big ring? Two portals per hub: one that goes to the next hub in the ring, and one that goes to
the previous hub. That’s a reasonable number of portals to build, but it might take a while to get
from one hub to another. Worst case, you need to traverse 500,000 portals.</p>
<p>To compare the methods so far:</p>
<pre><code>Method | Portals/Hub | Worst #Hops
-------+-------------+------------
Clique | 1000,000 | 1
Ring | 2 | 500,000
</code></pre>
<p>Maybe there’s a sweet spot in the middle?</p>
<p>Trees are a nice data structure. Let’s try making a tree! Arrange the hubs in a binary tree. This
tree will have depth ~20, and each hub (except the root) will have 3 portals: one to go to its
parent, and two to go to its children. To get from a source hub to a destination hub, you walk up
the tree to the common ancestor, then down the tree to the destination. Worst case, that’s 38 hops.
Not bad!</p>
<p>…except that half of all paths take you through the root hub. That hub is going to get
<em>way</em> too much traffic: about half a million times the traffic of most other hubs. In contrast, the
previous approaches would have a <em>uniform</em> traffic distribution, meaning that if travellers pick
source and destination hubs uniformly at random, then each hub will receive equal traffic.</p>
<pre><code>Method | Portals/Hub | Worst #Hops | Traffic Distribution
-------+-------------+-------------+---------------------
Clique | 1000,000 | 1 | Uniform
Ring | 2 | 500,000 | Uniform
Tree | 3 | 38 | Skewed
</code></pre>
<p>So trees aren’t great for navigation because too many travellers would need to cross the root. Is
there a way to get a small number of portals, a small number of hops, and a uniform distribution of
traffic across the hubs?</p>
<p>Why yes there is! Give each hub a 20 digit binary id. Then build 20 portals per hub: one to toggle
each digit in its id. For example, if we’re at hub <code>10000000000000000000</code>, then taking the third
portal (out of 20) will bring us to the hub with id <code>10100000000000000000</code>, because that’s the id
you get when you toggle the third bit.</p>
<p>To get from a starting hub to a destination hub, you just need to take portal number N for each bit
position N in which the starting hub’s id differs from the destination hub’s id. Worst case, that’s
20 hops. And the traffic will be uniform: it won’t be biased toward any particular hub.</p>
<p>(We’ve been aiming for one million hubs, but there are <code>2^20</code> ids, which is a bit larger. No biggie,
though. We can just build a few tens of thousands of additional hubs to fill out every id.)</p>
<p>This is a much better portal layout than the previous ones. It’s actually usable!</p>
<pre><code>Method | Portals/Hub | Worst #Hops | Traffic Distribution
-------------+-------------+-------------+---------------------
Clique | 1000,000 | 1 | Uniform
Ring | 2 | 500,000 | Uniform
Tree | 3 | 38 | Skewed
Bit Flipping | 20 | 20 | Uniform
</code></pre>
<h3 id="puzzle">Puzzle</h3>
<p>Can you do better? Reduce the portals/hub or #hops, while keeping the traffic uniform, and the
navigation method simple (don’t want travellers to get lost!).</p>
<p>If you want to puzzle it out, stop here. Try to find a nice portal layout.</p>
<hr />
<p>I wasn’t sure how much better you could do.</p>
<p>I was worried you would need a really complex portal layout, and there would be a tradeoff between
comprehensibility and efficiency.</p>
<p>There’s not.</p>
<p>You can get 4 one way portals/hub and 20 hops worst case, with a <em>very</em> simple navigation rule.</p>
<h3 id="solution">Solution</h3>
<p>Like the last approach, give each hub a 20-bit id. Give every hub two outbound portals, a “zero”
portal and a “one” portal. The “zero” portal takes you to the hub whose id you can get by (i)
shifting every bit in the id one to the left, and (ii) using 0 as the rightmost bit. Similarly, the
“one” portal shifts the bits one to the left and uses 1 as the rightmost bit.</p>
<p>Let’s look at an example, using 4 digit ids instead of 20 digit ids for simplicity. Say you want to
get from the hub with id 0110 to the hub with id 1000. You would first follow portal “one”, then
portal “zero”, then portal “zero”, then portal “zero”. That is: you just spell out your destination
address! Here’s where that will take you:</p>
<pre><code> 0110
portal_1: 1101
portal_0: 1010
portal_0: 0100
portal_0: 1000
</code></pre>
<p>In general, if you want to get from <code>abcd</code> to <code>WXYZ</code>, you take portals <code>W</code>, <code>X</code>, <code>Y</code>, <code>Z</code>, in that order:</p>
<pre><code> abcd
portal_W: bcdW
portal_X: cdWX
portal_Y: dWXY
portal_Z: WXYZ
</code></pre>
<p>If you get lost and want to go home, it’s easy. You don’t even need to know what hub you’re at. Just
follow portal <code>a</code>, then portal <code>b</code>, then portal <code>c</code>, then portal <code>d</code>, and you’re home.</p>
<p>So you get to your destination in 20 hops. The whole scheme is beautifully symmetric, so the traffic
will be uniformly spread across the hubs.</p>
<p>How efficient is 4 portals/hub and 20 hops worst case? Could there be a much better solution out
there, waiting to be found?</p>
<p>Nope! This solution is nearly optimal, if you assume that portals are meant to be taken one-way to
prevent traffic jams. Four portals/hub gives you 2 entrance portals and 2 exit portals, which means
that you’re making a binary choice with every hop. Twenty binary choices in a row allows you to
reach <em>at most</em> <code>2^20</code> (equals one million) possible destinations. So we couldn’t do any better!</p>
<p>Well… we could get a <em>little</em> better. This analysis assumed that you’re definitely going to make
20 hops, but you could also stop after fewer, which would open up more possible destinations. If you
take up to 19 hops, but are allowed to stop early, how many possible destinations is that? It’s 1
(if you stop at your starting hub) + 2 (if you take one hop) + 4 (if you take two hops) + … +
<code>2^19</code>, which is <code>2^20-1</code> (let’s ignore the -1).</p>
<p>So an optimal solution would require 19 hops instead of 20 worst case. So we’re <em>very close</em> to
optimal, while having a solution that’s easy to describe and has a uniform traffic distribution!</p>
<pre><code>Method | Portals/Hub | Worst #Hops | Traffic Distribution
-------------+-------------+-------------+---------------------
Clique | 1000,000 | 1 | Uniform
Ring | 2 | 500,000 | Uniform
Tree | 3 | 38 | Skewed
Bit Flipping | 20 | 20 | Uniform
Bit Shifting | 4 | 20 | Uniform
</code></pre>
<p>So that’s why you should hire me if you’re looking for a traffic engineer for portals.</p>
Sat, 15 May 2021 00:00:00 -0400
http://justinpombrio.net//2021/05/15/traffic-engineering-with-portals.html
http://justinpombrio.net//2021/05/15/traffic-engineering-with-portals.htmlAlgebra and Data Types<blockquote>
<p>Addition, multiplication, and exponentiation model data types.</p>
</blockquote>
<hr />
<p>In math class you’ve done algebra, with addition and multiplication and exponentiation and
polynomials like <code>1 + x + x²</code>. And while programming, you’ve worked with <code>enum</code>s and <code>struct</code>s and
functions and lists. You probably thought these things were unrelated.</p>
<p>Surprise! They’re deeply related, and by the end of this post you’ll see how to use algebra to
refactor your data types. The crowning example in this post will be finding an equivalent
representation of red-black trees.</p>
<p><strong>Table of contents:</strong></p>
<ul>
<li><a href="#algebraic-data-types">Algebraic Data Types</a></li>
<li><a href="#refactoring-with-algebra">Refactoring with Algebra</a></li>
<li><a href="#numbers">Numbers</a></li>
<li><a href="#arrays">Arrays</a></li>
<li><a href="#functions">Functions</a></li>
<li><a href="#lists">Lists</a></li>
<li><a href="#binary-trees">Binary Trees</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
<h2 id="algebraic-data-types">Algebraic Data Types</h2>
<p>If you already know why algebraic data types are called “algebraic”, and how they relate to
addition and multiplication, feel free to skip this section and jump to <a href="#refactoring-with-algebra">“Refactoring with
Algebra”</a>. Otherwise, don’t be afraid: algebraic data types are very
simple, and I’ll show examples below. All the code in this post will be in Rust.</p>
<p>To have algebraic data types, you need two things: “product types” and “sum types”.</p>
<h3 id="product-types">Product Types</h3>
<p><code>struct</code>s in Rust are product types:</p>
<div class="highlight"><pre><code class="rust"><span class="c1">// Structs are product types</span>
<span class="k">struct</span> <span class="n">Rectangle</span> <span class="p">{</span>
<span class="n">x</span><span class="o">:</span> <span class="k">i32</span><span class="p">,</span>
<span class="n">y</span><span class="o">:</span> <span class="k">i32</span><span class="p">,</span>
<span class="n">width</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="n">height</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<p><code>Rectangle</code> is called a product type because to calculate the <em>number of possible Rectangles</em>, you
<em>multiply</em> the number of possible values of its fields. There are <code>2^32</code> possible values for <code>x</code>,
for <code>y</code>, for <code>width</code>, and for <code>height</code>, so there are <code>2^32 * 2^32 * 2^32 * 2^32 = 2^128</code> possible
<code>Rectangle</code>s. Let’s call this number, <code>2^128</code>, the <em>cardinality</em> of <code>Rectangle</code>.</p>
<p>The same thing is true for tuples: the number of possible values of a tuple is the product of the
number of possible values of its elements. So tuples are also considered product types:</p>
<div class="highlight"><pre><code class="rust"><span class="c1">// Also a product type</span>
<span class="k">type</span> <span class="n">Pos</span> <span class="o">=</span> <span class="p">(</span><span class="k">i32</span><span class="p">,</span> <span class="k">i32</span><span class="p">);</span>
</code></pre>
</div>
<h3 id="sum-types">Sum Types</h3>
<p>We used a product type for <code>Pos</code> because it contains an x-coordinate <em>and</em> a y-coordinate. On the
other hand, you use a sum type when you have one thing <em>or</em> another thing. So Rust <code>enum</code>s are sum
types.</p>
<p>For example, a vending machine has multiple states, and needs to store different information
depending on which state it is in:</p>
<div class="highlight"><pre><code class="rust"><span class="c1">// This is a sum type</span>
<span class="k">enum</span> <span class="n">VendingMachineState</span> <span class="p">{</span>
<span class="c1">/// Just sitting around</span>
<span class="n">Idle</span><span class="p">,</span>
<span class="c1">/// Someone has put money in.</span>
<span class="c1">/// Store the total number of cents inserted so far.</span>
<span class="n">MoneyInserted</span><span class="p">(</span><span class="k">u32</span><span class="p">),</span>
<span class="c1">/// Someone bought an item, and we're dispensing it.</span>
<span class="c1">/// Store the letter label of the item they bought.</span>
<span class="n">Dispensing</span><span class="p">(</span><span class="n">char</span><span class="p">),</span>
<span class="p">}</span>
</code></pre>
</div>
<p>(This is not meant to be a full featured implementation of a vending machine. For example, it doesn’t
handle “someone put too much money in” or “help I’m out of quarters”.)</p>
<p><code>VendingMachineState</code> is called a sum type because its number of possible values is the <em>sum</em> of the
number of possible values of each of its options. There is <code>1</code> <code>Idle</code> value, <code>2^32</code> possible
“amounts of money” in the <code>MoneyInserted</code> state, and <code>2^32</code> possible <code>char</code>s in the <code>Dispensing</code>
state (because unlike in C, a Rust <code>char</code> is 4 bytes). So altogether, <code>VendingMachineState</code> has a
cardinality of <code>1 + 2^32 + 2^32 = 1 + 2^33</code>.</p>
<p>Notice that this is not a count of the number of <em>plausible</em> values of <code>VendingMachineState</code>. For
example, the <code>char</code> will probably only ever be an ascii letter, and the “amount of money inserted
in cents” should always be much less than <code>u32::MAX</code>. Instead, we are counting the number of
possible values allowed by the type system.</p>
<h4 id="result-and-option">Result and Option</h4>
<p>Let’s look at the cardinality of two common built-in sum types in Rust: <code>Result</code> and <code>Option</code>.</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Result</span><span class="o"><</span><span class="n">T</span><span class="p">,</span> <span class="n">E</span><span class="o">></span> <span class="p">{</span>
<span class="n">Ok</span><span class="p">(</span><span class="n">T</span><span class="p">),</span>
<span class="n">Err</span><span class="p">(</span><span class="n">E</span><span class="p">),</span>
<span class="p">}</span>
</code></pre>
</div>
<p>A <code>Result<T, E></code> contains either a value of type <code>T</code> or a value of type <code>E</code>. Thus its cardinality
is the sum of the cardinalities of <code>T</code> and <code>E</code>, which I’ll write as <code>T + E</code>.</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Option</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="p">{</span>
<span class="n">Some</span><span class="p">(</span><span class="n">T</span><span class="p">),</span>
<span class="n">None</span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<p>Since an <code>Option<T></code> contains <em>either</em> a <code>T</code> or no data, its cardinality is <code>1 + T</code>: all the
possible values of <code>T</code>, plus the extra <code>None</code> option.</p>
<h3 id="the-baffling-lack-of-sum-types">The Baffling Lack of Sum Types</h3>
<p>To reiterate:</p>
<ul>
<li>You use a product type when you have one thing <em>and</em> another thing.</li>
<li>You use a sum type when you have one thing <em>or</em> another thing.</li>
</ul>
<p><strong><rant></strong></p>
<p>How, then, are you supposed to represent X <em>or</em> Y in a language that lacks sum types?</p>
<p>There are several different workarounds, depending on the language. Here are a couple:</p>
<ul>
<li>Many languages allow values to be <code>null</code> (think objects in Java, or pointers in C). If you have a
type <code>T</code> that could also be <code>null</code>, its cardinality will be <code>1 + T</code> (just like <code>Option</code>). This is
a wimpy version of a sum type that you can use to emulate real sum types. To represent the sum
type <code>A + B</code>, you store a nullable <code>A</code> <em>and</em> a nullable <code>B</code>, and you take care to ensure that at
all times, exactly one of the two is <code>null</code>. The downside is that the language will allow you to
accidentally set 0 or 2 of the values to <code>null</code>, thereby constructing nonsensical data.</li>
<li>Many languages have abstract classes and inheritance (or the like). You can use this to emulate
the sum type <code>A + B</code> by making an abstract class for the sum, plus concrete classes for <code>A</code> and
<code>B</code>. One downside is that this tends to be verbose, and to split what is logically a single
function on <code>A + B</code> into multiple method implementations. It also invokes machinery that’s
<em>vastly</em> over-complicated for the task at hand.</li>
</ul>
<p>As a programming languages person, this drives me bonkers. Do you know how long we’ve had sum types?
Since the <a href="https://en.wikipedia.org/wiki/Algebraic_data_type">70s</a>! They are very easy to implement
and to type check. And they’re both safer and ergonomically nicer than the alternatives.</p>
<p>If you’re ever designing a language, please, I beg you, give it sum types.</p>
<p><strong></rant></strong></p>
<p>That’s my only rant this post, I promise.</p>
<h2 id="refactoring-with-algebra">Refactoring with Algebra</h2>
<p>Now for the magic!</p>
<p>Just as you can refactor <em>code</em> by rewriting it in a way that looks different but does the same
thing, you can refactor <em>data type(s)</em> by arranging their contents in a different way. For example,
we could replace the <code>x</code> and <code>y</code> in our <code>Rectangle</code> type with a <code>Pos</code> tuple:</p>
<div class="highlight"><pre><code class="rust"><span class="k">struct</span> <span class="n">Rectangle</span> <span class="p">{</span>
<span class="n">x</span><span class="o">:</span> <span class="k">i32</span><span class="p">,</span>
<span class="n">y</span><span class="o">:</span> <span class="k">i32</span><span class="p">,</span>
<span class="n">width</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="n">height</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<div class="highlight"><pre><code class="rust"><span class="k">type</span> <span class="n">Pos</span> <span class="o">=</span> <span class="p">(</span><span class="k">i32</span><span class="p">,</span> <span class="k">i32</span><span class="p">);</span>
<span class="k">struct</span> <span class="n">NewRectangle</span> <span class="p">{</span>
<span class="n">pos</span><span class="o">:</span> <span class="n">Pos</span><span class="p">,</span>
<span class="n">width</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="n">height</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<p>This probably isn’t news to you. What you may not know is that you can use <em>algebra</em> to verify that
this refactoring is correct!</p>
<p>The key insight is that for the refactoring to be correct, the information contained in the
<code>NewRectangle</code> type must be the same as the information contained in the old <code>Rectangle</code> type.
Therefore, the total number of possible values must remain the same. We can verify this with
algebra:</p>
<pre><code>Rectangle
= i32 * i32 * u32 * u32
= (i32 * i32) * u32 * u32
= Pos * u32 * u32
= NewRectangle
</code></pre>
<p>In general, two data types contain the same information if and only if the algebraic expressions for
those data types are equal. There are a couple ways to make use of this:</p>
<ul>
<li>You can verify that a refactoring doesn’t accidentally gain or lose information, like we did above
for <code>Rectangle</code> and <code>NewRectangle</code>.</li>
<li>You can derive refactoring techniques from laws of algebra. Every algebraic law gives a different
refactoring technique!</li>
</ul>
<p>In the rest of this post, we’ll work through the refactorings implied by a couple dozen laws of
algebra, and see many applications of them.</p>
<h3 id="commutativity">Commutativity</h3>
<p>The <em>commutative</em> law for multiplication says <code>A * B = B * A</code>. Remember that <code>A * B</code> <em>as a data
type</em> is a pair <code>(A, B)</code> or alternatively a <code>struct</code> with fields of type <code>A</code> and <code>B</code>. Thus
commutativity suggests that you can refactor a type by re-ordering its elements (thus winning the
contest for most boring refactoring ever):</p>
<div class="highlight"><pre><code class="rust"><span class="p">(</span><span class="n">String</span><span class="p">,</span> <span class="n">DateTime</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">DateTime</span><span class="p">,</span> <span class="n">String</span><span class="p">)</span>
</code></pre>
</div>
<div class="highlight"><pre><code class="rust"><span class="k">struct</span> <span class="n">Pos</span> <span class="p">{</span>
<span class="n">x</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="n">y</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="p">}</span>
<span class="o">=</span>
<span class="k">struct</span> <span class="n">Pos</span> <span class="p">{</span>
<span class="n">y</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="n">x</span><span class="o">:</span> <span class="k">u32</span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<p>There’s also a commutative law for addition: <code>A + B = B + A</code>. Remember that <code>A + B</code> is an enum with
two variants, one containing an <code>A</code> and one containing a <code>B</code>. So this law says, for example:</p>
<div class="highlight"><pre><code class="rust"><span class="n">Result</span><span class="o"><</span><span class="n">T</span><span class="p">,</span> <span class="n">E</span><span class="o">></span> <span class="o">=</span> <span class="n">Result</span><span class="o"><</span><span class="n">E</span><span class="p">,</span> <span class="n">T</span><span class="o">></span>
</code></pre>
</div>
<p>(Note that this isn’t good <em>programming practice</em>: semantically, the <code>Ok</code> and <code>Err</code> variants of a
<code>Result</code> are not symmetric. For example, the <code>?</code> operator treats them differently, and programmers
expect that if there’s an error it goes in the <code>Err</code> variant and not in the <code>Ok</code> variant. However,
the algebra only cares about the information content, and flipping a <code>Result</code> does keep the same
information content.)</p>
<p>Similarly:</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Color</span> <span class="p">{</span> <span class="n">Red</span><span class="p">,</span> <span class="n">Yellow</span> <span class="p">}</span> <span class="o">=</span> <span class="k">enum</span> <span class="n">Color</span> <span class="p">{</span> <span class="n">Yellow</span><span class="p">,</span> <span class="n">Red</span> <span class="p">}</span>
</code></pre>
</div>
<h3 id="associativity">Associativity</h3>
<p>There are also associative laws. One for multiplication:</p>
<pre><code>(A * B) * C = A * (B * C) = A * B * C
</code></pre>
<p>Which gives some equivalences between tuples:</p>
<div class="highlight"><pre><code class="rust"><span class="p">((</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">),</span> <span class="n">C</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">))</span> <span class="o">=</span> <span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">)</span>
</code></pre>
</div>
<p>And one for addition:</p>
<pre><code>(A + B) + C = A + (B + C) = A + B + C
</code></pre>
<p>Which gives some equivalences between enums:</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">StopLight</span> <span class="p">{</span> <span class="n">Green</span><span class="p">,</span> <span class="n">Yellow</span><span class="p">,</span> <span class="n">Red</span> <span class="p">}</span>
<span class="o">=</span>
<span class="k">enum</span> <span class="n">DontGoColor</span> <span class="p">{</span> <span class="n">Yellow</span><span class="p">,</span> <span class="n">Red</span> <span class="p">}</span>
<span class="k">enum</span> <span class="n">StopLight</span> <span class="p">{</span> <span class="n">Green</span><span class="p">,</span> <span class="n">DontGo</span><span class="p">(</span><span class="n">DontGoColor</span><span class="p">)</span> <span class="p">}</span>
<span class="o">=</span>
<span class="k">enum</span> <span class="n">MaybeGoColor</span> <span class="p">{</span> <span class="n">Green</span><span class="p">,</span> <span class="n">Yellow</span> <span class="p">}</span>
<span class="k">enum</span> <span class="n">StopLight</span> <span class="p">{</span> <span class="n">Red</span><span class="p">,</span> <span class="n">MaybeGo</span><span class="p">(</span><span class="n">MaybeGoColor</span><span class="p">)</span> <span class="p">}</span>
</code></pre>
</div>
<h3 id="distributivity">Distributivity</h3>
<p>So far we’ve only looked at <em>boring</em> algebraic laws that involved addition <em>or</em> multiplication. But
distributivity involves <em>both</em>:</p>
<pre><code>A * (B + C) = (A * B) + (A * C)
</code></pre>
<p>In terms of data types, we get a law involving both tuples/structs and enums:</p>
<div class="highlight"><pre><code class="rust"><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">Result</span><span class="o"><</span><span class="n">T</span><span class="p">,</span> <span class="n">E</span><span class="o">></span><span class="p">)</span> <span class="o">=</span> <span class="n">Result</span><span class="o"><</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">T</span><span class="p">),</span> <span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">E</span><span class="p">)</span><span class="o">></span>
</code></pre>
</div>
<p>Or a more realistic example:</p>
<div class="highlight"><pre><code class="rust"><span class="c1">// A binary tree with data on both leaves and branches.</span>
<span class="c1">// A leaf has no children, and a branch has two children.</span>
<span class="k">struct</span> <span class="n">BinaryTree</span><span class="o"><</span><span class="n">Data</span><span class="o">></span> <span class="p">{</span>
<span class="n">data</span><span class="o">:</span> <span class="n">Data</span><span class="p">,</span>
<span class="n">children</span><span class="o">:</span> <span class="n">Children</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">enum</span> <span class="n">Children</span> <span class="p">{</span>
<span class="n">Leaf</span><span class="p">,</span>
<span class="n">Branch</span><span class="p">(</span><span class="n">Box</span><span class="o"><</span><span class="n">BinaryTree</span><span class="o"><</span><span class="n">Data</span><span class="o">>></span><span class="p">,</span> <span class="n">Box</span><span class="o"><</span><span class="n">BinaryTree</span><span class="o"><</span><span class="n">Data</span><span class="o">>></span><span class="p">),</span>
<span class="p">}</span>
<span class="o">=</span>
<span class="k">enum</span> <span class="n">BinaryTree</span><span class="o"><</span><span class="n">Data</span><span class="o">></span> <span class="p">{</span>
<span class="n">Leaf</span><span class="p">(</span><span class="n">Data</span><span class="p">),</span>
<span class="n">Branch</span><span class="p">(</span><span class="n">Data</span><span class="p">,</span> <span class="n">Box</span><span class="o"><</span><span class="n">BinaryTree</span><span class="o"><</span><span class="n">Data</span><span class="o">>></span><span class="p">,</span> <span class="n">Box</span><span class="o"><</span><span class="n">BinaryTree</span><span class="o"><</span><span class="n">Data</span><span class="o">>></span><span class="p">),</span>
<span class="p">}</span>
</code></pre>
</div>
<p>This is a common refactoring. It is typically better to use the first form, since it doesn’t
logically duplicate <code>Data</code>.</p>
<h3 id="subtraction-as-a-constraint">Subtraction as a Constraint</h3>
<p>One interesting thing we can do is use subtraction to <em>constrain</em> a type.</p>
<p>A Go function that produces an answer <code>T</code> or an error <code>E</code> will return a pair <code>(Option<T>,
Option<E>)</code>. (I’m writing Go’s <code>nil</code>-able values as Rust-like <code>Option</code>s for convenience.)
In algebra:</p>
<pre><code> (Option<T>, Option<E>)
= (1 + T) * (1 + E)
</code></pre>
<p>However, programmers are expected to maintain the convention that exactly one of these two <code>Option</code>s
should be filled. Thus these two cases are invalid:</p>
<ul>
<li><code>1</code>: Neither an answer nor an error is present</li>
<li><code>T*E</code>: Both an answer and an error are present</li>
</ul>
<p>Let’s start with our full type and subtract out these two invalid cases, to see what’s left:</p>
<pre><code> (1 + T) * (1 + E) - 1 - T*E
= 1 + T + E + T*E - 1 - T*E
= T + E
</code></pre>
<p>This is the cardinality of Rust’s <code>Result<T, E></code>! So Rust’s <code>Result</code> type allows exactly the valid
Go results and no more, using the type system to enforce what is only a convention in Go.</p>
<h2 id="numbers">Numbers</h2>
<p>So far we’ve seen big numbers like <code>i32 = 2^32</code>. But what about <em>small</em> numbers?</p>
<h3 id="counting-down-two">Counting Down: Two</h3>
<p><code>2</code> would be the number for a type that has <em>two</em> possible values. Either one thing, or another
thing, but nothing else. You know a type like this! It’s the friendly boolean:</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Bool</span> <span class="p">{</span>
<span class="n">False</span><span class="p">,</span>
<span class="n">True</span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<p>What can you say in algebra about the number two? One thing is <code>A + A = 2A</code>. This means:</p>
<div class="highlight"><pre><code class="rust"><span class="n">Result</span><span class="o"><</span><span class="n">A</span><span class="p">,</span> <span class="n">A</span><span class="o">></span> <span class="o">=</span> <span class="p">(</span><span class="n">bool</span><span class="p">,</span> <span class="n">A</span><span class="p">)</span>
</code></pre>
</div>
<p>This is relevant to Rust’s standard library <a href="https://doc.rust-lang.org/stable/std/primitive.slice.html#method.binary_search">binary search function</a>:</p>
<div class="highlight"><pre><code class="rust"><span class="k">pub</span> <span class="k">fn</span> <span class="n">binary_search</span><span class="p">(</span><span class="o">&</span><span class="n">self</span><span class="p">,</span> <span class="n">x</span><span class="o">:</span> <span class="o">&</span><span class="n">T</span><span class="p">)</span> <span class="o">-></span> <span class="n">Result</span><span class="o"><</span><span class="n">usize</span><span class="p">,</span> <span class="n">usize</span><span class="o">></span>
</code></pre>
</div>
<p>The docs for this say:</p>
<blockquote>
<p>If the value is found then <code>Result::Ok</code> is returned, containing the index of the matching element.
[…] If the value is not found then <code>Result::Err</code> is returned, containing the index where a
matching element could be inserted while maintaining sorted order.</p>
</blockquote>
<p>Since <code>Result<usize, usize></code> is equivalent to <code>(bool, usize)</code>, this function could also have
returned both an index and a boolean specifying the meaning of that index. Though in this case
I think the <code>Result</code> type is more clear.</p>
<h3 id="counting-down-one">Counting Down: One</h3>
<p><code>1</code> would be the cardinality of a type that has <em>one</em> possible value.</p>
<p>In Rust, this is the <em>unit type</em>, written <code>()</code>. It’s like a tuple with 0 elements. It’s the most
boring type. It only has one possible value, which is also written <code>()</code>. Rust stores it in literally
0 bytes:</p>
<div class="highlight"><pre><code class="rust"><span class="n">println</span><span class="o">!</span><span class="p">(</span><span class="s">"{}"</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">mem</span><span class="o">::</span><span class="n">size_of</span><span class="o">::<</span><span class="p">()</span><span class="o">></span><span class="p">());</span>
<span class="c1">// prints 0</span>
</code></pre>
</div>
<h4 id="multiplicative-identity">Multiplicative Identity</h4>
<p>One is the multiplicative identity. This is a fancy way of saying:</p>
<pre><code>1 * A = A
</code></pre>
<p>In terms of types, this means:</p>
<div class="highlight"><pre><code class="rust"><span class="p">((),</span> <span class="n">A</span><span class="p">)</span> <span class="o">=</span> <span class="n">A</span>
</code></pre>
</div>
<p>In other words, there’s no reason to include the <code>()</code>. It doesn’t add any information.</p>
<p>(Why, then, does Rust even have <code>()</code>? One reason is that it’s the return type for functions that
“don’t return anything”. In C or Java this would be written “<code>void</code>”.)</p>
<h3 id="counting-down-zero">Counting Down: Zero?</h3>
<p>What about zero? Zero is the sum of no things. It’s an enum with no variants:</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Zero</span> <span class="p">{}</span>
</code></pre>
</div>
<p>Since there are no variants for it, you cannot construct a <code>Zero</code>. It has 0 possible values.</p>
<p>In Rust, this type is written <code>!</code> and pronounced <a href="https://doc.rust-lang.org/std/primitive.never.html">“the never
type”</a>.</p>
<h4 id="additive-identity">Additive Identity</h4>
<p>In algebra, zero is the additive identity of multiplication:</p>
<pre><code>A + 0 = A
</code></pre>
<p>In data types, this means:</p>
<div class="highlight"><pre><code class="rust"><span class="n">Result</span><span class="o"><</span><span class="n">A</span><span class="p">,</span> <span class="o">!></span> <span class="o">=</span> <span class="n">A</span>
</code></pre>
</div>
<p>As <a href="https://doc.rust-lang.org/std/primitive.never.html">explained in the Rust docs</a>:</p>
<blockquote>
<p>Since the <code>Err</code> variant contains a <code>!</code>, it can never occur. If the <code>exhaustive_patterns</code> feature
is present this means we can exhaustively match on <code>Result<T, !></code> by just taking the <code>Ok</code> variant.
This illustrates another behaviour of <code>!</code> - it can be used to “delete” certain enum variants from
generic types like <code>Result</code>.</p>
</blockquote>
<h4 id="multiplicative-absorbative">Multiplicative Absorbative</h4>
<p>Zero plays a second role in algebra too. It’s the “absorbative” element of multiplication (aren’t
these names great?):</p>
<pre><code>A * 0 = 0
</code></pre>
<p>In data types, this means:</p>
<div class="highlight"><pre><code class="rust"><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="o">!</span><span class="p">)</span> <span class="o">=</span> <span class="o">!</span>
</code></pre>
</div>
<p>In other words, <code>!</code> is contagious. There’s no way to construct it, so there’s no way to construct a
tuple or struct that contains it.</p>
<h3 id="many">Many</h3>
<p>We just saw some very small numbers: 2, 1, 0. And we’ve seen intermediate numbers, like <code>char =
2^32</code>. How about very large numbers?</p>
<p>Let’s take <code>String</code>. <code>String</code> has <em>infinitely many</em> possible values. We could say that the
cardinality of <code>String</code> is <code>∞</code>, but <code>∞</code> isn’t really a number, and our algebra would fall apart if
we tried to use it.</p>
<p>Instead we’re going to just… not simplify <code>String</code>. In the algebra, we’ll continue to call it
<code>String</code>, and treat it as an indeterminate, never replacing it with a concrete number.</p>
<p>In fact, it’s useful to do this for more than just <code>String</code>. For example, you may want to avoid
treating <code>char</code> and <code>u32</code> as identical, even if they have the same number of possible values, since
one is meant to store a unicode character, and the other a number. You can achieve this just by
leaving <code>char</code> as <code>char</code> and not setting it equal to <code>2^32</code>.</p>
<p>Summing up the numbers we’ve seen:</p>
<pre><code>Data Type Algebraic Expression
------------------------------------
! 0
() 1
bool 2
u8 2^8, or just u8
char 2^32, or just char
String String
</code></pre>
<h2 id="arrays">Arrays</h2>
<p>What’s the algebraic expression for an array? Well, the array <code>[A; n]</code> contains <code>n</code> <code>A</code>s. So its
expression is <code>A * A * ... * A = A^n</code>.</p>
<h3 id="exponentiation-laws">Exponentiation Laws</h3>
<p>And now we can get some refactoring rules for arrays from the laws of exponents!</p>
<h4 id="exponentiation-is-repeated-multiplication">Exponentiation is Repeated Multiplication</h4>
<p>An array can be stored in a product type instead:</p>
<pre><code>A^3 = A * A * A
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="c1">// An array of three u8s</span>
<span class="k">type</span> <span class="n">Color</span> <span class="o">=</span> <span class="p">[</span><span class="k">u8</span><span class="p">;</span> <span class="m">3</span><span class="p">];</span>
<span class="c1">// Can also be represented as a struct with 3 fields:</span>
<span class="k">struct</span> <span class="n">Color</span> <span class="p">{</span>
<span class="n">red</span><span class="o">:</span> <span class="k">u8</span><span class="p">,</span>
<span class="n">green</span><span class="o">:</span> <span class="k">u8</span><span class="p">,</span>
<span class="n">blue</span><span class="o">:</span> <span class="k">u8</span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<h4 id="matrix-storage">Matrix Storage</h4>
<pre><code>(A^m)^n = (A^n)^m
</code></pre>
<p>This gives the equivalence between row-major and column-major storage for matrices:</p>
<div class="highlight"><pre><code class="rust"><span class="k">type</span> <span class="n">RowMajorMatrix</span><span class="o"><</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="n">M</span><span class="o">:</span> <span class="n">usize</span><span class="p">,</span> <span class="k">const</span> <span class="n">N</span><span class="o">:</span> <span class="n">usize</span><span class="o">></span>
<span class="o">=</span> <span class="p">[[</span><span class="n">A</span><span class="p">;</span> <span class="n">N</span><span class="p">];</span> <span class="n">M</span><span class="p">];</span>
<span class="c1">// access with `matrix[row][col]`</span>
<span class="k">type</span> <span class="n">ColMajorMatrix</span><span class="o"><</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="n">M</span><span class="o">:</span> <span class="n">usize</span><span class="p">,</span> <span class="k">const</span> <span class="n">N</span><span class="o">:</span> <span class="n">usize</span><span class="o">></span>
<span class="o">=</span> <span class="p">[[</span><span class="n">A</span><span class="p">;</span> <span class="n">M</span><span class="p">];</span> <span class="n">N</span><span class="p">];</span>
<span class="c1">// access with `matrix[col][row]`</span>
</code></pre>
</div>
<h4 id="array-flattening">Array Flattening</h4>
<p>Of course, you can always just jam your matrix into one big array:</p>
<pre><code>(A^m)^n = A^(m*n)
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="k">type</span> <span class="n">FlatMatrix</span><span class="o"><</span><span class="n">A</span><span class="p">,</span> <span class="k">const</span> <span class="n">M</span><span class="o">:</span> <span class="n">usize</span><span class="p">,</span> <span class="k">const</span> <span class="n">N</span><span class="o">:</span> <span class="n">usize</span><span class="o">></span>
<span class="o">=</span> <span class="p">[</span><span class="n">A</span><span class="p">;</span> <span class="n">M</span> <span class="o">*</span> <span class="n">N</span><span class="p">];</span>
<span class="c1">// access with `matrix[N * row + col]`</span>
</code></pre>
</div>
<p>(I probably got some of these indices backwards… you get the idea, though.)</p>
<h4 id="array-splitting">Array Splitting</h4>
<p>If you have a long array, you can split it into a first part and a second part:</p>
<pre><code>A^(m+n) = A^m * A^n
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="p">[</span><span class="n">A</span><span class="p">;</span> <span class="m">6</span><span class="p">]</span> <span class="o">=</span> <span class="p">([</span><span class="n">A</span><span class="p">;</span> <span class="m">2</span><span class="p">],</span> <span class="p">[</span><span class="n">A</span><span class="p">;</span> <span class="m">4</span><span class="p">])</span>
</code></pre>
</div>
<h4 id="more-array-splitting">More Array Splitting</h4>
<p>An array of pairs is equivalent to a pair of arrays:</p>
<pre><code>(A*B)^n = A^n * B^n
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="p">[(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">);</span> <span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="p">([</span><span class="n">A</span><span class="p">;</span> <span class="n">n</span><span class="p">],</span> <span class="p">[</span><span class="n">B</span><span class="p">;</span> <span class="n">n</span><span class="p">])</span>
</code></pre>
</div>
<p>This is the famous <a href="https://en.wikipedia.org/wiki/AoS_and_SoA">array of structs vs. struct of
arrays</a> equivalence.</p>
<h3 id="exponentiation-with-small-numbers">Exponentiation with Small Numbers</h3>
<p>We can find more rules by asking what happens if either the base or the exponent is a small number.</p>
<h4 id="short-arrays">Short Arrays</h4>
<p>An array of length 1 might as well not be an array:</p>
<pre><code>A^1 = A
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="p">[</span><span class="n">A</span><span class="p">;</span> <span class="m">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">A</span>
</code></pre>
</div>
<h4 id="shorter-arrays">Shorter Arrays</h4>
<p>An array of length 0 contains no information at all:</p>
<pre><code>A^0 = 1
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="p">[</span><span class="n">A</span><span class="p">;</span> <span class="m">0</span><span class="p">]</span> <span class="o">=</span> <span class="p">()</span>
</code></pre>
</div>
<p>The Rust compiler confirms:</p>
<div class="highlight"><pre><code class="rust"><span class="n">println</span><span class="o">!</span><span class="p">(</span><span class="s">"{}"</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">mem</span><span class="o">::</span><span class="n">size_of</span><span class="o">::<</span><span class="p">[</span><span class="n">usize</span><span class="p">;</span> <span class="m">0</span><span class="p">]</span><span class="o">></span><span class="p">());</span>
<span class="c1">// prints 0</span>
</code></pre>
</div>
<h4 id="array-of-units">Array of Units</h4>
<p>An array of units also contains no information:</p>
<pre><code>1^n = 1
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="p">[();</span> <span class="n">n</span><span class="p">]</span> <span class="o">=</span> <span class="p">()</span>
</code></pre>
</div>
<p>Rustc confirms again:</p>
<div class="highlight"><pre><code class="rust"><span class="n">println</span><span class="o">!</span><span class="p">(</span><span class="s">"{}"</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">mem</span><span class="o">::</span><span class="n">size_of</span><span class="o">::<</span><span class="p">[();</span> <span class="m">100</span><span class="p">]</span><span class="o">></span><span class="p">());</span>
<span class="c1">// prints 0</span>
</code></pre>
</div>
<h2 id="functions">Functions</h2>
<p>There’s one last data type we will model: functions!</p>
<p>Functions are totally data. You can stick one in a variable:</p>
<div class="highlight"><pre><code class="rust"><span class="k">fn</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">fn</span> <span class="n">func</span><span class="p">(</span><span class="n">n</span><span class="o">:</span> <span class="k">u8</span><span class="p">)</span> <span class="o">-></span> <span class="n">bool</span> <span class="p">{</span> <span class="n">n</span> <span class="o">></span> <span class="m">0</span> <span class="p">}</span>
<span class="k">let</span> <span class="n">var</span><span class="o">:</span> <span class="k">fn</span><span class="p">(</span><span class="k">u8</span><span class="p">)</span> <span class="o">-></span> <span class="n">bool</span> <span class="o">=</span> <span class="n">func</span><span class="p">;</span>
<span class="p">}</span>
</code></pre>
</div>
<p>(There is a variety of function types in Rust, due to its ownership model. Let’s completely ignore
this fact.)</p>
<p>The whole trick of this blog post is to count the number of possible values of a data type. So how
many distinct functions from <code>u8</code> to <code>bool</code> are there?</p>
<p>There are infinitely many. For example:</p>
<ul>
<li>The function that returns <code>true</code> if its input is zero, and <code>false</code> otherwise.</li>
<li>The function that ignores its input, prints “Hi Mom!”, then returns <code>true</code>.</li>
<li>The function that mines bitcoins for 5 hours, stores them in a wallet, then returns <code>true</code>.</li>
<li>The function that prints “Haha!”, then runs forever.</li>
</ul>
<p>This is, uh, not really what we were looking for. The <code>u8</code>-to-<code>bool</code>-ness is getting buried
beneath the side effects.</p>
<p>The better question to ask is, how many <em>pure</em>, <em>terminating</em> functions from <code>u8</code> to <code>bool</code> are
there?</p>
<p>This is easier to answer: to specify such a function, you have to say what boolean it returns on
input <code>0</code> <em>and</em> what boolean it returns on input <code>1</code> … <em>and</em> what boolean it returns on input
<code>255</code>. Altogether, that’s <code>2*2*...*2 = 2^256</code> possible functions.</p>
<p>In general, the cardinality of a function from <code>A</code> to <code>B</code> is <code>B^A</code>.</p>
<h3 id="exponentiation-now-for-functions">Exponentiation: now for Functions</h3>
<p>Let’s go through some of the algebraic laws for exponentiation again, this time applying them to
functions rather than arrays.</p>
<h4 id="identity">Identity</h4>
<p>First off:</p>
<pre><code>A^n = A^n
</code></pre>
<p>Well no duh. What am I getting at? While the two <code>A^n</code>s look the same, I mean the first one to be
interpreted as the cardinality of <code>fn(n) -> A</code>, and the second one as the cardinality of <code>[A; n]</code>.
This is saying that a function whose input has <code>n</code> possible values and whose output is <code>A</code>, can be
represented as an array of <code>n</code> <code>A</code>s.</p>
<p>This was the answer to a Google interview question! The
<a href="https://www.geeksforgeeks.org/count-set-bits-in-an-integer/">question</a> was to find the fastest way
to count the number of set bits in a byte (i.e., the number of 1s in its binary representation). The
answer was to realize that:</p>
<pre><code> fn count_ones(byte: u8) -> u8
= u8^u8
= u8^256
= [u8; 256]
</code></pre>
<p>In other words: if you <em>really</em> care about the speed of a function, you can store all 256 of its
possible return values in a lookup table. Memory is cheap these days!</p>
<h4 id="currying">Currying</h4>
<p><a href="https://en.wikipedia.org/wiki/Currying">Currying</a> is the idea that a function that takes two
arguments can also be expressed as a function that takes the first argument and returns a function
that takes the second argument:</p>
<pre><code>A^(m*n) = (A^m)^n
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="c1">// A function from `A` and `B` to `C`:</span>
<span class="k">fn</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">)</span> <span class="o">-></span> <span class="n">C</span>
<span class="c1">// Can also be expressed as a function from `A` to a function from `B` to `C`:</span>
<span class="k">fn</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="k">fn</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">-></span> <span class="n">C</span><span class="p">)</span>
</code></pre>
</div>
<h4 id="function-splitting">Function Splitting</h4>
<p>If you have a function that takes an <code>enum</code> (like <code>Result</code>), you can split it into functions that
each handle one of the variants of the <code>enum</code> (like one for the <code>Ok</code> and one for the <code>Err</code>):</p>
<pre><code>C^(A+B) = C^A * C^B
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="k">fn</span><span class="p">(</span><span class="n">Result</span><span class="o"><</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="o">></span><span class="p">)</span> <span class="o">-></span> <span class="n">C</span>
<span class="o">=</span>
<span class="p">(</span><span class="k">fn</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">-></span> <span class="n">C</span><span class="p">,</span> <span class="k">fn</span><span class="p">(</span><span class="n">B</span><span class="p">)</span> <span class="o">-></span> <span class="n">C</span><span class="p">)</span>
</code></pre>
</div>
<p>(This is analogous to the way you can emulate sum types in OOP: instead of having a single function
that acts on the sum <code>A + B</code>, you have two methods—one for handling <code>A</code>, and one for handling
<code>B</code>—that live in different classes.)</p>
<h4 id="more-function-splitting">More Function Splitting</h4>
<p>A function that returns a pair is equivalent to a pair of functions:</p>
<pre><code>(B*C)^A = B^A * C^A
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="k">fn</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">C</span><span class="p">)</span>
<span class="o">=</span>
<span class="p">(</span><span class="k">fn</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">-></span> <span class="n">B</span><span class="p">,</span> <span class="k">fn</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">-></span> <span class="n">C</span><span class="p">)</span>
</code></pre>
</div>
<h3 id="exponentiation-with-small-numbers-1">Exponentiation with Small Numbers</h3>
<p>Again, we can find more rules by asking what happens if either the base or the exponent is a small
number.</p>
<h4 id="short-functions">Short Functions</h4>
<p>A function that just takes the unit value (or equivalently no arguments) might as well not be a
function:</p>
<pre><code>A^1 = A
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="k">fn</span><span class="p">(())</span> <span class="o">-></span> <span class="n">A</span> <span class="o">=</span> <span class="n">A</span>
</code></pre>
</div>
<p>Remember that exponentiation models <em>pure</em>, <em>terminating</em> functions. So this law is saying that a
function that takes a unit argument can only be useful (beyond its single return value) if it has a
side effect.</p>
<h4 id="shorter-functions">Shorter Functions</h4>
<p>A function that takes in a “never” argument is equivalent to unit:</p>
<pre><code>A^0 = 1
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="k">fn</span><span class="p">(</span><span class="o">!</span><span class="p">)</span> <span class="o">-></span> <span class="n">A</span> <span class="o">=</span> <span class="p">()</span>
</code></pre>
</div>
<p>[UPDATE] I wasn’t really sure about this, but a friend clarified it. We’re classifying functions
<em>based on their behavior</em>. How many functions are there from <code>!</code> to <code>A</code>, <em>counting functions as
equal if they have the same behavior</em>? Why, there’s just one. They’re <em>all</em> the same, because you
<em>can’t call</em> such a function.</p>
<h4 id="void-functions">Void Functions</h4>
<p>A function that doesn’t return anything is sometimes called a void function (because it returns
“<code>void</code>” in C). Void functions can’t do anything unless they have a side effect:</p>
<pre><code>1^A = 1
</code></pre>
<div class="highlight"><pre><code class="rust"><span class="k">fn</span><span class="p">(</span><span class="n">A</span><span class="p">)</span> <span class="o">-></span> <span class="p">()</span> <span class="o">=</span> <span class="p">()</span>
</code></pre>
</div>
<h2 id="lists">Lists</h2>
<p>We’ve looked at arrays, but how about lists? Unfortunately, there’s a lot of confusing terminology
around lists and their representations in various languages, so let me be clear. By “list”, I mean a
finite but arbitrarily long (and resizable) sequence of elements of the same type. Rust’s most
commonly used list type is called <code>Vec<A></code>. It’s backed by an array that’s re-allocated as needed,
but that’s an implementation detail.</p>
<p>How many possible values does a <code>Vec<A></code> have? This is easiest to answer if you think of a <code>Vec<A></code>
like a linked list (it’s the same information content either way). It’s either empty, <em>or</em> it has an
element and another list:</p>
<pre><code>Vec<A> = 1 + A*Vec<A>
</code></pre>
<p>Let’s substitute in the right hand side of the equation… repeatedly:</p>
<pre><code>Vec<A> = 1 + A*Vec<A>
= 1 + A*(1 + A*Vec<A>)
= 1 + A + A^2*Vec<A>
= 1 + A + A^2*(1 + A*Vec<A>)
= 1 + A + A^2 + A^3*Vec<A>
= ...
= 1 + A + A^2 + A^3 + ...
</code></pre>
<p>This is saying that a list of <code>A</code>s can have 0 <code>A</code>s (because it’s empty), or 1 <code>A</code> (because it has
length 1), or 2 <code>A</code>s (because it has length 2), etc.</p>
<p>It’s tedious to work with infinite polynomials like this, though, so let’s try to find a closed-form
solution:</p>
<pre><code>Vec<A> = 1 + A*Vec<A>
Vec<A> - A*Vec<A> = 1
(1 - A)*Vec<A> = 1
Vec<A> = 1/(1 - A)
</code></pre>
<p>There we go! A list of <code>A</code>s has <code>1/(1-A)</code> possible values.</p>
<p><em>“But Justin,” you say, “what does division mean? You haven’t shown any data type that it models.”</em></p>
<p><em>“And Justin,” you continue, “what if <code>A</code> is <code>bool</code>”? Then <code>1/(1-A) = 1/(1-2) = -1</code>. What are negative
numbers supposed to be?”</em></p>
<p>Shhh child, do not fear. One need not <em>understand</em> the formula for it to be true.</p>
<p><em>“You don’t know either, do you?”</em></p>
<p>Fear leads to anger. Anger leads to hate. Hate leads to algebraically manipulating infinite
polynomials because you weren’t willing to simplify them.</p>
<h4 id="short-lists">Short Lists</h4>
<p>And the <code>1/(1-A)</code> list formula <em>does</em> work.</p>
<p>Say we have a short list, with fewer than 3 elements. Equivalently, this is a list subject to
the constraint that it does <em>not</em> have 3 or more elements. We can represent all lists with 3 or
more elements as <code>A^3 * Vec<A></code>: you store the first three elements, plus a list for the rest.
Therefore our short list has cardinality:</p>
<pre><code>Vec<A> - A^3 * Vec<A>
</code></pre>
<p>I.e., it’s a list (<code>Vec<A></code>) subject to the constraint (<code>-</code>) that it does not have 3 or more
elements (<code>A^3 * Vec<A></code>).</p>
<p>Simplifying with algebra:</p>
<pre><code> Vec<A> - A^3 * Vec<A>
= 1/(1 - A) - A^3/(1 - A)
= (1 - A^3) / (1 - A)
= (1 - A) * (1 + A + A^2) / (1 - A)
= 1 + A + A^2
</code></pre>
<p>In other words, it has length 0, 1, or 2. Not a startling insight, but it shows that the list
formula works.</p>
<h4 id="zip">Zip</h4>
<p>Here’s a more interesting example. The
<a href="https://doc.rust-lang.org/std/iter/trait.Iterator.html#method.zip">zip</a> function takes two
iterators and produces an iterator of pairs, by consuming from both iterators at once until one of
them stops producing values. If we assume the iterators are finite (the methods in this blog post
don’t work on infinite data structures), then they’re equivalent to lists, so we can use the
<code>1/(1-A)</code> list formula for them.</p>
<p>Since the <code>zip</code> function might not fully consume its input iterators, it pretty clearly throws some
information away. Let’s figure out what is lost, exactly. <code>zip(Vec<A>, Vec<B>)</code> returns <code>Vec<A*B></code>,
but if it were lossless, it would have to return some additional information. Let’s call this extra
information <code>X</code>. Then <code>Vec<A>*Vec<B> = X*Vec<A*B></code>. Solving for <code>X</code> and simplifying:</p>
<pre><code>X = Vec<A>*Vec<B> / Vec<A*B>
= [1/(1-A) * 1/(1-B)] / [1/(1-A*B)]
= [1/(1-A)(1-B)] / [1/(1-A*B)]
= (1-A*B) / (1-A)(1-B)
= 1 + A/(1-A) + B/(1-B) // this step's tricky
= 1 + A*Vec<A> + B*Vec<B>
</code></pre>
<p>So if you wanted <code>zip</code> to be a lossless function, then not only would it have to produce
<code>Vec<A*B></code>, it would also have to preserve the unused elements of its inputs:</p>
<ul>
<li><code>1</code> in case its inputs are the same length so all elements have been paired up;</li>
<li><code>A*Vec<A></code> (i.e., a non-empty list of <code>A</code>s) in case its first input has unused elements; or</li>
<li><code>B*Vec<B></code> (i.e., a non-empty list of <code>B</code>s) in case its second input has unused elements.</li>
</ul>
<h4 id="alternating-lists">Alternating Lists</h4>
<p>Here’s another example of refactoring with algebra.</p>
<p>Say we have a list of alternating elements <code>A</code> and <code>B</code>, that could start with either <code>A</code> or <code>B</code>. We
could represent that with this Rust type:</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Alternating</span> <span class="p">{</span>
<span class="n">StartWithA</span><span class="p">(</span><span class="n">ListA</span><span class="p">),</span>
<span class="n">StartWithB</span><span class="p">(</span><span class="n">ListB</span><span class="p">),</span>
<span class="p">}</span>
<span class="k">enum</span> <span class="n">ListA</span> <span class="p">{</span>
<span class="n">Empty</span><span class="p">,</span>
<span class="n">Cons</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">Box</span><span class="o"><</span><span class="n">ListB</span><span class="o">></span><span class="p">),</span>
<span class="p">}</span>
<span class="k">enum</span> <span class="n">ListB</span> <span class="p">{</span>
<span class="n">Empty</span><span class="p">,</span>
<span class="n">Const</span><span class="p">(</span><span class="n">B</span><span class="p">,</span> <span class="n">Box</span><span class="o"><</span><span class="n">ListA</span><span class="o">></span><span class="p">),</span>
<span class="p">}</span>
</code></pre>
</div>
<p>There’s one subtlety here. I missed it the first time I wrote this section, but the algebra made it
visible.</p>
<p>These definitions give <em>two</em> ways to express an empty list: either an empty <code>StartWithA</code> or an
empty <code>StartWithB</code>. To avoid having two ways to represent the same information, we should pick one
representation, and declare the other one illegal. So let’s say that an empty list should be
represented by <code>Alternating::StartWithA(ListA::Empty)</code>, and that
<code>Alternating::StartWithB(ListB::Empty)</code> is illegal.</p>
<p>Then the cardinalities are:</p>
<pre><code>Alternating = ListA + ListB - 1 // minus 1 for the illegal state
ListA = 1 + A*ListB
ListB = 1 + B*ListA
</code></pre>
<p>Now to do some algebra, to find another representation! First, substitute <code>ListB</code> into the equation
for <code>ListA</code>, so that we can solve for <code>ListA</code>:</p>
<pre><code>ListA = 1 + A*ListB
ListA = 1 + A*(1 + B*ListA)
ListA = 1 + A + A*B*ListA
ListA(1 - A*B) = 1 + A
ListA = (1 + A)/(1 - A*B)
</code></pre>
<p>Then use that to solve for <code>Alternating</code>:</p>
<pre><code>Alternating
= ListA + ListB - 1
= ListA + 1 + B*ListA - 1
= ListA + B*ListA
= (1 + B) * ListA
= (1 + B) * (1 + A) / (1 - A*B)
= (1 + B) * (1 + A) * 1/(1 - A*B)
</code></pre>
<p>Wow! What is this type? Remember that <code>1 + X</code> means <code>Option<X></code> and <code>1/(1 - X)</code> means <code>Vec<X></code>.
Putting this together, and helpfully describing the fields, gives:</p>
<div class="highlight"><pre><code class="rust"><span class="k">struct</span> <span class="n">Alternating</span><span class="o"><</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="o">></span> <span class="p">{</span>
<span class="c1">/// If the sequence starts with a B, it's stored here</span>
<span class="n">initial_b</span><span class="o">:</span> <span class="n">Option</span><span class="o"><</span><span class="n">B</span><span class="o">></span><span class="p">,</span>
<span class="c1">/// If the sequence ends with an A, it's stored here</span>
<span class="n">final_a</span><span class="o">:</span> <span class="n">Option</span><span class="o"><</span><span class="n">A</span><span class="o">></span><span class="p">,</span>
<span class="c1">/// The rest of the elements, in order:</span>
<span class="n">middle_elements</span><span class="o">:</span> <span class="n">Vec</span><span class="o"><</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">B</span><span class="p">)</span><span class="o">></span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<p>That’s a nicer representation, isn’t it? Three types became one, the linked-lists became a <code>Vec</code>,
and the illegal state vanished.</p>
<h2 id="binary-trees">Binary Trees</h2>
<p>One last magic trick. A <a href="https://en.wikipedia.org/wiki/Red%E2%80%93black_tree">Red-black
tree</a> is a kind
of balanced binary tree. To stay balanced, it considers every node to be either red or black, and
maintains these invariants:</p>
<ul>
<li>The root is black.</li>
<li>All leaves are black.</li>
<li>A red node has two black nodes as children.</li>
<li>Every path from the root to a leaf crosses the same number of black nodes.</li>
</ul>
<p>Say that the leaves have type <code>X</code>, and the non-leaf nodes have type <code>Y</code>. Then we can capture the
first three invariants in Rust types:</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">></span> <span class="p">{</span>
<span class="n">Red</span><span class="p">(</span><span class="n">RedNode</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">></span><span class="p">),</span>
<span class="n">Black</span><span class="p">(</span><span class="n">BlackNode</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">></span><span class="p">),</span>
<span class="p">}</span>
<span class="k">enum</span> <span class="n">BlackNode</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">></span> <span class="p">{</span>
<span class="n">Leaf</span> <span class="p">{</span>
<span class="n">leaf_data</span><span class="o">:</span> <span class="n">X</span><span class="p">,</span>
<span class="p">},</span>
<span class="n">Branch</span> <span class="p">{</span>
<span class="n">branch_data</span><span class="o">:</span> <span class="n">Y</span><span class="p">,</span>
<span class="n">child_0</span><span class="o">:</span> <span class="n">Box</span><span class="o"><</span><span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">,</span>
<span class="n">child_1</span><span class="o">:</span> <span class="n">Box</span><span class="o"><</span><span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">,</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">struct</span> <span class="n">RedNode</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">></span> <span class="p">{</span>
<span class="n">branch_data</span><span class="o">:</span> <span class="n">Y</span><span class="p">,</span>
<span class="n">child_0</span><span class="o">:</span> <span class="n">Box</span><span class="o"><</span><span class="n">BlackNode</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">,</span>
<span class="n">child_1</span><span class="o">:</span> <span class="n">Box</span><span class="o"><</span><span class="n">BlackNode</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">,</span>
<span class="p">}</span>
</code></pre>
</div>
<p>We can use algebra to find an alternate form for red-black trees.</p>
<p>First, translate the Rust types to algebraic equations:</p>
<pre><code>// N: Node
// R: RedNode
// B: BlackNode
N = R + B
B = x + y*N*N
= x + y*(R + B)*(R + B)
R = y*B*B
</code></pre>
<p>But remember that this didn’t capture the last invariant, that “every path from the root to a leaf
crosses the same number of black nodes”. In order to capture it, we’ll need to index the types <code>R</code> and
<code>B</code> by the number of black nodes crossed. So I’ll write <code>Bn</code> for “the type of a black node, such
that every path from it to its leaves crosses <code>n</code> black nodes (including itself). Using this, we
can get the full equations for a red-black tree:</p>
<pre><code>B1 = x // n=1 -> must be a leaf
B{n+1} = y*(Bn + Rn)*(Bn + Rn)
Rn = y*Bn*Bn
</code></pre>
<p>Now that the equations are complete, we can simplify them. With a little algebra we can substitute
away <code>Rn</code>, so that there’s only one remaining type:</p>
<pre><code>B1 = x // this stays the same
B{n+1} = y * (Bn + Rn) * (Bn + Rn)
= y * (Bn + y*Bn*Bn) * (Bn + y*Bn*Bn)
= y * Bn * (1 + y*Bn) * Bn * (1 + y*Bn)
= y * Bn * Bn * (1 + y*Bn) * (1 + y*Bn)
</code></pre>
<p>We can translate this back into Rust. There’s no good way to keep the <code>n</code> parameter, so we’ll erase
it but remember that, per the last equation, the <code>n</code> value of a child is always 1 less than its
parent’s. Thus the tree now has constant depth (given by <code>n</code>). So erasing <code>n</code> we have:</p>
<div class="highlight"><pre><code class="rust"><span class="k">enum</span> <span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">></span> <span class="p">{</span>
<span class="n">Leaf</span> <span class="p">{</span>
<span class="n">leaf_data</span><span class="o">:</span> <span class="n">X</span><span class="p">,</span>
<span class="p">},</span>
<span class="n">Branch</span> <span class="p">{</span>
<span class="n">branch_data</span><span class="o">:</span> <span class="n">Y</span><span class="p">,</span>
<span class="n">child_0</span><span class="o">:</span> <span class="n">Box</span><span class="o"><</span><span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">,</span>
<span class="n">child_1</span><span class="o">:</span> <span class="n">Box</span><span class="o"><</span><span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">,</span>
<span class="n">child_2</span><span class="o">:</span> <span class="n">Option</span><span class="o"><</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="n">Box</span><span class="o"><</span><span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">)</span><span class="o">></span><span class="p">,</span>
<span class="n">child_3</span><span class="o">:</span> <span class="n">Option</span><span class="o"><</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span> <span class="n">Box</span><span class="o"><</span><span class="n">Node</span><span class="o"><</span><span class="n">X</span><span class="p">,</span> <span class="n">Y</span><span class="o">>></span><span class="p">)</span><span class="o">></span><span class="p">,</span>
<span class="p">},</span>
<span class="p">}</span>
</code></pre>
</div>
<p>So we’ve found another representation of red-black trees!</p>
<p>Is it useful? It at least warranted a <a href="https://en.wikipedia.org/wiki/Red%E2%80%93black_tree#Analogy_to_B-trees_of_order_4">prominent
mention</a> on the
Wikipedia page for red-black trees. This section starts out:</p>
<blockquote>
<p>A red–black tree is similar in structure to a B-tree of order 4, where each node can contain
between 1 and 3 values and (accordingly) between 2 and 4 child pointers.</p>
</blockquote>
<p>By “values”, it means “data in nodes”, i.e. our <code>Y</code>s. And indeed the <code>Branch</code> variant has 1-3 <code>Y</code>s
and (accordingly) 2-4 <code>Node<X, Y></code>s.</p>
<p>So that’s how you can use algebra to find another representation of a data structure, even one with
complicated invariants like a red-black tree.</p>
<h2 id="summary">Summary</h2>
<p>I don’t know about you, but I was excited to discover just how <em>extensively</em> you can model data
types with algebra. This post covered a lot, so here’s a cheat sheet to remember all the algebraic
laws and what they say about types:</p>
<p><a href="/src/algebra-and-datatypes-reference.pdf">Cheat Sheet</a></p>
<p>To read more in this vein, you might want to check out:</p>
<ul>
<li><a href="https://web.archive.org/web/20160607191240/http://chris-taylor.github.io/blog/2013/02/10/the-algebra-of-algebraic-data-types/">Blog posts by Chris Taylor on the same topic</a></li>
<li>Combinatorics. Remember when I dodged the question of why using <code>1/(1-A)</code> as the cardinality of a
list wasn’t total nonsense? That’s called a <em>generating function</em>, and you learn to wield those
in a class on combinatorics. UPDATE: book recommendation – <a href="https://en.wikipedia.org/wiki/Concrete_Mathematics">Concrete Mathematics</a></li>
<li><a href="https://github.com/hmemcpy/milewski-ctfp-pdf">Category Theory for Programmers</a>. By great
coincidence, I discovered it as I was writing this post. It shows a number of applications of
category theory to programming, including proofs of some of the laws in this post.</li>
<li><a href="https://themattchan.com/docs/algprog.pdf">Algebra of Programming</a></li>
</ul>
<p><a href="https://lobste.rs/s/aul5kz/algebra_data_types">Discussion on lobste.rs</a></p>
Thu, 11 Mar 2021 00:00:00 -0500
http://justinpombrio.net//2021/03/11/algebra-and-data-types.html
http://justinpombrio.net//2021/03/11/algebra-and-data-types.htmlWhat's a Confidence Interval?<blockquote>
<p>I think I <em>finally</em> understand exactly what a confidence interval is. This post explains my
understanding.</p>
</blockquote>
<hr />
<p>Alice is a researcher, and wants to know what fraction of people are left-handed. Since no one has
studied this before, she can’t just go to <a href="https://en.wikipedia.org/wiki/Handedness">Wikipedia</a> and
look up the answer. Instead she surveys some people chosen uniformly at random from Earth, analyzes
the results with the Clopper-Pearson method (as one does), and announces:</p>
<blockquote>
<p>The 95% confidence interval for the proportion of people who are left handed ranges from 5.43% to
12.91! (Hereafter 6%-13% for brevity.)</p>
</blockquote>
<p>What exactly does this mean?</p>
<p>The obvious answer would be that there’s a 95% chance that the proportion of lefties is between 6% and
13%. After all, it is called a 95% confidence interval, so you should be 95% confident that the true
value lies inside the interval. And <a href="https://featuredcontent.psychonomic.org/confidence-intervals-more-like-confusion-intervals/">more than half of psychology researchers in the Netherlands
agree with this statement</a>.</p>
<p>Nevertheless, this is wrong. There is not a 95% chance that the fraction of lefties is between 6%
and 13%.</p>
<p>There <em>is</em> a kind of interval for which that would be true. It’s called a <em>credible interval</em>. It’s
a different thing, that’s calculated a different way.</p>
<h2 id="confidence-intervals-vs-credence-intervals">Confidence Intervals vs. Credence Intervals</h2>
<p>Let me explain the difference between confidence intervals and credible intervals.</p>
<p>Say you are curious about some value <code>X</code>; in our example, the percentage of people who are left
handed. Then you can compute either a confidence interval, or a credible interval.</p>
<p>A 95% confidence interval has a lower and upper bound; call them <code>L</code> and <code>U</code>. 95% of the time, <code>X</code>
will lie between <code>L</code> and <code>U</code>:</p>
<pre><code>P(L < X < U) = 0.95
</code></pre>
<p>(Yes, I realize this appears to contradict what I said earlier. It does not. Keep reading.)</p>
<p>On the other hand, a 95% credible interval has a lower and upper bound; call them <code>L</code> and <code>U</code>. 95%
of the time, <code>X</code> will lie between <code>L</code> and <code>U</code>:</p>
<pre><code>P(L < X < U) = 0.95
</code></pre>
<p>Obviously, this is <em>completely different</em>.</p>
<p>Well, glad we cleared that up. Thanks for reading!</p>
<h2 id="frequentist-vs-bayesian-interpretations">Frequentist vs. Bayesian interpretations</h2>
<p>You’re still here?</p>
<p>Oh, perhaps you were confused by the fact that both the English description and the mathematical
formulas completely hid the distinction between the two types of intervals.</p>
<p>The crux of the matter is that you use a confidence interval when taking a Frequentist
interpretation of probability, and a credible interval when taking a Bayesian interpretation. And
while both of these use probability theory to model the experiment, the <em>way</em> they model it is
completely different.</p>
<p>The difference is which values they consider to be random variables, and which they consider to be
constants.</p>
<p>In the Frequentist interpretation:</p>
<ul>
<li>The proportion of people who are left-handed, <code>X</code>, is a constant (albeit an unknown one).</li>
<li>The computed lower and upper bounds, <code>L</code> and <code>U</code>, are random variables.</li>
</ul>
<p>On the other hand, in the Bayesian interpretation:</p>
<ul>
<li>The proportion of people who are left-handed, <code>X</code>, is a random variable.</li>
<li>The computed lower and upper bounds, <code>L</code> and <code>U</code>, are constants.</li>
</ul>
<p>From now on, I’ll put a <code>?</code> after random variables in formulas, to distinguish them from constants. So in a
(Frequentist) confidence interval:</p>
<pre><code>P(L? < X < U?) = 0.95
</code></pre>
<p>In other words, 95% of the time, the random computed confidence interval will contain the true
constant value X. So if you perform the same experiment over and over, surveying a different set of
people and computing a fresh confidence interval each time, 95% of these confidence intervals will
contain the true value.</p>
<p>On the other hand, in a (Bayesian) credible interval:</p>
<pre><code>P(L < X? < U) = 0.95
</code></pre>
<p>In other words, there’s a 95% chance the random true value lies inside the credible interval you
calculated after surveying a single set of people.</p>
<p>See? Completely different.</p>
<p>The consequences of this aren’t immediately clear, though. It isn’t even obvious whether this
difference really matters. Let’s explore!</p>
<h2 id="there-is-not-a-95-chance-the-true-value-lies-in-a-particular-interval">There is not a 95% chance the true value lies in a particular interval</h2>
<p>At the beginning of this post, I said that there is not a 95% chance that the fraction of lefties is
between 6% and 13%. But I <em>also</em> said just a second ago that, in a 95% confidence interval:</p>
<pre><code>P(L? < X < U?) = 0.95
</code></pre>
<p>And L=6% and U=13%. What gives?</p>
<p>In the Frequentist interpretation, L and U are random variables. They were randomized when Alice
picked which people to survey (uniformly at random). If her random selection picked different
people, then she might have seen more or fewer lefties, giving different values for L and U.</p>
<p>However, the <em>specific</em> values of L=6% and U=13% are clearly <em>not</em> random. They were random until
Alice selected them, but now we know exactly what they are. So</p>
<pre><code>P(0.06 < X < 0.13)
</code></pre>
<p><em>doesn’t make any sense</em> in this interpretation. <code>X</code> is not a random variable; it’s a constant whose
value we happen not to know. <code>P(0.06 < X < 0.13)</code> is either 1 or 0; we just don’t know which.</p>
<p>Here’s a more extreme example that may guide your intuition. Instead of getting L and U from a
confidence interval calculation, say we get them (literally) from the roll of a die. The die roll
will be an integer from 1 to 6, and we’ll define L to be one less than the roll and R to be one more.
(For example, if you roll a 2, then <code>L? = 1</code> and <code>R? = 3</code>.) And say that, instead of being an <em>unknown</em>
constant, X is a known constant with the value 2.5. Then, what is this probability?</p>
<pre><code>P(L? < X < U?)
</code></pre>
<p>Well, this is true if the die comes up 2 (giving <code>L? = 1</code> and <code>R? = 3</code>) or 3 (giving <code>L? = 2</code> and
<code>R? = 4</code>), and false otherwise. That’s 2 of the 6 possible rolls. So the probability that X is
between L and U is 33%.</p>
<p>Now say we rolls a 5, so <code>L? = 4</code> and <code>R? = 6</code>. What is this probability?</p>
<pre><code>P(4 < X < 6)
</code></pre>
<p>It’s 0. Duh. 2.5 is not greater than 4. X is not a random variable. Likewise, in the Frequentist
interpretation, you can’t talk about the probability that the proportion of lefties is between 6%
and 13%, because the proportion of lefties is not a random variable.</p>
<p>I realize that this “extreme” example may not be very convincing for what is happening with the
confidence interval. The problem is that X in this example is known, but you only realistically
compute a confidence interval when X is unknown.</p>
<p>Honestly, I think the Frequentist interpretation is counter-intuitive: it interprets L and U as
random variables, even though we know their values (6% and 13%), and X as a constant, even though we
have no idea what value it has.</p>
<h2 id="okay-but-i-still-want-to-know-if-x-is-in-the-interval">Okay, but I still want to know if X is in the interval</h2>
<p>The reason that Alice did an experiment was because she wants to know what X is. So it’s pretty
unsatisfying that, in the Frequentist interpretation, you can’t make any statements about how likely
it is that X is between 6% and 13%”. You can only make the long and awkward and not particularly
helpful statement that “the interval 6%-13% was generated by a random process which produces
intervals that contain the true proportion of lefties 95% of the time”.</p>
<p>I said above that in the Bayesian interpretation, X is a random variable. So can you take Alice’s
confidence interval, and somehow do some math and switch to a Bayesian perspective to get
probability bounds on X? Not really: in the Bayesian interpretation, the computation that Alice used
to get a confidence interval is meaningless.</p>
<p>(Except if the computation for the confidence interval is the same as the computation for the
credible interval, which happens for some Frequentist methods, but not others. It does not happen
for the Clopper-Pearson method that Alice used to get the 6-13% interval.)</p>
<p>So when someone <em>does</em> speculate on whether a particular confidence interval contains X or not, they
have left the realm of mathematics and are making claims <em>outside of probability theory</em>, using
good old-fashioned heuristics and guesswork. It’s fine to do this, just realize that you are no
longer doing math.</p>
<h2 id="many-possible-confidence-intervals">Many possible confidence intervals</h2>
<p>One important thing to realize is that there’s more than one confidence interval. There are many
statistical tests you can perform which will give you a confidence interval. As long as you apply
them appropriately, the interval they give you will have the property of confidence intervals, which
is that:</p>
<pre><code>P(L? < X < U?) = 0.95
</code></pre>
<p>The reason there can be several <em>different</em> ways to compute an interval with this property, is that
it’s a weak property. It just tells you that <em>on average</em>, 95% of intervals will contain the true
value. But some intervals may be more plausible than others.</p>
<p>I’ll give an extreme, though technically valid, example.</p>
<p>Here’s a statistical test that produces a perfectly valid confidence interval, for estimating a
parameter whose value is known to lie between 0% and 100% (e.g. the proportion of people who are left
handed):</p>
<ol>
<li>Roll a D20.</li>
<li>If it comes up anything other than 20, the confidence interval is 0%-100% (which obviously
contains the true value). If it comes up 20, the confidence interval is 200%-300% (which
obviously doesn’t).</li>
</ol>
<p>This obeys the confidence interval property precisely: 95% of the time, you’ll roll something other
than 20 and the interval will contain the true value, and 5% of the time you’ll roll a 20 and it
won’t.</p>
<p>With real statistical tests, it’s rare that a particular confidence interval will be <em>this</em>
ridiculous. But real tests sometimes return an implausibly tight or loose confidence interval.</p>
<p>The issue with this is that, if the true value turns out to be much larger than the upper bound of
the interval, I would <em>like</em> to be able to say “Wow, what a high value! I am surprised!”. But
instead I only get to say “Wow, what a value! Either that is higher than the data suggested, or we
got unlucky and the confidence interval was misleadingly tight for no particular reason!”.</p>
<p>(I read a great article by an actual statistician that talked about a statistical test
that—properly employed—would sometimes return a confidence interval that was partially negative,
for a value that was by definition positive. The point was, this is completely legit! If you feel it
isn’t, you don’t yet grok the confidence interval property. I’ll link to the article here if I ever
find it again.)</p>
<h2 id="many-possible-priors">Many possible priors</h2>
<p>Let’s contrast this to the Bayesian approach. Remember, in the Bayesian interpretation, a credible
interval has the property:</p>
<pre><code>P(L < X? < U) = 0.95
</code></pre>
<p>X, the proportion of people who are left handed, is a random variable. What is its distribution?</p>
<p>Actually, say that Alice hasn’t even surveyed anyone yet. At this point in time, X is still a random
variable, so it must be taken from some distribution. But which one? We have <em>no idea</em> how many
people are left handed; that’s the whole reason Alice is doing an experiment in the first place.</p>
<p>It’s tempting to refuse to pick a distribution for X. We’re <em>scientists</em>, after all, and we don’t
want to make any unecessary assumptions. To give a distribution for X <em>before we even have any
data</em>, seems at best hubris and at worst bias.</p>
<p>Nevertheless, in the Bayesian interpretation, X is random, and thus we must assume some distribution
for it prior to gathering data. This is called its <em>prior</em> distribution.</p>
<h2 id="computing-a-credible-interval">Computing a credible interval</h2>
<p>Let’s roll with this, and compute a credible interval. Since we <em>have</em> to pick a prior, let’s assume
that X’s starting distribution is uniform. (Remember that <code>X</code> is the fraction of all people who are
left handed.) So our starting distribution says that it’s equally likely for <code>X</code> to have any
particular value:</p>
<pre><code>Value of X: 1% 2% 3% ... 99% 100%
Probability: 1% 1% 1% ... 1% 1%
</code></pre>
<p>As a graph:</p>
<p><img src="/src/img/bayes-plot-0.png" width="60%" /></p>
<p>(Yes, I know the value of X is almost certainly not a perfect integer. Let’s simplify and say it
is. It doesn’t change the conclusion, I promise. Oh, and I’m also pretending it can’t be 0%. Shhh.)</p>
<p>That’s the probability distribution of X, before we’ve gathered any data. As soon as we start
gathering data, it changes.</p>
<p>So we pick a person uniformly at random from Earth, and ask them whether they’re left-handed. They
say (in Bengali) that they’re not; they’re right-handed. We can now use some basic probability
theory to update X’s distribution to account for this information.</p>
<p>Let <code>A</code> be short for “the person we randomly picked was not left handed”. Then the new distribution
should be <code>P(X|A)</code>, meaning “the probability that <code>X</code> has a particular value, given our observation
of a non-left-handed person”. We can compute it like this:</p>
<pre><code>P(X|A) = P(A|X) * P(X) = (1 - X) * 0.01 = 0.02 (1 - X)
------------- --------------
P(A) 0.5
</code></pre>
<p>(The law of probability used in the first step is called <em>Bayes rule</em>.)</p>
<p>So our updated distribution for X—which is called a <em>posterior</em> distribution—is <code>0.02 (1 - X)</code>:</p>
<pre><code>Value of X: 1% 2% 3% ... 99% 100%
Probability: 2% 1.98% 1.96% ... 0.02% 0%
</code></pre>
<p>As a graph:</p>
<p><img src="/src/img/bayes-plot-1.png" width="60%" /></p>
<p>The hypothesis that “literally everyone is left-handed” has been eliminated by the observation of a
right-handed person.</p>
<p>If we next see a left-handed person, our distribution for X updates to:</p>
<pre><code>Value of X: 1% 2% ... 50% ... 98% 99%
Probability: 0.06% 0.12% ... 1.5% ... 0.12% 0.06%
</code></pre>
<p><img src="/src/img/bayes-plot-2.png" width="60%" /></p>
<p>Remember, at this point we’ve seen one leftie and one rightie. So it makes sense that the
distribution is symmetric.</p>
<p>Here’s how the graph changes over 35 hypothetical observations:</p>
<p><img src="/src/img/bayes-plot.gif" width="60%" /></p>
<p>(The sudden jumps to the right are observations of a left-handed person.)</p>
<p>Say we keep going, and get to 21 lefties out of 243 interviewees (these are the numbers I assumed
for Alice’s confidence interval at the beginning of the post). Then our graph will look like this:</p>
<pre><code>Value of X: 1% 2% ... 5% 6% 7% 8% 9% 10% 11% 12% ...
Probability: ~0% ~0% ... 1% 5% 14% 21% 22% 17% 11% 5% ...
</code></pre>
<p><img src="/src/img/bayes-plot-242.png" width="60%" /></p>
<p>Computing a credible interval from this is easy. The probabilities of the various possibilities for
X add up to 100%. The credible interval picks the lower and upper values for X—call them L and
U—such that 2.5% of the probability lies below L, 2.5% lies above U, and 95% lies between them. In
this case, the credible interval is 5.74% to 12.86%.</p>
<p>So if you accept a uniform prior then there’s a 95% chance that the proportion of the population
that is left-handed is between 5.74% and 12.86%.</p>
<p>If you were to pick a different prior, you’d get a different credible interval. There’s an infinite
number of possible priors you could pick (and a uniform prior isn’t always appropriate), but every
prior leads to exactly one credible interval.</p>
<h2 id="computing-a-confidence-interval">Computing a Confidence Interval</h2>
<p>How could you follow in Alice’s footsteps, to compute a confidence interval?</p>
<p>Wikipedia lists five methods for computing a confidence interval for a binomial distribution (i.e.,
one with a yes/no answer, like “is this person left-handed?”). They are:</p>
<ol>
<li>Normal approximation interval</li>
<li>Wilson score interval</li>
<li>Jeffry’s interval</li>
<li>Clopper-Pearson interval</li>
<li>Agresti-Coull interval</li>
</ol>
<p>The Normal, Wilson, and Agresti-Coull intervals can be “permissive”, meaning that when you ask for a
95% confidence interval, the probability that the interval contains the true value might be less
than 95%. (Personally, I would have called this “wrong”, but “permissive” seems to be the word of
art.) So let’s ignore those.</p>
<p>The Clopper-Pearson interval is what I had Alice use. If you ask for a 95% interval, there’s always
at least a 95% chance that the interval contains the true value. Though it can be a bit
conservative. You can see that in this post: Alice’s confidence interval is a bit wider than the
Bayesian credible interval.</p>
<p>The final option is Jeffry’s interval. It is a Baysian credible interval, that also obeys the rules
for a confidence interval. As a credible interval, it’s like what we calculated above, except that
instead of starting with a uniform distribution, it starts with a beta distribution with parameters
(1/2, 1/2). That distribution looks <a href="https://www.wolframalpha.com/input/?i=beta+distribution+alpha%3D0.5+beta%3D0.5">link
this</a>, which seems a
little odd to me because it says it’s impossible that exactly half of all people might be
left-handed.</p>
<p>As a confidence interval, Jeffry’s interval has the bonus property that it’s equally likely for the
true value to lie to either side of the interval. According to Wikipedia, this is in contrast to the
Wilson interval, which is centered too close to <code>X = 1/2</code>.</p>
<p>There’s a lesson here: while the probability that the true value lies outside a randomly generated
95% confidence interval is 5%, the chance it lies <em>above</em> the interval need not be 2.5%.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I think the main takeaways are:</p>
<ul>
<li>There is a simple, technical difference between Frequentist and Bayesian interpretations of
probability theory: which things in the real world are modeled as constants, and which are modeled
as random variables.</li>
<li>A confidence interval <em>does not tell you</em> how likely it is that the true mean has any particular
value. To find out, you must either (i) do Bayesian math, or (ii) reason <em>outside of</em> probability
theory.</li>
</ul>
Fri, 19 Feb 2021 00:00:00 -0500
http://justinpombrio.net//2021/02/19/confidence-intervals.html
http://justinpombrio.net//2021/02/19/confidence-intervals.html