I recently read a great introductory paper on Supersingular Isogeny Diffie-Hellman (SIDH) by Craig Costello and wanted to summarize just the math of it (with some simplifications) for myself. Hopefully this summary is clear enough to also be useful to people who aren't myself.

§ Background

We deal with supersingular (we won't define this word) elliptic curves defined over finite fields of the form $\mathbb{F}_{p^2}$.

For SIDH, we deal only with curves such that $E(\mathbb{F}_{p^2}) \cong \mathbb{Z}_{p+1} \times \mathbb{Z}_{p+1}$. The kernel of the left multiplication map $P \to [a]P$ is of the form $\mathbb{Z}_a \times \mathbb{Z}_a$. We say that a map $\phi: E \to E'$ is an $\ell$-isogeny iff $|\ker \phi| = \ell$. Notice that $\ell$-isogenies only exist for $\ell \mid p+1$. Every $\ell$-isogeny $\phi: E \to E'$ also has an associated $\ell$-isogeny, called the dual isogeny $\psi: E' \to E$, such that $\phi \circ \psi$ is the left-multiplication map $[\ell]$. Vélu's formulas tell us that every subgroup $G \leq E$ corresponds to a unique isogeny out of $E$ (to some curve $E'$) with kernel G. They also tell us how to construct an isogeny given its desired kernel.

Claim: On a supersingular curve $E$ of the form above, for any prime $\ell \mid p+1$, there are exactly $\ell+1$ nontrivial $\ell$-isogenies out of $E$.

Proof: Using Vélu's formulas, it suffices to show that E has exactly $\ell+1$ many subgroups of prime order $\ell$. Firstly, since these are of order $\ell$, they all must be subgroups of $E[\ell] \cong \mathbb{Z}_\ell \times \mathbb{Z}_\ell$. Let $P$ and $Q$ denote the free generators of $E[\ell]$. Consider the family of cyclic subgroups generated by $P + [k]Q$ for $0 \leq k < \ell$. We claim that these are not only distinct, but disjoint. Suppose for distinct $k$, $k'$ that

$$ \alpha(P + [k]Q) = \beta(P + [k']Q) $$

for some integers $\alpha$, $\beta$. Rearranging,

$$ [\alpha]P - [\beta]P = [\beta k]Q - [\alpha k']Q $$

Since $P$ and $Q$ are linearly independent, both sides of this equation must be the identity. Therefore,

$$ \alpha \equiv \beta\ (\textrm{mod } \ell) $$

Putting this together with the fact that the RHS is the identity,

$$ \beta k \equiv \alpha k' \equiv \beta k'\ (\textrm{mod } \ell), $$

we conclude $k \equiv k'\ (\textrm{mod } \ell)$. This gives us a family of $\ell$ many disjoint subgroups. The final subgroup to include is the one generated by $Q$. It is clear that this is also disjoint from the previously mentioned subgroups. Finally, there cannot be any more order $\ell$ subgroups, since we have listed all the possible cyclic subgroups of $\mathbb{Z}_\ell \times \mathbb{Z}_\ell$, and all groups of prime order are cyclic.

A supersingular $\ell$-isogeny graph is a graph whose nodes are $j$-invariants (i.e., elements of $\mathbb{F}_{p^2}$ which are in bijective correspondence with isomorphism classes of elliptic curves) and whose edges represent $\ell$-isogenies between them. Note this is an undirected graph because every $\ell$-isogeny has a dual $\ell$-isogeny going in the other direction. Also, by the above claim, every node has degree $\ell+1$. It turns out that this is an expander graph, which means random walks on it mix quickly. In other words, the diameter of an expander graph on $N$ elements is roughly $\log(N)$. SIDH exploits this property to make its security claims.

§ SIDH

The purpose of SIDH is to compute a shared secret by exchanging public keys. Same idea as Diffie Hellman.

§ Public Parameters

A prime $p = 2^{e_A}3^{e_B} - 1$. This way, $p+1$ is divisible by powers of 2 and 3. $e_A$ and $e_B$ are chosen so that $2^{e_A} \approx 3^{e_B}$.
A starting curve $E$ defined over $\mathbb{F}_{p^2}$
Two points $P_A,Q_A \in E$ such that $\langle P_A, Q_A \rangle \cong \mathbb{Z}_{2^{e_A}} \times \mathbb{Z}_{2^{e_A}}$.
Two points $P_B,Q_B \in E$ such that $\langle P_B, Q_B \rangle \cong \mathbb{Z}_{3^{e_B}} \times \mathbb{Z}_{3^{e_B}}$.

§ Protocol

Here's the gist of the protocol: Alice takes a walk on a 2-isogeny graph and marks her stopping point. Bob takes a walk on a 3-isogeny graph and marks his stopping point. The constraint that $2^{e_A} \approx 3^{e_B}$ means that the number of possible stopping points that Alice can get to is roughly equal to the number of possible stopping points Bob can get to. They then share their stopping points with each other. Then both of them do the same walks as before, but starting at each other's stopping point. The final endpoints of Alice's and Bob's walks are isomorphic (as elliptic curves, since points on the isogeny graph are curves), which means they have the same $j$-invariant.

Concretely, here's the protocol sequence. Alice initiates:

Alice:

Picks a random secret scalar $0 \leq k_A < 2^{e_A}$ and defines a secret generator $S_A = P_A + [k_A]Q_A$. Note that $S_A$ has order $2^{e_A}$ because $P_A$ and $Q_A$ are linearly independent. By the proof above, all the possible subgroups $\langle S_A\rangle$ are disjoint.
Constructs a $2^{e_A}$-isogeny $\phi_A: E \to E/\langle S_A\rangle$. This is done iteratively by repeatedly constructing 2-isogenies with elements of $\langle S_A\rangle$ in the kernel. The way you do this is by getting a point $R_A = [2^{e_A-1}]S_A$ of order 2. Using Vélu's formulas, you can construct a $\phi_0: E \to E/\langle R_A\rangle$. Now let $S'_A = \phi_0(S_A)$. Then $S'_A$ has order $2^{e_A-1}$. Let $R'_A = [2^{e_A-2}]S'_A$, ... Once you've constructed $\phi_0$, ..., $\phi_{e_A}$, let $\phi_A$ be the composition of all of these.
Sends Bob $(\phi_A(E), \phi_A(P_B), \phi_A(Q_B))$, where $\phi_A(E)$ is the description of the output curve. We call this tuple Alice's "public key".

Bob:

Picks a secret $0 < k_B ≤ 3^{e_B}$, lets $S_B = P_B + [k_B]Q_B$, and constructs a $3^{e_B}$-isogeny $\phi_B: E \to E/\langle S_B\rangle$ in the same way as above.
Sends Alice $(\phi_B(E), \phi_B(P_A), \phi_B(Q_A))$. We call this tuple Bob's "public key".

Alice:

Uses the same method to construct $\psi_A: \phi_B(E) \to \phi_B(E)/\langle T_A\rangle$ where $$ T_A = \phi_B(S_A) = \phi_B(P_A) + [k_A]\phi_B(Q_A) $$
Computes the $j$-invariant of $\psi_A(\phi_B(E))$

Bob:

Same idea as above: Construct $\psi_B: \phi_A(E) \to \phi_A(E)/\langle T_B\rangle$ where $$ T_B = \phi_A(S_B) = \phi_A(P_B) + [k_B]\phi_A(Q_B) $$
Computes the $j$-invariant of $\psi_B(\phi_A(E))$

End of Protocol

Note that, since $j$-variants of isomorphic curves are equal, and $$ (E/\langle S_A\rangle)/\langle T_B\rangle \cong E/\langle S_A, S_B\rangle \cong (E/\langle S_B\rangle)/\langle T_A\rangle, $$

the computed $j$-invariants are the same. Thus, it makes sense to have the $j$-invariant be the shared secret.

§ An Attack on Static Public Keys

This attack is due to Galbraith, Petit, Shani, and Ti. If Alice keeps reusing the same $k_A$, Bob can determine its value by by initiating $\lceil\log_2(k_A)\rceil$ many SIDH exchanges. Here's the exploit:

Bob can figure out the bottom bit of $k_A$ by publishing $\phi_B(Q_A) + L_2$ instead of $\phi_B(Q_A)$, where $L_2$ is a point of order 2. If $k_A$ is odd, i.e., if $L_2$ is not killed by $k_A$, then the protocol does not produce agreement, because $T_A + L_2$ is not the image of $S_A$ in Bob's curve $\phi_B(E)$. If $k_A$ is even, then $L_2$ is killed by $k_A$ and the protocol completes successfully. Thus, the bottom bit is leaked.

Say Bob finds it was even. Now Bob wants to know if $k_A \equiv 0\ (\textrm{mod } 4)$ or $2\ (\textrm{mod } 4)$. Then he sends $\phi_B(Q_A) + L_4$ where $L_4$ is a point of order 4, and uses the same logic. If Bob had found $k_A$ was odd, then he would send $\phi_B(P_A) - L_4$ instead of $\phi_B(P_A)$, and $\phi_B(Q_A) + L_4$ instead of $\phi_B(Q_A)$. So if $k_A$ is $1\ (\textrm{mod } 4)$, then

$$ \begin{align*} &\phi_B(P_A) - L_4 + k_A(\phi_B(Q_A) + L_4) \newline &= \phi_B(P_A) - L_4 + [k_A]\phi_B(Q_A) + L_4 \newline &= \phi_B(P_A) + [k_A]\phi_B(Q_A) \end{align*} $$

and the protocol succeeds. Otherwise,

$$ \begin{align*} &\phi_B(P_A) - L_4 + k_A(\phi_B(Q_A) + L_4) \newline &= \phi_B(P_A) - L_4 + [k_A]\phi_B(Q_A) + [3]L_4 \newline &= \phi_B(P_A) + [k_A]\phi_B(Q_A) + [2]L_4 \end{align*} $$

and the protocol fails. Bob can continue like this to recover each successive bit of $k_A$.

§ Patching it up

How do you prevent the above attack from working? Ideally, Alice would like to be able to tell when she receives a malicious public key, i.e., one of the form $\phi_B(Q_A) + L$ for some low-order point $L$. If she could do this, then she would be able to terminate the protocol early and not reveal anything about the shared secret.

Unfortunately, given just a public key share, there's no way to tell if a given it was modified in this way or not.

But what if Alice was given more than a public key share? What if Bob somehow gave Alice his private ephemeral key as well? Then Alice would be able to compute Bob's public key share from his private key and check that it matches the one she received from Bob. If they match, then the public key was honest and the protocol completes successfully. If they don't match, then Alice can conclude that Bob cheated and she can abort the protocol early. This is precisely what the SIKE key encapsulation mechanism does. Here's the new protocol (a little simplified):

Alice

Picks her secrets randomly, as before
Publishes her public key $pk_A$, computed as before (i.e., a description of a curve, and two points on that curve)

Bob

Picks a random bitstring $m$
Computes his secret scalar $k_B = G(m \| pk_A)$, where $G$ is a pseudorandom number generator. He also computes the corresponding public key $pk_B$.
Computes the $j$-invariant of the SIDH exchange (as described above) between $k_B$ and $pk_A$
Derives a symmetric key $\kappa = \textrm{KDF}(j)$, where $\textrm{KDF}$ is some key-derivation function.
Computes $c = \textrm{Enc}_\kappa(m)$, where $\textrm{Enc}$ is the encryption function for some symmetric encryption scheme.
Sends Alice the tuple $(pk_B, c)$
Computes the shared secret (to be used if the protocol complete successfully) $K = \textrm{KDF}(m \| k_B \| c)$.

Alice

Uses $pk_B$ and her secret $k_A$ to derive the $j$-invariant, denoted $j'$
Computes $\kappa' = \textrm{KDF}(j')$
Decrypts $m' = \textrm{Dec}_{\kappa'}(c)$ (recall Alice received $c$ from Bob)
Computes the secret scalar $k_B' = G(m' \| pk_A)$ and its corresponding public key $pk_B'$
Checks that $pk_B' = pk_B$. If these are not equal, the protocol aborts.
Derives the shared secret $K = \textrm{KDF}(m' \| k_B' \| c)$

§ Analysis

Let's give an informal sketch of how this prevents information leakage.

Say Bob wants Alice to leak the lowest bit of her secret scalar $k_A$. The only leakage mechanism he has access to is learning whether the protocol succeeded or failed (since this can be used to convey a bit of information). He needs the protocol to do one thing when $k_A$ is even, and the other thing if it's odd.

If Bob computes and conveys everything honestly (i.e., as described in this new protocol), then it should be clear that the protocol succeeds with 100% probability. Thus, this leaks nothing. If Bob lies about $c$, then $m' = \textrm{Dec}_{\kappa'}(c)$ is certainly going to differ from $m$,¹ which results in a different $k'_B$ and thus a different $pk_B'$. So that strategy causes the protocol to always fail with overwhelming probability, and thus leak nothing.

So what if Bob lies about $pk_B$ and tries the "add a low-order point" trick from the previous section? As before, this gives Bob control over the value $j'$ that Alice derives. If $k_A$ is even, it will be the same as Bob's $j$, and if it's odd it will be different. So far so good. Continuing on the "good" branch, if $j' = j$, then Alice's steps 2, 3, and 4 all continue to derive the same values as Bob's. In step 5, Alice can see that the $pk_B$ she received doesn't match the public key corresponding to $k_B'$, so Alice aborts. On the "bad" branch, if $j' \neq j$, then all the values derived in steps 2, 3, and 4 are wrong, and Alice is certain to abort. Thus, this strategy also causes the protocol to always fail with overwhelming probability, thus leaking nothing!

I hope that explanation was understandable. The way the authors of SIKE prove security is actually far more elegant than this: it turns out that the "patch" I described above is a result of a much more general theorem.² Also, the patch I presented isn't totally accurate, so if you want more detail I encourage you to check out the spec (algorithms 1 and 2, and the proofs of security in section 4.3).

This relies on some assumption about the symmetric key primitive and on $\textrm{KDF}$, so that even if you modified $j'$ as well, you couldn't get $c$ to decrypt to a chosen plaintext

The FO transformation is precisely made to turn an IND-CPA PKE into an IND-CCA PKE with very little overhead. Here's a paper describing what it is and how it works. Here is the extremely influential paper that the authors of SIKE cite.

SIKE is currently in round 2 of the NIST post-quantum cryptography competition