Shor's algorithm is essentially phase estimation with the unitary matrix $U_{x,N}$ where

\[ U_{x,N} |y \rangle = |xy \textrm{ mod } N \rangle \textrm{ if } y < N \]
\[ U_{x,N} |y \rangle = |y \rangle \textrm{ if } y \geq N \]

A fast U exists

We start by convincing ourself that there exists a fast quantum circuit which evaluates $U_{x,N}$.

We know that multiplication (mod N) can be done quickly (for a single binary input) on a classical computer. What that means is that if you go to your ECE friend, they could make a short (poly-time) classical circuit which eats $y$ and spits back out $xy mod N$. If you insisted, your ECE friend could even make a quick circuit that does this out of a set of three-qubit reversible gates called tofelli gates (which are universal for reversible classical circuits).

But tofelli gates can also be treated as quantum gates. If we just take the classical circuit and put in quantum toffeli gates, then we will have a quantum circuit. And we could take our quantum circuit full of toffeli gates and turn it into one with just H, P, and CNOT.

When we run this quantum circuit it gives the same results that you get classically for each basis element. But in the quantum case you can run it in superposition!

Note, that this trick works not only for $U_{x,N}$ but any classical function $f$ which can be evaluated quickly.

Aside: This single fact makes quantum mechanics look pretty powerful. Suppose that I have a function $f$ I want to compute. In the time that classically I can do $f(0)$, in quantum mechanics I can get a superposition $\sum_i f(i)$. It's as if I did all the calculations in parallel! But what happens if I measure the system now?

We could in principle follow this approach but is hard because you have to have a good ECE friend (if anyone actually suceeds in doing this for Shors please tell me)

Another approach to make your unitary matrix and then use you work from Universal Gates (here) to change that unitary into $H$, CNOT, and P gates. This is straightforward to do but might not scale well (since the algorithm to convert unitaries into these gates automatically aren't always efficient). If someone, decides to do this, please do some tests on the scaling and let me know how it scales.

A fast controlled-$U_{x,N}$ exists

We just argued that $U_{x,N}$ can, in principle, be made out of a polynomial number of gates.
But if we can make a circuit out of a polynomial number of gates then we can make a controlled version of it out of a polynomial number of gates.

This can be done by taking your circuit and adding a controlled wire to each gat element. Then we get control-hadamard, control-CNOT, and control-phases. We can then turn each of those into just CNOT, H, and P gates. We will only have increased the number of gates by a bit.

Our approach

The approach we actually recommend taking is to implement (classical gates) and (control-classical gates) as atomic gates in your simulator. This is only a bit of cheating since we know that they are classically fast.

For simulator Ic:
In the previous section you've generated $U_{x,N}$. It should be straigtforward to add this as a gate in a sparse fasion into your simulator. Because it acts on a number of wires you just have to be careful to kron it against the right number of identities in the correct place.

To get the control-$U_{x,N}$ working is a bit trickier. Essentially you have to figure out what the big unitary matrix for this gate will look like (it will be sparse) and then add it to your simulator.

For simulator II and III:
Your simulator essentially works by taking basis elements and responding with the new basis element. Since python does modular multiplication this should be just as easy to add as any of the other gates.

1	`FUNC 4 5 xyModN 2 15`

where the first number is the first wire for the function, the next number is the number of wires for the function (so it works on lines 4,5,6,7,8), the "xyModN" is the name of the function in python, and everything after that is parameters to that function (in this case, $x$ and $N$).

1	`CFUNC 3 4 5 xyModN 2 15`

where the first number is the control-bit (like the CNOT), the second number is the first wire for the function, the next number is the number of wires for the function, the "xyModN" is the name of the function in python, and everything after that is parameters to that function (in this case, $x$ and $N$).

Classical and Controlled Classical

A fast U exists

A fast controlled-$U_{x,N}$ exists

Our approach