Restricted Boltzmann Machines

In this section, we will learn how to build and train a restricted Boltzmann machine (RBM) to solve a simple engineering task.

The restricted Boltzmann machine is an example of a physics-inspired method that is useful for engineering problems. At a high-level, RBMs are probability distributions for binary data with many tunable parameters. When "training" an RBM, we tune its parameters to make the RBM's probability distribution match the distribution of a training data set. Our goal in this unit is to make and train an RBM whose probability distribution matches the distribution of images in the MNIST handwritten digit data set.

Like a Hopfield network, a restricted Boltzmann machine is a particular type of Ising model. We saw that a Hopfield network can be thought of as a zero-temperature Ising model which equilibrates to a local energy minimum. Similarly, a restricted Boltzmann machine can be thought of as a finite-temperature Ising model with a particular connection structure between its spins. Because of their relation to statistical physics, RBMs are more interpretable than some other machine learning approaches. And because of their particular structure, they can be efficiently trained.

A restricted Boltzmann machine is a probability distribution over binary variables, which, like in the Hopfield network, can be interpreted as spins or neurons. In an RBM, the binary variables are separated into two types, "visible" $v_i$ or "hidden" $h_j$ variables. The visible and hidden variables do not interact with variables of the same type, but only interact with variables of the other type. This special "restricted" structure allows us to separate the binary variables into two layers. Described using the language of statistical physics, an RBM is simply a Boltzmann distribution of an Ising model for the spins $v_i,h_j = \pm 1$:
\[ \begin{align*} p(v,h) = \frac{1}{Z}\exp(-E(v,h)) \end{align*} \]
where $E(v,h)=-\sum_{ij}W_{ij} v_i h_j - \sum_i a_i v_i - \sum_j b_j h_j$ is the energy of the visible and hidden spins, $Z$ is a normalization constant, and the temperature is set to $T=1$. Each entry of the weight matrix $W_{ij}$ describes an interaction between a visible spin $v_i$ and a hidden spin $h_j$. The visible bias $a_i$ (hidden bias $b_j$) is a magnetic field that encourages the visible spin $v_i$ (hidden spin $h_j$) to turn on. Note that the energy $E(v,h)$, and therefore the probability $p(v,h)$, depends on the weights $W_{ij}$ and biases $a_i, b_j$. These parameters define the RBM and are what we will tune during training.

Our goal is to write an RBM code and make it "learn" a probability distribution from a training data set. Once trained on this data, the machine will be able to produce new samples from the data's distribution (or its best estimate).

RBM as an Ising Model

To summarize, an RBM is really just an Ising model.

Restricted Boltzmann Machine $\longleftrightarrow$ Ising Model

Neuron $\longleftrightarrow$ Ising Spin
- $+1$ = Spin Up
- $-1$ = Spin Down
$W_{ij} \longleftrightarrow J_{ij}$
$a_i,b_j \longleftrightarrow h_i$

The energy of the corresponding Ising model is
$$E(v,h)=-\sum_{i,j} v_i W_{ij} h_j - \sum_{i} v_i a_{i} - \sum_{j} h_j b_{j}.$$
When we sample an RBM, we obtain the spin configuration $(v,h)$ with the $T=1$ Boltzmann probability
$$
p(v,h) = \frac{1}{Z}\exp(-E(v,h)).
$$

There are $N_v$ visible spins $v_1,\ldots,v_{N_v}$ and $N_h$ hidden spins $h_1,\ldots,h_{N_h}$.

Understanding Probability Distributions

There are a number of probability distributions that we need to be familiar with to understand RBMs. These probabilities characterize the frequency with which we observe configurations of visible and hidden spins when sampling an RBM.

The first and most important is the joint probability $p(v,h)$. This is a probability distribution over multiple random variables that gives the probability of observing these variables simultaneously. In other words, it tells us probability that the visible spins are in configuration $v$ and the hidden spins are in configuration $h$. By definition, for an RBM, this is given by the Boltzmann distribution
$$
p(v,h)=\frac{\exp[-E(v,h)]}{\sum_{v,h} \exp[-E(v,h)]} \equiv \frac{\exp[-E(v,h)]}{Z}.
$$
This probability distribution contains all of the information about how variables $v$ and $h$ are correlated with one another. The remaining probability distributions that we discuss are derived from the joint distribution.

Next are the marginal probabilities $p(v)$ and $p(h)$. These are probability distributions for subsystems of the entire system. Consider $p(v)$. Intuitively, $p(v)$ is the probability of observing the visible spin configuration $v$, averaged over all possible hidden spin configurations $h$. This quantity can be computed from the joint distribution $p(v) \propto \sum_h p(v,h)$, where the sum over $h$ is a sum over all $2^{N_h}$ spin confiurations of the hidden spins. So that this is a normalized probability distribution, with $\sum_v p(v) = 1$, the marginal distribution for the visible spins is defined as
\[ \begin{align*} p(v) = \frac{\sum_h p(v,h)}{\sum_{v,h} p(v,h)} = \frac{\sum_h \exp[-E(v,h)]}{\sum_{v,h} \exp[-E(v,h)]}. \end{align*} \]
Likewise, the marginal distribution for the hidden spins is
\[ \begin{align*} p(h) = \frac{\sum_v p(v,h)}{\sum_{v,h} p(v,h)} = \frac{\sum_v \exp[-E(v,h)]}{\sum_{v,h} \exp[-E(v,h)]}. \end{align*} \]

Finally, we discuss the conditional probabilities $p(v|h)$ and $p(h|v)$, which are read as "probability of $v$ given $h$" and "probability of $h$ given $v$". These are a bit subtle, so take some time to think about them and how they relate to the other distributions. A conditional probability is a probability distribution of a subsystem conditioned on the complementary subsystem taking a specific value. Consider $p(v|h=\hat{h})$. This is the probability that $v$ occurs given that $h$ is fixed to the value $\hat{h}$. Like the marginal distribution, this probability can be computed from the joint distribution $p(v|h=\hat{h}) \propto p(v,h=\hat{h})$. Notice that $p(v|h=\hat{h})$ depends on two variables $v$ and $\hat{h}$, but is only a probability distribution over the variable $v$. This means that it is only normalized over $v$ so that $\sum_v p(v|h=\hat{h})=1$ is true for any $\hat{h}$. This leads us to the normalized definition of $p(v|h)$ (where we drop the $h=\hat{h}$ notation) and its explicit form for an RBM
\[ \begin{align*} p(v|h) = \frac{p(v,h)}{\sum_{v} p(v,h)} = \frac{\exp[-E(v,h)]}{\sum_{v} \exp[-E(v,h)]}. \end{align*} \]
Likewise, $p(h|v)$ is
\[ \begin{align*} p(h|v) = \frac{p(v,h)}{\sum_{h} p(v,h)} = \frac{\exp[-E(v,h)]}{\sum_{h} \exp[-E(v,h)]}. \end{align*} \]
Notice how closely related all of these probability distributions are. For example, the marginal and conditional probabilities are all directly related to the joint distribution $p(v,h)$, but have different normalizations, which dramatically affect their interpretations.

In general, joint, marginal, and conditional probability distributions are related by a simple formula. Suppose that we have a system split into subsystem $A$ and its complement $B$. The probability of $A$ conditioned on $B$, denoted by $p(A|B)$, relates to the marginal distribution of $B$ and the joint distribution of $A$ and $B$ via
$$
p(A,B) = p(A|B) p(B).
$$
You should check why this formula is true from the definitions of conditional and marginal probabilities.

This relation forms the basis of Bayes' theorem,
$$
p(A|B) p(B) = p(B|A) p(A),
$$
which holds whenever the joint distribution is symmetric, $p(A,B) = p(B,A)$.

Why are marginal and conditional probabilities important? Why did we not consider them in the Ising model unit?

The variables in an RBM are by design split into two groups, visible and hidden variables. When training an RBM, the visible variables are associated with data, such as the pixels in images, while the hidden variables are simply extra variables in the model, sometimes called auxiliary or latent variables. Because of the special interpretation of the visible variables, it is natural to consider the probability of just the visible spins, i.e., the marginal distribution $p(v)$. This marginal distribution is the probability distribution that we will want to match to a data set during training. This is why we care about marginal probabilities.

An important technical detail about RBMs is that, because of the bipartite structure of the RBM, there is an efficient way to sample spin configurations using conditional probabilities $p(v|h)$ and $p(h|v)$. This efficient sampling allows us to perform efficient training. This is why we care about conditional probabilities.

In the Ising model unit, there was only one type of spin and there was no reason to split the spins into different subsystems. So we never needed to consider probability distributions that related different subsystems.

Building and Sampling an RBM

Your first goal is to set up the basic code structure for your RBM. You will want to store the states of your visible and hidden variables, as well as the values of your parameters, the weight matrix and the biases. Note that the weight matrix is generally rectangular, as there may be different numbers of visible $N_v$ and hidden neurons $N_h$. You should also set up a function that computes the energy, which will be useful for debugging.

Next, we want to learn how to sample configurations from our RBM, given the weights and biases, which we will need later in our training. Our goal is to produce spin configurations $(v,h)$ with probability $p(v,h)$. Since the RBM is just an Ising model, one way we could perform this sampling is by using Metropolis Markov Chain Monte Carlo as we did in the Ising model unit, where we performed single spin flips randomly by using the energy function. It turns out there is a more efficient sampling method for RBMs, because of their special restricted structure, that allows us to "flip" many spins at once instead of one-at-a-time.

The efficient Monte Carlo method that we will use to sample the joint distribution $p(v,h)$ is known as Gibbs sampling. Gibbs sampling involves the following steps:

Given the current visible layer $v$, probabilistically choose a configuration for the hidden layer $h$.
Given the chosen $h$, probabilistically choose a configuration for $v$.
Repeat steps 1 and 2 $k$-times.

This process is a Markov chain, and therefore has a stationary distribution. To produce the desired distribution, we must probabilistically choose new configurations of the hidden or visible layers in such a way that the stationary distribution is proportional to $p(v,h) \propto \exp[-E(v,h)]$. This can be done by using conditional probabilities in each step of the sampling process. Specifically, the first two steps are:

Given $v$, choose a new $h$ with probability $p(h|v)$.
Given $h$, choose a new $v$ with probability $p(v|h)$.

Step 1. Because of the bipartite structure of the RBM, the conditional probability has a particularly nice, factorable form. For the hidden spins conditioned on the visible spins, it is
\[ \begin{align*} p(h|v) = \prod_{j=1}^{N_h} p(h_j|v) \equiv \prod_{j=1}^{N_h} \frac{\exp[m_j h_j]}{\exp[m_j h_j] + \exp[-m_j h_j]} \end{align*} \]
where $m_j$ is an "effective magnetic field" felt by hidden spin $h_j$, defined as
$$
m_j \equiv \sum_{i} W_{ij} v_i + b_j.
$$
It is helpful to check this yourself, by using the definition of $p(v,h)$ and $p(h|v)$.

Because $p(h|v)$ factors into a product of $p(h_j|v)$, the $h_j$ variables are distributed independently (for a given $v$) and we can sample them independently and at the same time. This is what makes Gibbs sampling efficient for RBMs. In practice, to sample $p(h|v)$, we go through each hidden neuron $h_j$ and choose to set it to $+1$ with probability $p(h_j=+1|v)=\exp(m_j)/(\exp(m_j) + \exp(-m_j))$ and set it to $-1$ otherwise.

Step 2. The exact same logic applies. The conditional probability $p(v|h)$ factors into
\[ \begin{align*} p(v|h) = \prod_{i=1}^{N_v} p(v_i|h) \equiv \prod_{i=1}^{N_v} \frac{\exp[m_i v_i]}{\exp[m_i v_i] + \exp[-m_i v_i]} \end{align*} \]
where $m_i$ is an "effective magnetic field" felt by visible spin $v_i$, defined as
$$
m_i \equiv \sum_{j} W_{ij} h_j + a_i.
$$
To sample $p(v|h)$, go through each visible spin $v_i$ and set it to $+1$ with probability $p(v_i=+1|h)=\exp(m_i)/(\exp(m_i) + \exp(-m_i))$ and set it to $-1$ otherwise.

One iteration of step 1 generates samples from $p(h|v)$. One iteration of step 2 generates samples from $p(v|h)$. However, $k$ iterations of alternating steps 1 and 2, produces samples $(v,h)$ that are actually approximately distributed according to the joint distribution $p(v,h)$. This approximation becomes better as $k$ is increased, but works well even for modest $k$. Making $k$ larger leads to more accurate, but less efficient sampling. You can use $k=10$.

Your goal is to implement Gibbs sampling. Since it is essential for the training algorithm, you need to make sure that it is working correctly by testing your generated samples against analytic results. We can do this check for a small RBM.

To check your Gibbs sampling:

Setup an RBM with $N_h=2$ hidden neurons and $N_v=5$ visible neurons. Choose random initial states for the visible and hidden neurons. Choose random weights and biases.

For the given $v$, sample the hidden layer spins 100,000 times and store a histogram of the resulting spin configurations. To do this I did something like:

map<string,double> probs;
//loop 100,000 times
string myString=rbm.toString(rbm.hidden);
  if (probs.find(myString)==probs.end())
    probs.insert(make_pair(myString,0));
  probs[myString]+=1;
  norm+=1;

Compare this histogram with the analytical result for $p(h|v)$ obtained by brute-force calculation. My largest error was about 0.003.

For a given $h$, sample the visible layer spins 100,000 times and store a histogram of the results.
- Compare this with the analytical result for $p(v|h)$.
Starting from a random initial spin configuration $(v_0,h_0)$, perform $k=10$ iterations of Gibbs sampling and obtain the final spin configuration $(v_f, h_f)$. Repeat this 100,000 times and store the $(v_f,h_f)$ configurations into a histogram. Also make separate histograms of $v_f$ and $h_f$.
- Compare the $(v_f,h_f)$-histogram with the analytical result for $p(v,h)$.
- Compare the $v_f$ histogram with the analytic result for $p(v)$.
- Compare the $h_f$ histogram with the analytic result for $p(h)$
Run this a couple different times to make sure everything agrees.

Comment on implementation: When writing up your RBM, it is possible to perform all of the calculations using for loops. However, if you are willing, it is worth trying to perform the calculations with matrix-vector multiplications. This can greatly simplify your code, since nested for loops can be replaced with a few lines of matrix-vector manipulations, and can speed up your code significantly, since matrix-vector libraries are highly optimized. While the C++ standard library does not include matrix-vector functionality, there is a (relatively) easy-to-use library called Eigen that allows you to work with matrices in C++. A handy reference for Eigen commands, and their equivalent commands in Matlab, is available here.

[15%] Grading: Write Gibbs sampling code that samples the hidden and visible layers of your RBM. For a small $N_h=2$, $N_v=5$ RBM, compute the probability distributions $p(v,h), p(v), p(h), p(v|h)$ and $p(h|v)$ using Gibbs sampling and by brute-force calculation using the explicit formulas for these probabilities that we gave above. Make sure that the approximate RBM probabilities and the exact brute-force probabilities agree within 0.01.

Unsupervised Learning and Training an RBM

The task that restricted Boltzmann machines are typically used for is unsupervised learning. The goal of unsupervised learning is to train a machine to learn the probability distribution underlying a data set. To accomplish this goal, the machine is trained on unlabeled data, such as raw images without any captions or identifying information. By training on a large enough data set of high quality samples, the machine will hopefully be able to learn the correlations in the data set. Ultimately, we would like the machine to be able to generalize from the data it has seen. A well-trained unsupervised learning model should be able to generate new data samples that have the same features as the original training data. For this reason, such models are often called generative models.

Your next big step is to train your RBM to learn a probability distribution, called the data distribution, from a data set of samples of the distribution, called the training data. During training, the RBM's parameters, the weights $W_{ij}$ and biases $a_i,b_j$, are tuned so that the marginal distribution of the visible spins $p(v)$ matches the data distribution represented by the training data set. After successful training, sampling the visible layer of your RBM should (ideally) produces new samples that are also from the same underlying distribution of the training data.

In general it might be hard to exactly reproduce the data distribution, so instead we aim to get a probability distribution that is "close." Let us define a measure of the "close-ness" of two distributions $p(v)$ and $q(v)$ as

$$
O = \sum_v q(v) \log \left( \frac{q(v)}{p(v)} \right).
$$

In the information theory literature, $O$ is called the Kullback–Leibler divergence, or the relative entropy of the distributions. In our case, $p(v)$ is the marginal distribution of the RBM and $q(v)$ is the data distribution. The function $O$, which is a function of the RBM parameters since it depends on $p(v)$, is the objective function that we seek to optimize to learn the data distribution.

We will minimize $O$ locally via the method of gradient descent. Gradient descent finds local minima of an objective function using the function's derivatives. Specifically, it starts at a particular point in the objective function's parameter space, computes the objective function's derivatives with respect to its parameters at that point, and then "steps" to a new point in the parameter space using the derivatives' values. Because the derivatives decrease in magnitude as a minimum is approached, local minima are found by iterating the stepping procedure.

Gradient descent is performed by repeatedly updating the parameters as follows:
\[ \begin{align*} W_{ij} &\rightarrow W_{ij} - \eta \frac{\partial O}{\partial W_{ij}} \\ a_{i} &\rightarrow a_{i} - \eta \frac{\partial O}{\partial a_{i}} \\ b_{j} &\rightarrow b_{j} - \eta \frac{\partial O}{\partial b_{j}} \end{align*} \]
where $\eta$ is an arbitrary parameter called the learning rate, or step size. The learning rate parameter affects the convergence of gradient descent and is generally chosen through trial and error.

To perform gradient descent, we need to evaluate the gradients of the objective function $O$ with respect to the parameters. For our particular objective function, the relative entropy between the data distribution and the RBM marginal $p(v)$ distribution, the gradients turn out to have the following form
\[ \begin{align*} \frac{\partial O}{\partial W_{ij}} &= \langle v_i h_j\rangle_{\textrm{RBM}} - \langle v_i h_j\rangle_{\textrm{RBM}|\textrm{data}} \\ \frac{\partial O}{\partial a_{i}} &= \langle v_i\rangle_{\textrm{RBM}} - \langle v_i\rangle_{\textrm{RBM}|\textrm{data}} \\ \frac{\partial O}{\partial b_{j}} &= \langle h_j\rangle_{\textrm{RBM}} - \langle h_j\rangle_{\textrm{RBM}|\textrm{data}} \\ \end{align*} \]
where $\langle f(v,h) \rangle_{\textrm{RBM}|\textrm{data}} = \sum_{v,h} f(v,h) p(h|v) q(v)$ represents an average of the function $f$ over the RBM, given that the visible variables are distributed according to the training data distribution $q(v)$; and $\langle f(v,h) \rangle_{\textrm{RBM}} = \sum_{v,h} f(v,h) p(v,h)$ represents an average over the RBM's joint distribution $p(v,h)$ without any conditioning on the visible variables.

Training the RBM with gradient descent amounts to approximately computing these expectation values by performing Gibbs sampling and updating the RBM parameters using the approximate gradients.

A pseudocode of the gradient descent calculation is as follows:

Initialize the gradient variables myDeriv_weights=0; myDeriv_visible_bias=0; myDeriv_hidden_bias=0.
Sample a configuration $v$ from the data distribution $q(v)$. You can do this by randomly picking a data point from the training data set.
Sample $h$ from $p(h|v)$ as we did in the last section. Update the gradient variables:
- myDeriv_weights -= outerProduct(v,h)
- myDeriv_visible_bias -= v
- myDeriv_hidden_bias -= h
Perform $k$ iterations of Gibbs sampling: sample the visible and hidden variables back and forth $v_0 \rightarrow h_1 \rightarrow v_1 \rightarrow h_2 \rightarrow \cdots \rightarrow h_k \rightarrow v_k$. Using the final spin configuration $(v_k, h_k)$, update the gradient variables:
- myDeriv_weights += outerProduct(v,h)
- myDeriv_visible_bias += v
- myDeriv_hidden_bias += h
Repeat steps 2-4 $N$ times. Compute the averages of the gradients and update the weights and biases according to them.

Mini-batches

Suppose that our training data set is made up of $N$ samples $x_1,\ldots,x_N$. In practice, doing the averages over all $N$ samples gives us a really accurate estimate of the gradients, but is costly. In practice, we do not need such an accurate derivative. Therefore, we can average over smaller chunks, or "mini-batches", of the data. This is a tradeoff that results in lower accuracy but greater efficiency. In practice, what many people do is chose an arbitrary "mini-batch size" $M$, something like 100 that includes a statistically useful number of samples. The easiest way to make mini-batches is to first shuffle the entire data set and then separate the data into $N/M$ mini-batches so that $ x_1 $ through $x_M$ are in mini-batch 1, $x_{M+1}$ through $x_{2M}$ are in mini-batch 2, etc.

The Training Algorithm

To summarize, the entire training algorithm is as follows:

Initialize the weights $W_{ij}$ and biases $a_i,b_j$ randomly.
For each step (or epoch), loop over all of the $N$ data points in the training data set:
- For each mini-batch, loop over $M$ data points in the data set:
  - For the data points in the mini-batch, compute the average of the gradients $\partial O/\partial W_{ij},\partial O/\partial a_{i},\partial O/\partial b_{j}$ using Gibbs sampling.
- Update the weights and biases using the average gradients.
  \[ \begin{align*} W_{ij} &\rightarrow W_{ij} - \eta \partial O/\partial W_{ij} \\ a_{i} &\rightarrow a_{i} - \eta \partial O/\partial a_{i} \\ b_{j} &\rightarrow b_{j} - \eta \partial O/\partial b_{j} \end{align*} \]

Testing

To convince yourself that you have a working RBM code, you need to perform some tests. A good way to test a complicated method such as the RBM training algorithm is to run the algorithm on a small example that you can check by hand or with another method. For example, you can define an arbitrary probability distribution of three binary variables and train an RBM with $N_v=3$ visible spins to match the distribution.

Here's one way to go about producing an arbitrary probability distribution:

 std::random_device rd;
 std::mt19937 mt;
 vector<int> ranInts;
 std::uniform_int_distribution<> ran_int(0,100);
 for (//loop over binary numbers in visible layers)
      ranInts.push_back(ran_int(mt));
    }
  // At this point ranInts of [2,1,4,2] means there is a 2/8 chance of 00, a 1/8 chance of 01, a 4/8 chance of 10, and a 2/8 chance of 11
  std::discrete_distribution<> d(ranInts.begin(),ranInts.end()); 
  myVisibleLayer=binary_nums[d(mt)]; //assuming binary_nums is a vector with binary numbers in it
 // this is the probability distribution you are training over.  

You should be able to analytically compute the probability distribution and sample it and check that it matches the RBM $p(v)$ distribution after the RBM is trained.

Some things you can check to debug your code:

Compute the value of the objective function at each iteration of the gradient descent. You should expect the objective function to decrease with each iteration.
Compute the gradients using the finite difference approximation $\partial O/ W_{ij} \approx (O(W_{ij} + \delta) - O(W_{ij}))/\delta$ for small $\delta$ and check that the finite difference gradients approximately match the estimated gradients obtained from Gibbs sampling when using a lot of samples.

For comparison, I used the following parameters for my RBM training:

$k=1$ (this is a typical but non-trivial approximation)
$\eta=0.0001$
I performed 1,000,000 steps of gradient descent.
I used 3 visible and 5 hidden neurons.

The difference between my two probability distributions on any of the eight possibilities was $\leq 0.01$. Your mileage may vary but should be similar.

[15%] Grading: Implement the RBM training algorithm. Using this algorithm, train your RBM and show that it can successfully learn a simple toy probability distribution.

Training your RBM on MNIST

Finally, we will use our RBM code to tackle a more realistic and difficult engineering task: learning hand-written digits. After training, your RBM will have learned some general features of hand-written digits and be able to generate its own convincing drawings of numbers. In addition to being used as a generative model, a trained RBM can also be useful for anomaly detection, for example, checking if a particular image is likely to be from the data distribution or from another distribution. In our case, that means that $p(v_{digit}) \gg p(v_{letter})$ for our RBM if $v_{digit}$ is an image of a hand-written digit and $v_{letter}$ is an image of a hand-written letter. RBMs can also be used as "feature-detectors" that supplement supervised learning methods in machine learning.

The training data set that we will use to train our RBM is the MNIST database. The MNIST hand-written digit database is a dataset of images and labels. Each image is a black-and-white picture of a hand-written number between $0$ and $9$. The label indicates which of these numbers the image corresponds to. Since we are doing unsupervised learning, we will ignore the labels and only use the images of the digits themselves as our training data set for the RBM.

You can download the dataset and read it into your C++ code using code on this page. When you read in the MNIST data, divide everything by 256 to get things scaled between 0 and 1. Then apply a threshold of $0.5$ to all of the pixel values to convert them into binary numbers.

Your goal is to train your RBM on the MNIST data set. This will require some careful tuning of the algorithm parameters. Here are some suggested RBM parameters that you can use:

Initialize the weights $W_{ij}$ and the visible biases $a_i$ with random numbers from a uniform distribution between $-0.01$ and $0.01$. Initialize the hidden biases $b_j$ to zero. (It seems that the training can get stuck in a bad local minimum if you initialize the hidden biases to be too large.)
$N_v=28\times 28=784$ visible spins, $N_h=50$ hidden spins.
Mini-batch size: $M=200$.
Epochs of training (number of times iterating over the entire dataset): 100.
Learning rate: $\eta=0.02$.

Here are some things you can do to debug your RBM training:

Compute the average of the RBM parameters and how much the weights and biases change at each step (or epoch) of training. If the training has converged, then the weights and biases will be not too large and will change very little at each step. If the training is failing, for example if the learning rate is too large, then the weights and biases might blow up and change dramatically at each step.
Train on a small piece of the entire dataset. For example, I used 5,000 samples from the MNIST dataset while testing. Then when you think things are working and you found good training parameters, rerun your calculations on the entire dataset.
Visualize the weights and biases. Since the visible spins $v_i$ form a 2D image, they can be relabeled as if they were on a 2D square grid $v_i \rightarrow v_{(x,y)}$, where $(x,y)$ is a coordinate in the grid. Therefore, for a given hidden spin $h_j$, the weights connecting to the visible spins $W_{ij}$ can be reshaped into a matrix $W_{ij} \rightarrow W^{j}_{(x,y)}$. These matrices can be visualized, say with matplotlib.pyplot.matshow in python. Visually, these weight matrices represent the correlations the hidden spins have with the visible spins. In a trained RBM, the weight matrices should look like "features" that contain elements of handwritten digits, such as straight and curved strokes. If your weights all look noisy or identical for all $j$, then probably the training is not working correctly.
Compute the objective function over part of the training data set and see that it decreases during training. Do this infrequently, say once every training step (epoch), since this is an expensive calculation.

If you are really stuck, a handy, put pretty technical, reference for RBM training can be found here.

[20%] Grading: Train an RBM on the MNIST handwritten digits data set. Show that your trained RBM is a generative model that can generate new images of digits that it has not seen before. Do this by initializing the visible spins $v$ to a random state and performing alternating Gibbs sampling $k=10000$ times. The resulting $v$ sampled from the RBM should look like a random hand-written digit. Repeat this 20 times to convince yourself that the RBM learned how to draw digits. (Don't worry if your resulting digits look noisy. That's normal. If you want to generate nicer looking images, you could plot $p(v|h)$ at the end of your Gibbs sampling. That should look like a smooth grayscale image.)

Optional: Training your RBM on an Ising Model

If you are interested, you can also train your RBM on Ising spin configurations.

We saw that RBMs, because of their special two-layer visible-hidden structure, are able to be efficiently trained. We also saw that they can learn complicated probability distributions, such as MNIST. This then leads us to the possibility of using RBMs to efficiently represent complicated probability distributions encountered in physics problems.

A recent idea, explored in this paper, is to learn the statistical mechanical properties of Ising models using RBMs. Since you have already worked with RBMs and Ising models, you have all the available tools ready to reproduce the results of this paper.

Another idea is to interpret how RBMs learn during training. Some argue that RBM learning is similar to the renormalization group (RG), which we discussed earlier. To examine this connection, you could do the following:

Using your Ising model code, sample spin configurations on a $27\times 27$ lattice and save them to a file. This will be your training data set.
Train an RBM with $N_v=27\times 27$ visible spins and $N_h=9 \times 9$ hidden spins on your training data set. Think of your hidden layer as a $9 \times 9$ Ising model with 81 hidden neurons. When you train your RBM, don't allow all the connections. Instead only allow non-zero weights between the $3 \times 3$ blocks that you used when coarse-graining your Ising model in the RG unit.
Once you've trained your Ising model, go ahead and look at the probability distribution of the hidden layer. What does it look like?