Labs Q&A
Lab 4
Q1. Is it sufficient to perform alternate Gibbs sampling in the top RBM for a couple of iterations before running the network top to bottom for generative purposes?
No, you should run it even for 200 iterations. Importantly, you should remember about clamping (fixing) the label unit corresponding to the label of the digit you wish to generate samples from throughout all 200 iterations.
Q2. What is the expected recognition performance for the DBN without fine tuning?
It should be around 80-90%.
Q3. Why is a trained DBN with satisfactory recognition performance poor in generating samples?
It is common that despite reasonable classification accuracy, the DBN is not well trained (not sufficiently long, for example), which is reflected in low generation performance. Bear in mind that recognition only needs to predict 10 units, while generation has to predict 784 units. In this case, each RBM learning must be improved with momentum, weight decay, and longer training epochs. Usually, if the recognition performance is >80%, generation should also work, at least generate few digits legibly if not all digits.
Q4. Should gradient descent be equipped with some extra regularisation mechanisms for robust RBM learning?
If you only use gradient descent your training weights (weight update is the learning rate times data correlation minus model correlation), you can definitely observe learning in RBM but it may not offer sufficient robustness. It is actually useful to implement weight decay as a regulariser as well as momentum that smoothens the gradient. The effect of these manipulations is reflected among others in the weight distribution (visualisation suggested in Q6).
Q5. What is the cost function that RBM minimises? Is it a reconstruction loss?
No, it is not a reconstruction loss (even though you find this in the code). Reconstruction loss only provides some insights into the training process but RBM minimises energy. So, even if reconstruction loss is satisfactorily low (it gets reduced relatively quickly) it may not imply that you reached the desirable energy state. So, the reconstruction loss is not the metric based on which you should decide about stopping the training process. It would be more convenient to monitor the convergence of biases and weights themselves.
Q6. How to monitor the effects of training RBM with MNIST data?
As mentioned in Q4, monitoring the reconstruction error does not provide sufficient insights. It is recommended that you also visualise weights, which is easy since weights arriving at each hidden unit can be plotted in the input image space. Make sure the weights are smooth and not grainy, and appear as small blobs (either positive or negative). This intuition is lost while moving to higher RBMs, so potential problems should be fixed while training the first RBM (it has connections to the input, which offers intuitive understanding of the weights in the input space).
Q7. What is the expected range of weight values?
All weights are approximately within the range of -5 to 5, so anything far from this range may be a cause of concern.
Q8. What should be computed in the process of mini-batch training based on the collected data and model correlations?
Make sure you compute the mean/average over the minibatch and not sum! Otherwise, you will be multiplying each update by the size of mini-batch inadvertently.
Q9. Is it really so important when probabilities and when samples are used in DBN computations?
It is vital to know when to apply probabilities and when to use samples to guarantee robust performance and training time. As a general rule, having probabilities is preferred whenever permissible, since we retain the most information, while sampling creates stochasticity, and we might need to run the network longer to get the required probabilities back. On the other hand, not all places can use probabilities, and these are the places where we use sampled activations (Section 3 in "Practical Guide to RBM" and lab presentation under lab 4 covers this in detail).