Event Abstract

Contrastive Divergence Learning May Diverge When Training Restricted Boltzmann Machines

  • 1 Ruhr-Universität Bochum, Bernstein Center for Computational Neuroscience, Germany
  • 2 Ruhr-Universität Bochum, Institut für Neuroinformatik, Germany

Understanding and modeling how brains learn higher-level representations from sensory input is one of the key challenges in computational neuroscience and machine learning. Layered generative models such as deep belief networks (DBNs) are promising for unsupervised learning such representations, and new algorithms that operate in a layer-wise fashion make learning these models computationally tractable [1-5]. Restricted Boltzmann Machines (RBMs) are the typical building blocks for DBN layers. They are undirected graphical models, and their structure is a bipartite graph connecting input (visible) and hidden neurons. Training large undirected graphical models by likelihood maximization in general involves averages over an exponential number of terms, and obtaining unbiased estimates of these averages by Markov chain Monte Carlo methods typically requires many sampling steps. However, recently it was shown that estimates obtained after running the chain for just a few steps can be sufficient for model training [3]. In particular, gradient-ascent on the k-step Contrastive Divergence (CD-k), which is a biased estimator of the log-likelihood gradient based on k steps of Gibbs sampling, has become the most common way to train RBMs [1-5].

Contrastive Divergence learning does not necessarily reach the maximum likelihood estimate of the parameters (e.g., because of the bias). However, we show that the situation is much worse. We demonstrate empirically that for some benchmark problems taken from the literature [6], CD learning systematically leads to a steady decrease of the log-likelihood after an initial increase (see supplementary Figure 1). This seems to happen especially when trying to learn more complex distributions, which are the targets if RBMs are used within DBNs. The reason for the decreasing log-likelihood is an increase of the model parameter magnitudes. The estimation bias depends on the mixing rate of the Markov chain, and it is well-known that mixing slows down with growing magnitude of model parameters [1,3]. Weight-decay can therefore solve the problem if the strength of the regularization term is adjusted correctly. If chosen too large, learning is not accurate enough. If chosen too small, learning still divergences. For large k, the effect is less pronounced. Increasing k, as suggested in [1] for finding parameters with higher likelihood, may therefore prevent divergence. However, divergence occurs even for values of k too large to be computationally tractable for large models. Thus, a dynamic schedule to control k is needed.

References

1. Bengio Y, Delalleau O. Justifying and Generalizing Contrastive Divergence. Neural Computation 21(6):1601-1621, 2009

2. Bengio Y, Lamblin P, Popovici D, Larochelle H, Montreal U. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems (NIPS 19), pp. 153-160, 2007, MIT Press

3. Hinton GE. Training products of experts by minimizing contrastive divergence. Neural Computation 14(8):1771-1800 , 2002

4. Hinton GE. Learning multiple a layers of representation. Trends in Cognitive Science 11(10):428-434, 2007

5. Hinton GE, Osindero S, Teh YW. A fast learning algorithm for deep belief nets. Neural Computation 18(7):1527-1554 , 2006

6. McKay D. Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003

Conference: Bernstein Conference on Computational Neuroscience, Frankfurt am Main, Germany, 30 Sep - 2 Oct, 2009.

Presentation Type: Poster Presentation

Topic: Abstracts

Citation: Fischer A and Igel C (2009). Contrastive Divergence Learning May Diverge When Training Restricted Boltzmann Machines. Front. Comput. Neurosci. Conference Abstract: Bernstein Conference on Computational Neuroscience. doi: 10.3389/conf.neuro.10.2009.14.121

Copyright: The abstracts in this collection have not been subject to any Frontiers peer review or checks, and are not endorsed by Frontiers. They are made available through the Frontiers publishing platform as a service to conference organizers and presenters.

The copyright in the individual abstracts is owned by the author of each abstract or his/her employer unless otherwise stated.

Each abstract, as well as the collection of abstracts, are published under a Creative Commons CC-BY 4.0 (attribution) licence (https://creativecommons.org/licenses/by/4.0/) and may thus be reproduced, translated, adapted and be the subject of derivative works provided the authors and Frontiers are attributed.

For Frontiers’ terms and conditions please see https://www.frontiersin.org/legal/terms-and-conditions.

Received: 27 Aug 2009; Published Online: 27 Aug 2009.

* Correspondence: Christian Igel, Ruhr-Universität Bochum, Bernstein Center for Computational Neuroscience, Bochum, Germany, christian.igel@neuroinformatik.rub.de