# ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders

@article{CarreiraPerpin2019ParMACDO, title={ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders}, author={Miguel {\'A}. Carreira-Perpi{\~n}{\'a}n and Mehdi Alizadeh}, journal={ArXiv}, year={2019}, volume={abs/1605.09114} }

Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to optimise such “nested” functions is the method of auxiliary coordinates (MAC) (Carreira-Perpiñán and Wang, 2014). MAC introduces an auxiliary coordinate for each data point in order to decouple the nested model into independent submodels. This decomposes the optimisation into steps that alternate… Expand

#### Figures, Tables, and Topics from this paper

#### 5 Citations

PARMAC: DISTRIBUTED OPTIMISATION

- 2019

Many powerful machine learning models are based on the composition of multiple processing layers, such as deep nets, which gives rise to nonconvex objective functions. A general, recent approach to… Expand

LocoProp: Enhancing BackProp via Local Loss Optimization

- Computer Science
- ArXiv
- 2021

A local loss construction approach for optimizing neural networks is studied and it is shown that the construction consistently improves convergence, reducing the gap between first-order and second-order methods. Expand

Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training

- Computer Science, Mathematics
- AISTATS
- 2020

This model represents activation functions as equivalent biconvex constraints and uses Lagrange Multipliers to arrive at a rigorous lower bound of the traditional neural network training problem. Expand

Improving CTC Using Stimulated Learning for Sequence Modeling

- Computer Science
- ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019

Connectionist temporal classification (CTC) is a sequence-level loss that has been successfully applied to train recurrent neural network (RNN) models for automatic speech recognition. However, one… Expand

Training Deep Architectures Without End-to-End Backpropagation: A Brief Survey

- Computer Science, Mathematics
- ArXiv
- 2021

This tutorial paper surveys training alternatives to end-to-end backpropagation (E2EBP) — the de facto standard for training deep architectures that allow for greater modularity and transparency in deep learning workflows, aligning deep learning with the mainstream computer science engineering that heavily exploits modularization for scalability. Expand

#### References

SHOWING 1-10 OF 72 REFERENCES

Distributed optimization of deeply nested systems

- Computer Science, Mathematics
- AISTATS
- 2014

This work describes a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC), which replaces the original problem involving a deeply nested function with a constrained problem involved in an augmented space without nesting. Expand

Large Scale Distributed Deep Networks

- Computer Science
- NIPS
- 2012

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training. Expand

Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

- Computer Science
- Found. Trends Mach. Learn.
- 2011

It is argued that the alternating direction method of multipliers is well suited to distributed convex optimization, and in particular to large-scale problems arising in statistics, machine learning, and related areas. Expand

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

- Computer Science
- INTERSPEECH
- 2014

This work shows empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback), and implements data-parallel deterministically distributed SGD by combining this finding with AdaGrad. Expand

A fast, universal algorithm to learn parametric nonlinear embeddings

- Computer Science, Mathematics
- NIPS
- 2015

Using the method of auxiliary coordinates, a training algorithm is derived that works by alternating steps thatTrain an auxiliary embedding with steps that train the mapping, and it can reuse N-body methods developed for nonlinear embeddings, yielding linear-time iterations. Expand

Petuum: A New Platform for Distributed Machine Learning on Big Data

- Computer Science
- IEEE Transactions on Big Data
- 2015

This work proposes a general-purpose framework, Petuum, that systematically addresses data- and model-parallel challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant, iterative-convergent algorithmic solutions. Expand

Learning both Weights and Connections for Efficient Neural Network

- Computer Science
- NIPS
- 2015

A method to reduce the storage and computation required by neural networks by an order of magnitude without affecting their accuracy by learning only the important connections, and prunes redundant connections using a three-step method. Expand

Distributed Coordinate Descent Method for Learning with Big Data

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2016

This paper develops and analyzes Hydra: HYbriD cooRdinAte descent method for solving loss minimization problems with big data, and gives bounds on the number of iterations sufficient to approximately solve the problem with high probability. Expand

Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

- Computer Science, Mathematics
- NIPS
- 2011

This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. Expand

Asynchronous stochastic gradient descent for DNN training

- Computer Science
- 2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013

This paper describes an effective approach to achieve an approximation of BP - asynchronous stochastic gradient descent (ASGD), which is used to parallelize computing on multi-GPU, which achieves a 3.2 times speed-up on 4 GPUs than the single one, without any recognition performance loss. Expand