Large-scale convex optimization: parallelization and variance reduction

Traore', MOHAMED CHEIK IBRAHIM

doi:10.15167/traore-mohamed-cheik-ibrahim_phd2024-06-10

In this work, we investigate two aspects of large-scale optimization for convex functions defined on an infinite-dimensional separable Hilbert space: parallelized methods and incremental methods. These methods are used to efficiently solve problems that arise in data science, especially in machine learning and inverse problems. In parallelized optimization methods, the computational load of running the algorithm is distributed among several workers. For example, if the algorithm comprises a gradient computation, one can give each worker a coordinate of the gradient to compute, and then put everything back together. A parallelized algorithm is called synchronous if there is a synchronization phase where local information of all workers are updated. It is called asynchronous if there is no such phase. In practice, asynchronous implementations are preferred to synchronous ones. However, their analysis has to account for delayed information, which is modeled by a delay vector. In this document, we study an asynchronous version of random block coordinate descent, where only one randomly selected coordinate is used at each iteration. We consider a version in which the selection probability of the coordinates is arbitrary, in contrast to what is done in the literature for asynchronous algorithms. We also allow coordinate-wise stepsize rule. Under convexity assumption, we prove weak convergence of the iterates and sublinear convergence rate. Assuming an additional error bound condition, we prove a linear convergence rate and strong convergence of the iterates. In both cases, the dependence on the delay vector is linear. Incremental optimization methods are iterative algorithms used to minimize a function defined as a finite sum of functions. The function is then minimized by using one summand at each iteration instead of the whole function. We are interested in the case where the choice of the summand is random. This leads to stochastic algorithms such as stochastic gradient descent (SGD) or stochastic proximal point algorithm (SPPA). While they are cheaper to implement in terms of computation and memory than their deterministic counterparts that use the entire function, stochastic methods suffer from a drop in convergence rates. This drop is mainly due to the variance introduced by the stochasticity. Therefore, variance reduction techniques have been used in the literature to successfully recover the rates of deterministic algorithms. These techniques were first applied to stochastic gradient methods. In our work, we are, instead, concerned with the stochastic proximal point algorithm (SPPA). This method has recently been studied and has been shown to be more stable compared to stochastic gradient descent (SGD). Our work focuses on the application of variance reduction methods to SPPA. In particular, we introduce a general variance reduction scheme for SPPA. Many variance-reduced SPPA-based algorithms can be recovered from this scheme, mimicking those that already exist for SGD (SVRG, SAGA, etc.). We recover standard sublinear (respectively linear) convergence rates of the proximal point algorithm (PPA) when the stepsize is constant and the function is convex (respectively convex plus satisfying the Polyak-Łojasiewicz condition).