Research on the Proximal Gradient Method for Composite Optimization Problems under Generalized Smoothness Assumptions

Lin Yang; Na Xian

doi:10.4236/jamp.2026.144076

Journal of Applied Mathematics and Physics > Vol.14 No.4, April 2026

Research on the Proximal Gradient Method for Composite Optimization Problems under Generalized Smoothness Assumptions

Lin Yang, Na Xian^*
Key Laboratory of Optimization Theory and Applications, School of Mathematical Sciences, China West Normal University, Nanchong, China.
DOI: 10.4236/jamp.2026.144076 PDF HTML XML 33 Downloads 203 Views

Abstract

The proximal gradient method (PGD) is an important approach for solving composite optimization problems consisting of the sum of a smooth function and a nonsmooth function. Classical convergence analysis of PGD typically assumes that the smooth function has a globally Lipschitz continuous gradient. In recent years, researchers have relaxed this assumption from various perspectives, thereby providing theoretical support for the application of PGD to more general problems. In particular, for unconstrained smooth optimization problems, Li et al. introduced the concept of $ℓ (\cdot)$ -smoothness, studied the convergence rates of classical gradient methods, and showed that under this generalized smoothness condition, the convergence rates of classical gradient methods remain consistent with those under the classical smoothness condition. Nevertheless, existing results are mostly focused on unconstrained smooth optimization, and the corresponding theoretical analysis of PGD for composite optimization problems still requires further development. To this end, this paper further investigates the convergence rates of PGD for solving composite optimization problems within the $ℓ (\cdot)$ -smoothness framework. Under the assumption that the smooth component in the composite optimization problem is convex, we prove that the sequence of function values generated by the constant-stepsize PGD under-smoothness achieves a convergence rate of $O (1 / k)$ .

Keywords

Generalized Smoothness, Proximal Gradient Method, Composite Optimization Problem

Share and Cite:

Yang, L. and Xian, N. (2026) Research on the Proximal Gradient Method for Composite Optimization Problems under Generalized Smoothness Assumptions. Journal of Applied Mathematics and Physics, 14, 1612-1626. doi: 10.4236/jamp.2026.144076.

1. Introduction

This paper considers the following composite convex optimization problem:

$\min_{x \in R^{d}} F (x) = f (x) + g (x)$ (1)

where $f : ℝ^{d} \to ℝ \cup {+ \infty}$ is a proper closed convex function that is differentiable on the open effective domain $X = dom f$ , $g : ℝ^{d} \to ℝ \cup {+ \infty}$ is a proper closed convex function satisfying $dom g \subseteq X$ .

Problem (1) has wide applications in fields such as compressed sensing, sparse phase retrieval [1] [2], network quantization [3], and machine learning. Currently, researchers have proposed various classical methods for solving problem (1), including the subgradient algorithm, splitting algorithms [4], and the proximal gradient method [5].

The proximal gradient method (PGD) combines gradient descent with the proximal operator to solve optimization problems with nonsmooth regularization terms by decomposing complex objective functions. Its iterative scheme is as follows:

$x^{t + 1} = {prox}_{λ_{t}, g} {x^{t} - \nabla f (x)},$

where $λ_{t} > 0$ is the step size parameter, and ${prox}_{λ_{t}, g}$ denotes the proximal operator of the function $g$ defined as

${prox}_{λ_{t}, g} (z) = arg min_{x \in ℝ^{d}} {g (x) + \frac{1}{2 λ_{t}} {‖ x - z ‖}^{2}} .$

In the convergence theory analysis of traditional PGD, it is often required that the gradient satisfy Lipschitz continuity. Under this smoothness condition, if $f$ is convex, the algorithm can achieve a sublinear convergence rate of $O (1 / k)$ .

However, the classical smoothness condition of gradient Lipschitz continuity is relatively strict. For functions with rapidly changing curvature or uneven local geometric structures, this assumption is often difficult to satisfy. Particularly in machine learning and large-scale optimization, many objective functions, although differentiable or even twice differentiable, may not have globally Lipschitz continuous gradients. Even in some standard models, the global smoothness constant may be unbounded [6], such as the $ℓ_{2}$ -regression function. Additionally, when the function $f$ is an augmented Lagrangian function or the dual function of the original problem (which is not necessarily uniformly convex), the objective function may also exhibit nonsmooth or non-uniformly smooth characteristics [7] [8]. Therefore, how to establish algorithm convergence theory under conditions weaker than classical smoothness remains an important question.

Significant progress has been made in research on weakening the classical smoothness assumption. In 2020, Zhang et al. [9], based on an in-depth study of the training process of deep neural networks such as LSTM [10] and ResNet [11], first proposed a non-uniform smoothness condition— $(L_{0}, L_{1})$ -smoothness—to characterize the dependence between the variation of the function gradient and the gradient norm. Compared with the classical smoothness condition of gradient Lipschitz continuity, this type of condition allows the local smoothness degree to vary with the position of the iteration point or the gradient norm, thus covering a broader class of functions, such as univariate polynomials and exponential functions. Under this generalized smoothness condition, Zhang et al. proved the convergence rate of gradient clipping [12] and demonstrated its advantages over gradient descent methods in neural network training. Since the introduction of this generalized smoothness condition, related research has been extensively developed in various fields, including minimax optimization [13], bilevel optimization [14], and variational inequalities [15].

On this basis, Chen et al. [16] further proposed a more general class of symmetric generalized smoothness conditions— $α$ -symmetric $(L_{0}, L_{1})$ -smoothness. This concept generalizes existing smoothness definitions and covers many mainstream machine learning problems and important function classes, such as distributionally robust optimization [17] [18], higher-order polynomials, and exponential functions. Under this framework, the researchers established corresponding descent lemmas for different values of $α$ . On this basis, to solve non-convex optimization problems, Chen et al. designed several deterministic normalized gradient descent algorithms [19], which achieve an optimal complexity of $O (ϵ^{- 2})$ . Meanwhile, they found that under the stochastic setting, the classic variance reduction algorithm SPIDER [20] can also achieve the optimal sample complexity of $O (ϵ^{- 3})$ .

In the same year, Li et al. [21] also generalized the $(L_{0}, L_{1})$ -smoothness condition by replacing the original affine function $L_{0} + L_{1} ‖ \cdot ‖$ with a more general non-decreasing continuous function $ℓ (\cdot)$ to characterize the dependence between the norm of the Hessian matrix and the gradient norm, thereby proposing a more generalized smoothness concept— $ℓ (\cdot)$ -smoothness. That is, there exists a non-decreasing continuous function $ℓ (\cdot)$ such that $f$ satisfies

$‖ \nabla^{2} f (x) ‖ \leq ℓ (‖ \nabla f (x) ‖) .$

This generalization significantly expands the applicability of the generalized smoothness theory, enabling researchers to handle function classes with more complex curvature variations. Based on this new framework, Li et al. systematically analyzed the convergence rates of gradient descent methods in convex, strongly convex, and non-convex settings. Notably, under the condition that the objective function is convex, they further proved that Nesterov’s accelerated gradient method [22] achieves an optimal computational complexity of $O (ϵ^{- 0.5})$ , which exactly matches the optimal bound under the classical smoothness condition. In addition, they also conducted an in-depth study of stochastic non-convex optimization problems and, under the assumption of bounded variance, proved that stochastic gradient descent achieves an optimal complexity of $O (ϵ^{- 4})$ , which also matches its optimal bound under the classical smoothness condition.

Related Work

Overall, existing literature mainly focuses on smooth unconstrained problems, while studies for composite optimization problems are relatively limited. In particular, for the proximal gradient method, a fundamental algorithm for solving composite optimization problems, its theoretical analysis under the generalized smooth framework remains incomplete. Although the convergence analysis of proximal gradient methods based on the Kurdyka-Łojasiewicz (KL) property has successfully extended the theoretical framework to non-convex and non-globally Lipschitz settings [8], such methods still have obvious limitations: their convergence relies on line search strategies to ensure iterative descent, and they can only yield a unified sublinear convergence rate determined by the KL exponent, failing to recover the classical exact convergence rate under convex structures. Therefore, it is necessary to re-examine the algorithmic theory for optimization problems under generalized smoothness conditions and establish a systematic analysis framework applicable to proximal gradient algorithms.

Against the above background, this paper focuses on the composite optimization problem (1) under generalized smoothness conditions, and systematically studies the convergence rate of the PGD. For problem (1), we rigorously establish the achievable convergence rate of the PGD method with a constant step size, provided that the function $f$ is $ℓ$ -smooth and convex. This result shows that the PGD method can still retain the same convergence rate as that under the classical Lipschitz smoothness assumption, even under the weaker generalized smoothness condition.

2. Preliminaries

This paper mainly conducts research in the finite-dimensional Euclidean space $ℝ^{d}$ . Let $X \subseteq ℝ^{d}$ be a non-empty convex set. For any $x, y \in ℝ^{d}$ , the inner product is denoted by $〈 x, y 〉$ , and the induced norm is defined as $‖ x ‖ : = \sqrt{〈 x, x 〉}$ . In this thesis, $B (x, R)$ denotes the Euclidean ball centered at $x$ with radius $R$ .

Definition 1 ( $ℓ$ -smoothness [21]). Let $f : X \to ℝ$ be a real-valued differentiable function. If there exists a non-decreasing continuous function $ℓ : [0, + \infty) \to (0, + \infty)$ such that the spectral norm of the Hessian matrix of $f$ satisfies

$‖ \nabla^{2} f (x) ‖ \leq ℓ (‖ \nabla f (x) ‖)$

almost everywhere on $X$ (with respect to the Lebesgue measure), then $f$ is called an $ℓ$ -smooth function.

Remark. When the function $ℓ$ is a constant function $ℓ (μ) = L$ , the definition of $ℓ$ -smoothness degenerates to the classical smoothness definition; when $ℓ$ is an affine function $ℓ (μ) = L_{0} + L_{1} μ$ , the $ℓ$ -smoothness corresponds to the $(L_{0}, L_{1})$ -smoothness definition [9].

Definition 2 (( $r$ , $ℓ$ )-smoothness). Let $f : X \to ℝ$ be a real-valued differentiable function. Suppose there exist a non-decreasing continuous function $ℓ : [0, + \infty) \to (0, + \infty)$ and a non-increasing continuous function $r : [0, + \infty) \to (0, + \infty)$ such that

1) $\forall x \in X$ , $B (x, r (‖ \nabla f (x) ‖)) \subseteq X$ ;

2) $\forall x_{1}, x_{2} \in B (x, r (‖ \nabla f (x) ‖))$ , $‖ \nabla f (x_{1}) - \nabla f (x_{2}) ‖ \leq ℓ (‖ \nabla f (x) ‖) ‖ x_{1} - x_{2} ‖$ ;

Then $f$ is called an $(r, ℓ)$ -smooth function.

Remark. Here, the function $r : [0, + \infty) \to (0, + \infty)$ is assumed to be nonincreasing and the function $ℓ : [0, + \infty) \to (0, + \infty)$ is assumed to be nondecreasing. This restriction is usually without loss of generality. When $r$ and $ℓ$ do not satisfy the required monotonicity, the desired monotonicity can be achieved by the following construction technique:

1) Replace the original function $r$ with $\hat{r} : = {inf}_{0 \leq v \leq u} r (v) \leq r (u)$ ;

2) Replace the original function $ℓ$ with the nonincreasing function $\hat{ℓ} : = {sup}_{0 \leq v \leq u} ℓ (v) \geq ℓ (u)$ .

Next, we elaborate on the intrinsic relationship between the two types of smooth functions. In fact, under appropriate assumptions, $ℓ$ -smooth functions and $(r, ℓ)$ -smooth functions are equivalent.

Proposition 1. If $f$ is $(r, ℓ)$ -smooth, then $f$ is also $ℓ$ -smooth; if $f$ is $ℓ$ -smooth and $f$ is a differentiable closed function on the open effective domain $X$ , then $f$ is an $(r, m)$ -smooth function, where $m (u) : = ℓ (u + a)$ , $r (u) : = a / m (u)$ , and the constant $a > 0$ is arbitrary.

The proof of the above proposition can be found in appendix A.2 of [21]. The objective function considered in this paper obviously satisfies the assumptions of Proposition 1. In view of the equivalence between $ℓ$ -smoothness and $(r, ℓ)$ -smoothness in the above framework, the subsequent theoretical analysis will mainly focus on $(r, ℓ)$ -generalized smooth functions.

Lemma 2 (Descent Lemma [23]). Let $f : X \to ℝ$ be a $M$ -smooth function, where $X$ is a convex set. Then for all $x, y \in X$ , we have

$f (y) \leq f (x) + 〈 \nabla f (x), y - x 〉 + \frac{M}{2} {‖ x - y ‖}^{2} .$

Lemma 3. [21] Suppose the differentiable function $f : X \to ℝ$ satisfies $(r, ℓ)$ -smoothness, and for all $x \in X$ , $‖ \nabla f (x) ‖ \leq C$ , Then

1) $B (x, r (C)) \subseteq X$ ;

2) For any $x_{1}, x_{2} \in B (x, r (C))$ ,

$‖ \nabla f (x_{1}) - \nabla f (x_{2}) ‖ \leq L_{f} ‖ x_{1} - x_{2} ‖,$ (2)

$f (x_{1}) \leq f (x_{2}) + 〈 \nabla f (x_{2}), x_{1} - x_{2} 〉 + \frac{L_{f}}{2} {‖ x_{1} - x_{2} ‖}^{2},$ (3)

where $L_{f} : = ℓ (C)$ denotes an effective smoothness constant.

Proof of the above lemma can be found in appendix A.3 of [21].

Remark. From Lemma 3, it is not difficult to see that when the gradient norm of an $(r, ℓ)$ -smooth function is bounded, it also satisfies the property of a classical smooth function in a local neighborhood $B (x, r (C))$ , i.e., the descent lemma.

Remark. From Proposition 1, the two classes of smooth functions are equivalent. Therefore, the conclusion of Lemma 3 obviously also holds for $ℓ$ -smooth functions. When $a : = C$ , it suffices to take $L_{f} = ℓ (2 C)$ and $r (C) = C / L_{f}$ to obtain the corresponding conclusion [21].

Lemma 4 (Nonexpansiveness of the Proximal Operator [23]). Let $g : ℝ^{d} \to ℝ \cup {+ \infty}$ be a closed proper convex function. Then for any $x, y \in ℝ^{d}$ , we have

$〈 x - y, {prox}_{g} (x) - {prox}_{g} (y) 〉 \geq {‖ {prox}_{g} (x) - {prox}_{g} (y) ‖}^{2},$ (4)

and

$‖ {prox}_{g} (x) - {prox}_{g} (y) ‖ \leq ‖ x - y ‖ .$

3. The Proximal Gradient Method

3.1. The Proximal Gradient Method

This paper mainly considers the framework of generalized smoothness assumptions, and investigates the iterative convergence rate of PGD for solving unconstrained composite convex optimization problems. In the following, this chapter will discuss the convex case, present and prove the convergence rate of PGD.

Before proceeding with the study, we first introduce an important fundamental assumption in the field of optimization algorithm research, namely, the existence of an optimal solution.

Assumption 1. Assume there exists a point $x^{*} \in X$ such that $F (x^{*}) = F^{*} = \inf_{x \in X} F (x)$ .

This chapter mainly considers the constant step-size version of PGD. Let the iteration step size be $\frac{1}{L}$ , where $L > 0$ . The iteration scheme is given by

$x^{t + 1} = {prox}_{\frac{1}{L} g} (x^{t} - \frac{1}{L} \nabla f (x^{t})),$ (5)

For simplicity, we define $T_{L} (x) : = {prox}_{\frac{1}{L} g} (x - \frac{1}{L} \nabla f (x))$ . Then the PGD iteration scheme is abbreviated as

$x^{t + 1} = T_{L} (x^{t}) .$

Next, we introduce an important tool, the gradient mapping, which is defined as follows.

Definition 3. When solving the objective problem (1) using PGD (5), combined with the definition of the operator $T_{L}$ , the mapping $G_{L} : X \to ℝ^{d}$ is defined as

$G_{L} (x) = L (x - T_{L} (x)), x \in X,$

and is called the gradient mapping.

Based on the above definition of the gradient mapping, the proximal gradient iteration can be equivalently transformed into a gradient descent iteration format, i.e.,

$x^{t + 1} = x^{t} - \frac{1}{L} G_{L} (x^{t}) .$

Through this structural reconstruction, it is not difficult to see that PGD can be regarded as a natural extension of the classical gradient descent method in composite optimization problems.

Since cocoercivity serves as an important theoretical guarantee for the convergence analysis of the PGD iteration framework. However, in the composite convex optimization problem (1) considered in this paper, the function $f$ is only $(r, ℓ)$ -smooth. Therefore, we need to verify whether the corresponding co-coercivity still holds when the function satisfies $(r, ℓ)$ -smoothness. Li et al. [21] pointed out that when a function is both convex and $(r, ℓ)$ -smooth, its gradient also possesses cocoercivity, as stated in the following lemma.

Lemma 5 (cocoercivity [21]). Let the function $f : X \to ℝ$ be an $(r, ℓ)$ -smooth convex function. Then for $\forall x \in X, y \in B (x, \frac{r (‖ \nabla f (x) ‖)}{2})$ , we have

$〈 \nabla f (x) - \nabla f (y), x - y 〉 \geq \frac{1}{ℓ (‖ \nabla f (x) ‖)} {‖ \nabla f (x) - \nabla f (y) ‖}^{2} .$ (6)

Furthermore, we will investigate the convergence rate of PGD iterations based on the concept of co-coercivity. Since we will utilize Lemma 4 in the convergence analysis of PGD, it requires that consecutive iterates $x^{k + 1}, x^{k}$ of PGD satisfy $‖ x^{k + 1} - x^{k} ‖ = \frac{1}{L} ‖ G_{L} (x^{k}) ‖ \leq r (C) / 2$ , Based on this, we prove that the norm of the gradient mapping $‖ G_{L} (x^{k}) ‖$ is monotonically decreasing.

Lemma 6. Let $f : X \to ℝ$ be an $(r, ℓ)$ -smooth differentiable convex function satisfying $‖ \nabla f (x) ‖ \leq C$ ; and let $g : ℝ^{d} \to ℝ \cup {+ \infty}$ be a proper closed convex function. If for $x \in X$ , $T_{L} (x) \in B (x, \frac{r (C)}{2}) \subseteq B (x, \frac{r (‖ \nabla f (x) ‖)}{2})$ , then

$‖ G_{L} (T_{L} (x)) ‖ \leq ‖ G_{L} (x) ‖,$ (7)

where $L \geq \frac{L_{f}}{2} = \frac{ℓ (C)}{2}$ .

Proof. Let $x \in X, y \in B (x, \frac{r (‖ \nabla f (x) ‖)}{2})$ . By Lemma 4, the proximal operator is nonexpansive. When we choose $x : = x - \frac{1}{L} \nabla f (x)$ , $y : = y - \frac{1}{L} \nabla f (y)$ respectively, Equation (4) becomes

$〈 T_{L} (x) - T_{L} (y), (x - \frac{1}{L} \nabla f (x)) - (y - \frac{1}{L} \nabla f (y)) 〉 \geq {‖ T_{L} (x) - T_{L} (y) ‖}^{2} .$

Since the mapping $T_{L} = Id - \frac{1}{L} G_{L}$ , the above inequality can be transformed into

$\begin{array}{l} 〈 (x - \frac{1}{L} G_{L} (x)) - (y - \frac{1}{L} G_{L} (y)), (x - \frac{1}{L} \nabla f (x)) - (y - \frac{1}{L} \nabla f (y)) 〉 \\ \geq {‖ (x - \frac{1}{L} G_{L} (x)) - (y - \frac{1}{L} G_{L} (y)) ‖}^{2} . \end{array}$

Because $L > 0$ , the above inequality can be simplified as

$〈 (x - \frac{1}{L} G_{L} (x)) - (y - \frac{1}{L} G_{L} (y)), (G_{L} (x) - \nabla f (x)) - (G_{L} (y) - \nabla f (y)) 〉 \geq 0,$

Further, by expanding the inner product and rearranging terms, we obtain

$\begin{matrix} 〈 G_{L} (x) - G_{L} (y), x - y 〉 \geq 〈 \nabla f (x) - \nabla f (y), x - y 〉 + \frac{1}{L} {‖ G_{L} (x) - G_{L} (y) ‖}^{2} \\ - \frac{1}{L} 〈 G_{L} (x) - G_{L} (y), \nabla f (x) - \nabla f (y) 〉 . \end{matrix}$ (8)

Since the function $f$ is an $(r, ℓ)$ -smooth differentiable convex function and satisfies $‖ \nabla f (x) ‖ \leq C$ , it follows from Lemma 5 that

$\begin{matrix} 〈 \nabla f (x) - \nabla f (y), x - y 〉 \geq \frac{1}{ℓ (‖ \nabla f (x) ‖)} {‖ \nabla f (x) - \nabla f (y) ‖}^{2} \\ \geq \frac{1}{ℓ (C)} {‖ \nabla f (x) - \nabla f (y) ‖}^{2}, \end{matrix}$

where the second inequality holds because $‖ \nabla f (x) ‖ \leq C$ and the function $ℓ$ is nondecreasing and continuous. Therefore, based on the above inequality, Equation (8) can be further transformed into

$\begin{array}{l} ℓ (C) 〈 G_{L} (x) - G_{L} (y), x - y 〉 \\ \geq \frac{ℓ (C)}{L} {‖ G_{L} (x) - G_{L} (y) ‖}^{2} + {‖ \nabla f (x) - \nabla f (y) ‖}^{2} \\ - \frac{ℓ (C)}{L} 〈 G_{L} (x) - G_{L} (y), \nabla f (x) - \nabla f (y) 〉, \end{array}$ (9)

For convenience, let $a : = G_{L} (x) - G_{L} (y)$ , $b : = \nabla f (x) - \nabla f (y)$ . Then the right-hand side of the above inequality can be further simplified as

$\frac{ℓ (C)}{L} {‖ a ‖}^{2} + {‖ b ‖}^{2} - \frac{ℓ (C)}{L} a^{T} b = \frac{4 L \cdot ℓ (C) - ℓ {(C)}^{2}}{4 L^{2}} {‖ a ‖}^{2} + {‖ \frac{ℓ (C)}{2 L} a - b ‖}^{2}$

Substituting the simplified result back into (9) yields

$ℓ (C) 〈 G_{L} (x) - G_{L} (y), x - y 〉 \geq \frac{4 L \cdot ℓ (C) - ℓ {(C)}^{2}}{4 L^{2}} {‖ G_{L} (x) - G_{L} (y) ‖}^{2} .$

Since the function $ℓ : [0, + \infty) \to (0, + \infty)$ , we have $ℓ (C) > 0$ always holds. Dividing both sides of the above inequality by $ℓ (C)$ , we obtain

$〈 G_{L} (x) - G_{L} (y), x - y 〉 \geq \frac{4 L - ℓ (C)}{4 L^{2}} {‖ G_{L} (x) - G_{L} (y) ‖}^{2} .$ (10)

Next, we estimate the left-hand side of (10). Using the Cauchy-Schwarz inequality, we obtain

$〈 G_{L} (x) - G_{L} (y), x - y 〉 \leq ‖ G_{L} (x) - G_{L} (y) ‖ ‖ x - y ‖,$

Therefore, combining the above inequality with (10) yields

$‖ G_{L} (x) - G_{L} (y) ‖ \leq \frac{4 L^{2}}{4 L - ℓ (C)} ‖ x - y ‖ .$ (11)

According to the definition of the gradient mapping $G_{L} (x) = L (x - T_{L} (x))$ , and taking $y = T_{L} (x)$ , we obtain

$x - y = x - T_{L} (x) = \frac{1}{L} G_{L} (x)$

and

$G_{L} (y) = L (y - T_{L} (y)) = G_{L} (T_{L} (x)) = L (T_{L} (x) - T_{L} (T_{L} (x))),$

Let $x^{+} : = T_{L} (x)$ , the last equality is abbreviated as $G_{L} (y) = G_{L} (x^{+}) = L (x^{+} - T_{L} (x^{+}))$ . Substituting these two equalities into (10), we obtain

$\begin{array}{l} 〈 G_{L} (x^{+}) - G_{L} (x), - \frac{1}{L} G_{L} (x) 〉 \geq \frac{4 L - ℓ (C)}{4 L^{2}} {‖ G_{L} (x^{+}) - G_{L} (x) ‖}^{2} \\ \Leftrightarrow - \frac{1}{L} (〈 G_{L} (x^{+}), G_{L} (x) 〉 - {‖ G_{L} (x) ‖}^{2}) \\ \geq \frac{4 L - ℓ (C)}{4 L^{2}} ({‖ G_{L} (x^{+}) ‖}^{2} - 2 〈 G_{L} (x^{+}), G_{L} (x) 〉 + {‖ G_{L} (x) ‖}^{2}) \\ \Leftrightarrow \frac{2 L - ℓ (C)}{2 L^{2}} 〈 G_{L} (x^{+}), G_{L} (x) 〉 \geq \frac{4 L - ℓ (C)}{4 L^{2}} {‖ G_{L} (x^{+}) ‖}^{2} - \frac{ℓ (C)}{4 L^{2}} {‖ G_{L} (x) ‖}^{2} . \end{array}$ (12)

1) If $L = \frac{ℓ (C)}{2}$ , Equation (12) can be simplified as

$\frac{1}{ℓ (C)} {‖ G_{L} (x) ‖}^{2} \geq \frac{1}{ℓ (C)} {‖ G_{L} (x^{+}) ‖}^{2},$

Multiplying both sides by $ℓ (C) > 0$ , we obtain

${‖ G_{L} (x) ‖}^{2} \geq {‖ G_{L} (x^{+}) ‖}^{2} .$

Further, by letting $x^{+} : = T_{L} (x)$ , we arrive at the final conclusion ${‖ G_{L} (x) ‖}^{2} \geq {‖ G_{L} (T_{L} (x)) ‖}^{2}$ .

2) Similarly, when $L \in (\frac{ℓ (C)}{2}, + \infty)$ , combining with Equation (12) and using the Cauchy-Schwarz inequality, we obtain

$\frac{2 L - ℓ (C)}{2 L^{2}} ‖ G_{L} (x^{+}) ‖ ‖ G_{L} (x) ‖ \geq \frac{4 L - ℓ (C)}{4 L^{2}} {‖ G_{L} (x^{+}) ‖}^{2} - \frac{ℓ (C)}{4 L^{2}} {‖ G_{L} (x) ‖}^{2},$ (13)

By rearranging terms and factoring, the above inequality becomes

$(‖ G_{L} (x) ‖ - ‖ G_{L} (x^{+}) ‖) (\frac{ℓ (C)}{4 L^{2}} ‖ G_{L} (x) ‖ + \frac{4 L - ℓ (C)}{4 L^{2}} ‖ G_{L} (x^{+}) ‖) \geq 0.$ (14)

(i) If $‖ G_{L} (x^{+}) ‖ = 0$ , by the non-negativity of the norm we have $‖ G_{L} (x) ‖ \geq 0$ . Furthermore, since $‖ G_{L} (x^{+}) ‖ = 0$ , it follows that $‖ G_{L} (x) ‖ \geq ‖ G_{L} (x^{+}) ‖$ , the conclusion holds.

(ii) If $‖ G_{L} (x^{+}) ‖ \neq 0$ , by the non-negativity of the norm we have $‖ G_{L} (x^{+}) ‖ > 0$ . In this case, $‖ G_{L} (x) ‖ > 0$ must also hold. If $‖ G_{L} (x) ‖ > 0$ were not true, then by the non-negativity of the norm we would have $‖ G_{L} (x) ‖ = 0$ . From (11), with $x^{+} : = T_{L} (x)$ and $G_{L} (x) = L (x - T_{L} (x))$ , we obtain

$\begin{matrix} ‖ G_{L} (x^{+}) - G_{L} (x) ‖ \leq \frac{4 L^{2}}{4 L - ℓ (C)} ‖ x^{+} - x ‖ = \frac{4 L^{2}}{4 L - ℓ (C)} ‖ T_{L} (x) - x ‖ \\ = \frac{4 L^{2}}{4 L - ℓ (C)} \cdot \frac{1}{L} ‖ G_{L} (x) ‖, \end{matrix}$

Since $‖ G_{L} (x) ‖ = 0$ , it follows that $‖ G_{L} (x^{+}) ‖ \leq 0$ , which contradicts the premise $‖ G_{L} (x^{+}) ‖ > 0$ . Therefore, given $L > \frac{ℓ (C)}{2}$ and $ℓ (C) > 0$ , we have $4 L - ℓ (C) > 0$ , and consequently

$\frac{ℓ (C)}{4 L^{2}} ‖ G_{L} (x) ‖ + \frac{4 L - ℓ (C)}{4 L^{2}} ‖ G_{L} (x^{+}) ‖ > 0.$

At this point, combining the above inequality with (14), it is easy to see that $‖ G_{L} (x) ‖ \geq ‖ G_{L} (x^{+}) ‖$ , so the conclusion holds.

□

Next, we present the key descent inequality in the convergence analysis.

Lemma 7 (Basic Proximal Gradient Inequality). Let $f : X \to ℝ$ be a differentiable convex function, where $X$ is an open set, and let $g : ℝ^{d} \to ℝ \cup {+ \infty}$ be a proper closed convex function. For any $x, y \in X$ and $L > 0$ , if

$f (T_{L} (x)) \leq f (y) + 〈 \nabla f (y), T_{L} (y) - y 〉 + \frac{L}{2} {‖ T_{L} (y) - y ‖}^{2},$ (15)

then

$F (x) - F (T_{L} (y)) \geq \frac{L}{2} {‖ x - T_{L} (y) ‖}^{2} - \frac{L}{2} {‖ x - y ‖}^{2} + h (x, y),$ (16)

where $h (x, y) = f (x) - f (y) - 〈 \nabla f (y), x - y 〉$ .

Proof. Define the function

$ϕ (u) : = f (y) + 〈 \nabla f (y), u - y 〉 + g (u) + \frac{L}{2} {‖ u - y ‖}^{2} .$ (17)

Clearly, $ϕ$ is a strongly convex function with strong convexity modulus $L$ . From the definition of $T_{L} (x)$ , we have

$\begin{matrix} T_{L} (y) = {prox}_{\frac{1}{L} g} (y - \frac{1}{L} \nabla f (y)) \\ = arg min_{u \in ℝ^{d}} {g (u) + \frac{L}{2} {‖ u - (y - \frac{1}{L} \nabla f (y)) ‖}^{2}} \\ = arg min_{u \in ℝ^{d}} ϕ (u) . \end{matrix}$ (18)

By strong convexity, we obtain

$ϕ (x) - ϕ (T_{L} (y)) \geq \frac{L}{2} {‖ x - T_{L} (y) ‖}^{2} .$ (19)

Using the convexity of $f$ , we have

$f (y) \geq f (T_{L} (y)) + 〈 \nabla f (y), y - T_{L} (y) 〉 .$

Letting $u : = T_{L} (y)$ and substituting it into (17), together with the above inequality, we get

$\begin{matrix} ϕ (T_{L} (y)) = f (y) + 〈 \nabla f (y), T_{L} (y) - y 〉 + \frac{L}{2} {‖ T_{L} (y) - y ‖}^{2} + g (T_{L} (y)) \\ \geq f (T_{L} (y)) + g (T_{L} (y)) = F (T_{L} (y)), \end{matrix}$

furthermore, from (19), it is not difficult to see that

$ϕ (x) - F (T_{L} (x)) \geq \frac{L}{2} {‖ x - T_{L} (y) ‖}^{2} .$

Next, substituting $u : = x$ into the expression for $ϕ (u)$ and combining with the above inequality, we obtain

$f (y) + 〈 \nabla f (x), x - y 〉 + g (x) + \frac{L}{2} {‖ x - y ‖}^{2} - F (T_{L} (y)) \geq \frac{L}{2} {‖ x - T_{L} (y) ‖}^{2},$

by rearranging terms, we get

$F (x) - F (T_{L} (y)) \geq \frac{L}{2} {‖ x - T_{L} (y) ‖}^{2} - \frac{L}{2} {‖ x - y ‖}^{2} + f (x) - f (y) - 〈 \nabla f (x), x - y 〉 .$

This completes the proof. □

Since $y$ is arbitrary, we may replace $y$ with $x$ , and thus the following corollary can be deduced from Lemma 7.

Corollary 1. Let $f : X \to ℝ$ be a continuously differentiable convex function, where $X$ is an open set, and let $g : ℝ^{d} \to ℝ \cup {+ \infty}$ be a proper closed convex function. For any $x, y \in X$ and $L > 0$ , if

$f (T_{L} (x)) \leq f (x) + 〈 \nabla f (x), T_{L} (x) - x 〉 + \frac{L}{2} {‖ T_{L} (x) - x ‖}^{2},$

then

$F (x) - F (T_{L} (x)) \geq \frac{1}{2 L} {‖ G_{L} (x) ‖}^{2} .$

Since the consecutive iterates $x^{t + 1}$ and $x^{t}$ of the proximal gradient method satisfy $x^{t + 1} = T_{L} (x^{t})$ , Corollary 1 directly establishes the sufficient descent property of the PGD algorithm. In what follows, we combine the above lemmas to study the convergence rate of PGD for solving convex problems.

3.2. Convergence Rate Analysis of the Proximal Gradient Method in the Convex Case

In this subsection, we consider the case where $f$ in the objective function is convex, and analyze the convergence of PGD iterations under the generalized smoothness assumption. Specifically, the convergence rate results obtained through theoretical analysis are as follows.

Theorem 8. Consider problem (1). Let $g : ℝ^{d} \to ℝ \cup {+ \infty}$ be a proper closed convex function, and let $f : X \to ℝ$ be a differentiable proper closed convex function, where $X$ is an open convex set and $dom g \subseteq X$ . Assume that $f$ is an $(r, ℓ)$ -smooth function, where $ℓ : [0, + \infty) \to (0, + \infty)$ is a nondecreasing continuous function and $r : [0, + \infty) \to (0, + \infty)$ is a nonincreasing continuous function. Moreover, suppose that $f$ is $C$ -Lipschitz continuous. If the parameter $L$ satisfies

$L \geq max {ℓ (C), 2 ‖ G_{L} (x_{0}) ‖ / r (C)},$

then

$F (x^{k}) - F^{*} \leq \frac{L {‖ x^{0} - x^{*} ‖}^{2}}{2 k} .$

Proof. Since $f$ is $C$ -Lipschitz continuous, for any $x, y \in X$ , we have

$‖ \nabla f (x) ‖ = lim_{y \to x} \frac{‖ f (y) - f (x) ‖}{| y - x |} \leq \frac{C | y - x |}{| y - x |} = C .$

Given that $f$ is convex and $(r, ℓ)$ -smooth, and $g$ is a proper closed convex function. First, from

$x^{t + 1} = T_{L} (x^{t})$

and $G_{L} (x) = L (x - T_{L} (x))$ , together with Lemma 6, we obtain

$‖ x^{t + 1} - x^{t} ‖ = ‖ T_{L} (x^{t}) - x^{t} ‖ = \frac{1}{L} ‖ G_{L} (x^{t}) ‖ \leq \frac{1}{L} ‖ G_{L} (x^{t - 1}) ‖ \leq \dots \leq \frac{1}{L} ‖ G_{L} (x^{0}) ‖ \leq \frac{r (C)}{2},$

which implies

$x^{t + 1} \in B (x^{t}, \frac{r (C)}{2}) .$

Secondly, by Lemma 3, the two consecutive iterates $x^{t + 1}, x^{t}$ of PGD always satisfy

$f (x^{t + 1}) \leq f (x^{t}) + 〈 \nabla f (x^{t}), x^{t + 1} - x^{t} 〉 + \frac{ℓ (C)}{2} {‖ x^{t + 1} - x^{t} ‖}^{2} .$ (20)

Finally, in Lemma 7, due to the arbitrariness of $L$ , we may set $L = ℓ (C)$ . Together with $x^{t + 1} = T_{L} (x^{t})$ , we have for any $t \in ℕ^{*}$ ,

$\frac{2}{ℓ (C)} (F (x^{*}) - F (x^{t + 1})) \geq {‖ x^{*} - x^{t + 1} ‖}^{2} - {‖ x^{*} - x^{t} ‖}^{2} + \frac{2}{ℓ (C)} h (x^{*}, x^{t}),$ (21)

where, by Assumption 1, $x^{*}$ is the optimal solution of the objective function, and the second inequality follows from the convexity of $f$ . Now, letting $t$ run over $0, 1, 2, \dots, k - 1$ and summing the inequalities, we obtain

$\frac{2}{ℓ (C)} \sum_{t = 0}^{k - 1} (F (x^{*}) - F (x^{t + 1})) \geq {‖ x^{*} - x^{t} ‖}^{2} - {‖ x^{*} - x^{0} ‖}^{2} .$

Multiplying both sides of the above inequality by $\frac{ℓ (C)}{2}$ , we get

$\sum_{t = 0}^{k - 1} (F (x^{*}) - F (x^{t})) \leq \frac{ℓ (C)}{2} {‖ x^{*} - x^{0} ‖}^{2} - \frac{ℓ (C)}{2} {‖ x^{*} - x^{k} ‖}^{2} \leq \frac{ℓ (C)}{2} {‖ x^{*} - x^{0} ‖}^{2} .$

From Corollary 1, we also have $F (x^{t + 1}) \leq F (x^{*})$ . Combining this with the above inequality yields

$\frac{2}{ℓ (C)} \sum_{t = 0}^{k - 1} {(F (x^{*}) - F (x^{t + 1}))}^{2} \geq {‖ x^{*} - x^{k} ‖}^{2} - {‖ x^{*} - x^{0} ‖}^{2} .$

Multiplying both sides of the above inequality by $\frac{ℓ (C)}{2}$ , we obtain

$\sum_{t = 0}^{k - 1} (F (x^{*}) - F (x^{t})) \leq \frac{ℓ (C)}{2} {‖ x^{*} - x^{0} ‖}^{2} - \frac{ℓ (C)}{2} {‖ x^{*} - x^{k} ‖}^{2} \leq \frac{ℓ (C)}{2} {‖ x^{*} - x^{0} ‖}^{2} .$

From Corollary 1, we also have $F (x^{t + 1}) \leq F (x^{t})$ . Combining this with the above inequality yields

$k (F (x^{k}) - F (x^{*})) \leq \sum_{t = 1}^{k - 1} (F (x^{t + 1}) - F (x^{*})) \leq \frac{ℓ (C)}{2} {‖ x^{*} - x^{0} ‖}^{2},$

which can be rearranged as

$F (x^{k}) - F (x^{*}) \leq \frac{ℓ (C) {‖ x^{*} - x^{0} ‖}^{2}}{2 k} \leq \frac{L {‖ x^{*} - x^{0} ‖}^{2}}{2 k} .$

This completes the proof. □

4. Conclusions

To address the strict limitation of the classic gradient Lipschitz smoothness assumption on the applicability of the PGD, this paper introduces a novel class of generalized smoothness conditions— $ℓ (\cdot)$ —smoothness, establishing a more general framework for analyzing composite convex optimization problems. Under this generalized smoothness framework, we systematically derive the convergence theory of the constant-stepsize PGD algorithm and rigorously prove that: even under the generalized smoothness assumption that relaxes the classic Lipschitz gradient continuity, the constant-stepsize PGD can still achieve the exact same sublinear convergence rate $O (1 / k)$ as in the classic smoothness scenario, provided that the function $f$ satisfies $ℓ (\cdot)$ —smoothness and Lipschitz continuity.

However, the analysis framework in this paper relies on the core assumption that the objective function $f$ satisfies gradient Lipschitz continuity. While this assumption guarantees the gradient Lipschitz continuity of $f$ in local regions and provides theoretical support for the convergence analysis of the constant-step-size PGD algorithm, it also significantly limits the scope of application of the method: for composite convex optimization problems that do not satisfy gradient Lipschitz continuity (such as scenarios with unbounded gradients or rapidly changing gradients), the sublinear convergence rate theory of the constant-step-size PGD algorithm established in this paper will no longer hold.

Acknowledgements

Sincere thanks to the members of JAMP for their professional performance, and special thanks to managing editor Hellen XU for a rare attitude of high quality.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

References

[1]	Shechtman, Y., Beck, A. and Eldar, Y.C. (2014) GESPAR: Efficient Phase Retrieval of Sparse Signals. IEEE Transactions on Signal Processing, 62, 928-938.[CrossRef]
[2]	Cai, J.F., Long, Y., Wen, R.X. and Ying, J. (2023) A Fast and Provable Algorithm for Sparse Phase Retrieval.[CrossRef]
[3]	Bai, Y., Wang, Y.X. and Liberty, E. (2019) ProxQuant: Quantized Neural Networks Via Proximal Operators. 7th International Conference on Learning Representations (ICLR), New Orleans, 6-9 May 2019, 1-20.
[4]	Vũ, B.C. (2013) A Splitting Algorithm for Dual Monotone Inclusions Involving Cocoercive Operators. Advances in Computational Mathematics, 38, 667-681.[CrossRef]
[5]	Rockafellar, R.T. (2001) Monotone Operators and the Proximal Point Algorithm. SIAM Journal on Control and Optimization, 14, 877-898.[CrossRef]
[6]	Faw, M., Tziotis, I., Caramanis, C. and Mokhtari, A. (2022) The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance. Proceedings of the 35th Conference on Learning Theory (COLT), London, 2-5 July 2022, 313-355.
[7]	De Marchi, A., Jia, X., Kanzow, C. and Mehlitz, P. (2023) Constrained Composite Optimization and Augmented Lagrangian Methods. Mathematical Programming, 201, 863-896.[CrossRef]
[8]	Jia, X., Kanzow, C. and Mehlitz, P. (2023) Convergence Analysis of the Proximal Gradient Method in the Presence of the Kurdyka-Łojasiewicz Property without Global Lipschitz Assumptions. SIAM Journal on Optimization, 33, 3038-3056.[CrossRef]
[9]	Zhang, J., He, T., Sra, S. and Jadbabaie, A. (2020) Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity.[CrossRef]
[10]	Merity, S., Keskar, N.S. and Socher, R. (2018) Regularizing and Optimizing LSTM Language Models.[CrossRef]
[11]	He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 27-30 June 2016, 770-778.[CrossRef]
[12]	Mikolov, T. (2012) Statistical Language Models Based on Neural Networks. Brno University of Technology, Faculty of Information Technology. https://www.fit.vut.cz/study/phd-thesis/283/
[13]	Xian, W., Chen, Z. and Huang, H. (2024) Delving into the Convergence of Generalized Smooth Minimax Optimization. Communications in Mathematical Physics, 235, 54191-54211.
[14]	Hao, J., Gong, X. and Liu M. (2024) Bilevel Optimization under Unbounded Smoothness: A New Algorithm and Convergence Analysis. https://arxiv.org/abs/2401.09587
[15]	Vankov, D., Nedich, A. and Sankar, L. (2024) Generalized Smooth Variational Inequalities: Methods with Adaptive Stepsizes. Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, 21-27 July 2024, 49137-49170. https://proceedings.mlr.press/v235/vankov24a.html
[16]	Chen, Z., Zhou, Y., Liang, Y. and Lu, Z. (2023) Generalized-Smooth Nonconvex Optimization Is as Efficient as Smooth Nonconvex Optimization. Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, 23-29 July 2023, 5396-5427. https://proceedings.mlr.press/v202/chen23v.html
[17]	Levy, D., Carmon, Y., Duchi, J.C. and Sidford, A. (2020) Large-Scale Methods for Distributionally Robust Optimization. https://arxiv.org/abs/2010.05893
[18]	Jin, J., Zhang, B., Wang, H. and Wang, L. (2021) Non-Convex Distributionally Robust Optimization: Non-Asymptotic Analysis. https://arxiv.org/abs/2110.12459
[19]	Cortés, J. (2006) Finite-time Convergent Gradient Flows with Applications to Network Consensus. Automatica, 42, 1993-2000.[CrossRef]
[20]	Fang, C., Li, C., Lin, Z. and Zhang, T. (2018) SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator. https://arxiv.org/pdf/1807.01695
[21]	Li, H., Tian, Y., Rakhlin, A. and Jadbabaie, A. (2023) Convex and Non-Convex Optimization under Generalized Smoothness. https://arxiv.org/pdf/2306.01264.pdf
[22]	Nesterov, Y.E. (1983) A Method for Solving the Convex Programming Problem with Convergence Rate O(1/k²). Proceedings of the USSR Academy of Sciences, 269, 543-547.
[23]	Beck, A. (2017) First-Order Methods in Optimization. Society for Industrial and Applied Mathematics.[CrossRef]

	[email protected]
	+86 18163351462 (WhatsApp)
	1655362766
	SCIRP WeChat

Journals Menu

Home

About SCIRP

Service

Policies