Research on the Proximal Gradient Method for Composite Optimization Problems under Generalized Smoothness Assumptions ()
1. Introduction
This paper considers the following composite convex optimization problem:
(1)
where
is a proper closed convex function that is differentiable on the open effective domain
,
is a proper closed convex function satisfying
.
Problem (1) has wide applications in fields such as compressed sensing, sparse phase retrieval [1] [2], network quantization [3], and machine learning. Currently, researchers have proposed various classical methods for solving problem (1), including the subgradient algorithm, splitting algorithms [4], and the proximal gradient method [5].
The proximal gradient method (PGD) combines gradient descent with the proximal operator to solve optimization problems with nonsmooth regularization terms by decomposing complex objective functions. Its iterative scheme is as follows:
where
is the step size parameter, and
denotes the proximal operator of the function
defined as
In the convergence theory analysis of traditional PGD, it is often required that the gradient satisfy Lipschitz continuity. Under this smoothness condition, if
is convex, the algorithm can achieve a sublinear convergence rate of
.
However, the classical smoothness condition of gradient Lipschitz continuity is relatively strict. For functions with rapidly changing curvature or uneven local geometric structures, this assumption is often difficult to satisfy. Particularly in machine learning and large-scale optimization, many objective functions, although differentiable or even twice differentiable, may not have globally Lipschitz continuous gradients. Even in some standard models, the global smoothness constant may be unbounded [6], such as the
-regression function. Additionally, when the function
is an augmented Lagrangian function or the dual function of the original problem (which is not necessarily uniformly convex), the objective function may also exhibit nonsmooth or non-uniformly smooth characteristics [7] [8]. Therefore, how to establish algorithm convergence theory under conditions weaker than classical smoothness remains an important question.
Significant progress has been made in research on weakening the classical smoothness assumption. In 2020, Zhang et al. [9], based on an in-depth study of the training process of deep neural networks such as LSTM [10] and ResNet [11], first proposed a non-uniform smoothness condition—
-smoothness—to characterize the dependence between the variation of the function gradient and the gradient norm. Compared with the classical smoothness condition of gradient Lipschitz continuity, this type of condition allows the local smoothness degree to vary with the position of the iteration point or the gradient norm, thus covering a broader class of functions, such as univariate polynomials and exponential functions. Under this generalized smoothness condition, Zhang et al. proved the convergence rate of gradient clipping [12] and demonstrated its advantages over gradient descent methods in neural network training. Since the introduction of this generalized smoothness condition, related research has been extensively developed in various fields, including minimax optimization [13], bilevel optimization [14], and variational inequalities [15].
On this basis, Chen et al. [16] further proposed a more general class of symmetric generalized smoothness conditions—
-symmetric
-smoothness. This concept generalizes existing smoothness definitions and covers many mainstream machine learning problems and important function classes, such as distributionally robust optimization [17] [18], higher-order polynomials, and exponential functions. Under this framework, the researchers established corresponding descent lemmas for different values of
. On this basis, to solve non-convex optimization problems, Chen et al. designed several deterministic normalized gradient descent algorithms [19], which achieve an optimal complexity of
. Meanwhile, they found that under the stochastic setting, the classic variance reduction algorithm SPIDER [20] can also achieve the optimal sample complexity of
.
In the same year, Li et al. [21] also generalized the
-smoothness condition by replacing the original affine function
with a more general non-decreasing continuous function
to characterize the dependence between the norm of the Hessian matrix and the gradient norm, thereby proposing a more generalized smoothness concept—
-smoothness. That is, there exists a non-decreasing continuous function
such that
satisfies
This generalization significantly expands the applicability of the generalized smoothness theory, enabling researchers to handle function classes with more complex curvature variations. Based on this new framework, Li et al. systematically analyzed the convergence rates of gradient descent methods in convex, strongly convex, and non-convex settings. Notably, under the condition that the objective function is convex, they further proved that Nesterov’s accelerated gradient method [22] achieves an optimal computational complexity of
, which exactly matches the optimal bound under the classical smoothness condition. In addition, they also conducted an in-depth study of stochastic non-convex optimization problems and, under the assumption of bounded variance, proved that stochastic gradient descent achieves an optimal complexity of
, which also matches its optimal bound under the classical smoothness condition.
Related Work
Overall, existing literature mainly focuses on smooth unconstrained problems, while studies for composite optimization problems are relatively limited. In particular, for the proximal gradient method, a fundamental algorithm for solving composite optimization problems, its theoretical analysis under the generalized smooth framework remains incomplete. Although the convergence analysis of proximal gradient methods based on the Kurdyka-Łojasiewicz (KL) property has successfully extended the theoretical framework to non-convex and non-globally Lipschitz settings [8], such methods still have obvious limitations: their convergence relies on line search strategies to ensure iterative descent, and they can only yield a unified sublinear convergence rate determined by the KL exponent, failing to recover the classical exact convergence rate under convex structures. Therefore, it is necessary to re-examine the algorithmic theory for optimization problems under generalized smoothness conditions and establish a systematic analysis framework applicable to proximal gradient algorithms.
Against the above background, this paper focuses on the composite optimization problem (1) under generalized smoothness conditions, and systematically studies the convergence rate of the PGD. For problem (1), we rigorously establish the achievable convergence rate of the PGD method with a constant step size, provided that the function
is
-smooth and convex. This result shows that the PGD method can still retain the same convergence rate as that under the classical Lipschitz smoothness assumption, even under the weaker generalized smoothness condition.
2. Preliminaries
This paper mainly conducts research in the finite-dimensional Euclidean space
. Let
be a non-empty convex set. For any
, the inner product is denoted by
, and the induced norm is defined as
. In this thesis,
denotes the Euclidean ball centered at
with radius
.
Definition 1 (
-smoothness [21]). Let
be a real-valued differentiable function. If there exists a non-decreasing continuous function
such that the spectral norm of the Hessian matrix of
satisfies
almost everywhere on
(with respect to the Lebesgue measure), then
is called an
-smooth function.
Remark. When the function
is a constant function
, the definition of
-smoothness degenerates to the classical smoothness definition; when
is an affine function
, the
-smoothness corresponds to the
-smoothness definition [9].
Definition 2 ((
,
)-smoothness). Let
be a real-valued differentiable function. Suppose there exist a non-decreasing continuous function
and a non-increasing continuous function
such that
1)
,
;
2)
,
;
Then
is called an
-smooth function.
Remark. Here, the function
is assumed to be nonincreasing and the function
is assumed to be nondecreasing. This restriction is usually without loss of generality. When
and
do not satisfy the required monotonicity, the desired monotonicity can be achieved by the following construction technique:
1) Replace the original function
with ;
2) Replace the original function
with the nonincreasing function .
Next, we elaborate on the intrinsic relationship between the two types of smooth functions. In fact, under appropriate assumptions,
-smooth functions and
-smooth functions are equivalent.
Proposition 1. If
is
-smooth, then
is also
-smooth; if
is
-smooth and
is a differentiable closed function on the open effective domain
, then
is an
-smooth function, where
,
, and the constant
is arbitrary.
The proof of the above proposition can be found in appendix A.2 of [21]. The objective function considered in this paper obviously satisfies the assumptions of Proposition 1. In view of the equivalence between
-smoothness and
-smoothness in the above framework, the subsequent theoretical analysis will mainly focus on
-generalized smooth functions.
Lemma 2 (Descent Lemma [23]). Let
be a
-smooth function, where
is a convex set. Then for all
, we have
Lemma 3. [21] Suppose the differentiable function
satisfies
-smoothness, and for all
,
, Then
1)
;
2) For any
,
(2)
(3)
where
denotes an effective smoothness constant.
Proof of the above lemma can be found in appendix A.3 of [21].
Remark. From Lemma 3, it is not difficult to see that when the gradient norm of an
-smooth function is bounded, it also satisfies the property of a classical smooth function in a local neighborhood
, i.e., the descent lemma.
Remark. From Proposition 1, the two classes of smooth functions are equivalent. Therefore, the conclusion of Lemma 3 obviously also holds for
-smooth functions. When
, it suffices to take
and
to obtain the corresponding conclusion [21].
Lemma 4 (Nonexpansiveness of the Proximal Operator [23]). Let
be a closed proper convex function. Then for any
, we have
(4)
and
3. The Proximal Gradient Method
3.1. The Proximal Gradient Method
This paper mainly considers the framework of generalized smoothness assumptions, and investigates the iterative convergence rate of PGD for solving unconstrained composite convex optimization problems. In the following, this chapter will discuss the convex case, present and prove the convergence rate of PGD.
Before proceeding with the study, we first introduce an important fundamental assumption in the field of optimization algorithm research, namely, the existence of an optimal solution.
Assumption 1. Assume there exists a point
such that
.
This chapter mainly considers the constant step-size version of PGD. Let the iteration step size be
, where
. The iteration scheme is given by
(5)
For simplicity, we define
. Then the PGD iteration scheme is abbreviated as
Next, we introduce an important tool, the gradient mapping, which is defined as follows.
Definition 3. When solving the objective problem (1) using PGD (5), combined with the definition of the operator
, the mapping
is defined as
and is called the gradient mapping.
Based on the above definition of the gradient mapping, the proximal gradient iteration can be equivalently transformed into a gradient descent iteration format, i.e.,
Through this structural reconstruction, it is not difficult to see that PGD can be regarded as a natural extension of the classical gradient descent method in composite optimization problems.
Since cocoercivity serves as an important theoretical guarantee for the convergence analysis of the PGD iteration framework. However, in the composite convex optimization problem (1) considered in this paper, the function
is only
-smooth. Therefore, we need to verify whether the corresponding co-coercivity still holds when the function satisfies
-smoothness. Li et al. [21] pointed out that when a function is both convex and
-smooth, its gradient also possesses cocoercivity, as stated in the following lemma.
Lemma 5 (cocoercivity [21]). Let the function
be an
-smooth convex function. Then for
, we have
(6)
Furthermore, we will investigate the convergence rate of PGD iterations based on the concept of co-coercivity. Since we will utilize Lemma 4 in the convergence analysis of PGD, it requires that consecutive iterates
of PGD satisfy
, Based on this, we prove that the norm of the gradient mapping
is monotonically decreasing.
Lemma 6. Let
be an
-smooth differentiable convex function satisfying
; and let
be a proper closed convex function. If for
,
, then
(7)
where
.
Proof. Let
. By Lemma 4, the proximal operator is nonexpansive. When we choose
,
respectively, Equation (4) becomes
Since the mapping
, the above inequality can be transformed into
Because
, the above inequality can be simplified as
Further, by expanding the inner product and rearranging terms, we obtain
(8)
Since the function
is an
-smooth differentiable convex function and satisfies
, it follows from Lemma 5 that
where the second inequality holds because
and the function
is nondecreasing and continuous. Therefore, based on the above inequality, Equation (8) can be further transformed into
(9)
For convenience, let
,
. Then the right-hand side of the above inequality can be further simplified as
Substituting the simplified result back into (9) yields
Since the function
, we have
always holds. Dividing both sides of the above inequality by
, we obtain
(10)
Next, we estimate the left-hand side of (10). Using the Cauchy-Schwarz inequality, we obtain
Therefore, combining the above inequality with (10) yields
(11)
According to the definition of the gradient mapping
, and taking
, we obtain
and
Let
, the last equality is abbreviated as
. Substituting these two equalities into (10), we obtain
(12)
1) If
, Equation (12) can be simplified as
Multiplying both sides by
, we obtain
Further, by letting
, we arrive at the final conclusion
.
2) Similarly, when
, combining with Equation (12) and using the Cauchy-Schwarz inequality, we obtain
(13)
By rearranging terms and factoring, the above inequality becomes
(14)
(i) If
, by the non-negativity of the norm we have
. Furthermore, since
, it follows that
, the conclusion holds.
(ii) If
, by the non-negativity of the norm we have
. In this case,
must also hold. If
were not true, then by the non-negativity of the norm we would have
. From (11), with
and
, we obtain
Since
, it follows that
, which contradicts the premise
. Therefore, given
and
, we have
, and consequently
At this point, combining the above inequality with (14), it is easy to see that
, so the conclusion holds.
□
Next, we present the key descent inequality in the convergence analysis.
Lemma 7 (Basic Proximal Gradient Inequality). Let
be a differentiable convex function, where
is an open set, and let
be a proper closed convex function. For any
and
, if
(15)
then
(16)
where
.
Proof. Define the function
(17)
Clearly,
is a strongly convex function with strong convexity modulus
. From the definition of
, we have
(18)
By strong convexity, we obtain
(19)
Using the convexity of
, we have
Letting
and substituting it into (17), together with the above inequality, we get
furthermore, from (19), it is not difficult to see that
Next, substituting
into the expression for
and combining with the above inequality, we obtain
by rearranging terms, we get
This completes the proof. □
Since
is arbitrary, we may replace
with
, and thus the following corollary can be deduced from Lemma 7.
Corollary 1. Let
be a continuously differentiable convex function, where
is an open set, and let
be a proper closed convex function. For any
and
, if
then
Since the consecutive iterates
and
of the proximal gradient method satisfy
, Corollary 1 directly establishes the sufficient descent property of the PGD algorithm. In what follows, we combine the above lemmas to study the convergence rate of PGD for solving convex problems.
3.2. Convergence Rate Analysis of the Proximal Gradient Method in the Convex Case
In this subsection, we consider the case where
in the objective function is convex, and analyze the convergence of PGD iterations under the generalized smoothness assumption. Specifically, the convergence rate results obtained through theoretical analysis are as follows.
Theorem 8. Consider problem (1). Let
be a proper closed convex function, and let
be a differentiable proper closed convex function, where
is an open convex set and
. Assume that
is an
-smooth function, where
is a nondecreasing continuous function and
is a nonincreasing continuous function. Moreover, suppose that
is
-Lipschitz continuous. If the parameter
satisfies
then
Proof. Since
is
-Lipschitz continuous, for any
, we have
Given that
is convex and
-smooth, and
is a proper closed convex function. First, from
and
, together with Lemma 6, we obtain
which implies
Secondly, by Lemma 3, the two consecutive iterates
of PGD always satisfy
(20)
Finally, in Lemma 7, due to the arbitrariness of
, we may set
. Together with
, we have for any
,
(21)
where, by Assumption 1,
is the optimal solution of the objective function, and the second inequality follows from the convexity of
. Now, letting
run over
and summing the inequalities, we obtain
Multiplying both sides of the above inequality by
, we get
From Corollary 1, we also have
. Combining this with the above inequality yields
Multiplying both sides of the above inequality by
, we obtain
From Corollary 1, we also have
. Combining this with the above inequality yields
which can be rearranged as
This completes the proof. □
4. Conclusions
To address the strict limitation of the classic gradient Lipschitz smoothness assumption on the applicability of the PGD, this paper introduces a novel class of generalized smoothness conditions—
—smoothness, establishing a more general framework for analyzing composite convex optimization problems. Under this generalized smoothness framework, we systematically derive the convergence theory of the constant-stepsize PGD algorithm and rigorously prove that: even under the generalized smoothness assumption that relaxes the classic Lipschitz gradient continuity, the constant-stepsize PGD can still achieve the exact same sublinear convergence rate
as in the classic smoothness scenario, provided that the function
satisfies
—smoothness and Lipschitz continuity.
However, the analysis framework in this paper relies on the core assumption that the objective function
satisfies gradient Lipschitz continuity. While this assumption guarantees the gradient Lipschitz continuity of
in local regions and provides theoretical support for the convergence analysis of the constant-step-size PGD algorithm, it also significantly limits the scope of application of the method: for composite convex optimization problems that do not satisfy gradient Lipschitz continuity (such as scenarios with unbounded gradients or rapidly changing gradients), the sublinear convergence rate theory of the constant-step-size PGD algorithm established in this paper will no longer hold.
Acknowledgements
Sincere thanks to the members of JAMP for their professional performance, and special thanks to managing editor Hellen XU for a rare attitude of high quality.