Difference of two norms-regularizations for Q -Lasso

The focus of this paper is in Q -Lasso introduced in Alghamdi et al. (2013) which extended the Lasso by Tibshirani(1996).The closed convexsubset Q belongingin a Euclidean m -space,for m ∈ IN, isthe set oferrors whenlinear measurements are taken torecover asignal/imagevia the Lasso.Basedon a recentwork by Wang (2013), we are interested in two new penalty methods for Q -Lasso relying on two types of difference of convex functions (DC for short) programming where the DC objective functions are the difference of l 1 and l σ q norms and the difference of l 1 and l r norms with r > 1. By means of a generalized q -term shrinkage operator upon the special structure of l σ q norm, we design a proximal gradient algorithm for handling the DC l 1 − l σ q model. Then, based on the majorization scheme, we develop a majorized penalty algorithm for the DC l 1 − l r model. The convergence results of our new algorithms are presented as well. We would like to emphasize that extensive simulation results in the case Q ¼ f b g show that these two new algorithms offer improved signal recovery performanceandrequirereducedcomputationaleffortrelativetostate-of-the-art l 1 and l p ( p ∈ ð 0 ; 1 Þ )models,see Wang (2013). We also devise two DC Algorithms on the spirit of a paper where exact DC representation of the cardinality constraint is investigated and which also used the largest-q norm of l σ q and presented numerical results that show the efficiency of our DC Algorithm in comparison with other methods using other penalty terms in the context of quadratic programing, see Jun-ya et al. (2017).


Introduction and preliminaries
The process of compressive sensing (CS) [8], which consists of encoding and decoding, is rapidly consolidated year after year due to the blooming of large datasets which become increasingly important and available. The process of encoding involves taking a set of (linear) measurements, b ¼ Ax, where A is a matrix of size m3n. If m < n, we can compress the signal x ∈ IR n , whereas the process of decoding is to recover x from b where x is assumed to be sparse. It can be formulated as an optimization problem, namely min kxk 0 subject to Ax ¼ b; (1.1) where k$k 0 is the l 0 norm, which counts the number of nonzero entries of x; namely with j$j being here the cardinality, i.e., the number of elements of a set. Hence minimizing the l 0 norm amounts to finding the sparsest solution. One of the difficulties in CS is solving the decoding problem above, since l 0 optimization is NP-hard. An approach that has gained popularity is to replace l 0 by the convex norm l 1 since it often gives a satisfactory sparse solution and has been applied in many different fields such as geology and ultrasound imaging.
More recently, nonconvex metrics were used as alternative approaches to l 1 , especially the nonconvex metric l p for p ∈ ð0; 1Þ in [6] which can be interpreted as a continued approximation strategy of l 0 as p → 0. A great deal of research has been conducted into l p problems including all kinds of variants and related algorithms, as you can see in [4] and references therein. The convex l 1 relaxation compared to the nonconvex problem (l p ) is generally more difficult to handle. However, it was shown in [12] that the potential reduction method can solve this special nonconvex problem in polynomial time with arbitrarily given accuracy.
Most recently, the majority of such sparsity inducing functions are unified as the notion of DC programming in [9], including log-sum, smoothly clipped absolute deviation and capped-l 1 penalty. Generally, DC programming problem can be solved through a primal-dual convex relaxations algorithm which is famous in the literature of DC Programming [11]. Other algorithms appeared as for solving application problems of DC programming in the area of finance and insurance, data analysis, machine learning as well as signal processing. However, as noted in [18], among the above mentioned DC programming approaches for sparse reconstruction, most of them are mainly preserving the separability properties of both l 0 and l 1 norms.
To begin with, let us recall that the lasso of Tibshirani [16] is given by the following minimization problem min x∈IR n A being an m3n real matrix, b ∈ IR m and γ > 0 is a tuning parameter. The latter is nothing else than the basic pursuit (BP) of Chen et al. [7], namely min x∈IR n kxk 1 such that Ax ¼ b: However, the constraint Ax ¼ b being inexact due to errors of measurements, the problem (1.4) can be reformulated as min x∈IR n kxk 1 subject to kAx À bk p ≤ ε; (1.5) where ε > 0 is the tolerance level of errors and p is often 1; 2 or ∞. It is noticed in [1] that (1.5) can be rewritten as min x∈IR n kxk 1 subject to Ax ∈ Q; (1.6) in the case when Q :¼ B ε ðbÞ, the closed ball in IR n with center b and radius ε. Now, when Q is a nonempty closed convex set of IR m and P Q the orthogonal projection from IR m onto the set Q and by observing that the constraint is equivalent to the condition Ax − P Q ðAxÞ ¼ 0, this leads to the following Lagrangian formulation min x∈IR n 1 2 kðI À P Q ÞAxk 2 2 þ γkxk 1 ; (1.7) γ > 0 being a Lagrangian multiplier. A link is also made in [1] with split feasibility problems [5] which consist in finding x satisfying x ∈ C; Ax ∈ Q; (1.8) with C and Q two nonempty closed convex subsets of IR n and IR m , respectively. An equivalent formulation of (1.8) as a minimization problem is given by (1.9) and its l 1 -regularization is min x∈C 1 2 kðI À P Q ÞAxk 2 2 þ γkxk 1 ; (1.10) with γ > 0 a regularization parameter. This convex relaxation approach was frequently employed, see for example [1,20] and references there in. As the level curves of l 1 -l 2 are closer to l 0 than those of l 1 , this motivated us in [14] to propose a regularization of split feasibility problems by means of the nonconvex l 1 -l 2 , namely min x∈C 1 2 kðI À P Q ÞAxk 2 2 þ γðkxk 1 Àkxk 2 Þ; (1.11) and present three algorithms with their convergence properties [14]. Unlike the separable sparsity inducing functions involved in the aforementioned DC programming for problem ðl 0 Þ, we are interested in the two first sections of this work to two specific types of DC programming with un-separable objective functions, which are in the form of difference functions between two norms, namely the new notion l σq denoting the sum of q largest elements of a vector in magnitude (i.e., the l 1 norm of q-term best approximation of a vector) introduced in [18] and the classical l r norm with r > 1. Obviously l σq and l r ðr > 1Þ are regular convex norms. The corresponding DC programs are as follows: and min x∈IR n ðkxk 1 À εkxk r : Ax ¼ bÞ; (1.13) where ε ∈ ð0; 1; kxk σq is defined as the sum of the q largest elements of x in magnitude, q ∈ f1; 2; $$$ng and r > 1. We would like to emphasize that the following least-squares variant of (1.12) and (1.13), were studied in the recent work by Wang [18]: (1.14) where μ > 0 and ε ∈ ð0; 1Þ, and Difference of two normsregularizations where r > 0 and ε ∈ ð0; 1Þ.
This paper proposes generalizations to Q-Lasso, namely where μ > 0 and ε ∈ ð0; 1Þ, as well as where r > 0 and ε ∈ ð0; 1Þ, and our attention will be focused on the algorithmic aspect.
The rest of the paper is organized as follows. In Sections 2 and 3, two DC-penalty methods instead of conventional methods such as l 1 or l 1 − l 2 minimization are proposed. Their convergence to a stationary point are also analyzed. The first iterative minimization method is based on the gradient proximal algorithm and the second one is designed by means of the majored penalty strategy. Furthermore, relying on DCA (difference of convex algorithm) two other algorithms are proposed and their convergence results are established in Section 3 and 4.

Proximal gradient algorithm
First, we recall that the subdifferential of a convex function f is given by vfðxÞ :¼ fu ∈ IR n ; fðyÞ ≥ fðxÞ þ hu; y À xi ∀y ∈ IR n g: Each element of vfðxÞ is called subgradient. If fðxÞ ¼ 1 (2.2) and when fðxÞ ¼ kxk 1 , we have The indicator function of a set C ⊆ IR n is defined by Moreover, the normal cone of a set C at x ∈ C, denoted by N C ðxÞ is defined as Connection between the above definitions is given by the key relation vi C ¼ N C . In this section our interest is in solving the DC programming min where μ > 0 and ε ∈ ð0; 1Þ.
Similar to l 1 norm, l 2 norm, etc., we adopt the notation kxk σq to denote the norm of l σq which is defined a line below (1.13) and we design an iterative algorithm based both on a generalized q-term shrinkage operator and on the proximal gradient algorithm framework.
At this stage, observe that the restriction on ε guarantees that f ðxÞ ≥ 0 for all x. To solve (2.6), we consider the following standard proximal gradient algorithm: 1. Initialization: Let x 0 be given and set L > λ maxðA T AÞ with λ maxðA T AÞ the maximal eigenvalue.
Observe that subproblem (2.7) can equivalently formulated as Thus, it suffices to consider the solutions to the following minimization problem min with a given vector y and positive numbers λ 1 > λ 2 > 0. An explicit solution of this problem is given by the following result, see [18]. Proposition 2.1 Let fi 1 ; Á Á Á ; i n g be the indices such that y i 1 ≥ y i 2 ≥ Á Á Á ≥ y in : is a solution of (2.9). The proximal operator above (called the generalized q-term shrinkage operator in [18]) amounts to write the algorithm as follows: Proximal Gradient Algorithm: 1. Start: Let x 0 be given and set L > λ maxðA T AÞ with λ maxðA T AÞ the maximal eigenvalue.

For
Sort y kþ1 as End. Now, we are in a position to show the following convergence result of the scheme (2.7): Difference of two normsregularizations Proposition 2.2 The sequence ðx k Þ generated by the Proximal Gradient Algorithm above converges to a stationary point of problem (2.6).
Proof. Remember that hðxÞ ¼ 1 2 kðI − P Q ÞAxk 2 is differentiable and its gradient ∇hðxÞ ¼ A T ðI − P Q ÞAx is Lipschitz continuous with constantL :¼ λ maxðA T AÞ . By [3]-Proposition A.24, we have Combining this with definition of x kþ1 , we obtain Since L >L, we see immediately that f ðx kþ1 Þ ≤ f ðx k Þ and thus the sequence ðf ðx k ÞÞ is convergent since f is a non-negative function. Furthermore, we obtain that P k kx kþ1 − x k k 2 < þ∞ which follows by summing (2.12) from k ¼ 0 to ∞. As a further consequence, we note that Since μð1 − εÞ > 0, we have that ðx k Þ is bounded. Moreover, the objective function f is square term plus a piecewise linear function which ensures that f is semi-algebric and hence satisfies Kurdyka-Lojasiewicz inequality. [2]-Theorem 5.1 is then applicable and obtain that ðx k Þ is convergent to a stationary point of (2.6). ,

Majorized penalty algorithm
Consider the following minimization problem min x f ðxÞ :¼ 1 2 kðI À P Q ÞAxk 2 þ μðkxk 1 À εkxk r ; (3.1) where A ∈ IR m3n ; Q a nonempty closed convex set of IR m ; r > 0 and ε ∈ ð0; 1Þ. First, observe again that conditions on ε guarantees thatf ðxÞ ≥ 0 for all x. We will now describe an algorithm for solving (3.1), based on the majorized penalty approach see, for example, [18] and references therein. Following the same lines as in [18], we start by constructing a majorization off. To that end let L > λ max ðA T AÞ, then for any x; y ∈ IR n , we have Moreover, by invoking the convexity of the norm kxk r and definition of its subdifferential, we also have kxk r ≥ kyk r þ hgðyÞ; x À yi with gðyÞ ∈ vkyk r ; where ACI Hence, if we define kx À yk 2 þ μðkxk 1 À εkyk r À εhgðyÞ; x À yiÞ; hence, for every x; y ∈ IR n , we get Fðx; yÞ ≥f ðxÞ and Fðy; yÞ ¼f ðyÞ: Starting with an initial iterate x 0 , the majorized penalty approach above updates x k by solving This leads to the following explicit formulation of x kþ1 by means of the proximity (shrinkage) operator of kxk 1 : We summarize the algorithm as follows: Majorized Penalty Algorithm: 1. Initialization: Let x 0 be given and set L > λ maxðA T AÞ .
The following proposition contains the convergence result of this Penalty Algorithm. Proposition 3.1 Let ðx k Þ be the sequence generated by the Majorized Penalty Algorithm above. Then L 2 kx k À x kþ1 k 2 ≤f ðx k Þ Àf ðx kþ1 Þ: Furthermore, the sequence ðx k Þ is bounded and any cluster point is a stationary point of problem (3.1).
Proof. Since x kþ1 minimizes Fðx; x k Þ, thanks to the first-order optimality condition we can write 0 ∈ A T ðI À P Q ÞAx k þ Lðx kþ1 À x k Þ þ μkx kþ1 k 1 À μεgðx k Þ; (3.6) Difference of two normsregularizations gðx k Þ being a subgradient of kxk r at x kþ1 . This combined with the definition of the subdifferential of kxk 1 at x kþ1 gives This together with the definition of F, for any k ≥ 1, leads tõ : Hencef ðx kþ1 Þ ≤f ðx k Þ and thus the sequence ðf ðx k ÞÞ is convergent sincef is a non-negative function. Furthermore, the sequence ðx k Þ is such that X ∞ k¼0 kx kþ1 À x k k 2 < þ∞: Indeed, by summing (3.7) from k ¼ 0 to ∞, we obtain that Consequently, the sequence ðx k Þ is asymptotically regular, i.e., lim k→þ∞ kx k − x kþ1 k¼ 0. On the other hand, observe that the definition off for any k ≥ 1, leads to μðkx k k 1 À εkx k k r Þ ≤ 1 2 kðI À P Q ÞAx k k 2 2 þ μðkx k k 1 À εkx k k r Þ ¼f ðx k Þ ≤f ðx 0 Þ: Since kx k k 1 ≥ kx k k r , we obtain that μð1 − εÞkx k k r ≤f ðx 0 Þ. This implies that ðx k Þ is bounded since 0 < ε < 1. To conclude, we prove that every cluster point of ðx k Þ is a stationary point of (3.1). Let x * be a cluster point of ðx k Þ, then x * ¼ lim V x kv ; ðx kv Þ being subsequence of ðx k Þ. By passing to the limit in (3.6) along the subsequence ðx kv Þ and in the light of the upper semicontinuity of (Clarke) subdifferentials, we obtain the desired result, namely which is nothing else than the first-order optimality condition of (3.1). ,

DCA algorithm
Now we turn our attention to a DC Algorithm (DCA), where the dual step at each iteration can be efficiently carried out due to the accessible subgradients of the largest-q-norm k$k σq and k$k r norm. Remember that to find critical points of f :¼ f − ψ, the DCA consists in designing of sequences ðx k Þ and ðy k Þ by the following rules & y k ∈ vψðx k Þ; x kþ1 ¼ argmin x∈IR n ðfðxÞ À ðψðx k Þ þ hy k ; x À x k iÞÞ: Note that by the definition of subdifferential, we can write ψðx kþ1 Þ ≥ ψðx k Þ þ hy k ; x kþ1 À x k i: Since x kþ1 minimizes fðxÞ − ðψðx k Þ þ hy k ; x − x k iÞ, we also have fðx kþ1 Þ À ðψðx k Þ þ hy k ; x kþ1 À x k iÞ ≤ fðx k Þ À ψðx k Þ: Combining the last inequalities, we obtain f ðx k Þ ¼ fðx k Þ À ψðx k Þ ≥ fðx kþ1 Þ À ðψðx k Þ þ hy k ; x kþ1 À x k iÞ ≥ f ðx kþ1 Þ: Therefore, the DCA leads to a monotonically decreasing sequence ðf ðx k ÞÞ that converges as long as the objective function f is bounded below. Now, we can decompose the objective function in (2.6) as follows min x f ðxÞ :¼ 1 2 where μ > 0, ε ∈ ð0; 1Þ, here fðxÞ ¼ 1 2 kðI − P Q ÞAxk 2 þ μkxk 1 and ψðxÞ ¼ μεkxk σq .
At each iteration, DCA solves The convex subproblem defined by linearizing the concave term −εkxk σq is solved by DCA at each iteration until a convergence condition is satisfied. More precisely, we have 8 > < > : Especially, if either the function f or ψ is polyhedral, the DCA is said to be polyhedral and terminates in finite iterations [15]. Note that the our proposed DCA is polyhedral since the largest-q norm term −εkxk σq can be expressed as a pointwise maximum of 2 q C q n linear functions, see [10]. On the other hand, the subdifferential of kxk σ q at a point x k is given in, see for example [19], that is vkx k k σq ¼ n ðy 1 ; Á Á Á ; y n Þ : where y i j denotes the element of y corresponding to x i j in the linear program (4.4). Observe that a subgradient y ∈ vkx k k σq can be computed efficiently by first sorting the elements j x i j in decreasing order, namely Then, assign 1 to y i which corresponds to x i 1 Á Á Á ; x iq .
The subgradient y ∈ vkx k k r is also available via the formula (3.2) and the DCA in this context take the following form 8 < : y k ∈ μεvkx k k r ; x kþ1 ¼ argmin x∈IR n 1 2 kðI À P Q ÞAxk 2 þ μkxk 1 À ðμεkx k k r þ hy k ; x À x k iÞ : For the details of DCA convergence properties, see [15].

Concluding remarks
The focus of this paper is on Q-Lasso relying on two new DC-penalty methods instead of conventional methods such as l 1 or l 1 − l 2 minimization developed in [13,17] and [21]. Two iterative minimization methods based on the gradient proximal algorithm as well as the majored penalty algorithm are designed and their convergence to a stationary point is proved. Furthermore, by means of DC (difference of convex) Algorithm, two other algorithms are devised and their convergence results are also stated.