
 
                        International Institute of Finance, School of Management, University of Science and Technology of China, Hefei 230026, China
Received 19 November 2018; Revised 10 April 2019
Foundation items: Supported by the Fundamental Research Funds for the Central Universities and the National Natural Science Foundation of China (11601500, 11671374, 11771418)
Corresponding author: GUO Xiao, xiaoguo@ustc.edu.cn
Abstract: Breakdown point is regarded as an important measure of robustness in regression analysis. At the same time, sparse model estimation is a hot topic in data analysis. In the case of less attention to the breakdown point of robust estimates in nonlinear models, we study it in binary response models. We prove that the penalized estimate of logistic models always stays bounded, which means the finite explosive breakdown point of it is 1. Moreover, we give an upper bound of the implosive breakdown point of the slope parameter. Both simulation study and real data application verify this point while we use the approximation method and coordinate descent algorithm.
Keywords: breakdown pointlogistic regressionmaximum likelihood estimatorpenalizationrobust estimation
惩罚逻辑回归的击穿点
董玉林, 郭潇

 
                        中国科学技术大学管理学院国际金融研究院, 合肥 230026
摘要: 击穿点是估计未知参数稳健性的重要度量。在常规的低维逻辑回归模型中,极大似然估计的击穿点已有了广泛的研究,但少有对高维惩罚似然估计击穿点的分析。通过研究高维逻辑回归模型的惩罚似然估计的增加和替换击穿点来分析这个问题。特别地,证明惩罚极大似然估计的L2范数总是有界的,这表明有限样本的击穿点达到了最大值0.5。此外,还提供了斜率参数的内爆击穿点的上界。模拟学习很好地支持了该理论结果。
关键词: 击穿点逻辑回归极大似然估计正则化稳健估计
Breakdown point is one of the basic tools to measure the robustness of statistical estimation. It was firstly defined by Hample[1], depending on the specific distribution of the observed data set. A distribution-free finite-sample definition of the breakdown point was proposed in Ref.[2]. Intuitively, the breakdown point is the maximum proportion of outliers that a given sample may contain without spoiling the estimator completely. Obviously, higher breakdown point indicates more robustness of an estimator. This concept has been widely applied to the location, scale, and regression models to describe the resistance of outliers.
The study of breakdown point has made great progress over the recent 40 years. In linear regression models, it is well known that the breakdown point of least square method is 0, which means only one outlier may have a large effect on the estimator. Robust estimators with a high breakdown point of 0.5 have been developed, e.g. the MM-estimators[3], the τ-estimators[4]; the least median of squares (LMS) and least trimmed squares (LTS) estimators[5]. The breakdown properties of some nonlinear regression model estimators were also investigated in, e.g. Ref.[6]. In the area of logistic regression model, the breakdown property of the maximum likelihood estimator (MLE) is totally different from that of the linear model. Cox and Snell[7] argued that the commonly used MLE in logistic regression models is not robust. Christmann[8] proposed that high breakdown point estimators do not exist in logistic regression models, unless there are some specific conditions, such as p≤n. Similar statement was noticed by Ref. [9] if one replaced some observations to raw data set. Neykov and Muller[10] considered the finite-sample breakdown point of trimmed likelihood estimators in generalized linear models. Khan et al.[11] adopted the least trimmed absolute (LTA) estimator for logistic model fitting, in which LTA is a robust estimator with high breakdown since it outperforms M-estimator in case of a certain degree of contamination. Croux et al.[12] proved that classical MLE in logistic regression models always stays bounded if some outliers were added to the raw data set. However, the aforementioned methods requires low-dimensionality that n>p.
Due to the rapid development of science and technology and various ways of data collection, large-dimensional data is becoming more and more common and has attracted great attention. Regularization is powerful to solve high-dimensional problems, which can force the model be sparse, low rank, smooth and so on. Thus, a lot of penalized estimators have been established to select useful variables for high dimensional statistical linear modeling. Breiman[13] proposed the nonnegative garrote for subset regression. Tibshirani introduced lasso in Ref. [14], and LAD-lasso was discussed in Ref. [15]. Xie and Huang[16] applied the smoothly clipped absolute deviation (SCAD) penalty to achieve sparsity in the high-dimensional linear models. Therefore, breakdown point of penalized regression is becoming more important. Alfons et al.[17] focused on some sparse estimators in linear models, in which the breakdown point of sparse LTS is verified to be (n-h+1)/n, where h is the initial guess of non-contaminations, both lasso and LAD-lasso have the breakdown point of 1/n. Various penalized logistic regression estimators are also proposed as alternatives to the non-robust MLE in logistic models, but the breakdown point of penalized logistic regression model under high-dimensionality is not discussed yet. It can neither be derived from the results of the penalized linear model nor from the generalization of the results of the unpenalized logistic regression models.
We aim to compute the breakdown point of penalized logistic regression estimator while there are outliers in the observations. Traditional methods are often limited to n>p, but the collected data sets are often with sophisticated large scale, so we emphasis not only on n>p, but also on n < p in this paper. We have tried to add different penalty items to the original loss function while the model designs can be corrupted by outliers, either in replacement or in addition situations. It is verified that the explosive breakdown point of penalized MLE in logistic regression model attains its maximal value. Accurately speaking, we can still get bounded regression coefficients even if the pollution rate of the observed data reached 50%. Our simulation studies just illustrate this property. While classical estimators in linear regression models might be arbitrarily large while there is only one outlier, which is quite different. The upper bound of implosive breakdown point of logistic regression estimators is also computed. Moreover, our method is not limited to the logistic models, but also to the probit models.
The rest part of this article is arranged as follows. In Section 2, the replacement and addition explosive breakdown point of the penalized logistic regression estimator are obtained respectively. Furthermore, we also show the replacement implosive breakdown point of the penalized MLE. Section 3 discusses a fast convergent iterative algorithm and Section 4 performs the simulation studies. Finally, Section 5 concludes.
We introduce some necessary notations in this paper. If β=(β1, …, βp)T∈
1 Main resultsWe start with a brief introduction of the logistic regression model. Denote by X=(Xij)1≤i≤p, 1≤j≤n=(X1, …, Xn) the matrix of predictor variables, where n is the number of observations and p the number of variables. Let Y=(Y1, …, Yn)T be the vector of responses, where Yi is either 0 or 1, i=1, …, n.
The logistic regression model is given by
| ${\mu _i}: = p({Y_i} = 1|{\mathit{\boldsymbol{X}}_i}) = {F^{ - 1}}({\theta _i}),$ | (1) | 
M-estimation of the parameters is often obtained by minimizing a certain loss function. While it is not suitable to use the quadratic loss for classification problems. Let γ=(γ1, …, γp+1)T=(α; β). Giving the observed sample Z=(Z1, …, Zn), where Zi=(Xi; Yi), i=1, …, n. Specifically,
Quadratic loss: l(γ, Zi)=(Yi-θi)2/2;
Deviance loss:
| $l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i}) = - 2\{ {Y_i}{\rm{log}}({\mu _i}) + (1 - {Y_i}){\rm{log}}(1 - {\mu _i})\} ;$ | 
In this paper, we use Bernoulli deviance loss which is often used in exponential dispersion models or generalized linear models.
| $\begin{array}{*{20}{l}}{L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) = - \frac{2}{n}\sum\limits_{i = 1}^n {\{ {Y_i}{\rm{log}}(} {\mu _i}) + }\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} (1 - {Y_i}){\rm{log}}(1 - {\mu _i})\} .}\end{array}$ | (2) | 
| $\mathit{\pmb{\hat \gamma }} = \mathop {{\rm{argmin}}}\limits_\mathit{\boldsymbol{\gamma }} \{ L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) + {P_\lambda }(\mathit{\boldsymbol{\beta }})\} : = \mathop {{\rm{ argmin }}}\limits_\gamma Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}),$ | (3) | 
The components of Pλ(β) are always additive, which means
L1 penalty function,
| ${P_\lambda }(|{\beta _j}|) = \lambda |{\beta _j}|.$ | (4) | 
| ${P_\lambda }(|{\beta _j}|) = \lambda \beta _j^2/2.$ | (5) | 
| $\begin{array}{l}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {P_\lambda }(|{\beta _j}|) = \\\left\{ {\begin{array}{*{20}{l}}{\lambda |{\beta _j}|,}&{|{\beta _j}| \le \lambda ,}\\{\frac{{ - (1{\beta _j}{|^2} - 2a\lambda |{\beta _j}| + {\lambda ^2})}}{{2(a - 1)}},}&{\lambda < |{\beta _j}| \le a\lambda ,}\\{(a + 1){\lambda ^2}/2,}&{|{\beta _j}| > a\lambda ,}\end{array}} \right.\end{array}$ | (6) | 
To study the robustness of an estimator, Donoho and Huber[2] introduced two finite-sample versions of breakdown point, one is replacement breakdown point, the other is addition breakdown point. Generally speaking, the value is the smallest proportion of contamination that can lead the estimator's Euclidean norm to infinity or to zero, which means the estimator becomes completely unreliable. Rousseeuw and Leroy[21] held that breakdown point can not exceed 50%. Intuitively, if more than half of the data are polluted, we will not be able to distinguish the original distribution from the pollution distribution.
1.1 Explosivebreakdown point When the Euclidean norm of
| ${\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) = {\rm{min}}\left\{ {\frac{m}{n}:\mathop {{\rm{sup}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} {{\left\| {\mathit{\pmb{\hat \gamma }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = \infty } \right\},$ | 
| ${\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{add}}}}) = {\rm{min}}\left\{ {\frac{m}{{n + m}}:\mathop {{\rm{sup}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ add }}}}} {{\left\| {\mathit{\pmb{\hat \gamma }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = \infty } \right\},$ | 
For penalized MLE in logistic models, we have the following theorem.
Theorem 1.1 For model (1), considering
| ${\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = 0.5.$ | 
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}) = \frac{{ - 2}}{n}\sum\limits_{i = 1}^n {\{ {{\tilde Y}_i}{\rm{log}}(} {\mu _i}) + }\\{(1 - {{\tilde Y}_i}){\rm{log}}(1 - {\mu _i})\} + {P_\lambda }(\mathit{\boldsymbol{\beta }}).}\end{array}$ | 
| $Q(\mathit{\pmb{0}},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2.$ | 
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) = \lambda {{\left\| \mathit{\boldsymbol{\beta }} \right\|}_1} \ge c\lambda {{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ge 2{\rm{log}}2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).}\end{array}$ | 
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) = \frac{\lambda }{2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \ge 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).}\end{array}$ | 
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}) > {P_\lambda }(\mathit{\boldsymbol{\beta }}) > \frac{{(a + 1){\lambda ^2}}}{2}}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > 2{\rm{log}}{\kern 1pt} {\kern 1pt} 2 + 1 > Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{ rep }}}}).}\end{array}$ | 
We can see that the constant M only depends on p. So either for the case of n>p or the situation of n < p, Theorem 1.1 is true. We choose deviance loss in the proof of Theorem 1.1. If the loss term was replaced by exponential loss function, the process of proving would be similar, and the conclusion would be the same.
As mentioned above, there are two types of finite-sample breakdown point. Zuo[22] ever gave the quantitative relationship between replacement breakdown point and addition breakdown point of a large class of estimators whose breakdown points are independent of the configuration of X. One may think that the influence of adding outliers may be different from that some of the raw observations are replaced by outliers. From the proof process of Theorem 1.1, it can be seen whether the arbitral observations are replaced or added, it does not matter. If we added m outliers to the raw data set, we could conclude the following theorem.
Theorem 1.2 For model (1), considering
| ${\varepsilon _\infty }(\mathit{\pmb{\hat \gamma }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{add}}}}) = 0.5.$ | 
Theorem 1.1 and Theorem 1.2 show that the penalized MLE is very robust in binary logistic regression models. Moreover, for multiple classification models, the theorems are also applicable. In logistic regression models with binary data, we verified that the explosive breakdown point of penalized MLE in logistic regression models gets the biggest value 0.5, which is a good result.
1.2 Implosivebreakdown point Another direction to describe a meaningless estimator is ‖
| ${\varepsilon _0}(\mathit{\pmb{\hat \beta }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) = {\rm{min}}\left\{ {\frac{m}{n}:\mathop {{\rm{inf}}}\limits_{{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}} {{\left\| {\mathit{\pmb{\hat \beta }}({{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}})} \right\|}_2} = 0} \right\}.$ | (7) | 
Theorem 1.3 In model (1), considering (3), the penalized MLE satisfies
| ${\varepsilon _0}(\mathit{\pmb{\hat \beta }},{\mathit{\boldsymbol{\tilde Z}}^{{\rm{rep}}}}) < \frac{{2p}}{n}.$ | 
First, |α+CekTβ|>N‖β‖2>Nδ.
Case one: α+CekTβ>Nδ=τ, considering the observation
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{\tilde Z}}_{k0}^{{\rm{rep}}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{\alpha + \mathit{\boldsymbol{\tilde X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }}}})}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).}\end{array}$ | 
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{\tilde Z}}_{k1}^{{\rm{rep}}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{ - \alpha - \mathit{\boldsymbol{\tilde X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }}}})}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).}\end{array}$ | 
Case one: βjm>0,
α+ejmTβ=α+Cβjm < N‖β‖2, so α≤N‖β‖2-Cβjm≤N‖β‖2-C‖β‖2/
| $\begin{array}{*{20}{l}}{\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }} \le \alpha + {{\left\| {{\mathit{\boldsymbol{X}}_i}} \right\|}_2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \le - (M + N){{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} + M{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = - N{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} < - N\delta = - \tau .}\end{array}$ | 
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_{i1}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{ - \alpha - \mathit{\boldsymbol{X}}_i^{\rm{T}}\beta }})}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n} - {\rm{log}}(1 + {e^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).}\end{array}$ | 
0>α+ejmTβ=α+Cβjm>-N‖β‖2, so α>-N‖β‖2-Cβjm>-N‖β‖2+C‖β‖2/
| $\begin{array}{*{20}{l}}{\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\mathit{\boldsymbol{\beta }} > \alpha - {{\left\| {{\mathit{\boldsymbol{X}}_i}} \right\|}_2}{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > (M + N){{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} - M{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2}}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = N{{\left\| \mathit{\boldsymbol{\beta }} \right\|}_2} > N\delta = \tau .}\end{array}$ | 
| $\begin{array}{*{20}{l}}{Q(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}) > \frac{1}{n}l(\mathit{\boldsymbol{\gamma }},{{\mathit{\boldsymbol{\tilde Z}}}_{i0}}) = \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^{\alpha + \mathit{\boldsymbol{X}}_i^{\rm{T}}\beta }})}\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} > \frac{2}{n}{\rm{log}}(1 + {{\rm{e}}^\tau }) = Q(\mathit{\pmb{0}},{{\mathit{\boldsymbol{\tilde Z}}}^{{\rm{rep}}}}).}\end{array}$ | 
Theorem 1.3 implies that the number of contaminated data cannot exceed twice the number of independent variables, which shows the non-robustness of the estimator. The standard error of a regression estimator would not explode when the Euclidean norm of the estimator tends to zero. Therefore, this kind of breakdown is more difficult to detect. Croux et al.[12] proved that MLE of logistic regression model breaks down to zero when adding several outliers to the data set. Here, we discuss penalized MLE in replacement situation, and our result is not bad.
2 AlgorithmIn this section, we concentrate on the computation of (3). Approaches applicable to linear models may computationally infeasible to nonlinear regression analysis. In our model, the objective function is nonlinear, the general normal equation is a transcendental equation. we can only solve the problem by numerical methods in stead of algebraic methods.
There are some fast algorithms for numerical computing. Efron et al.[23] introduced least angle regression. Balakrishnan and Madigan[24] presented the online algorithm and the MP algorithm for learning L1 logistic regression models; Wu and Lange[25] used the coordinated descent (CD) algorithm to solve lasso regressions. Friedman et al.[26] derived the generalized CD algorithm for convex optimization problems. CD was also mentioned in Ref.[27], in which it was shown to gain computational superiority. It is proved that the block CD algorithm has linear convergence rate in Ref.[28]. Saha and Tewari[29] proved that the CD algorithm has the convergence rate of O(1/p). Since this algorithm has many excellent properties, it is wise to use it in this paper.
An approximate solution can be calculated by convergent iterative algorithms. we try to transform the deviance loss function into approximate squared loss function through Taylor expansion. In this way, the regression coefficient can be calculated more conveniently. Note that for multiple classification regression problems, the Taylor expansion for the ordinary objective function would become more complex. In these situations, one should consider the use of other numerical methods, such as Newton method, gradient ascent method, among others.
An initial estimator is necessary for Taylor expansion. If the initial value was closer to the real parameter, the iteration rate would be faster and the result would be more accurate. After a large number of experiments, we found that γ0=0 is a good choice. For a given initial estimator γ0=(α0; β0), let θi0=α0+XiTβ0, i=1, …, n. Next, we develop the Taylor expansion of the loss function at θi0,
| $\begin{array}{l}L(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}}) = \frac{1}{n}\sum\limits_{i = 1}^n l (\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})\\ \approx \frac{1}{n}\sum\limits_{i = 1}^n {\left\{ {l({\mathit{\boldsymbol{\gamma }}_0},{\mathit{\boldsymbol{Z}}_i}) + {{\left. {\frac{{\partial l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial {\theta _i}}}} \right|}_{{\theta _{i0}}}}({\theta _i} - {\theta _{i0}}) + } \right.} \\\left. {{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \frac{1}{2}{{\left. {\frac{{{\partial ^2}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial \theta _i^2}}} \right|}_{{\theta _{i0}}}}{{({\theta _i} - {\theta _{i0}})}^2}} \right\}\\ \approx \frac{1}{{2n}}\sum\limits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} {\left\{ {{\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - {\theta _i}} \right\}^2}\end{array}$ | 
If the loss function is deviance loss function, we would have l(γ, Zi)=-2{yilogμi+(1-yi)log(1-μi)}. Then for the logistic model and the probit model, we already have their distribution functions, it is easy to obtain their first order derivative function and two order derivative function.
For logit model, we have
| ${l_{{\theta _i}}^\prime = \frac{{\partial l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial {\theta _i}}} = 2({\mu _i} - {y_i}),}$ | 
| ${l_{{\theta _i}}^{\prime \prime } = \frac{{{\partial ^2}l(\mathit{\boldsymbol{\gamma }},{\mathit{\boldsymbol{Z}}_i})}}{{\partial \theta _i^2}} = 2{\mu _i}(1 - {\mu _i}).}$ | 
| $l_{{\theta _i}}^\prime = 2\frac{{{\mu _i}\mu _i^\prime - {y_i}\mu _i^\prime }}{{{\mu _i}(1 - {\mu _i})}},$ | 
| $\begin{array}{*{20}{l}}{l_{{\theta _i}}^{\prime \prime } = 2\frac{{(\mu _i^\prime \mu _i^\prime + {\mu _i}\mu _i^{\prime \prime } - {y_i}\mu _i^{\prime \prime })({\mu _i} - \mu _i^2)}}{{\mu _i^2{{(1 - {\mu _i})}^2}}} - }\\{{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} 2\frac{{({\mu _i}\mu _i^\prime - {y_i}\mu _i^\prime )(\mu _i^\prime - 2{\mu _i}\mu _i^\prime )}}{{\mu _i^2{{(1 - {\mu _i})}^2}}},}\end{array}$ | 
Otherwise, if the loss function was exponential loss l(γ, Zi)=exp{-(yi-0.5)θi}, there is nothing to do with μi, so no matter how to construct ui, we have
| ${l_{{\theta _i}}^\prime = ({y_i} - 0.5){{\rm{e}}^{ - ({y_i} - 0.5){\theta _i}}},}$ | 
| ${l_{{\theta _i}}^{\prime \prime } = \frac{1}{4}{{\rm{e}}^{ - ({y_i} - 0.5){\theta _i}}}.}$ | 
| $\left\{ {\begin{array}{*{20}{l}}{\partial Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}})/\partial \alpha = 0,}\\{\partial Q(\mathit{\boldsymbol{\gamma }},\mathit{\boldsymbol{Z}})/\partial {\beta _j} = 0,J = 1, \cdots ,p.}\end{array}} \right.$ | 
| ${\beta _1^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k - 1)}},\beta _2^{(k - 1)}, \cdots ,\beta _p^{(k - 1)}),\mathit{\boldsymbol{Z}}\} ,}$ | 
| ${\beta _2^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k)}},\beta _1^{(k)},\beta _3^{(k - 1)}, \cdots ,\beta _p^{(k - 1)}),\mathit{\boldsymbol{Z}}\} ,}$ | 
| $\begin{array}{l} \cdots \\\beta _p^{(k)} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ ({\alpha ^{(k)}},\beta _1^{(k)},\beta _2^{(k)}, \cdots ,\beta _{p - 1}^{(k)}),\mathit{\boldsymbol{Z}}\} ,\end{array}$ | 
| ${{\alpha ^{(k)}} \in {\rm{arg}}{\kern 1pt} {\kern 1pt} {\rm{min}}Q\{ (\beta _1^k,\beta _2^k,\beta _3^k, \cdots ,\beta _p^k),\mathit{\boldsymbol{Z}}\} .}$ | 
For L1 penalization,
| ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \beta } _j} = \frac{{S\{ \sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} {x_{ij}}({\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - y_{i0}^{ - j}),\lambda \} }}{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} x_{ij}^2}}.$ | (8) | 
| $S(\mu ,\lambda ) = \left\{ {\begin{array}{*{20}{l}}{\mu - \lambda ,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu > \lambda ,}\\{0,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} |\mu | \le \lambda ,}\\{\mu + \lambda ,}&{{\rm{ if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \mu < - \lambda .}\end{array}} \right.$ | 
| ${\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\frown$}} \over \beta } _j} = \frac{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} {x_{ij}}({\theta _{i0}} - \frac{{l_{{\theta _{i0}}}^\prime }}{{l_{{\theta _{i0}}}^{\prime \prime }}} - y_{i0}^{ - j})}}{{\sum\nolimits_{i = 1}^n {l_{{\theta _i}0}^{\prime \prime }} x_{ij}^2 + 1}}.$ | (9) | 
| $\hat \alpha = \frac{{\sum\nolimits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} ({y_i} - {x_i}\mathit{\pmb{\hat \beta }})}}{{\sum\nolimits_{i = 1}^n {l_{{\theta _{i0}}}^{\prime \prime }} }}.$ | (10) | 
3 SimulationIn this section, simulation studies on artificial data sets with different groups of n and p are presented. We construct two data configurations, include the case of n>p and the case of n < p.
The first data configuration is with n=200 and p=5, n/2, n-20 respectively, which is corresponding to the low-dimensional data set. Assume that X obeys the multivariate normal distribution, X~N(0, ∑), where the covariance matrix ∑=(∑ij)1≤i, j≤p is assigned to ∑ij=0.5|i-j|, which implies the multiple multicollinearity between predictor variables. Using the coefficient vector γ with α=0.2, β1=0.5, β2=-0.6, β3=1, β4=-0.8, β5=0.1, and βj=0 for j∈{1, 2, …, p+1}\{1, 2, 3, 4, 5}, the response variable Y is generated according to model (1).
Then, the second configuration represents the moderate case of high-dimensional. n=100 observations are generated and the value of p is n+10, 2n or 4n. The independent variables are subjected to the distributions N(0, ∑) with ∑ij=0.1|i-j|. For regression parameters, we assume that α=0.2, and β1=-0.7, β1=1, β1=-0.3, β1=0.5, β1=-0.6, while other slope coefficients βj=0. Y is also constructed according to model (1).
Denote by k the contaminated percentage, k=0, 25% or 50%. For each data set, we make different kinds of pollution.
1) No contamination;
2) Vertical outliers: k of the dependent variables in model (1) are changed, which means some Yi will become 1-Yi;
3) Leverage points: same as in (2), while for the k contaminated observations, we drawing the predictor variables from N(μ, ∑) instead of N(0, ∑), where μ=(20, …, 20).
As for shrinkage parameter λ, it is always unknown in advance. We know that the bigger value of λ, the higher punishment. More coefficients will be compressed to 0, which lead to greater bias. But if the value was too small, it could not reach the effect of variable selection. There are many ways to choose an optimal value, such as Bayes information criterion (BIC), Akaike info criterion (AIC), cross validation (CV), among others. In this paper, let MSE=
There are several termination rules widely adopted in an iterative algorithm, such as the function descent criterion, the distance criterion, the gradient criterion. At present article, we use the distance criterion. For each cycle of the coordinate descent algorithm, the specific form of stopping criterion is
| $\mathop {{\rm{max}}}\limits_{j = 1, \cdots ,p + 1} |\hat \gamma _j^{{\rm{ old }}} - \hat \gamma _j^{{\rm{ new }}}| < {10^{ - 3}}.$ | (11) | 
Given a parametric statistical model (sample space and sample distribution family), decision space and loss function, it is naturally desirable to specify a decision for each point in the sample space. We assume that the decision function is
| ${Y_i} = 1{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\rm{if}}{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} p({Y_i} = 1|{\mathit{\boldsymbol{X}}_i}) > 0.5.$ | (12) | 
We polluted the raw data in different degrees and different types of outliers. Following are the partial simulation results and all the above simulations are performed in Matlab2016b.
Here, Fig. 1 is one of the cross validation diagram.
Fig. 1
|  | Download: JPG larger image | 
| Fig. 1 Cross validation result while n=200 and p=100 | |
Table 1 and Table 2 express the results of logistic regression model while n>p and n < p, averaged over 500 simulation runs. Here, let MR be the misclassification rate.
Table 1
|  Table 1 Logistic regression results while there are vertical outliers 
 | Table 1 Logistic regression results while there are vertical outliers | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 2
|  Table 2 Logistic regression results with different numbers of leverage points 
 | Table 2 Logistic regression results with different numbers of leverage points | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
As shown in Table 1, if there were vertical outliers in the data set, the regression coefficient always stayed bounded regardless of the extent of the first kind of pollution degree. At the same time, regression coefficients keep bounded no matter how many leverage points were in data set in Table 2. In general, penalized logistic regression is a very robust approach, which echoing the previous
Then, we change the link function to Probit link function. Table 3 shows the performance of probit regression, averaged over 500 simulation runs.
Table 3
|  Table 3 Probit regression results with different numbers of outliers 
 | Table 3 Probit regression results with different numbers of outliers | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Similarly, the MSE have a limit in Table 3, which means the regression estimator keeps bounded. In addition to some similarities between logistic and probit regression models, previous scholars found that the logistic regression coefficient is about 1.8 times bigger than that of the probit regression coefficient with the same data sets. Although the coefficients are different, the classification results are essentially similar. The meaning of the coefficient, the evaluation of the model and the hypothesis test are all similar.
Furthermore, we find that the choice of the loss functions has an almost negligible effect on the prediction result. This might be the reason why less discussion about the impact on different loss functions was talked in classification models.
4 DiscussionThis paper shows that penalized MLE has a high explosive breakdown point in binary logistic regression models. Accurately speaking, we can still get bounded regression coefficients even if the pollution rate of the observations reaches 50%. The property is shown by our simulation studies. In either logistic model or probit model, the regression estimators are robust in contaminated data sets. Also, we give the upper bound of implosive breakdown point of the slope parameter.
Theorem 1.3 gives the upper bound of implosive breakdown point of slope parameter. However, for robust estimator, the bigger the breakdown point, the better. Naturally, we consider whether the implosive breakdown point has a larger lower bound by changing the penalty item or changing the loss function. For example, sparse LTS method is pretty robust in linear models, can we learn from the idea of trimming? As we know, SCAD penalty is a famous penalty term as it satisfies sparsity mathematical conditions. So it may lead to unexpected gains if we use this punishment. In our future research, we will pay more attention to it.
While there is only a slight multi-collinearity between the explanatory variables, MLE is consistent, asymptotically valid, and asymptotically normal for n→+∞ under some weak assumptions. Fahrmeir and Kaufmann[31] did some research. With the increase of sample size, the standard error of parameter estimation will gradually decrease. After adding a penalty item, these properties may change. As for hypothesis test, we need more studies, not only on breakdown point, but also on the standard error.
References
| [1] | Hample F R. A general qualitative definition of robustness[J]. The Annals of Mathematical Statistics, 1971, 42: 1187-1896. DOI:10.1214/aoms/1177693235 | 
| [2] | Donoho D, Huber P J. The notion of breakdown point[C]//Bickel P J, Doksum K A, Hodges J L. A Festschrift for Eric Lehmann. Belmont: Wadsworth, 1983: 157-184. | 
| [3] | Yohai V J. High breakdown point and high efficiency robust esrimates for regression[J]. Tne Annals of Statistics, 1987, 15: 642-656. DOI:10.1214/aos/1176350366 | 
| [4] | Yohai V J, Zamar R H. High breakdown point estimators of regression by means of the minimization of an efficient scale[J]. Journal of American Statistical Association, 1988, 83: 406-613. DOI:10.1080/01621459.1988.10478611 | 
| [5] | Rousseeuw P. Least median of squares regression[J]. Publications of the American Statistical Association, 1984, 79: 871-880. DOI:10.1080/01621459.1984.10477105 | 
| [6] | Sakata S, White H. An alternative definition of finite-sample breakdown point with application to regression model estimators[J].  Journal of the American Statistical Association, 1995, 90: 1099-1106. | 
| [7] | Cox D R, Snell E J. The analysis of binary data[M]. Alexandria: Biometrics, 1970. | 
| [8] | Christmann A. Least median of weighted squares in logistic regression with large strata[J]. Biometrika, 1994, 81: 413-417. DOI:10.1093/biomet/81.2.413 | 
| [9] | Kunsch H R, Stefanski L A, Carroll R J. Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models[J].  Journal of the American Statistical Association, 1981, 84: 460-466. | 
| [10] | Neykov N M, Muller C H. Breakdown point and computation of trimmed likelihood estimators in generalized linear models[C]//Dutter R, Filzmoser P, Gather U, et al. Developments in Robust Statistics. Berlin: Spring-Verlag, 2003: 277-286. | 
| [11] | Khan D M, Ihtisham S, Ali A, et al. An efficient and high breakdown estimation procedure for nonlinear regression models[J].  Pakistan Journal of Statistics, 2017, 33: 223-236. | 
| [12] | Croux C, Flandre C, Haesbroeck G. The breakdown behavior of the maximum likelihood estimator in the logistic regression model[J].  Statistics & Probability Letters, 2002, 60: 377-386. | 
| [13] | Breiman L. Better subset regression using the nonnegative garrote[J]. Technometrics, 1995, 37: 373-384. DOI:10.1080/00401706.1995.10484371 | 
| [14] | Tibshirani R. Regression shrinkage and selection via the lasso[J].  Journal of the Royal Statistical Society, 1996, 58: 267-288. | 
| [15] | Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the LAD-lasso[J]. Journal of Business and Economic Statistics, 2007, 25: 347-355. DOI:10.1198/073500106000000251 | 
| [16] | Xie H, Huang J. Scad-penalized regression in high-dimensional partially linear models[J]. The Annals of Statistics, 2009, 37: 673-296. DOI:10.1214/07-AOS580 | 
| [17] | Alfons A, Croux C, Gelper S. Sparse least trimmed squares regression for analyzing high-dimensional large data sets[J]. The Annals of Applied Statistics, 2013, 7: 226-248. DOI:10.1214/12-AOAS575 | 
| [18] | Albert A, Anderson J A. On the existence of maximum likelihood estimators in logistic regression models[J]. Biometrika, 1984, 71: 1-10. DOI:10.1093/biomet/71.1.1 | 
| [19] | Silvapulle M J. On the existence of maximum likelihood estimators for the binomial response models[J].  Journal of the Royal Statistical Society, 1981, 43: 310-313. | 
| [20] | Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties[J]. Journal of the American Statistical Association, 2001, 96: 1348-1360. DOI:10.1198/016214501753382273 | 
| [21] | Rousseeuw P J, Leroy A M. Robust regression and outlier detection[M]. America: Wiley-Interscience, 1987. | 
| [22] | Zuo Y. Some quantitative relationships between two types of finite-sample breakdown point[J].  Statistics & Probability Letters, 2001, 51: 369-375. | 
| [23] | Efron B, Hastie T, Johnstone I, et al. Least angle regression[J]. The Annals of Statistics, 2004, 32: 407-451. DOI:10.1214/009053604000000067 | 
| [24] | Balakrishnan S, Madigan D. Algorithms for sparse linear classifiers in the massive data setting[J].  Journal of Machine Learning Research, 2008, 9: 313-337. | 
| [25] | Wu T T, Lange K. Coordinate descent algorithms for lasso penalized regression[J]. The Annals of Applied Statistics, 2008, 2: 224-244. DOI:10.1214/07-AOAS147 | 
| [26] | Friedman J, Hastie T, Hofling H, et al. Pathwise coordinate optimization[J]. The Annals of Applied Statistics, 2007, 1: 302-332. DOI:10.1214/07-AOAS131 | 
| [27] | Zhang C, Zhang Z, Chai Y. Penalized bregman divergence estimation via coordinate descent[J].  Journal of the Iranian Statistical Society, 2011, 2: 125-140. | 
| [28] | Luo Z Q, Tseng P. Error bounds and convergence analysis of feasible descent methods:a general approach[J].  Annals of Operations Research, 1993, 46: 157-178. | 
| [29] | Saha A, Tewari A. On the nonasymptotic convergence of cyclic coordinate descent methods[J]. Siam Journal on Optimization, 2013, 23: 576-601. DOI:10.1137/110840054 | 
| [30] | Donoho D L, Johnstone I M. Ideal spatial adaptation by wavelet shrinkage[J]. Biometrika, 1994, 81: 425-455. DOI:10.1093/biomet/81.3.425 | 
| [31] | Fahrmeir L, Kaufmann H. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models[J]. The Annals of Statistics, 1985, 13: 342-368. DOI:10.1214/aos/1176346597 | 
