Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers

C. Bielza; V. Robles; P. Larrañaga

doi:10.3414/ME9223

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

Methods Inf Med 2009; 48(03): 236-241
DOI: 10.3414/ME9223

Original Articles

Schattauer GmbH

Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers

Authors

C. Bielza

¹Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Spain
V. Robles

²Departamento de Arquitectura y Tecnología de Sistemas Informáticos, Universidad Politécnica de Madrid, Spain
P. Larrañaga

¹Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Spain

Further Information

Publication History

31 March 2009

Publication Date:
17 January 2018 (online)

Permissions and Reprints

Summary

Objectives: The “large k (genes), small N (samples)” phenomenon complicates the problem of microarray classification with logistic regression. The indeterminacy of the maximum likelihood solutions, multicollinearity of predictor variables and data over-fitting cause unstable parameter estimates. Moreover, computational problems arise due to the large number of predictor (genes) variables. Regularized logistic regression excels as a solution. However, the difficulties found here involve an objective function hard to be optimized from a mathematical viewpoint and a careful required tuning of the regularization parameters.

Methods: Those difficulties are tackled by introducing a new way of regularizing the logistic regression. Estimation of distribution algorithms (EDAs), a kind of evolutionary algorithms, emerge as natural regularizers. Obtaining the regularized estimates of the logistic classifier amounts to maximizing the likelihood function via our EDA, without having to be penalized. Likelihood penalties add a number of difficulties to the resulting optimization problems, which vanish in our case. Simulation of new estimates during the evolutionary process of EDAs is performed in such a way that guarantees their shrinkage while maintaining their probabilistic dependence relationships learnt. The EDA process is embedded in an adapted recursive feature elimination procedure, thereby providing the genes that are best markers for the classification.

Results: The consistency with the literature and excellent classification performance achieved with our algorithm are illustrated on four microarray data sets: Breast, Colon, Leukemia and Prostate. Details on the last two data sets are available as supplementary material.

Conclusions: We have introduced a novel EDA-based logistic regression regularizer. It implicitly shrinks the coefficients during EDA evolution process while optimizing the usual likelihood function. The approach is combined with a gene subset selection procedure and automatically tunes the required parameters. Empirical results on microarray data sets provide sparse models with confirmed genes and performing better in classification than other competing regularized methods.

Keywords

Logistic regression - regularization - estimation of distribution algorithms - DNA micro-arrays

References
1 Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, Robles V. Machine learning in bioinformatics. Briefings in Bioinformatics 2006; 17 (01) 86-112.

Search in Google Scholar
Download RIS citation
2 Dugas M, Weninger F, Merk S, Kohlmann A, Haferlach T. A generic concept for large-scale microarray analysis dedicated to medical diagnostics. Methods Inf Med 2006; 45 (02) 146-152.

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
3 Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd edn. New York: J. Wiley and Sons; 2000

Search in Google Scholar
Download RIS citation
4 Thisted RA. Elements of Statistical Computing. New York: Chapman and Hall; 1988

Search in Google Scholar
Download RIS citation
5 Markowetz F, Spang R. Molecular diagnosis classification, model selection and performance evaluation. Methods Inf Med 2005; 44 (03) 438-443.

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
6 Weber G, Vinterbo S, Ohno-Machado L. Multivariate selection of genetic markers in diagnostic classification. Artif Intell Med 2004; 31: 155-167.

Crossref PubMed Search in Google Scholar
Download RIS citation
7 Heckerling PS, Gerber BS, Tape TG, Wigton R. Selection of predictor variables for pneumonia using neural networks and genetic algorithms. Methods Inf Med 2005; 44 (01) 89-97.

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
8 Lee A, Silvapulle M. Ridge estimation in logistic regression. Comm Statist Simulation Comput 1988; 17: 1231-1257.

Crossref Search in Google Scholar
Download RIS citation
9 Lozano JA, Larrañaga P, Inza I, Bengoetxea E. (eds). Towards a New Evolutionary Computation. Advances in Estimation of Distribution Algorithms. New York: Springer; 2006

Search in Google Scholar
Download RIS citation
10 Minka T. A comparison of numerical optimizers for logistic regression. Tech Rep 758, Carnegie Mellon University; 2003

Search in Google Scholar
Download RIS citation
11 Keerthi SS, Duan KB, Shevade SK, Poo AN. A fast dual algorithm for kernel logistic regression. Mach Learning 2005; 61: 151-165.

Crossref Search in Google Scholar
Download RIS citation
12 Eilers P, Boer J, van Ommen G, van Houwelingen H. Classification of microarray data with penalized logistic regression. In: Proc of SPIE. Progress in Biomedical Optics and Images. 2001 Volume 4266 (2): 187-198.

PubMed Search in Google Scholar
Download RIS citation
13 Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics 2004; 5: 427-443.

Crossref PubMed Search in Google Scholar
Download RIS citation
14 Shen L, Tan EC. Dimension reduction-based penalized logistic regression for cancer classification using microarray data. IEEE Trans Comput Biol Bioinformatics 2005; 2: 166-175.

Search in Google Scholar
Download RIS citation
15 Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learning 2002; 46: 389-422.

Crossref Search in Google Scholar
Download RIS citation
16 Shevade SK, Keerthi SS. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003; 19: 2246-2253.

Crossref PubMed Search in Google Scholar
Download RIS citation
17 Cawley GC, Talbot N. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics 2006; 22: 2348-2355.

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Koh K, Kim SY, Boyd S. An interior-point method for large-scale L1-regularized logistic regression. J Mach Learn Res 2007; 8: 1519-1555.

Search in Google Scholar
Download RIS citation
19 Krishnapuram B, Carin L, Figueiredo M, Harte-mink A. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE Trans Pattern Anal Mach Intell 2005; 27: 957-968.

Crossref PubMed Search in Google Scholar
Download RIS citation
20 Robles V, Bielza C, Larrañaga P, González S, OhnoMachado L. Optimizing logistic regression coefficients for discrimination and calibration using estimation of distribution algorithms. TOP 2008; 16: 345-366.

Crossref Search in Google Scholar
Download RIS citation
21 Larrañaga P, Etxeberria R, Lozano JA, Peña JM. Optimization in continuous domains by learning and simulation of Gaussian networks. In: Workshop in Optimization by Building and Using Probabilistic Models. Genetic and Evolutionary Computation Conference, GECCO 2000 pp 201-204.

PubMed Search in Google Scholar
Download RIS citation
22 González C, Lozano JA, Larrañaga P. Mathematical modelling of UMDAc algorithm with tournament selection Behaviour on linear and quadratic functions. Internat J Approx Reason 2002; 31: 313-340.

Crossref Search in Google Scholar
Download RIS citation
23 Shachter R, Kenley C. Gaussian influence diagrams. Manag Sci 1989; 35: 527-550.

Crossref Search in Google Scholar
Download RIS citation
24 Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 2002; 97: 77-87.

Crossref Search in Google Scholar
Download RIS citation
25 West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, Nevins JR. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001; 98 (20) 11462-11467.

Crossref PubMed Search in Google Scholar
Download RIS citation
26 Inza I, Larrañaga P, Blanco R, Cerrolaza A. Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 2004; 31: 91-103.

Crossref PubMed Search in Google Scholar
Download RIS citation
27 Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification?. Bioinformatics 2004; 20: 374-380.

Crossref PubMed Search in Google Scholar
Download RIS citation
28 Alon U. et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide micro-arrays. Proc Natl Acad Sci USA 1999; 96: 6745-6750.

Crossref PubMed Search in Google Scholar
Download RIS citation
29 Golub TR. et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 1996; 286: 531-537.

Search in Google Scholar
Download RIS citation
30 Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1: 203-209.

Crossref PubMed Search in Google Scholar
Download RIS citation
31 Fort G, Lambert-Lacroix S. Classification using partial least squares with penalized logistic regression. Bioinformatics 2005; 21: 1104-1111.

Crossref PubMed Search in Google Scholar
Download RIS citation
32 Rohde M, Daugaard M, Jensen MH, Helin K, Nylandsted J, Marja Jaattela M. Members of the heat-shock protein 70 family promote cancer cell growth by distinct mechanisms. Genes Dev 2005; 19: 570-582.

Crossref PubMed Search in Google Scholar
Download RIS citation
33 Chiappetta G, Botti G, Monaco M, Pasquinelli R, Pentimalli F, Di Bonito M, D’Aiuto G, Fedele M, Iuliano R, Palmieri EA, Pierantoni GM, Giancotti V, Fusco A. HMGA1 protein overexpression in human breast carcinomas: Correlation with ErbB2 expression. Clin Cancer Res 2004; 10: 7637-7644.

Crossref PubMed Search in Google Scholar
Download RIS citation
34 Sisci D, Morelli C, Garofalo C, Romeo F, Morabito L, Casaburi F, Middea E, Cascio S, Brunelli E, Ando S, Surmacz E. Expression of nuclear insulin receptor substrate 1 in breast cancer. J Clin Pathol 2007; 60: 633-641.

Crossref PubMed Search in Google Scholar
Download RIS citation
35 Turner GA, Ellis RD, Guthrie D, Latner AL, Monaghan JM, Ross WM, Skillen AW, Wilson RG. Urine cyclic nucleotide concentrations in cancer and other conditions; cyclic GMP: A potential marker for cancer treatment. J Clin Pathol 2004; 35 (08) 800-806.

Search in Google Scholar
Download RIS citation
36 Abba MC, Drake JA, Hawkins KA, Hu Y, Sun H, Notcovich C, Gaddis S, Sahin A, Baggerly K, Aldaz CM. Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression. Breast Cancer Res 2004; 6: 499-513.

Crossref PubMed Search in Google Scholar
Download RIS citation
37 Liu Z, Jiang F, Tian G, Wang S, Sato F, Meltzer SJ, Tan M. Sparse logistic regression with Lp penalty for biomarker identification. Statistical Applications in Genetics and Molecular Biology. 2007 6: Article 6.

PubMed Search in Google Scholar
Download RIS citation
38 Furlanello C, Serafini M, Merler S, Jurman G. Entropy-based gene ranking without selection bias for the predictive classification of microarray data. BMC Bioinform 2003; 4: 54.

Crossref PubMed Search in Google Scholar
Download RIS citation
39 Gardina PJ. Alternative splicing and differential gene expression in colon cancer detected by a whole genome exon array. BMC Genomics 2006; 7: 325.

Crossref PubMed Search in Google Scholar
Download RIS citation
40 Lin YM, Furukawa Y, Tsunoda T, Yue CT, Yang KC, Nakamura Y. Molecular diagnosis of colorectal tumors by expression profiles of 50 genes expressed differentially in adenomas and carcinomas. Onco-gene 2002; 21: 4120-4128.

Crossref PubMed Search in Google Scholar
Download RIS citation
41 Ma S, Huang J. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics 2005; 21: 4356-4362.

Crossref PubMed Search in Google Scholar
Download RIS citation

Related Journals

Subscribe to RSS

Share / Bookmark

Estimation of Distribution Algorithms as Logistic Regression Regularizers of Microarray Classifiers

Authors

Publication History

Summary

Keywords

References