Methods Inf Med 2010; 49(03): 254-268
DOI: 10.3414/ME09-01-0010
Original Articles
Schattauer GmbH

Correlation-based Gene Selection and Classification Using Taguchi-BPSO

L.-Y. Chuang
1   Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan
,
C.-S. Yang
2   Institute of Biomedical Engineering, National Cheng Kung University, Tainan, Taiwan
3   Department of Plastic Surgery, Chiayi Christian Hospital, Chiayi, Taiwan
,
K.-C. Wu
4   Department of Computer Science and Information Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
,
C.-H. Yang
5   Department of Network Systems, Toko University, Chiayi, Taiwan
6   Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
› Author Affiliations
Further Information

Publication History

received: 06 February 2009

accepted: 02 February 2009

Publication Date:
17 January 2018 (online)

Summary

Background: Microarray data with reference to gene expression profiles have provided some valuable results related to a variety of problems, and contributed to advances in clinical medicine. Microarray data characteristically have a high dimension and small sample size, which makes it difficult for a general classification method to obtain correct data for classification. However, not every gene is potentially relevant for distinguishing the sample class. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process, and an effective gene extraction method is necessary for eliminating irrelevant genes and decreasing the classification error rate.

Objective: The purpose of gene expression analysis is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification.

Method: In this paper, correlation-based feature selection (CFS) and Taguchi-binary particle swarm optimization (TBPSO) were combined into a hybrid method, and the K-nearest neighbor (K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for ten gene expression profiles.

Results: Experimental results show that this hybrid method effectively simplifies feature selection by reducing the number of features needed. The classification error rate obtained by the proposed method had the lowest classification error rate for all of the ten gene expression data set problems tested. For six of the gene expression profile data sets a classification error rate of zero could be reached.

Conclusion: The introduced method outperformed five other methods from the literature in terms of classification error rate. It could thus constitute a valuable tool for gene expression analysis in future studies.

 
  • References

  • 1 Wang X, Yang J, Teng X, Xia W, Jensen R. Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters 2007; 28 (04) 459-471.
  • 2 Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research 2003; 3: 1157-1182.
  • 3 Kohavi R, John GH. Wrappers for feature subset selection. Artificial Intelligence 1997; 97 1–2 273-324.
  • 4 Liu H, Motoda H. Feature selection for knowledge discovery and data mining. Boston: Kluwer Academic Publishers; 1998
  • 5 Liu X, Krishnan A, Mondry A. An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics 2005; 6 (01) 76.
  • 6 Kodaz H, Ozsen S, Arslan A, Gunes S. Medical application of information gain based artificial immune recognition system (AIRS): Diagnosis of thyroid disease. Expert Systems with Applications 2009; 36 (02) 3086-3092.
  • 7 Verron S, Tiplica T, Kobi A. Fault detection and identification with a new feature selection based on mutual information. Journal of Process Control 2008; 18 (05) 479-490.
  • 8 Hall MA. Correlation-based Feature Subset Selection for Machine Learning. PhD thesis, Department of Computer Science, University of Waikato; 1999
  • 9 Oh I-S, Lee J-S, Moon B-R. Hybrid genetic algorithms for feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 2004; 26 (11) 1424-1437.
  • 10 Tahir MA, Bouridane A, Kurugollu F. Simultaneous feature selection and feature weighting using Hybrid Tabu Search/K-nearest neighbor classifier. Pattern Recognition Letters 2007; 28 (04) 438-446.
  • 11 Shi XH, Liang YC, Lee HP, Lu C, Wang LM. An improved GA and a novel PSO-GA-based hybrid algorithm. Information Processing Letters 2005; 93 (05) 255-261.
  • 12 Secrest BR, Lamont GB. Visualizing particle swarm optimization – Gaussian particle swarm optimization. In: Proceedings of the 2003 IEEE Swarm Intelligence Symposium, 2003 SIS ’03: 2003. pp 198-204.
  • 13 Chang TC, Tsai FC, Ke JH. Data mining and Taguchi method combination applied to the selection of discharge factors and the best interactive factor combination under multiple quality properties. The International Journal of Advanced Manufacturing Technology 2006; 31: 164-174.
  • 14 Sohn SY, Shin HW. Experimental study for the comparison of classifier combination methods. Pattern Recognition 2007; 40 (01) 33-40.
  • 15 Kwak N, Choi C-H. Input feature selection for classification problems. IEEE Transactions on Neural Networks 2002; 13 (01) 143-159.
  • 16 Chen W-C, Tai P-H, Wang M-W, Deng W-J, Chen C-T. A neural network-based approach for dynamic quality prediction in a plastic injection molding process. Expert Systems with Applications 2008; 35 (03) 843-849.
  • 17 Cover T, Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967; 13 (01) 21-27.
  • 18 Fix E, Hodges J. Discriminatory Analysis. Nonpara-metric Discrimination: Consistency Properties. In: Technical Report. USAF School of Aviation Medicine, Randolph Field, TX.; 1951
  • 19 Tan S. An effective refinement strategy for KNN text classifier. Expert Systems with Applications 2006; 30 (02) 290-298.
  • 20 Cawley GC, Talbot NLC. Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognition 2003; 36 (11) 2585-2592.
  • 21 Stone M. Cross-Validatory Choice and Assessment of Statistical Predictions. Journal of the Royal Statistical Society Series B, (Methodological) 1974; 36 (02) 111-147.
  • 22 Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7 (01) 3.
  • 23 Kennedy J, Eberhart RC. A discrete binary version of the particle swarm algorithm. In: 1997 IEEE International Conference on Systems, Man, and Cybernetics, 1997 “Computational Cybernetics and Simulation”: 1997; 1997: 4104-4108 vol. 4105.
  • 24 Wu Y, Wu A. Taguchi Methods for Robust Design. ASME Press New York; 2000
  • 25 Tsai J-T, Liu T-K, Chou J-H. Hybrid Taguchi-genetic algorithm for global numerical optimization. IEEE Transactions on Evolutionary Computation 2004; 8 (04) 365-377.
  • 26 Taguchi G, Chowdhury S, Taguchi S. Robust Engineering. New York, NY: McGraw-Hill; 2000
  • 27 Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka. Bioinformatics 2004; 20 (15) 2479-2481.
  • 28 Blake CL, Merz CJ. UCI repository of machine learning databases. In: Irvine, CA: University of California, Department of Information and Computer Science; 1998
  • 29 Conover WJ. Practical nonparametric statistics, 3rd ed. New York: Wiley & Sons Inc.; 1980
  • 30 Shi Y, Eberhart R. A modified particle swarm optimizer. In: The 1998 IEEE International Conference on Evolutionary Computation Proceedings, 1998 IEEE World Congress on Computational Intelligence, 1998. 1998. pp 69-73.
  • 31 Huang H-L, Lee C-C, Ho S-Y. Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers. Biosystems 2007; 90 (01) 78-86.
  • 32 Huerta EB, Duval B, Hao J. A hybrid ga/svm approach for gene selection and classification of microarray data. Lecture Notes in Computer Science 2006; 3907: 34-44.
  • 33 Deb K, Raji Reddy A. Reliable classification of two-class cancer data using evolutionary algorithms. Biosystems 2003; 72 1–2 111-129.
  • 34 Okun O, Priisalu H. Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artificial Intelligence in Medicine 2009; 45 (2–3) 151-162.
  • 35 Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 23 (19) 2507-2517.
  • 36 Wang Y, Tetko IV, Hall MA, Frank E, Facius A, Mayer KFX, Mewes HW. Gene selection from microarray data for cancer classification – a machine learning approach. Computational Biology and Chemistry 2005; 29 (01) 37-46.
  • 37 Inza I, Larranaga P, Blanco R, Cerrolaza AJ. Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial Intelligence in Medicine 2004; 31 (02) 91-103.
  • 38 Xiong M, Fang X, Zhao J. Biomarker Identification by Feature Wrappers. Genome Research 2001; 11 (11) 1878-1887.
  • 39 Zhu Z, Ong Y-S, Dash M. Wrapper-Filter Feature Selection Algorithm Using a Memetic Framework. IEEE Transactions on Systems, Man, and Cybernetics, Part B. 2007; 37 (01) 70-76.
  • 40 Reunanen J, Guyon I, Elisseeff A. Overfitting in Making Comparisons Between Variable Selection Methods. Journal of Machine Learning Research 2003: 3.
  • 41 Loughrey J, Cunningham P. Overfitting in Wrapper-Based Feature Subset Selection: The Harder You Try the Worse it Gets. In: Research and Development in Intelligent Systems XXI. 2005 pp 33-43.
  • 42 Schaffer C. Overfitting avoidance as bias. Machine learning 1993; 10 (02) 153-178.
  • 43 Wolpert DH. On overfitting avoidance as bias. In: Santa Fe Institute: Technical Report SFI-TR-92-03-5001; 1993
  • 44 Yang CH, Huang CC, Wu KC, Chang HY. A Novel GA-Taguchi-Based Feature Selection Method. In: Intelligent Data Engineering and Automated Learning. Daejeon, South Korea; 2008. pp 112-119.
  • 45 Yang C-S, Chuang L-Y, Li J-C, Yang C-H. A novel BPSO approach for gene selection and classification of microarray data. In: IJCNN 2008 pp 2147-2152.
  • 46 Chuang L-Y, Chang H-W, Tu C-J, Yang C-H. Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry 2008; 32 (01) 29-38.
  • 47 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 1999; 286: 531-537.
  • 48 van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002; 415 6871 530-536.
  • 49 Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M. et al. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics 2000; 24 (03) 227-235.
  • 50 Ramaswamy S, Ross KN, Lander ES, Golub TR. A molecular signature of metastasis in primary solid tumors. Nature Genetics 2002; 33 (01) 49-54.
  • 51 Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C. et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002; 415 6870 436-442.
  • 52 Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999; 96 (12) 6745-6750.
  • 53 Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000; 403 6769 503-511.
  • 54 Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP. et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 2002; 1 (02) 203-209.
  • 55 Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 2001; 7: 673-679.