Methods Inf Med 2010; 49(04): 371-378
DOI: 10.3414/ME09-01-0009
Original Articles
Schattauer GmbH

Chi-square-based Scoring Function for Categorization of MEDLINE Citations

A. Kastrin
1   Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia
,
B. Peterlin
1   Institute of Medical Genetics, University Medical Centre Ljubljana, Ljubljana, Slovenia
,
D. Hristovski
2   Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
› Author Affiliations
Further Information

Publication History

received: 04 February 2009

accepted: 22 January 2009

Publication Date:
17 January 2018 (online)

Summary

Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE® citations containing genetic relevant topic.

Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH® descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain.

Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine-learning algorithms (support vector machines, decision trees, naïve Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine-learning algorithms.

Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.

 
  • References

  • 1 Rebholz-Schuhmann D, Kirsch H, Couto F. Facts from text – is text mining ready to deliver?. PLoS Biol 2005; 3 (02) e65-00.
  • 2 Manning CD, Schuetze H. Foundations of statistical natural language processing. Cambridge, MA: MIT Press; 1999
  • 3 Humphrey SM, Rindflesch TC, Aronson AR. Automatic indexing by discipline and high-level categories: methodology and potential applications. In: Soergel D, Srinivasan P, Kwasnik B. editors. Proceedings of the 11th ASIS&T SIG/CR Classification Research Workshop; Nov 12, 2000; Chicago, IL. Silver Spring, MD: American Society for Information Science and Technology; 2000. pp 103-116.
  • 4 Donaldson I, Martin J, de Bruijn B, Wolting C, Lay V, Tuekam B. et al. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics 2003; 4: 11.
  • 5 Dobrokhotov PB, Goutte C, Veuthey AL, Gaussier E. Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation. Bioinformatics 2003; 19 (Suppl 1): i91-i94.
  • 6 Bernhardt PJ, Humphrey SM, Rindflesch TC. Determining prominent subdomains in medicine. AMIA Annu Symp Proc 2005 pp 46-50.
  • 7 Miotto O, Tan TW, Brusic V. Supporting the curation of biological databases with reusable text mining. Genome Inform 2005; 16 (02) 32-44.
  • 8 Chen D, Müller HM, Sternberg PW. Automatic document classification of biological literature. BMC Bioinformatics 2006; 7: 370.
  • 9 McDonald R, Scott Winters R, Ankuda CK, Murphy JA, Rogers AE, Pereira F. et al. An automated procedure to identify biomedical articles that contain cancer-associated gene variants. Hum Mutat 2006; 27 (09) 957-964.
  • 10 Wang P, Morgan AA, Zhang Q, Sette A, Peters B. Automating document classification for the Immune Epitope Database. BMC Bioinformatics 2007; 8: 269.
  • 11 Cohen AM, Hersh WR. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab 2006; 1: 4.
  • 12 Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005; 6 (01) S1.
  • 13 Hristovski D, Stare J, Peterlin B, Dzeroski S. Supporting discovery in medicine by association rule mining in Medline and UMLS. Stud Health Technol Inform 2001; 10 Pt (02) 1344-1348.
  • 14 Hristovski D, Peterlin B, Mitchell JA, Humphrey SM. Using literature-based discovery to identify disease candidate genes. Int J Med Inform 2005; 74 2–4 289-298.
  • 15 Chen L, Liu H, Friedman C. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics 2005; 21 (02) 248-256.
  • 16 Di Fabio F, Alvarado C, Majdan A, Gologan A, Voda L, Mitmaker E. et al. Underexpression of miner-alocorticoid receptor in colorectal carcinomas and association with VEGFR-2 overexpression. J Gastrointest Surg 2007; 11 (11) 1521-1528.
  • 17 Oakes M, Gaaizauskas R, Fowkes H, Jonsson A, Wan V, Beaulieu M. A method based on the chi-square test for document classification. In: Croft WB, Harper DJ, Kraft DH, Zobel J. editors. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in infor mation retrieval (SIGIR ’01); Sep 9-13, 2001; New Orleans, LA. New York, NY: ACM Press; 2001. pp 440-441.
  • 18 Alexandrov M, Gelbukh AF, Lozovoi G. Chi-square classifier for document categorization. In: Gelbukh A. editor. Computational linguistics and intelligent text processing. Berlin: Springer; 2001. pp 457-459.
  • 19 Kastrin A, Hristovski D. A fast document classification algorithm for gene symbol disambiguation in the BITOLA literature-based discovery support system. AMIA Annu Symp Proc 2008 pp 358-362.
  • 20 Entrez Gene (FTP repository, cited Oct 22, 2009). Available from: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/.
  • 21 Rice JA. Mathematical statistics and data analysis. Belmont, CA: Duxbury Press; 2006
  • 22 Agresti A. Categorical data analysis. Hoboken, NJ: Wiley; 2002
  • 23 Medical Subject Headings Home Page (homepage on the Internet, cited Oct 22, 2009). Available from: http://www.nlm.nih.gov/mesh.
  • 24 Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H. et al. Top 10 algorithms in data mining. Knowl Inform Syst 2008; 14 (01) 1-37.
  • 25 Kononenko I, Kuhar M. Machine learning and data mining: introduction to principles and algorithms. West Sussex: Horwood; 2007
  • 26 LIBSVM: a library for support vector machines (homepage on the Internet, cited Oct 22, 2009). Available from: http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • 27 A practical guide to support vector classification (document on the Internet, cited Oct 22, 2009). Available from: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf.
  • 28 Stopword List 1 (document on the internet, cited Oct 22, 2009).. Available from: http://www.lextek.com/manuals/onix/stopwords1.html.
  • 29 Lovins JB. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 1968; 11 1–2 22-31.
  • 30 Baeza-Yates R, Ribeiro-Neto B. Modern information retrieval. Boston, MA: Addison Wesley; 1999
  • 31 MEDLINE Baseline Repository Query Tool (home-page on the Internet, cited Oct 22, 2009). Available from: http://mbr.nlm.nih.gov/Query/index.shtml.
  • 32 Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960; 20 (01) 37-46.
  • 33 Dietterich TG. Approximate statistical test for comparing supervised classification learning algorithms. Neural Comput 1998; 10 (07) 1895-1923.
  • 34 R Development Core Team.. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria; 2009
  • 35 Andrej Kastrin Home Page (homepage on the Internet, cited Oct 22, 2009).. Available from: http://www2.arnes.si/~akastr1/annotated_corpus.txt.
  • 36 Yetisgen-Yildiz M, Pratt W. The effect of feature representation on MEDLINE document classification. AMIA Annu Symp Proc 2005 pp 849-853.
  • 37 Koprinska I, Poon J, Clark J, Chan J. Learning to classify e-mail. Inform Sci 2007; 177 (10) 2167-2187.
  • 38 Duda RO, Hart PE, Stork DG. Pattern classification. New York, NY: Wiley; 2000
  • 39 Rubin DL, Thorn CF, Klein TE, Altman RB. A statistical approach to scanning the biomedical literature for pharmacogenetics knowledge. J Am Med Inform Assoc 2005; 12 (02) 121-129.
  • 40 Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Hardin D, Aliferis CF. Text categorization models for high-quality article retrieval in internal medicine. J Am Med Inform Assoc 2005; 12 (02) 207-216.
  • 41 Cohen AM, Hersh WR, Peterson K, Yen PY. Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc 2006; 13 (02) 206-219.
  • 42 BITOLA – Biomedical Discovery Support System (homepage on the Internet, cited Oct 22, 2009). Available from: http://ibmi.mf.uni-lj.si/bitola.