Optimal discretization and selection of features by association rates of joint distributions
RAIRO - Operations Research - Recherche Opérationnelle, Volume 50 (2016) no. 2, pp. 437-449.

In this paper we propose a new method to measure the contribution of discretized features for supervised learning and discuss its applications to biological data analysis. We restrict the description and the experiments to the most representative case of discretization in two intervals and of samples belonging to two classes. In order to test the validity of the method, we measured the abundance of different explanatory models that can be derived from a given set of binary features. We compare the performances of our algorithm with those of popular feature selection methods, over three different publicly available gene expression data sets. The results of the comparison are in favour of the proposed method.

DOI: 10.1051/ro/2015045
Classification: 62H30
Keywords: Features selection, discretization, data mining
Santoni, Daniele 1; Weitschek, Emanuel 1, 2; Felici, Giovanni 1

1 Institute for System Analysis and Computer Science “Antonio Ruberti”, National Research Council of Italy, Via dei Taurini 19, 00185 Rome, Italy
2 Department of Engineering, Uninettuno International University, Corso Vittorio Emanuele II, 39, 00186 Rome, Italy.
     author = {Santoni, Daniele and Weitschek, Emanuel and Felici, Giovanni},
     title = {Optimal discretization and selection of features by association rates of joint distributions},
     journal = {RAIRO - Operations Research - Recherche Op\'erationnelle},
     pages = {437--449},
     publisher = {EDP-Sciences},
     volume = {50},
     number = {2},
     year = {2016},
     doi = {10.1051/ro/2015045},
     zbl = {1341.62188},
     mrnumber = {3479881},
     language = {en},
     url = {http://archive.numdam.org/articles/10.1051/ro/2015045/}
AU  - Santoni, Daniele
AU  - Weitschek, Emanuel
AU  - Felici, Giovanni
TI  - Optimal discretization and selection of features by association rates of joint distributions
JO  - RAIRO - Operations Research - Recherche Opérationnelle
PY  - 2016
SP  - 437
EP  - 449
VL  - 50
IS  - 2
PB  - EDP-Sciences
UR  - http://archive.numdam.org/articles/10.1051/ro/2015045/
DO  - 10.1051/ro/2015045
LA  - en
ID  - RO_2016__50_2_437_0
ER  - 
%0 Journal Article
%A Santoni, Daniele
%A Weitschek, Emanuel
%A Felici, Giovanni
%T Optimal discretization and selection of features by association rates of joint distributions
%J RAIRO - Operations Research - Recherche Opérationnelle
%D 2016
%P 437-449
%V 50
%N 2
%I EDP-Sciences
%U http://archive.numdam.org/articles/10.1051/ro/2015045/
%R 10.1051/ro/2015045
%G en
%F RO_2016__50_2_437_0
Santoni, Daniele; Weitschek, Emanuel; Felici, Giovanni. Optimal discretization and selection of features by association rates of joint distributions. RAIRO - Operations Research - Recherche Opérationnelle, Volume 50 (2016) no. 2, pp. 437-449. doi : 10.1051/ro/2015045. http://archive.numdam.org/articles/10.1051/ro/2015045/

Affymetrix technologies. www.affymetrix.com.

Agilent technologies. www.genomics.agilent.com.

Affymetrix, Affymetrix Microarray Suite User Guide. Affymetrix, Santa Clara, CA, Version 5 edn. (2001).

I. Arisi et al., Gene expression biomarkers in the brain of a mouse model for alzheimer’s disease: mining of microarray data by logic classification and feature selection. J. Alzheimer’s Disease 24 (2011) 721–738. | DOI

A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer and Z. Yakhini, Tissue classification with gene expression profiles. J. Comput. Biol. 7 (2000) 559–583. | DOI

P. Bertolazzi, G. Felici, P. Festa and G. Lancia, Logic classification and feature selection for biomedical data. Comput. Math. Appl. 55 (2008) 889–899. | DOI | MR | Zbl

P. Bertolazzi, G. Felici and E. Weitschek, Learning to classify species with barcodes. BMC Bioinform. 10 (2009) 1–12. | DOI

P. Bertolazzi, G. Felici and G. Lancia, Application of Feature Selection and Classification to Computational Molecular Biology. In Biological Data Mining, edited by S. Lonardi and J.K. Chen. Chapman & Hall (2010) 257–294. | MR

P. Bertolazzi, G. Felici, P. Festa, G. Fiscon and E. Weitschek, Integer programming models for feature selection: new extensions and a randomized solution algorithm. Eur. J. Oper. Res. 250 (2015) 389–399. | DOI | MR | Zbl

E. Boros, T. Ibaraki and K. Makino, Logical analysis of binary data with missing bits. Artif. Intell. 107 (1999) 219–263. | DOI | MR | Zbl

L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA (1984). | MR | Zbl

M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. Ares, Jr and D. Haussler, Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. USA 97 (2000) 262–267. | DOI

M.R. Dalman, A. Deeter, G. Nimishakavi and Z.-H. Duan, Fold change and p-value cutoffs significantly alter microarray interpretations. BMC Bioinform. 13 (2012) 1471–2105. | DOI

T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer and D. Haussler, Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinform. 16 (2000) 906–914. | DOI

M.R. Garey and D.S Johnson, Computers and Intractability : A Guide to the Theory of NP-Completeness. Series Books Math. Sci. Edited by W.H. Freeman (1979). | MR | Zbl

T.R. Golub et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (1999) 531–537. | DOI

I. Guyon and A. Elisseeff, An introduction to variable and feature selection. J. Mach. Learn. Res. 3 (2003) 1157–1182. | Zbl

I. Guyon, J. Weston, S. Barnhill and V. Vapnik, Gene selection for cancer classification using support vector machines. Machine Lear. 46 (2002) 389–422. | DOI | Zbl

H. Hu, J. Li, A.W. Plank, H. Wang and G. Daggard, A comparative study of classification methods for microarray data analysis. In AusDM (2006) 33–37.

T. Jirapech-Umpai and S Aitken, Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinform. 148 (2005).

I. Kononenko, Estimating attributes: analysis and extensions of relief. In Machine Learning: ECML-94. Springer (1994) 171–182.

T. Li, C. Zhang and M. Ogihara, A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinform. 20 (2004) 2429–2437. | DOI

H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers (2000). | Zbl

S.L. Pomeroy et al., Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415 (2002) 436–442. | DOI

J. Quackenbush, Microarray data normalization and transformation. Nature Genet. 32 (2002) 496–501. | DOI

D. Santoni and E. Pourabbas, Automatic detection of words associations in texts based on joint distribution of words occurrences. To appear in Comput. Intell. (2015) . | DOI | MR

M. Schena, D. Shalon, R.W. Davis and P.O. Brown, Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science 270 (1995) 467–470. | DOI

M. Tom, Machine Learning. The Mc-Graw-Hill Companies (1997).

I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005). | Zbl

H. Xiong and X.-W. Chen, Kernel-based distance metric learning for microarray data classification. BMC Bioinform. 7 (2006) 299. | DOI

Cited by Sources: