r2VIM: A new variable selection method for random forests in genome-wide association studies

MOLLOY, ANNE

dc.contributor.author	MOLLOY, ANNE	en
dc.date.accessioned	2016-09-27T09:39:30Z
dc.date.available	2016-09-27T09:39:30Z
dc.date.issued	2016	en
dc.date.submitted	2016	en
dc.identifier.citation	Szymczak S, Holzinger E, Dasgupta A, Malley J.D, Molloy A.M, Mills J.L, Brody L.C, Stambolian D, Bailey-Wilson J.E, r2VIM: A new variable selection method for random forests in genome-wide association studies, BioData Mining, 9, 1, 2016, 7	en
dc.identifier.other	Y	en
dc.description.abstract	Background Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genome-wide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses. Results We propose a new variable selection approach, recurrent relative variable importance measure (r2VIM). Importance values are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified while the approach selects none of the SNPs in an underpowered GWAS. Conclusions The novel variable selection method r2VIM is a promising extension to standard RF for objectively selecting relevant SNPs in GWAS while controlling the number of false-positive results.	en
dc.description.sponsorship	This work was supported by the Intramural Research Programs of the National Human Genome Research Institute (NIH), National Institute of Arthritis and Musculoskeletal and Skin Diseases (NIH) and Center for Information Technology (NIH) and utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health (http://hpc.nih.gov). AMM was funded by National Institute of Child Health and Human Development grant N01HD33348 and DS was funded by National Eye Institute grant RO1EY020483. The authors acknowledge the contributions made by the study participants in the Trinity Student Study (TSS). The TSS GWAS work was supported in part by the Intramural Research Programs of the National Human Genome Research Institute, the Eunice Shriver National Institute of Child Health and Development of the National Institutes of Health (NIH) and the Health Research Board, Dublin, Ireland.	en
dc.format.extent	7	en
dc.relation.ispartofseries	BioData Mining	en
dc.relation.ispartofseries	9	en
dc.relation.ispartofseries	1	en
dc.rights	Y	en
dc.subject	Machine learning Random forest Variable selection Variable importance Genome-wide association study Genetic SNP	en
dc.subject.lcsh	Machine learning Random forest Variable selection Variable importance Genome-wide association study Genetic SNP	en
dc.title	r2VIM: A new variable selection method for random forests in genome-wide association studies	en
dc.type	Journal Article	en
dc.type.supercollection	scholarly_publications	en
dc.type.supercollection	refereed_publications	en
dc.identifier.peoplefinderurl	http://people.tcd.ie/amolloy	en
dc.identifier.rssinternalid	128047	en
dc.identifier.doi	http://dx.doi.org/10.1186/s13040-016-0087-3	en
dc.rights.ecaccessrights	openAccess
dc.identifier.rssuri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-84958122861&partnerID=40&md5=092734e26f0391242213e810e7edc4c9	en
dc.identifier.orcid_id	0000-0002-1688-9049	en
dc.identifier.uri	http://hdl.handle.net/2262/77422

Files in this item

Name:: art%3A10.1186%2Fs13040-016-008 ...
Size:: 1.298Mb
Format:: PDF

View/Open

Name:: license.txt
Size:: 3.419Kb
Format:: Text file

View/Open

This item appears in the following Collection(s)

Clinical Medicine (Scholarly Publications)
Clinical Medicine (Scholarly Publications)
RSS Feeds

Show simple item record

Browse

My Account

r2VIM: A new variable selection method for random forests in genome-wide association studies

Files in this item

This item appears in the following Collection(s)