ROSEFW-RF: The winner algorithm for the ECBDL'14 big data competition: An extremely imbalanced big data bioinformatics problem
- Submitting institution
-
University of Newcastle upon Tyne
- Unit of assessment
- 11 - Computer Science and Informatics
- Output identifier
- 216886-176193-1292
- Type
- D - Journal article
- DOI
-
10.1016/j.knosys.2015.05.027
- Title of journal
- Knowledge-Based Systems
- Article number
- -
- First page
- 69
- Volume
- 87
- Issue
- -
- ISSN
- 0950-7051
- Open access status
- Compliant
- Month of publication
- June
- Year of publication
- 2015
- URL
-
http://dx.doi.org/10.1016/j.knosys.2015.05.027
- Supplementary information
-
-
- Request cross-referral to
- -
- Output has been delayed by COVID-19
- No
- COVID-19 affected output statement
- -
- Forensic science
- No
- Criminology
- No
- Interdisciplinary
- No
- Number of additional authors
-
5
- Research group(s)
-
B - Interdisciplinary Computing and Complex Biosystems (ICOS)
- Citation count
- 64
- Proposed double-weighted
- No
- Reserve for an output with double weighting
- No
- Additional information
- This paper presents a big data methodology to tackle a very challenging classification problem - contact map prediction - arising from the protein structure prediction field (the longest-standing unsolved problem in computational biology). The data tackled can be called big many times over: large amount of records (32M), large number of variables (631) and large class imbalance (less than 2% of positive examples). The paper shows how the final strategy for solving this problem is constructed piece by piece in several phases of experiments, providing useful information for big data practitioners to apply these techniques on their own data.
- Author contribution statement
- -
- Non-English
- No
- English abstract
- -