We are inspired by the rapid progress of machine learning methods in the fields of natural language processing (NLP) and representation learning on multi-modal data. These fields of machine learning have seen advances in part due to the availability of standardized benchmark datasets used to compare model performance on well-defined computation tasks.
Many of the machine learning methods developed for computer vision and natural language processing are extended to analogous problems in the field of computational biology. For example, the GPT model (e.g. ChatGPT) has recently been applied to reconstructing continuous language from fMRI of people watching Disney movies. In the biological sciences, computational challenges have spurred machine learning advances by identifying important open problems and providing formalized competitions to the broad research community.
With the BICCN challenge we seek to facilitate a formalized competition to make headway on unsolved problems in computational biology that are applied to large-scale sequencing efforts of the brain.
We believe there are key traits of our challenge that will drive innovation:
- Modeling tasks are formally defined with clear mathematical and biological interpretation.
- Easily accessible cross-species datasets are publicly available in easy to access formats including h5ad.
- Rigorous quantitative metrics are defined for each task to judge model success.
- Benchmark models are provided to identify improvement over current standards.
- State-of-the-art models are ranked on a public and continuously updated leaderboard.
Our goal is to host an open source, community driven challenge with flexible benchmarking of formalized tasks in cross-species, single-cell multi-omics analysis. We are interested in providing unbaised evaluation of novel methods, for prediction of cell type-specific cis-regulatory elements, that incorporate data-driven factors including cross-species comparisons of epigenetic signals and biological priors from domain experts.
Expectations on model sharing
Teams are expected to provide a one-page document describing their approach in sufficient detail that a person knowledgeable in the field would be able to replicate the results. Detailing all data sets used, biological priors incoperated and an overview of the predictive model would constitute a sufficient overview of the teams approach.
Leaderboards will be hosted on the Leaderboard webpage. All code and methods are driven by broad input from the scientific community.