October 14 12:00 pm-1:00 pm
Identifying Genetic Variants in Human Genome Data With In-Memory Technology
12:00 PM - 12:20 PM
LEVEL: Advanced

By integrating a SNP calling algorithm into a column-store in-memory database, our approach benefits from built-in compression and parallelization techniques and accessing data directly from main memory instead of slower disk space. We apply a statistical model that is sensitive to input data quality and compare our approach with GATK’s UnifiedGenotyper. Results show that our approach outperforms it on average by magnitudes of speed whilst requiring less administration efforts.

Data Science Fights Ebola: Pre-Symptomatic Warning of Hemorrhagic Fever Infection
12:20 PM - 12:40 PM
LEVEL: Intermediate

In practice, Ebola and other hemorrhagic fevers are clinically invisible until fever, which occurs several days after infection. This study proposes methods to detect infection before fever. Using physiological signals from non-human primates exposed to either Marburg or Ebola viruses, we build random forest classification models to predict infection at any given time. In many cases, we can predict with high precision (>.95) infection 24-48 hours before fever.

Itemset Mining for Risk Characterization in Environmental Health Epidemiology
12:40 PM - 1:00 PM
LEVEL: Intermediate

Asthma affects an increasing fraction of the population, especially children, and a known trigger and possible cause is air pollution. Researchers have tried to quantify the association between pollution levels and incidence, but the problem is intrinsically difficult and hard to solve with traditional statistical methods. In this presentation, we will talk about limitations of traditional methods, and propose an alternative study based on itemset mining modified for risk characterization.