Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDive¶

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/40208

Summary¶

KBs hoelp with macroscopic questions
Probabilistic inference = algorithmic independence
Hardware-aware engine is our current approach

KBs hoelp with macroscopic questions¶

The world’s scientific knowledge is accessible. Today, some pressing problems require microscopic knowledge such as climate & biodiversity, health or financial markets. But we’re still human. Could we build a machine to read for us? Data are buried in tables, but not in a self-contained way.

PaleoDeepDive¶

The goal: extract paleobiological facts to build higher coverage fossil record. Aggressive approach: Statistical interface of billions of variables.

Machine-created

10x documents
100x extractions

PaleoDB¶

Human-created

329 volunteers
13 years
46K documents
200+ Papers, 17 Nature/Science

Formation Precision

PaleoDB volunteers: 0.84
PaleoDeepDive: 0.94

PaleoDeepDive vs. PharmaDeepDive¶

PaleoDeepDive is broadly usable.

PaleoDeepDive	PharmaDeepDive
2 years	6 months
1 CS Student	1 BioE Student

easier and cheaper
features not algorithms
framework

Probabilistic inference = algorithmic independence¶

The DeepDive Framework (Geo)¶

Data Acquisition -> SotA NLP -> Statistical Stuff (DeepDive) -> Web Front-end

Meaningful probability, not just scores Accuracy, Output Probability

Hardware-aware engine is our current approach¶

To get rid of algorithms, we need a fast engine. Let hardware do the work.

How do we run SGD in Parallel?¶

Insane conflicts: … the model
Thm: no locking
High-level idea: answer is only statistically correct

Are CPUs really so slow for CNNs?¶

“Comming soon”

Key Issue: Statistical versus Hardware Efficiency¶

Relaxing consistency require … tradeoff

Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDive¶

Summary¶

KBs hoelp with macroscopic questions¶

PaleoDeepDive¶

PaleoDB¶

PaleoDeepDive vs. PharmaDeepDive¶

Probabilistic inference = algorithmic independence¶

The DeepDive Framework (Geo)¶

Hardware-aware engine is our current approach¶

How do we run SGD in Parallel?¶

Are CPUs really so slow for CNNs?¶

Key Issue: Statistical versus Hardware Efficiency¶

See also¶

My repository

Table Of Contents

Previous topic

Next topic

This Page