Drugs, DNA, and Dinosaurs: Building High Quality Knowledge Bases with DeepDive

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/40208

Summary

  • KBs hoelp with macroscopic questions
  • Probabilistic inference = algorithmic independence
  • Hardware-aware engine is our current approach

KBs hoelp with macroscopic questions

The world’s scientific knowledge is accessible. Today, some pressing problems require microscopic knowledge such as climate & biodiversity, health or financial markets. But we’re still human. Could we build a machine to read for us? Data are buried in tables, but not in a self-contained way.

PaleoDeepDive

The goal: extract paleobiological facts to build higher coverage fossil record. Aggressive approach: Statistical interface of billions of variables.

Machine-created

  • 10x documents
  • 100x extractions

PaleoDB

Human-created

  • 329 volunteers
  • 13 years
  • 46K documents
  • 200+ Papers, 17 Nature/Science

Formation Precision

  • PaleoDB volunteers: 0.84
  • PaleoDeepDive: 0.94

PaleoDeepDive vs. PharmaDeepDive

PaleoDeepDive is broadly usable.

PaleoDeepDive PharmaDeepDive
2 years 6 months
1 CS Student 1 BioE Student
  • easier and cheaper
  • features not algorithms
  • framework

Probabilistic inference = algorithmic independence

The DeepDive Framework (Geo)

Data Acquisition -> SotA NLP -> Statistical Stuff (DeepDive) -> Web Front-end

Meaningful probability, not just scores Accuracy, Output Probability

Hardware-aware engine is our current approach

To get rid of algorithms, we need a fast engine. Let hardware do the work.

How do we run SGD in Parallel?

  • Insane conflicts: … the model
  • Thm: no locking
  • High-level idea: answer is only statistically correct

Are CPUs really so slow for CNNs?

“Comming soon”

Key Issue: Statistical versus Hardware Efficiency

Relaxing consistency require … tradeoff