Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38546

Summary

  • SQL-on-Hadoop Benefits
    • Excellent price/performance
    • SQL access
    • Incredible pace of product advancements
  • SQL-on-Hadoop Pitfalls
    • Crowded, confusing market
      • Hive, TEZ, Vertica, Splice, Impala, Drill, Vectorwise, Phoenix, Spark SQL, Presto, HAWQ …
    • Relatively immature products
    • Difficult to determine product differences
  • Recommendations
    • Review Top 10 pitfalls
      1. Individual Lookups and Range Queries
      2. Fixing Bad Data
      3. Sharding
      4. Hotspotting
      5. SQL Coverage
      6. Concurrency
      7. Columnar
      8. Node Sizing
      9. Brittle ETL on Hadoop
      10. Cost-Based Optimizer
    • Truly understand your workloads
    • Do POCSs to verify

Pitfall #1: Individual Record Lookups and Range Queries

Issues

  • Support real-time apps?
    • Concurrent readers/writers?
    • e.g., TPC-C A.5

Impact

  • Data delays
  • Inability to support:
    • many users
    • many sources
    • real-time apps

Examples

  • Affected:
    • Hive
    • HAWQ
  • Unaffected:
    • Splice
    • Phoenix

Solution

  • Avoid analytic systems designed for full table scans
  • Choose systems designed for range queries
  • Verification: do query <30ms?

Pitfall #2: Fixing Bad Data

Issues

  • Fix bad data without reloading all data

Impact

  • Bad reports
  • Delays to reload entire data set

Examples

  • Affected:
    • Impala
    • Drill
  • Unaffected:
    • Splice

Solution

  • Ability to update and delete
  • Transactions for multi-row updates
  • Verification: update one record in <30ms?

Pitfall #3: Sharding

Issues

  • Right key to distribute data
  • Right shard size

Impact

  • Slow queries, esp. for large joins or aggregations

Examples

  • Affected:
    • Splice
    • Phoenix
  • Unaffected:
    • Hive

Solution

  • Requires design time decisions and/or tuning
  • Small shards = more parallelization
  • Fewer shards = less memory and compaction overhead

Pitfall #4: Hotspotting

Issues

  • Data too concentrated in a few nodes
    • Especially for time series data
    • e.g., order data

Impact

  • Slow queries
  • Poor parallelization

Examples

  • Affected:
    • Splice
    • Phoenix
  • Unaffected:
    • Hive
    • Impala

Solution

  • Key design
  • Salt keys, if range queries not important

Pitfall #5: SQL Coverage

Issues

  • Limited SQL dialects
    • Why can’t I do full join syntax like ON or OR?
    • Can I do a correlated sub-query?

Impact

  • Can’t run queries to meet business needs

Examples

  • Affected:
    • Hive
    • Impala
  • Unaffected:
    • Splice
    • HOWQ

Solution

  • Do your homework
  • Compile list of toughest queries and test

Pitfall #6: Concurrency

Issues

  • Power real-time apps
  • Support 1000s of readers/writers at once
  • Commit records across tables atomically

Impact

  • Inability to:
    • Power real-time apps
    • Handle many users
    • Support many input sources
    • Deliver reports as updates happen

Examples

  • Affected:
    • Hive
    • Drill
  • Unaffected:
    • Splice

Solution

  • Lockless, high-concurrency transactions
  • Verification: TPC-C benchmark

Pitfall #7: Columnar

Issues

  • OLAP queries slower
    • Especially large joins or aggregations
    • e.g., TPC-H Q6

Impact

  • Poor productivity
    • Queries may take minutes or hours instead of seconds

Examples

  • Affected:
    • Splice
    • Phoenix
  • Unaffected:
    • Vertica
    • Spark SQL

Solution

  • Columnar, but not all created equal:
    1. Columnar storage/compression (4x advantage)
    2. Columnar projection (10x)
    3. Columnar engine (e.g., vector processing, pipelining) (50x)

Pitfall #8: Node Sizing

Issues

  • Big boxes vs. pizza boxes?
  • SSDs?
  • Memory size?

Impact

  • Price/performance
  • Performance bottlenecks

Examples

  • Affected:
    • Every distributed system

Solution

  • Defalut: stick to pizza boxes with moderate memory (32-64 GB) and no SSDs unless a good reason
  • Do your homework – profile your workload
    • CPU bound? Momory? I/O? Network?

Pitfall #9: Brittle ETL on Hadoop

Issues

  • Restart ETL pipeline because of errors
  • Serialize ETL Pipeline

Impact

  • Miss ETL window
  • Delay reports to business users

Examples

  • Affected:
    • Hive
    • Impala
  • Unaffected:
    • Splice

Solution

  • Incremental updates with transactions to addredd data quality

Pitfall #10: Cost-Based Optimizer

Issues

  • Choose right join strategy
  • Choose right index
  • Choose right ordering

Impact

  • Poor performance = poor productivity
  • Manual tuning by DBAs

Examples

  • Affected:
    • Presto
    • Phoenix
  • Unaffected:
    • HAWQ
    • Vertica

Solution

  • Cost-based optimizer
  • However, maturity for OLAP queries may be issue for newer products like Hive, Impala, and Drill