Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation¶

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38546

Summary¶

SQL-on-Hadoop Benefits
- Excellent price/performance
- SQL access
- Incredible pace of product advancements
SQL-on-Hadoop Pitfalls
- Crowded, confusing market
  - Hive, TEZ, Vertica, Splice, Impala, Drill, Vectorwise, Phoenix, Spark SQL, Presto, HAWQ …
- Relatively immature products
- Difficult to determine product differences
Recommendations
- Review Top 10 pitfalls
  1. Individual Lookups and Range Queries
  2. Fixing Bad Data
  3. Sharding
  4. Hotspotting
  5. SQL Coverage
  6. Concurrency
  7. Columnar
  8. Node Sizing
  9. Brittle ETL on Hadoop
  10. Cost-Based Optimizer
- Truly understand your workloads
- Do POCSs to verify

Pitfall #1: Individual Record Lookups and Range Queries¶

Issues¶

Support real-time apps?
- Concurrent readers/writers?
- e.g., TPC-C A.5

Impact¶

Data delays
Inability to support:
- many users
- many sources
- real-time apps

Examples¶

Affected:
- Hive
- HAWQ
Unaffected:
- Splice
- Phoenix

Solution¶

Avoid analytic systems designed for full table scans
Choose systems designed for range queries
Verification: do query <30ms?

Pitfall #2: Fixing Bad Data¶

Issues¶

Fix bad data without reloading all data

Impact¶

Bad reports
Delays to reload entire data set

Examples¶

Affected:
- Impala
- Drill
Unaffected:
- Splice

Solution¶

Ability to update and delete
Transactions for multi-row updates
Verification: update one record in <30ms?

Pitfall #3: Sharding¶

Issues¶

Right key to distribute data
Right shard size

Impact¶

Slow queries, esp. for large joins or aggregations

Examples¶

Affected:
- Splice
- Phoenix
Unaffected:
- Hive

Solution¶

Requires design time decisions and/or tuning
Small shards = more parallelization
Fewer shards = less memory and compaction overhead

Pitfall #4: Hotspotting¶

Issues¶

Data too concentrated in a few nodes
- Especially for time series data
- e.g., order data

Impact¶

Slow queries
Poor parallelization

Examples¶

Affected:
- Splice
- Phoenix
Unaffected:
- Hive
- Impala

Solution¶

Key design
Salt keys, if range queries not important

Pitfall #5: SQL Coverage¶

Issues¶

Limited SQL dialects
- Why can’t I do full join syntax like ON or OR?
- Can I do a correlated sub-query?

Impact¶

Can’t run queries to meet business needs

Examples¶

Affected:
- Hive
- Impala
Unaffected:
- Splice
- HOWQ

Solution¶

Do your homework
Compile list of toughest queries and test

Note¶

Correlated subquery that Impala has issues with:
- http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Plans-for-Impala-to-support-correlated-subqueries/td-p/16620
Subquery restrictions:
- http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_subqueries.html

Pitfall #6: Concurrency¶

Issues¶

Power real-time apps
Support 1000s of readers/writers at once
Commit records across tables atomically

Impact¶

Inability to:
- Power real-time apps
- Handle many users
- Support many input sources
- Deliver reports as updates happen

Examples¶

Affected:
- Hive
- Drill
Unaffected:
- Splice

Solution¶

Lockless, high-concurrency transactions
Verification: TPC-C benchmark

Pitfall #7: Columnar¶

Issues¶

OLAP queries slower
- Especially large joins or aggregations
- e.g., TPC-H Q6

Impact¶

Poor productivity
- Queries may take minutes or hours instead of seconds

Examples¶

Affected:
- Splice
- Phoenix
Unaffected:
- Vertica
- Spark SQL

Solution¶

Columnar, but not all created equal:
1. Columnar storage/compression (4x advantage)
2. Columnar projection (10x)
3. Columnar engine (e.g., vector processing, pipelining) (50x)

Note¶

Who is How Columnar?
- http://skylandtech.net/who-is-how-columnar/

Pitfall #8: Node Sizing¶

Issues¶

Big boxes vs. pizza boxes?
SSDs?
Memory size?

Impact¶

Price/performance
Performance bottlenecks

Examples¶

Affected:
- Every distributed system

Solution¶

Defalut: stick to pizza boxes with moderate memory (32-64 GB) and no SSDs unless a good reason
Do your homework – profile your workload
- CPU bound? Momory? I/O? Network?

Pitfall #9: Brittle ETL on Hadoop¶

Issues¶

Restart ETL pipeline because of errors
Serialize ETL Pipeline

Impact¶

Miss ETL window
Delay reports to business users

Examples¶

Affected:
- Hive
- Impala
Unaffected:
- Splice

Solution¶

Incremental updates with transactions to addredd data quality

Pitfall #10: Cost-Based Optimizer¶

Issues¶

Choose right join strategy
Choose right index
Choose right ordering

Impact¶

Poor performance = poor productivity
Manual tuning by DBAs

Examples¶

Affected:
- Presto
- Phoenix
Unaffected:
- HAWQ
- Vertica

Solution¶

Cost-based optimizer
However, maturity for OLAP queries may be issue for newer products like Hive, Impala, and Drill

Slide¶

http://cdn.oreillystatic.com/en/assets/1/event/118/Top%20Ten%20Pitfalls%20to%20Avoid%20in%20a%20SQL-on-Hadoop%20Implementation%20Presentation.pptx