.. _150220_1600_top_ten_pitfalls_to_avoid_in_a_sql-on-hadoop_implementation:

===========================================================
Top Ten Pitfalls to Avoid in a SQL-on-Hadoop Implementation
===========================================================

http://strataconf.com/big-data-conference-ca-2015/public/schedule/detail/38546

-------
Summary
-------

* SQL-on-Hadoop Benefits

  - Excellent price/performance
  - SQL access
  - Incredible pace of product advancements

* SQL-on-Hadoop Pitfalls

  - Crowded, confusing market

    - Hive, TEZ, Vertica, Splice, Impala, Drill, Vectorwise, Phoenix, Spark SQL, Presto, HAWQ ...

  - Relatively immature products
  - Difficult to determine product differences

* Recommendations

  - Review Top 10 pitfalls

    #. Individual Lookups and Range Queries
    #. Fixing Bad Data
    #. Sharding
    #. Hotspotting
    #. SQL Coverage
    #. Concurrency
    #. Columnar
    #. Node Sizing
    #. Brittle ETL on Hadoop
    #. Cost-Based Optimizer

  - Truly understand your workloads
  - Do POCSs to verify


-------------------------------------------------------
Pitfall #1: Individual Record Lookups and Range Queries
-------------------------------------------------------

Issues
======

* Support real-time apps?

  - Concurrent readers/writers?
  - e.g., TPC-C A.5

Impact
======

* Data delays
* Inability to support:

  - many users
  - many sources
  - real-time apps

Examples
========

* Affected:

  - Hive
  - HAWQ

* Unaffected:

  - Splice
  - Phoenix

Solution
========

* Avoid analytic systems designed for full table scans
* Choose systems designed for range queries
* Verification: do query <30ms?


---------------------------
Pitfall #2: Fixing Bad Data
---------------------------

Issues
======

* Fix bad data without reloading all data

Impact
======

* Bad reports
* Delays to reload entire data set

Examples
========

* Affected:

  - Impala
  - Drill

* Unaffected:

  - Splice

Solution
========

* Ability to update and delete
* Transactions for multi-row updates
* Verification: update one record in <30ms?


--------------------
Pitfall #3: Sharding
--------------------

Issues
======

* Right key to distribute data
* Right shard size

Impact
======

* Slow queries, esp. for large joins or aggregations

Examples
========

* Affected:

  - Splice
  - Phoenix

* Unaffected:

  - Hive

Solution
========

* Requires design time decisions and/or tuning
* Small shards = more parallelization
* Fewer shards = less memory and compaction overhead


-----------------------
Pitfall #4: Hotspotting
-----------------------

Issues
======

* Data too concentrated in a few nodes

  - Especially for time series data
  - e.g., order data

Impact
======

* Slow queries
* Poor parallelization

Examples
========

* Affected:

  - Splice
  - Phoenix

* Unaffected:

  - Hive
  - Impala

Solution
========

* Key design
* Salt keys, if range queries not important


------------------------
Pitfall #5: SQL Coverage
------------------------

Issues
======

* Limited SQL dialects

  - Why can't I do full join syntax like ON or OR?
  - Can I do a correlated sub-query?

Impact
======

* Can't run queries to meet business needs

Examples
========

* Affected:

  - Hive
  - Impala

* Unaffected:

  - Splice
  - HOWQ

Solution
========

* Do your homework
* Compile list of toughest queries and test

Note
====

* Correlated subquery that Impala has issues with:

  - http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Plans-for-Impala-to-support-correlated-subqueries/td-p/16620

* Subquery restrictions:

  - http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_subqueries.html


-----------------------
Pitfall #6: Concurrency
-----------------------

Issues
======

* Power real-time apps
* Support 1000s of readers/writers at once
* Commit records across tables atomically

Impact
======

* Inability to:

  - Power real-time apps
  - Handle many users
  - Support many input sources
  - Deliver reports as updates happen

Examples
========

* Affected:

  - Hive
  - Drill

* Unaffected:

  - Splice

Solution
========

* Lockless, high-concurrency transactions
* Verification: TPC-C benchmark


--------------------
Pitfall #7: Columnar
--------------------

Issues
======

* OLAP queries slower

  - Especially large joins or aggregations
  - e.g., TPC-H Q6

Impact
======

* Poor productivity

  - Queries may take minutes or hours instead of seconds

Examples
========

* Affected:

  - Splice
  - Phoenix

* Unaffected:

  - Vertica
  - Spark SQL

Solution
========

* Columnar, but not all created equal:

  #. Columnar storage/compression (4x advantage)
  #. Columnar projection (10x)
  #. Columnar engine (e.g., vector processing, pipelining) (50x)

Note
====

* Who is How Columnar?

  - http://skylandtech.net/who-is-how-columnar/


-----------------------
Pitfall #8: Node Sizing
-----------------------

Issues
======

* Big boxes vs. pizza boxes?
* SSDs?
* Memory size?

Impact
======

* Price/performance
* Performance bottlenecks

Examples
========

* Affected:

  - Every distributed system

Solution
========

* Defalut: stick to pizza boxes with moderate memory (32-64 GB) and no SSDs unless a good reason
* Do your homework -- profile your workload

  - CPU bound? Momory? I/O? Network?


---------------------------------
Pitfall #9: Brittle ETL on Hadoop
---------------------------------

Issues
======

* Restart ETL pipeline because of errors
* Serialize ETL Pipeline

Impact
======

* Miss ETL window
* Delay reports to business users

Examples
========

* Affected:

  - Hive
  - Impala

* Unaffected:

  - Splice

Solution
========

* Incremental updates with transactions to addredd data quality


---------------------------------
Pitfall #10: Cost-Based Optimizer
---------------------------------

Issues
======

* Choose right join strategy
* Choose right index
* Choose right ordering

Impact
======

* Poor performance = poor productivity
* Manual tuning by DBAs

Examples
========

* Affected:

  - Presto
  - Phoenix

* Unaffected:

  - HAWQ
  - Vertica

Solution
========

* Cost-based optimizer
* However, maturity for OLAP queries may be issue for newer products like Hive, Impala, and Drill


-----
Slide
-----

http://cdn.oreillystatic.com/en/assets/1/event/118/Top%20Ten%20Pitfalls%20to%20Avoid%20in%20a%20SQL-on-Hadoop%20Implementation%20Presentation.pptx