Cloudera Impala as a member of SQL-on-Hadoop system has received increasing Published in: 16th International Symposium on Communications and. The Fifth Elephant – The Fifth Elephant is HasGeek's annual If there is more than one proposal from a domain, the one which meets the editorial criteria will be chosen. Data Processing, Querying and Analysis – Oozie, Azkaban, scikit-learn, Mahout, Impala, Hive, Tez, etc. Real-time Shailesh Kumar (@ shkumar). Mar 29, Explore Sherrie Capestany's board " Impala" on Pinterest. Ravelry: Easy Crochet Soccer Ball pattern by Sarita Kumar free pattern.
We ran precisely the same open decision-support benchmark derived from TPC-DS described in our previous testing with queries categorized into Interactive, Reporting, and Deep Analytics buckets.
Due to the lack of a cost-based optimizer and predicate propagation in all tested engines excepting Impala, we ran the same queries that had been converted to SQLstyle joins from the previous testing and also manually propagated predicates where semantically equivalent. For consistency, we ran those same queries against Impala — although Impala produces identical results without these modifications.
In the case of Shark, manual query hints were needed in addition to the modifications above to complete the query runs. Furthermore, Shark required more memory than available in the cluster to run the Reporting and Deep Analytics queries on RDDs and thus those queries could not be completed. We selected comparable file formats across all engines, consistently using Snappy compression to ensure apples-to-apples comparisons.2014 Chevrolet Impala Review
Furthermore, each engine was tested on a file format that ensures the best possible performance and a fair, consistent comparison: Note that native support for Parquet in Shark as well as Presto is forthcoming. Standard methodical testing techniques multiple runs, tuning, and so on were used for each of the engines involved. Results Single User Impala on Parquet was the performance leader by a substantial margin, running on average 5x faster than its next best alternative Shark 0.
The two Hive-on-DAG implementations produced similar results, which is consistent with what one would have expected given they have highly similar designs. Presto is the youngest implementation of the four and is held back by the fact that it runs on RCFile, which is a much less effective columnar format than Parquet.
We look forward to re-running these benchmarks in a few months when Presto runs on Parquet. Although these results are exciting in themselves, as previously explained, we believe that measuring latency under a multi-user workload is a more valuable metric — because you would very rarely, if ever, commit your entire cluster to a single query at a time.
Multiple Users In this test of a concurrent workload, we ran seven Interactive queries q42, q52, q55, q63, q68, q73, q98 10 times concurrently. To prevent all processes from running the same queries at the same time, queries were run consistently back-to-back and randomized.
Furthermore, because we could not run the full query set for Shark on RDDs, we used only the partition necessary for the Interactive queries to do the single-user and user comparisons. In this run, Impala widened its performance advantage, performing 9.
Throughput and Hardware Utilization In the above chart you can see that under the simulated load of 10 concurrent users, Impala slows down by 1. This performance difference translates into quality of experience as perceived by the BI user. At saturation point, query latency is low and throughput is high. Scalable in the number of users: Adding users after saturation leads to proportionally increasing latency without compromising throughput.
Scalable in cluster size: Adding hardware to the system leads to proportionally increasing throughput and decreasing latency. As you can see, for a given threshold, increasing the cluster size results in increasing the number of users supported. The general shape of the graphs is identical for different thresholds. For a fixed size, a cluster can add more users with increasing latency, allowing the cluster to support many users before exceeding a given threshold. This result is impressive.
It supports the contention that to maintain low query latency while adding more users, you would simply add more nodes to the cluster. Saturation Throughput Scales with Cluster Size Cluster saturation throughput is another important performance metric.
When there is a large number of users the cluster is saturatedyou want larger clusters to run through the queries proportionally faster.
How Impala Scales for Business Intelligence: New Test Results - Cloudera Engineering Blog
The results indicate that this is indeed the case with Impala. Again, the results indicate that increasing the cluster size will allow you to increase the cluster saturation throughput. Cluster Behavior as More Users are Added A different view on the data allows one to identify distinct cluster operating regions as more users are added.
The results below show query latency on the node cluster as we progressively add more users. There is initially a region where the cluster is under-utilized, and query latency remains constant even as we add more users.
At the extreme right of the graph, there is a saturation region, where adding more users results in proportionally longer query latency. There is also a transition region in between. The shape of the graph on the right side is important because it indicates gracefully degrading query latency.
So although fluctuations in real-life workloads can often take the cluster beyond saturation, in those conditions, query latencies would not become pathologically high. Thus, Impala is efficiently utilizing the available hardware resources. The variation in CPU utilization is due to execution skew more about that below. The graph on the right shows the memory utilization. Why Admission Control is Necessary Previously you saw that when we add more users to a cluster, we get gracefully degrading query latency.
We achieve this behavior by configuring Admission Control. When users submit queries beyond the limit, the cluster puts the additional queries in a waiting queue. That is how we achieve graceful latency degradation. The following is a conservative heuristic to set the admission control limit.
Cloudera Engineering Blog
It requires running typical queries in the workload in a stand-alone fashion, then finding the per-node peak memory use from the query profiles. On larger clusters, the same queries result in lower values for the per-node peak memory use, and this heuristic will give higher limits for Admission Control.
Scaling Overhead and Execution Skew A query engine that scales simplifies cluster planning and operations: To add more users while maintaining query latency, just add more nodes. However, the behavior is not ideal. In our tests, increasing the cluster size Nx resulted in below-Nx increase in the number of users supported at the same latency.
One reason for this scaling overhead is skew in how the workload is executed. The graph below shows what happens when we vary the cluster size. On small clusters, there is almost no gap between maximum and minimum CPU utilization across the cluster. On large clusters, there is a large gap. The overall performance is bottlenecked on the slowest node.
The bigger the gap, the bigger the skew, and the bigger the scaling overhead. The primary source of execution skew occurs during the fact table scan HDFS operator. It arises out of uneven data placement across different nodes.
The fact table is partitioned by date. Many of the queries filter by date ranges in a month, so all but 30 partitions will be pruned.
Most partitions have three blocks, and the average is 4.
The net effect is that after partition pruning, there will be around 30 partitions remaining and Parquet blocks that will be scanned across the cluster. On an node cluster, most nodes will scan one or two blocks, and some nodes could end up scanning three or more blocks or zero blocks.
37 best Impala images on Pinterest in | Sweets, Tutorials and Chicken
This is a source of heavy skew. Smaller clusters would have on average more blocks per node, and statistically smooth out the node-to-node variation.
We verified the behavior by examining the query profiles. This skew impacts all queries, and propagates through the rest of the query execution after the fact table scan HDFS operator. Scales better to large clusters.
As datasets and the number of users grow, clusters will also grow. Solutions that can prove themselves at large cluster sizes are better. Achieves fast, interactive latency. This enables human specialists to explore the data, discover new ideas, then validate those ideas, all without waiting and losing their train of thought. Makes efficient use of hardware — CPU as well as memory. One can always buy bigger machines and build larger clusters. However, an efficient solution will support more users from the hardware available.
Adding more users should not require a complete redesign of the system, or migrate wholesale to a different platform.