scala> sql("select 1 + 1 + 1").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(((1 + 1) + 1), None)]
+- OneRowRelation$
== Analyzed Logical Plan ==
((1 + 1) + 1): int
Project [((1 + 1) + 1) AS ((1 + 1) + 1)#4]
+- OneRowRelation$
== Optimized Logical Plan ==
Project [3 AS ((1 + 1) + 1)#4]
+- OneRowRelation$
== Physical Plan ==
*Project [3 AS ((1 + 1) + 1)#4]
+- Scan OneRowRelation[]
Logical Query Plan Optimizer
Optimizer
is a RuleExecutor for logical plans. It uses a collection of logical plan optimizations.
Note
|
Catalyst is a Spark SQL framework for manipulating trees. It can work with trees of relational operators and expressions in logical plans before they end up as physical execution plans. |
Spark 2.0 uses Catalyst’s tree manipulation library to build an extensible query plan optimizer with a number of query optimizations.
Catalyst supports both rule-based and cost-based optimization.
Collection of Logical Plan Optimizations — batches
Method
batches: Seq[Batch]
batches
returns a collection of logical plan optimization batches.
Name | Description |
---|---|
SparkOptimizer
— The Default Logical Query Plan Optimizer
SparkOptimizer
is the default logical query plan optimizer that is available as optimizer
attribute of SessionState
with the logical plan optimizations.
sparkSession.sessionState.optimizer
Note
|
SparkOptimizer is merely used to compute the optimized LogicalPlan for a QueryExecution (available as optimizedPlan ).
|
SparkOptimizer
requires a SessionCatalog, a SQLConf and ExperimentalMethods
with user-defined experimental methods.
Note
|
SparkOptimizer 's input experimentalMethods serves an extension point for custom ExperimentalMethods .
|
SparkOptimizer
extends the Optimizer
batches with the following batches:
-
Optimize Metadata Only Query (as
OptimizeMetadataOnlyQuery
) -
Extract Python UDF from Aggregate (as
ExtractPythonUDFFromAggregate
) -
Prune File Source Table Partitions (as
PruneFileSourcePartitions
) -
User Provided Optimizers for the input user-defined
ExperimentalMethods
You can see the result of executing SparkOptimizer
on a query plan using optimizedPlan
attribute of QueryExecution
.
// Applying two filter in sequence on purpose
// We want to kick CombineTypedFilters optimizer in
val dataset = spark.range(10).filter(_ % 2 == 0).filter(_ == 0)
// optimizedPlan is a lazy value
// Only at the first time you call it you will trigger optimizations
// Next calls end up with the cached already-optimized result
// Use explain to trigger optimizations again
scala> dataset.queryExecution.optimizedPlan
res0: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
TypedFilter <function1>, class java.lang.Long, [StructField(value,LongType,true)], newInstance(class java.lang.Long)
+- Range (0, 10, step=1, splits=Some(8))
Tip
|
Enable Add the following line to
Refer to Logging. |
Further reading or watching
-
(video) Modern Spark DataFrame and Dataset (Intermediate Tutorial) by Adam Breindel from Databricks.