Settings

The following list are the settings used to configure Spark SQL applications.

You can set them in a SparkSession upon instantiation using config method.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .master("local[*]")
  .appName("My Spark Application")
  .config("spark.sql.warehouse.dir", "c:/Temp") (1)
  .getOrCreate

Sets spark.sql.warehouse.dir for the Spark SQL session

spark.sql.warehouse.dir

spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse) is the default location of Hive warehouse directory (using Derby) with managed databases and tables.

See also the official Hive Metastore Administration document.

spark.sql.parquet.filterPushdown

spark.sql.parquet.filterPushdown (default: true) is a flag to control the filter predicate push-down optimization for data sources using parquet file format.

spark.sql.catalogImplementation

spark.sql.catalogImplementation (default: in-memory) is an internal setting to select the active catalog implementation.

There are two possible values:

in-memory (default)
hive

Tip	Read ExternalCatalog — System Catalog of Permanent Entities.

Tip	You can enable Hive support in a `SparkSession` using `enableHiveSupport` builder method.

spark.sql.shuffle.partitions

spark.sql.shuffle.partitions (default: 200) — the default number of partitions to use when shuffling data for joins or aggregations.

spark.sql.allowMultipleContexts

spark.sql.allowMultipleContexts (default: true) controls whether creating multiple SQLContexts/HiveContexts is allowed.

spark.sql.autoBroadcastJoinThreshold

spark.sql.autoBroadcastJoinThreshold (default: 10 * 1024 * 1024) configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. If the size of the statistics of the logical plan of a DataFrame is at most the setting, the DataFrame is broadcast for join.

Negative values or 0 disable broadcasting.

Consult Broadcast Join for more information about the topic.

spark.sql.columnNameOfCorruptRecord

spark.sql.columnNameOfCorruptRecord…FIXME

spark.sql.dialect

spark.sql.dialect - FIXME

spark.sql.sources.default

spark.sql.sources.default (default: parquet) sets the default data source to use in input/output.

It is used when reading or writing data in DataFrameWriter and DataFrameReader, when creating external table from a path (in Catalog.createExternalTable) and in the streaming DataStreamReader and DataStreamWriter.

spark.sql.streaming.checkpointLocation

spark.sql.streaming.checkpointLocation is the default location for storing checkpoint data for continuously executing queries.

spark.sql.codegen.wholeStage

spark.sql.codegen.wholeStage (default: true) controls whether the whole stage (of multiple operators) will be compiled into single java method (true) or not (false).

Settings

Settings

spark.sql.warehouse.dir

spark.sql.parquet.filterPushdown

spark.sql.catalogImplementation

spark.sql.shuffle.partitions

spark.sql.allowMultipleContexts

spark.sql.autoBroadcastJoinThreshold

spark.sql.columnNameOfCorruptRecord

spark.sql.dialect

spark.sql.sources.default

spark.sql.streaming.checkpointLocation

spark.sql.codegen.wholeStage

results matching ""

No results matching ""