Settings

The following list are the settings used to configure Spark SQL applications.

You can set them in a SparkSession upon instantiation using config method.

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .master("local[*]")
  .appName("My Spark Application")
  .config("spark.sql.warehouse.dir", "c:/Temp") (1)
  .getOrCreate
  1. Sets spark.sql.warehouse.dir for the Spark SQL session

spark.sql.warehouse.dir

spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse) is the default location of Hive warehouse directory (using Derby) with managed databases and tables.

See also the official Hive Metastore Administration document.

spark.sql.parquet.filterPushdown

spark.sql.parquet.filterPushdown (default: true) is a flag to control the filter predicate push-down optimization for data sources using parquet file format.

spark.sql.catalogImplementation

spark.sql.catalogImplementation (default: in-memory) is an internal setting to select the active catalog implementation.

There are two possible values:

  • in-memory (default)

  • hive

Tip
You can enable Hive support in a SparkSession using enableHiveSupport builder method.

spark.sql.shuffle.partitions

spark.sql.shuffle.partitions (default: 200) — the default number of partitions to use when shuffling data for joins or aggregations.

spark.sql.allowMultipleContexts

spark.sql.allowMultipleContexts (default: true) controls whether creating multiple SQLContexts/HiveContexts is allowed.

spark.sql.autoBroadcastJoinThreshold

spark.sql.autoBroadcastJoinThreshold (default: 10 * 1024 * 1024) configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. If the size of the statistics of the logical plan of a DataFrame is at most the setting, the DataFrame is broadcast for join.

Negative values or 0 disable broadcasting.

Consult Broadcast Join for more information about the topic.

spark.sql.columnNameOfCorruptRecord

spark.sql.columnNameOfCorruptRecord…​FIXME

spark.sql.dialect

spark.sql.dialect - FIXME

spark.sql.sources.default

spark.sql.sources.default (default: parquet) sets the default data source to use in input/output.

It is used when reading or writing data in DataFrameWriter and DataFrameReader, when creating external table from a path (in Catalog.createExternalTable) and in the streaming DataStreamReader and DataStreamWriter.

spark.sql.streaming.checkpointLocation

spark.sql.streaming.checkpointLocation is the default location for storing checkpoint data for continuously executing queries.

spark.sql.codegen.wholeStage

spark.sql.codegen.wholeStage (default: true) controls whether the whole stage (of multiple operators) will be compiled into single java method (true) or not (false).

results matching ""

    No results matching ""