import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.appName("My Spark Application")
.config("spark.sql.warehouse.dir", "c:/Temp") (1)
.getOrCreate
Settings
The following list are the settings used to configure Spark SQL applications.
You can set them in a SparkSession upon instantiation using config method.
-
Sets spark.sql.warehouse.dir for the Spark SQL session
spark.sql.warehouse.dir
spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse) is the default location of Hive warehouse directory (using Derby) with managed databases and tables.
See also the official Hive Metastore Administration document.
spark.sql.parquet.filterPushdown
spark.sql.parquet.filterPushdown (default: true) is a flag to control the filter predicate push-down optimization for data sources using parquet file format.
spark.sql.catalogImplementation
spark.sql.catalogImplementation (default: in-memory) is an internal setting to select the active catalog implementation.
There are two possible values:
-
in-memory(default) -
hive
|
Tip
|
You can enable Hive support in a SparkSession using enableHiveSupport builder method.
|
spark.sql.shuffle.partitions
spark.sql.shuffle.partitions (default: 200) — the default number of partitions to use when shuffling data for joins or aggregations.
spark.sql.allowMultipleContexts
spark.sql.allowMultipleContexts (default: true) controls whether creating multiple SQLContexts/HiveContexts is allowed.
spark.sql.autoBroadcastJoinThreshold
spark.sql.autoBroadcastJoinThreshold (default: 10 * 1024 * 1024) configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. If the size of the statistics of the logical plan of a DataFrame is at most the setting, the DataFrame is broadcast for join.
Negative values or 0 disable broadcasting.
Consult Broadcast Join for more information about the topic.
spark.sql.columnNameOfCorruptRecord
spark.sql.columnNameOfCorruptRecord…FIXME
spark.sql.dialect
spark.sql.dialect - FIXME
spark.sql.sources.default
spark.sql.sources.default (default: parquet) sets the default data source to use in input/output.
It is used when reading or writing data in DataFrameWriter and DataFrameReader, when creating external table from a path (in Catalog.createExternalTable) and in the streaming DataStreamReader and DataStreamWriter.
spark.sql.streaming.checkpointLocation
spark.sql.streaming.checkpointLocation is the default location for storing checkpoint data for continuously executing queries.