DAGScheduler — Stage-Oriented Scheduler

Note

The introduction that follows was highly influenced by the scaladoc of org.apache.spark.scheduler.DAGScheduler. As DAGScheduler is a private class it does not appear in the official API documentation. You are strongly encouraged to read the sources and only then read this and the related pages afterwards.

"Reading the sources", I say?! Yes, I am kidding!

Introduction

DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan (i.e. RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).

Figure 1. DAGScheduler Transforming RDD Lineage Into Stage DAG

After an action has been called, SparkContext hands over a logical plan to DAGScheduler that it in turn translates to a set of stages that are submitted as TaskSets for execution (see Execution Model).

dagscheduler rdd partitions job resultstage.png

Figure 2. Executing action leads to new ResultStage and ActiveJob in DAGScheduler

The fundamental concepts of DAGScheduler are jobs and stages (refer to Jobs and Stages respectively) that it tracks through internal registries and counters.

DAGScheduler works solely on the driver and is created as part of SparkContext’s initialization (right after TaskScheduler and SchedulerBackend are ready).

Figure 3. DAGScheduler as created by SparkContext with other services

DAGScheduler does three things in Spark (thorough explanations follow):

Computes an execution DAG, i.e. DAG of stages, for a job.
Determines the preferred locations to run each task on.
Handles failures due to shuffle output files being lost.

DAGScheduler computes a directed acyclic graph (DAG) of stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a minimal schedule to run jobs. It then submits stages to TaskScheduler.

In addition to coming up with the execution DAG, DAGScheduler also determines the preferred locations to run each task on, based on the current cache status, and passes the information to TaskScheduler.

Furthermore, it handles failures due to shuffle output files being lost, in which case old stages may need to be resubmitted. Failures within a stage that are not caused by shuffle file loss are handled by the TaskScheduler itself, which will retry each task a small number of times before cancelling the whole stage.

DAGScheduler uses an event queue architecture in which a thread can post DAGSchedulerEvent events, e.g. a new job or stage being submitted, that DAGScheduler reads and executes sequentially. See the section Internal Event Loop - dag-scheduler-event-loop.

DAGScheduler runs stages in topological order.

Table 1. `DAGScheduler` Internal Registries and Counters
Name	Description
`nextJobId`	The next job id counting from `0`. Used when `DAGScheduler` submits a job and a map stage, and runs an approximate job.
`nextStageId`	The next stage id counting from `0`. Used when `DAGScheduler` creates a shuffle map stage and a result stage. It is the key in stageIdToStage.
`stageIdToStage`	The lookup table for stages per their ids. Used when `DAGScheduler` creates a shuffle map stage, creates a result stage, cleans up job state and independent stages, is informed that a task is started, a taskset has failed, a job is submitted (to compute a `ResultStage`), a map stage was submitted, a task has completed or a stage was cancelled, updates accumulators, aborts a stage and fails a job and independent stages.
`jobIdToStageIds`	The lookup table of all stages per `ActiveJob` id
`shuffleIdToMapStage`	The lookup table of ShuffleMapStages per ShuffleDependency.
`jobIdToActiveJob`	The lookup table of `ActiveJob`s per job id.
`waitingStages`	The stages with parents to be computed
`runningStages`	The set of stages that are currently "running". A stage is added when submitMissingTasks gets executed (without first checking if the stage has not already been added).
`failedStages`	Stages that failed due to fetch failures (when a task fails with `FetchFailed` exception).
`activeJobs`	`ActiveJob` instances
`cacheLocs`	Registry of block locations per RDD and partition. The keys are RDD ids and the values are arrays indexed by partition numbers. Each array value is the set of locations where that RDD partition is cached, i.e. the BlockManagers of the blocks. `cacheLocs` is used when `DAGScheduler` is requested for the locations of the cache blocks of a RDD or clear them.
`failedEpoch`	The lookup table of lost executors and the epoch of the event.

Tip	Enable `INFO`, `DEBUG` or `TRACE` logging levels for `org.apache.spark.scheduler.DAGSchedule` logger to see what happens inside `DAGScheduler`. Add the following line to `conf/log4j.properties`: `log4j.logger.org.apache.spark.scheduler.DAGScheduler=TRACE` Refer to Logging.

DAGScheduler uses SparkContext, TaskScheduler, LiveListenerBus, MapOutputTracker and BlockManager for its services. However, at the very minimum, DAGScheduler takes a SparkContext only (and requests SparkContext for the other services).

DAGScheduler reports metrics about its execution (refer to the section Metrics).

When DAGScheduler schedules a job as a result of executing an action on a RDD or calling SparkContext.runJob() method directly, it spawns parallel tasks to compute (partial) results per partition.

Running Approximate Job — `runApproximateJob` Method

Caution

FIXME

`getPreferredLocsInternal` Internal Method

getPreferredLocsInternal(
  rdd: RDD[_],
  partition: Int,
  visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation]

getPreferredLocsInternal…FIXME

`createResultStage` Internal Method

createResultStage(
  rdd: RDD[_],
  func: (TaskContext, Iterator[_]) => _,
  partitions: Array[Int],
  jobId: Int,
  callSite: CallSite): ResultStage

Caution

FIXME

`updateJobIdStageIdMaps` Method

Caution

FIXME

Creating DAGScheduler Instance

DAGScheduler(
  sc: SparkContext,
  taskScheduler: TaskScheduler,
  listenerBus: LiveListenerBus,
  mapOutputTracker: MapOutputTrackerMaster,
  blockManagerMaster: BlockManagerMaster,
  env: SparkEnv,
  clock: Clock = new SystemClock())

DAGScheduler takes a SparkContext, TaskScheduler, LiveListenerBus, MapOutputTrackerMaster, BlockManagerMaster, SparkEnv, and a Clock.

Note	`DAGScheduler` can reference all the services through a single SparkContext.

When created, DAGScheduler does the following (in order):

Creates DAGSchedulerSource metrics source,
Creates messageScheduler,
Creates eventProcessLoop,
Sets itself in the current TaskScheduler,
Initializes the internal registries and counters,
Starts eventProcessLoop.

LiveListenerBus Event Bus for SparkListenerEvents — `listenerBus` Property

listenerBus: LiveListenerBus

listenerBus is a LiveListenerBus to post scheduling events and is passed in when DAGScheduler is created.

`executorHeartbeatReceived` Method

executorHeartbeatReceived(
  execId: String,
  accumUpdates: Array[(Long, Int, Int, Seq[AccumulableInfo])],
  blockManagerId: BlockManagerId): Boolean

executorHeartbeatReceived posts a SparkListenerExecutorMetricsUpdate (to listenerBus) and informs BlockManagerMaster that blockManagerId block manager is alive (by posting BlockManagerHeartbeat).

Note	`executorHeartbeatReceived` is called when `TaskSchedulerImpl` handles `executorHeartbeatReceived`.

Cleaning Up After ActiveJob and Independent Stages — `cleanupStateForJobAndIndependentStages` Method

cleanupStateForJobAndIndependentStages(job: ActiveJob): Unit

cleanupStateForJobAndIndependentStages cleans up the state for job and any stages that are not part of any other job.

cleanupStateForJobAndIndependentStages looks the job up in the internal jobIdToStageIds registry.

If no stages are found, the following ERROR is printed out to the logs:

ERROR No stages registered for job [jobId]

Oterwise, cleanupStateForJobAndIndependentStages uses stageIdToStage registry to find the stages (the real objects not ids!).

For each stage, cleanupStateForJobAndIndependentStages reads the jobs the stage belongs to.

If the job does not belong to the jobs of the stage, the following ERROR is printed out to the logs:

ERROR Job [jobId] not registered for stage [stageId] even though that stage was registered for the job

If the job was the only job for the stage, the stage (and the stage id) gets cleaned up from the registries, i.e. runningStages, shuffleIdToMapStage, waitingStages, failedStages and stageIdToStage.

While removing from runningStages, you should see the following DEBUG message in the logs:

DEBUG Removing running stage [stageId]

While removing from waitingStages, you should see the following DEBUG message in the logs:

DEBUG Removing stage [stageId] from waiting set.

While removing from failedStages, you should see the following DEBUG message in the logs:

DEBUG Removing stage [stageId] from failed set.

After all cleaning (using stageIdToStage as the source registry), if the stage belonged to the one and only job, you should see the following DEBUG message in the logs:

DEBUG After removal of stage [stageId], remaining stages = [stageIdToStage.size]

The job is removed from jobIdToStageIds, jobIdToActiveJob, activeJobs registries.

The final stage of the job is removed, i.e. ResultStage or ShuffleMapStage.

Note	`cleanupStateForJobAndIndependentStages` is used in `handleTaskCompletion` when a `ResultTask` has completed successfully, failJobAndIndependentStages and markMapStageJobAsFinished.

Marking ShuffleMapStage Job Finished — `markMapStageJobAsFinished` Method

markMapStageJobAsFinished(job: ActiveJob, stats: MapOutputStatistics): Unit

markMapStageJobAsFinished marks the active job finished and notifies Spark listeners.

Internally, markMapStageJobAsFinished marks the zeroth partition finished and increases the number of tasks finished in job.

The job listener is notified about the 0th task succeeded.

The state of the job and independent stages are cleaned up.

Ultimately, SparkListenerJobEnd is posted to LiveListenerBus (as listenerBus) for the job, the current time (in millis) and JobSucceeded job result.

Note	`markMapStageJobAsFinished` is used in handleMapStageSubmitted and handleTaskCompletion.

Submitting Job — `submitJob` method

submitJob[T, U](
  rdd: RDD[T],
  func: (TaskContext, Iterator[T]) => U,
  partitions: Seq[Int],
  callSite: CallSite,
  resultHandler: (Int, U) => Unit,
  properties: Properties): JobWaiter[U]

submitJob creates a JobWaiter and posts a JobSubmitted event.

Figure 4. DAGScheduler.submitJob

Internally, submitJob does the following:

Checks whether partitions reference available partitions of the input rdd.
Increments nextJobId internal job counter.
Returns a 0-task JobWaiter when the number of partitions is zero.
Posts a JobSubmitted event and returns a JobWaiter.

You may see a IllegalArgumentException thrown when the input partitions references partitions not in the input rdd:

Attempting to access a non-existent partition: [p]. Total number of partitions: [maxPartitions]

Note	`submitJob` is called when `SparkContext` submits a job and `DAGScheduler` runs a job.

Note	`submitJob` assumes that the partitions of a RDD are indexed from 0 onwards in sequential order.

Submitting ShuffleDependency for Execution — `submitMapStage` Method

submitMapStage[K, V, C](
  dependency: ShuffleDependency[K, V, C],
  callback: MapOutputStatistics => Unit,
  callSite: CallSite,
  properties: Properties): JobWaiter[MapOutputStatistics]

submitMapStage creates a JobWaiter (that it eventually returns) and posts a MapStageSubmitted event to DAGScheduler’s Internal Event Bus).

Internally, submitMapStage increments nextJobId internal counter to get the job id.

submitMapStage then creates a JobWaiter (with the job id and with one artificial task that will however get completed only when the entire stage finishes).

submitMapStage announces the map stage submission application-wide (by posting a MapStageSubmitted to LiveListenerBus).

Note	A `MapStageSubmitted` holds the newly-created job id and `JobWaiter` with the input `dependency`, `callSite` and `properties` parameters.

submitMapStage returns the JobWaiter.

If the number of partition to compute is 0, submitMapStage throws a SparkException:

Can't run submitMapStage on RDD with 0 partitions

Note	`submitMapStage` is used when `SparkContext` submits a map stage for execution.

Posting StageCancelled Event — `cancelStage` Method

cancelStage(stageId: Int)

cancelJobGroup merely posts a StageCancelled event to the DAGScheduler’s Internal Event Bus.

Note	`cancelStage` is executed when a `SparkContext` is requested to cancel a stage.

Posting JobGroupCancelled Event — `cancelJobGroup` Method

cancelJobGroup(groupId: String): Unit

cancelJobGroup prints the following INFO message to the logs followed by posting a JobGroupCancelled event to the DAGScheduler’s Internal Event Bus.

INFO Asked to cancel job group [groupId]

Note	`cancelJobGroup` is executed when a `SparkContext` is requested to cancel a specified group of jobs.

Posting AllJobsCancelled Event — `cancelAllJobs` Method

cancelAllJobs(): Unit

cancelAllJobs merely posts a AllJobsCancelled event to the DAGScheduler’s Internal Event Bus.

Note	`cancelAllJobs` is executed when a `SparkContext` is requested to cancel all running and scheduled Spark jobs.

Posting BeginEvent Event — `taskStarted` Method

taskStarted(task: Task[_], taskInfo: TaskInfo)

taskStarted merely posts a BeginEvent event to the DAGScheduler’s Internal Event Bus.

Note	`taskStarted` is executed when a `TaskSetManager` starts a task.

Posting GettingResultEvent Event — `taskGettingResult` Method

taskGettingResult(taskInfo: TaskInfo)

taskGettingResult merely posts a GettingResultEvent event to the DAGScheduler’s Internal Event Bus.

Note	`taskGettingResult` is executed when a `TaskSetManager` gets notified about a task fetching result.

Reporting Task Ended Event (CompletionEvent Event) — `taskEnded` Method

taskEnded(
  task: Task[_],
  reason: TaskEndReason,
  result: Any,
  accumUpdates: Map[Long, Any],
  taskInfo: TaskInfo,
  taskMetrics: TaskMetrics): Unit

taskEnded simply posts a CompletionEvent event to the DAGScheduler’s Internal Event Bus.

Note	`taskEnded` is called when a `TaskSetManager` reports task completions, i.e. success or failure.

Tip	Read TaskMetrics.

Posting TaskSetFailed Event — `taskSetFailed` Method

taskSetFailed(
  taskSet: TaskSet,
  reason: String,
  exception: Option[Throwable]): Unit

taskSetFailed simply posts a TaskSetFailed to DAGScheduler’s Internal Event Bus.

Note	The input arguments of `taskSetFailed` are exactly the arguments of TaskSetFailed.

Note	`taskSetFailed` is executed when a `TaskSetManager` is aborted.

Posting ExecutorLost Event — `executorLost` Method

executorLost(execId: String, reason: ExecutorLossReason): Unit

executorLost simply posts a ExecutorLost event to DAGScheduler’s Internal Event Bus.

Posting ExecutorAdded Event — `executorAdded` Method

executorAdded(execId: String, host: String): Unit

executorAdded simply posts a ExecutorAdded event to DAGScheduler’s Internal Event Bus.

Posting JobCancelled Event — `cancelJob` Method

cancelJob(jobId: Int): Unit

cancelJob prints the following INFO message and posts a JobCancelled to DAGScheduler’s Internal Event Bus.

INFO DAGScheduler: Asked to cancel job [id]

Note	`cancelJob` is called when SparkContext and JobWaiter are requested to cancel a Spark job.

Finding Or Creating Missing Direct Parent ShuffleMapStages (For ShuffleDependencies of Input RDD) — `getOrCreateParentStages` Internal Method

getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage]

getOrCreateParentStages finds all direct parent ShuffleDependencies of the input rdd and then finds ShuffleMapStage stages for each ShuffleDependency.

Note	`getOrCreateParentStages` is used when `DAGScheduler` createShuffleMapStage and createResultStage.

Marking Stage Finished — `markStageAsFinished` Internal Method

markStageAsFinished(stage: Stage, errorMessage: Option[String] = None): Unit

Caution

FIXME

`messageScheduler` Single-Thread Executor

Caution

FIXME

Running Job — `runJob` Method

runJob[T, U](
  rdd: RDD[T],
  func: (TaskContext, Iterator[T]) => U,
  partitions: Seq[Int],
  callSite: CallSite,
  resultHandler: (Int, U) => Unit,
  properties: Properties): Unit

runJob submits an action job to the DAGScheduler and waits for a result.

Internally, runJob executes submitJob and then waits until a result comes using JobWaiter.

When the job succeeds, you should see the following INFO message in the logs:

INFO Job [jobId] finished: [callSite], took [time] s

When the job fails, you should see the following INFO message in the logs and the exception (that led to the failure) is thrown.

INFO Job [jobId] failed: [callSite], took [time] s

Note	`runJob` is used when `SparkContext` runs a job.

Finding or Creating New ShuffleMapStages for ShuffleDependency — `getOrCreateShuffleMapStage` Internal Method

getOrCreateShuffleMapStage(
  shuffleDep: ShuffleDependency[_, _, _],
  firstJobId: Int): ShuffleMapStage

getOrCreateShuffleMapStage finds or creates the ShuffleMapStage for the input ShuffleDependency.

Internally, getOrCreateShuffleMapStage finds the ShuffleDependency in shuffleIdToMapStage internal registry and returns one when found.

If no ShuffleDependency was available, getOrCreateShuffleMapStage finds all the missing shuffle dependencies and creates corresponding ShuffleMapStage stages (including one for the input shuffleDep).

Note	All the new `ShuffleMapStage` stages are associated with the input `firstJobId`.

Note	`getOrCreateShuffleMapStage` is used when `DAGScheduler` finds or creates missing direct parent ShuffleMapStages (for ShuffleDependencies of given RDD), getMissingParentStages (for ShuffleDependencies), is notified that `ShuffleDependency` was submitted, and checks if a stage depends on another.

Creating ShuffleMapStage for ShuffleDependency (Copying Shuffle Map Output Locations From Previous Jobs) — `createShuffleMapStage` Method

createShuffleMapStage(
  shuffleDep: ShuffleDependency[_, _, _],
  jobId: Int): ShuffleMapStage

createShuffleMapStage creates a ShuffleMapStage for the input ShuffleDependency and jobId (of a ActiveJob) possibly copying shuffle map output locations from previous jobs to avoid recomputing records.

Note

When a ShuffleMapStage is created, the id is generated (using nextStageId internal counter), rdd is from ShuffleDependency, numTasks is the number of partitions in the RDD, all parents are looked up (and possibly created), the jobId is given, callSite is the creationSite of the RDD, and shuffleDep is the input ShuffleDependency.

Internally, createShuffleMapStage first finds or creates missing parent ShuffleMapStage stages of the associated RDD.

Note	ShuffleDependency is associated with exactly one `RDD[Product2[K, V]]`.

createShuffleMapStage creates a ShuffleMapStage (with the stage id from nextStageId internal counter).

Note	The RDD of the new `ShuffleMapStage` is from the input ShuffleDependency.

createShuffleMapStage registers the ShuffleMapStage in stageIdToStage and shuffleIdToMapStage internal registries.

createShuffleMapStage calls updateJobIdStageIdMaps.

If MapOutputTrackerMaster tracks the input ShuffleDependency (because other jobs have already computed it), createShuffleMapStage requests the serialized ShuffleMapStage outputs, deserializes them and registers with the new ShuffleMapStage.

Note	MapOutputTrackerMaster was defined when `DAGScheduler` was created.

DAGScheduler MapOutputTrackerMaster containsShuffle.png

Figure 5. DAGScheduler Asks MapOutputTrackerMaster Whether Shuffle Map Output Is Already Tracked

If however MapOutputTrackerMaster does not track the input ShuffleDependency, you should see the following INFO message in the logs and createShuffleMapStage registers the ShuffleDependency with MapOutputTrackerMaster.

INFO Registering RDD [id] ([creationSite])

createShuffleMapStage returns the new ShuffleMapStage.

Note	`createShuffleMapStage` is executed only when `DAGScheduler` finds or creates parent `ShuffleMapStage` stages for a `ShuffleDependency`.

Clearing Cache of RDD Block Locations — `clearCacheLocs` Internal Method

clearCacheLocs(): Unit

clearCacheLocs clears the internal registry of the partition locations per RDD.

Note	`DAGScheduler` clears the cache while resubmitting failed stages, and as a result of JobSubmitted, MapStageSubmitted, CompletionEvent, ExecutorLost events.

Finding Missing ShuffleDependencies For RDD — `getMissingAncestorShuffleDependencies` Internal Method

getMissingAncestorShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]]

getMissingAncestorShuffleDependencies finds all missing shuffle dependencies for the given RDD traversing its dependency chain (aka RDD lineage).

Note	A missing shuffle dependency of a RDD is a dependency not registered in `shuffleIdToMapStage` internal registry.

Internally, getMissingAncestorShuffleDependencies finds direct parent shuffle dependencies of the input RDD and collects the ones that are not registered in shuffleIdToMapStage internal registry. It repeats the process for the RDDs of the parent shuffle dependencies.

Note	`getMissingAncestorShuffleDependencies` is used when `DAGScheduler` finds all `ShuffleMapStage` stages for a `ShuffleDependency`.

Finding Direct Parent Shuffle Dependencies of RDD — `getShuffleDependencies` Internal Method

getShuffleDependencies(rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]]

getShuffleDependencies finds direct parent shuffle dependencies for the given RDD.

spark DAGScheduler getShuffleDependencies.png

Figure 6. getShuffleDependencies Finds Direct Parent ShuffleDependencies (shuffle1 and shuffle2)

Internally, getShuffleDependencies takes the direct shuffle dependencies of the input RDD and direct shuffle dependencies of all the parent non-ShuffleDependencies in the dependency chain (aka RDD lineage).

Note	`getShuffleDependencies` is used when `DAGScheduler` finds or creates missing direct parent ShuffleMapStages (for ShuffleDependencies of given RDD) and finds all missing shuffle dependencies for a given RDD.

Failing Job and Single-Job Stages — `failJobAndIndependentStages` Internal Method

failJobAndIndependentStages(
  job: ActiveJob,
  failureReason: String,
  exception: Option[Throwable] = None): Unit

The internal failJobAndIndependentStages method fails the input job and all the stages that are only used by the job.

Internally, failJobAndIndependentStages uses jobIdToStageIds internal registry to look up the stages registered for the job.

If no stages could be found, you should see the following ERROR message in the logs:

ERROR No stages registered for job [id]

Otherwise, for every stage, failJobAndIndependentStages finds the job ids the stage belongs to.

If no stages could be found or the job is not referenced by the stages, you should see the following ERROR message in the logs:

ERROR Job [id] not registered for stage [id] even though that stage was registered for the job

Only when there is exactly one job registered for the stage and the stage is in RUNNING state (in runningStages internal registry), TaskScheduler is requested to cancel the stage’s tasks and marks the stage finished.

Note	`failJobAndIndependentStages` is called from handleJobCancellation and `abortStage`.

Note	`failJobAndIndependentStages` uses jobIdToStageIds, stageIdToStage, and runningStages internal registries.

Aborting Stage — `abortStage` Internal Method

abortStage(
  failedStage: Stage,
  reason: String,
  exception: Option[Throwable]): Unit

abortStage is an internal method that finds all the active jobs that depend on the failedStage stage and fails them.

Internally, abortStage looks the failedStage stage up in the internal stageIdToStage registry and exits if there the stage was not registered earlier.

If it was, abortStage finds all the active jobs (in the internal activeJobs registry) with the final stage depending on the failedStage stage.

At this time, the completionTime property (of the failed stage’s StageInfo) is assigned to the current time (millis).

All the active jobs that depend on the failed stage (as calculated above) and the stages that do not belong to other jobs (aka independent stages) are failed (with the failure reason being "Job aborted due to stage failure: [reason]" and the input exception).

If there are no jobs depending on the failed stage, you should see the following INFO message in the logs:

INFO Ignoring failure of [failedStage] because all jobs depending on it are done

Note	`abortStage` is used to handle `TaskSetFailed` event, when submitting a stage with no active job

Checking Out Stage Dependency on Given Stage — `stageDependsOn` Method

stageDependsOn(stage: Stage, target: Stage): Boolean

stageDependsOn compares two stages and returns whether the stage depends on target stage (i.e. true) or not (i.e. false).

Note	A stage `A` depends on stage `B` if `B` is among the ancestors of `A`.

Internally, stageDependsOn walks through the graph of RDDs of the input stage. For every RDD in the RDD’s dependencies (using RDD.dependencies) stageDependsOn adds the RDD of a NarrowDependency to a stack of RDDs to visit while for a ShuffleDependency it finds ShuffleMapStage stages for a ShuffleDependency for the dependency and the stage's first job id that it later adds to a stack of RDDs to visit if the map stage is ready, i.e. all the partitions have shuffle outputs.

After all the RDDs of the input stage are visited, stageDependsOn checks if the target's RDD is among the RDDs of the stage, i.e. whether the stage depends on target stage.

dag-scheduler-event-loop — DAGScheduler’s Internal Event Bus

eventProcessLoop is DAGScheduler’s event bus to which Spark (by submitJob) posts jobs to schedule their execution. Later on, TaskSetManager talks back to DAGScheduler to inform about the status of the tasks using the same "communication channel".

It allows Spark to release the current thread when posting happens and let the event loop handle events on a separate thread - asynchronously.

…IMAGE…FIXME

Caution

FIXME statistics? MapOutputStatistics?

Submitting Waiting Child Stages for Execution — `submitWaitingChildStages` Internal Method

submitWaitingChildStages(parent: Stage): Unit

submitWaitingChildStages submits for execution all waiting stages for which the input parent Stage is the direct parent.

Note	Waiting stages are the stages registered in `waitingStages` internal registry.

When executed, you should see the following TRACE messages in the logs:

TRACE DAGScheduler: Checking if any dependencies of [parent] are now runnable
TRACE DAGScheduler: running: [runningStages]
TRACE DAGScheduler: waiting: [waitingStages]
TRACE DAGScheduler: failed: [failedStages]

submitWaitingChildStages finds child stages of the input parent stage, removes them from waitingStages internal registry, and submits one by one sorted by their job ids.

Note	`submitWaitingChildStages` is executed when `DAGScheduler` submits missing tasks for stage and handles successful `ShuffleMapTask` completion.

Submitting Stage or Its Missing Parents for Execution — `submitStage` Internal Method

submitStage(stage: Stage)

submitStage is an internal method that DAGScheduler uses to submit the input stage or its missing parents (if there any stages not computed yet before the input stage could).

Note	`submitStage` is also used to resubmit failed stages.

submitStage recursively submits any missing parents of the stage.

Internally, submitStage first finds the earliest-created job id that needs the stage.

Note	A stage itself tracks the jobs (their ids) it belongs to (using the internal `jobIds` registry).

The following steps depend on whether there is a job or not.

If there are no jobs that require the stage, submitStage aborts it with the reason:

No active job for stage [id]

If however there is a job for the stage, you should see the following DEBUG message in the logs:

DEBUG DAGScheduler: submitStage([stage])

submitStage checks the status of the stage and continues when it was not recorded in waiting, running or failed internal registries. It simply exits otherwise.

With the stage ready for submission, submitStage calculates the list of missing parent stages of the stage (sorted by their job ids). You should see the following DEBUG message in the logs:

DEBUG DAGScheduler: missing: [missing]

When the stage has no parent stages missing, you should see the following INFO message in the logs:

INFO DAGScheduler: Submitting [stage] ([stage.rdd]), which has no missing parents

submitStage submits the stage (with the earliest-created job id) and finishes.

If however there are missing parent stages for the stage, submitStage submits all the parent stages, and the stage is recorded in the internal waitingStages registry.

Note	`submitStage` is executed when `DAGScheduler` submits missing parent map stages (of a stage) recursively or waiting child stages, resubmits failed stages, and handles JobSubmitted, MapStageSubmitted, or CompletionEvent events.

Fault recovery - stage attempts

A single stage can be re-executed in multiple attempts due to fault recovery. The number of attempts is configured (FIXME).

If TaskScheduler reports that a task failed because a map output file from a previous stage was lost, the DAGScheduler resubmits the lost stage. This is detected through a CompletionEvent with FetchFailed, or an ExecutorLost event. DAGScheduler will wait a small amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost stage(s) that compute the missing tasks.

Please note that tasks from the old attempts of a stage could still be running.

A stage object tracks multiple StageInfo objects to pass to Spark listeners or the web UI.

The latest StageInfo for the most recent attempt for a stage is accessible through latestInfo.

Block Location Discovery — `getCacheLocs` Internal Method

Note	RDD blocks matter only after a to avoid recomputing tasks.

DAGScheduler tracks which RDDs are cached (or persisted) to avoid "recomputing" them, i.e. redoing the map side of a shuffle. DAGScheduler remembers what ShuffleMapStages have already produced output files (that are stored in BlockManagers).

DAGScheduler is only interested in cache location coordinates, i.e. host and executor id, per partition of a RDD.

Caution

FIXME: A diagram, please

getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]]

getCacheLocs returns TaskLocations for the partitions of the rdd (which correspond to block locations).

When called, getCacheLocs looks the rdd up in the cacheLocs internal registry (of partition locations per RDD).

Note	The size of the collection from `getCacheLocs` is the number of partitions in `rdd` RDD.

If the input rdd could not be found in the cacheLocs registry, getCacheLocs checks storage level of the RDD.

For NONE storage level, the result is a collection of empty locations.

val numPartitions = 2
scala> IndexedSeq.fill(numPartitions)(Nil)  // <-- that is what `getCacheLocs` returns for a unknown 2-partition RDD
res0: IndexedSeq[scala.collection.immutable.Nil.type] = Vector(List(), List())

For non-NONE storage levels, getCacheLocs requests locations from BlockManagerMaster (and maps BlockManagerIds to TaskLocations with the host and executor).

Note	The `BlockManagerMaster` was given when `DAGScheduler` was created.

Note	`getCacheLocs` requests locations from `BlockManagerMaster` using RDDBlockId with the RDD id and the partition indices (which implies that the order of the partitions matters to request proper blocks).

Note	`DAGScheduler` uses `TaskLocation`s (with host and executor) while BlockManagerMaster uses BlockManagerId (to track similar information, i.e. block locations).

Note	`getCacheLocs` is used when `DAGScheduler` calculates missing parent MapStages and getPreferredLocsInternal.

Preferred Locations

DAGScheduler computes where to run each task in a stage based on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.

Adaptive Query Planning / Adaptive Scheduling

See SPARK-9850 Adaptive execution in Spark for the design document. The work is currently in progress.

DAGScheduler.submitMapStage method is used for adaptive query planning, to run map stages and look at statistics about their outputs before submitting downstream stages.

ScheduledExecutorService daemon services

DAGScheduler uses the following ScheduledThreadPoolExecutors (with the policy of removing cancelled tasks from a work queue at time of cancellation):

dag-scheduler-message - a daemon thread pool using j.u.c.ScheduledThreadPoolExecutor with core pool size 1. It is used to post a ResubmitFailedStages event when FetchFailed is reported.

They are created using ThreadUtils.newDaemonSingleThreadScheduledExecutor method that uses Guava DSL to instantiate a ThreadFactory.

Finding Missing Parent ShuffleMapStages For Stage — `getMissingParentStages` Internal Method

getMissingParentStages(stage: Stage): List[Stage]

getMissingParentStages finds missing parent ShuffleMapStages in the dependency graph of the input stage (using the breadth-first search algorithm).

Internally, getMissingParentStages starts with the stage's RDD and walks up the tree of all parent RDDs to find uncached partitions.

Note	A `Stage` tracks the associated RDD using `rdd` property.

Note	An uncached partition of a RDD is a partition that has `Nil` in the internal registry of partition locations per RDD (which results in no RDD blocks in any of the active BlockManagers on executors).

getMissingParentStages traverses the parent dependencies of the RDD and acts according to their type, i.e. ShuffleDependency or NarrowDependency.

Note	ShuffleDependency and NarrowDependency are the main top-level Dependencies.

For each NarrowDependency, getMissingParentStages simply marks the corresponding RDD to visit and moves on to a next dependency of a RDD or works on another unvisited parent RDD.

Note	NarrowDependency is a RDD dependency that allows for pipelined execution.

getMissingParentStages focuses on ShuffleDependency dependencies.

Note	ShuffleDependency is a RDD dependency that represents a dependency on the output of a ShuffleMapStage, i.e. shuffle map stage.

For each ShuffleDependency, getMissingParentStages finds ShuffleMapStage stages. If the ShuffleMapStage is not available, it is added to the set of missing (map) stages.

Note	A `ShuffleMapStage` is available when all its partitions are computed, i.e. results are available (as blocks).

Caution

FIXME…IMAGE with ShuffleDependencies queried

Note	`getMissingParentStages` is used when `DAGScheduler` submits missing parent `ShuffleMapStage`s (of a stage) and handles JobSubmitted and MapStageSubmitted events.

Computing Preferred Locations for Tasks and Partitions — `getPreferredLocs` Method

getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation]

Caution

FIXME Review + why does the method return a sequence of TaskLocations?

Note	Task ids correspond to partition ids.

Submitting Missing Tasks for Stage and Job — `submitMissingTasks` Internal Method

submitMissingTasks(stage: Stage, jobId: Int): Unit

submitMissingTasks is an internal method that…FIXME

When executed, submitMissingTasks prints the following DEBUG message out to the logs:

DEBUG DAGScheduler: submitMissingTasks([stage])

The input stage's pendingPartitions internal field is cleared (it is later filled out with the partitions to run tasks for).

The stage is asked for the indices of the partitions to compute (aka missing partitions).

submitMissingTasks adds the stage to runningStages internal registry.

submitMissingTasks informs OutputCommitCoordinator that a stage is started.

Note

The input maxPartitionId argument handed over to OutputCommitCoordinator depends on the type of the stage, i.e. ShuffleMapStage or ResultStage. ShuffleMapStage tracks the number of partitions itself (as numPartitions property) while ResultStage uses the internal RDD to find out the number.

For the missing partitions to compute, submitMissingTasks calculates the locality information associated with the missing partitions of the RDD (for the stage it is computed and partition indicies).

Note	The locality information of a RDD is called preferred locations.

In case of non-fatal exceptions at this time (while getting the locality information), submitMissingTasks creates a new stage attempt.

Note	A stage attempt is an internal property of a stage.

Despite the failure to submit any tasks, submitMissingTasks does announce that at least there was an attempt on LiveListenerBus by posting a SparkListenerStageSubmitted message.

Note	The Spark application’s LiveListenerBus is given when `DAGScheduler` is created.

submitMissingTasks then aborts the stage (with the reason being "Task creation failed" followed by the exception).

The stage is removed from the internal runningStages collection of stages and submitMissingTasks exits.

When no exception was thrown (while computing the locality information for tasks), submitMissingTasks creates a new stage attempt and announces it on LiveListenerBus by posting a SparkListenerStageSubmitted message.

Note	Yes, that is correct. Whether there was a task submission failure or not, `submitMissingTasks` creates a new stage attempt and posts a `SparkListenerStageSubmitted`. That makes sense, doesn’t it?

At that time, submitMissingTasks serializes the RDD (of the stage for which tasks are submitted for) and, depending on the type of the stage, the ShuffleDependency (for ShuffleMapStage) or the function (for ResultStage).

Note	`submitMissingTasks` uses a closure `Serializer` that `DAGScheduler` creates for the entire lifetime when it is created. The closure serializer is available through SparkEnv.

The serialized so-called task binary bytes are "wrapped" as a broadcast variable (to make it available for executors to execute later on).

Note	That exact moment should make clear how important broadcast variables are for Spark itself that you, a Spark developer, can use, too, to distribute data across the nodes in a Spark application in a very efficient way.

Any NotSerializableException exceptions lead to aborting the stage (with the reason being "Task not serializable: [exception]") and removing the stage from the internal runningStages collection of stages. submitMissingTasks exits.

Any non-fatal exceptions lead to aborting the stage (with the reason being "Task serialization failed" followed by the exception) and removing the stage from the internal runningStages collection of stages. submitMissingTasks exits.

With no exceptions along the way, submitMissingTasks computes a collection of tasks to execute for the missing partitions (of the stage).

submitMissingTasks creates a ShuffleMapTask or ResultTask for every missing partition of the stage being ShuffleMapStage or ResultStage, respectively. submitMissingTasks uses the preferred locations (computed earlier) per partition.

Caution

FIXME Image with creating tasks for partitions in the stage.

Any non-fatal exceptions lead to aborting the stage (with the reason being "Task creation failed" followed by the exception) and removing the stage from the internal runningStages collection of stages. submitMissingTasks exits.

If there are tasks to submit for execution (i.e. there are missing partitions in the stage), you should see the following INFO message in the logs:

INFO DAGScheduler: Submitting [size] missing tasks from [stage] ([rdd])

submitMissingTasks records the partitions (of the tasks) in the stage's pendingPartitions property.

Note	`pendingPartitions` property of the `stage` was cleared when `submitMissingTasks` started.

You should see the following DEBUG message in the logs:

DEBUG DAGScheduler: New pending partitions: [pendingPartitions]

submitMissingTasks submits the tasks to TaskScheduler for execution (with the id of the stage, attempt id, the input jobId, and the properties of the ActiveJob with jobId).

Note	A `TaskScheduler` was given when `DAGScheduler` was created.

Caution

FIXME What are the ActiveJob properties for? Where are they used?

submitMissingTasks records the submission time in the stage’s StageInfo and exits.

If however there are no tasks to submit for execution, submitMissingTasks marks the stage as finished (with no errorMessage).

You should see a DEBUG message that varies per the type of the input stage which are:

DEBUG DAGScheduler: Stage [stage] is actually done; (available: [isAvailable],available outputs: [numAvailableOutputs],partitions: [numPartitions])

DEBUG DAGScheduler: Stage [stage] is actually done; (partitions: [numPartitions])

for ShuffleMapStage and ResultStage, respectively.

In the end, with no tasks to submit for execution, submitMissingTasks submits waiting child stages for execution and exits.

Note	`submitMissingTasks` is called when `DAGScheduler` submits a stage for execution.

Stopping DAGScheduler — `stop` Method

stop(): Unit

stop stops the internal dag-scheduler-message thread pool, dag-scheduler-event-loop, and TaskScheduler.

`DAGSchedulerSource` Metrics Source

DAGScheduler uses Spark Metrics System (via DAGSchedulerSource) to report metrics about internal status.

Caution

FIXME What is DAGSchedulerSource?

The name of the source is DAGScheduler.

It emits the following numbers:

stage.failedStages - the number of failed stages
stage.runningStages - the number of running stages
stage.waitingStages - the number of waiting stages
job.allJobs - the number of all jobs
job.activeJobs - the number of active jobs

Updating Accumulators with Partial Values from Completed Tasks — `updateAccumulators` Internal Method

updateAccumulators(event: CompletionEvent): Unit

The private updateAccumulators method merges the partial values of accumulators from a completed task into their "source" accumulators on the driver.

Note	It is called by handleTaskCompletion.

For each AccumulableInfo in the CompletionEvent, a partial value from a task is obtained (from AccumulableInfo.update) and added to the driver’s accumulator (using Accumulable.++= method).

For named accumulators with the update value being a non-zero value, i.e. not Accumulable.zero:

stage.latestInfo.accumulables for the AccumulableInfo.id is set
CompletionEvent.taskInfo.accumulables has a new AccumulableInfo added.

Caution

FIXME Where are Stage.latestInfo.accumulables and CompletionEvent.taskInfo.accumulables used?

Settings

Table 2. Spark Properties
Spark Property	Default Value	Description
`spark.test.noStageRetry`	`false`	When enabled (i.e. `true`), task failures with `FetchFailed` exceptions will not cause stage retries, in order to surface the problem. Used for testing.

DAGScheduler — Stage-Oriented Scheduler

Introduction

Running Approximate Job — runApproximateJob Method

getPreferredLocsInternal Internal Method

createResultStage Internal Method

updateJobIdStageIdMaps Method

Creating DAGScheduler Instance

LiveListenerBus Event Bus for SparkListenerEvents — listenerBus Property

executorHeartbeatReceived Method

Cleaning Up After ActiveJob and Independent Stages — cleanupStateForJobAndIndependentStages Method

Marking ShuffleMapStage Job Finished — markMapStageJobAsFinished Method

Submitting Job — submitJob method

Submitting ShuffleDependency for Execution — submitMapStage Method

Posting StageCancelled Event — cancelStage Method

Posting JobGroupCancelled Event — cancelJobGroup Method

Posting AllJobsCancelled Event — cancelAllJobs Method

Posting BeginEvent Event — taskStarted Method

Posting GettingResultEvent Event — taskGettingResult Method

Reporting Task Ended Event (CompletionEvent Event) — taskEnded Method

Posting TaskSetFailed Event — taskSetFailed Method

Posting ExecutorLost Event — executorLost Method

Posting ExecutorAdded Event — executorAdded Method

Posting JobCancelled Event — cancelJob Method

Finding Or Creating Missing Direct Parent ShuffleMapStages (For ShuffleDependencies of Input RDD) — getOrCreateParentStages Internal Method

Marking Stage Finished — markStageAsFinished Internal Method

messageScheduler Single-Thread Executor

Running Job — runJob Method

Finding or Creating New ShuffleMapStages for ShuffleDependency — getOrCreateShuffleMapStage Internal Method

Creating ShuffleMapStage for ShuffleDependency (Copying Shuffle Map Output Locations From Previous Jobs) — createShuffleMapStage Method

Clearing Cache of RDD Block Locations — clearCacheLocs Internal Method

Finding Missing ShuffleDependencies For RDD — getMissingAncestorShuffleDependencies Internal Method

Finding Direct Parent Shuffle Dependencies of RDD — getShuffleDependencies Internal Method

Failing Job and Single-Job Stages — failJobAndIndependentStages Internal Method

Aborting Stage — abortStage Internal Method

Checking Out Stage Dependency on Given Stage — stageDependsOn Method

dag-scheduler-event-loop — DAGScheduler’s Internal Event Bus

Submitting Waiting Child Stages for Execution — submitWaitingChildStages Internal Method

Submitting Stage or Its Missing Parents for Execution — submitStage Internal Method

Fault recovery - stage attempts

Block Location Discovery — getCacheLocs Internal Method

Preferred Locations

Adaptive Query Planning / Adaptive Scheduling

ScheduledExecutorService daemon services

Finding Missing Parent ShuffleMapStages For Stage — getMissingParentStages Internal Method

Computing Preferred Locations for Tasks and Partitions — getPreferredLocs Method

Submitting Missing Tasks for Stage and Job — submitMissingTasks Internal Method

Stopping DAGScheduler — stop Method

DAGSchedulerSource Metrics Source

Updating Accumulators with Partial Values from Completed Tasks — updateAccumulators Internal Method

Settings

results matching ""

No results matching ""

Running Approximate Job — `runApproximateJob` Method

`getPreferredLocsInternal` Internal Method

`createResultStage` Internal Method

`updateJobIdStageIdMaps` Method

LiveListenerBus Event Bus for SparkListenerEvents — `listenerBus` Property

`executorHeartbeatReceived` Method

Cleaning Up After ActiveJob and Independent Stages — `cleanupStateForJobAndIndependentStages` Method

Marking ShuffleMapStage Job Finished — `markMapStageJobAsFinished` Method

Submitting Job — `submitJob` method

Submitting ShuffleDependency for Execution — `submitMapStage` Method

Posting StageCancelled Event — `cancelStage` Method

Posting JobGroupCancelled Event — `cancelJobGroup` Method

Posting AllJobsCancelled Event — `cancelAllJobs` Method

Posting BeginEvent Event — `taskStarted` Method

Posting GettingResultEvent Event — `taskGettingResult` Method

Reporting Task Ended Event (CompletionEvent Event) — `taskEnded` Method

Posting TaskSetFailed Event — `taskSetFailed` Method

Posting ExecutorLost Event — `executorLost` Method

Posting ExecutorAdded Event — `executorAdded` Method

Posting JobCancelled Event — `cancelJob` Method

Finding Or Creating Missing Direct Parent ShuffleMapStages (For ShuffleDependencies of Input RDD) — `getOrCreateParentStages` Internal Method

Marking Stage Finished — `markStageAsFinished` Internal Method

`messageScheduler` Single-Thread Executor

Running Job — `runJob` Method

Finding or Creating New ShuffleMapStages for ShuffleDependency — `getOrCreateShuffleMapStage` Internal Method

Creating ShuffleMapStage for ShuffleDependency (Copying Shuffle Map Output Locations From Previous Jobs) — `createShuffleMapStage` Method

Clearing Cache of RDD Block Locations — `clearCacheLocs` Internal Method

Finding Missing ShuffleDependencies For RDD — `getMissingAncestorShuffleDependencies` Internal Method

Finding Direct Parent Shuffle Dependencies of RDD — `getShuffleDependencies` Internal Method

Failing Job and Single-Job Stages — `failJobAndIndependentStages` Internal Method

Aborting Stage — `abortStage` Internal Method

Checking Out Stage Dependency on Given Stage — `stageDependsOn` Method

Submitting Waiting Child Stages for Execution — `submitWaitingChildStages` Internal Method

Submitting Stage or Its Missing Parents for Execution — `submitStage` Internal Method

Block Location Discovery — `getCacheLocs` Internal Method

Finding Missing Parent ShuffleMapStages For Stage — `getMissingParentStages` Internal Method

Computing Preferred Locations for Tasks and Partitions — `getPreferredLocs` Method

Submitting Missing Tasks for Stage and Job — `submitMissingTasks` Internal Method

Stopping DAGScheduler — `stop` Method

`DAGSchedulerSource` Metrics Source

Updating Accumulators with Partial Values from Completed Tasks — `updateAccumulators` Internal Method