A very completed introduction about the internal of Apache Spark. Highly recommended!
It is a full day workshop (almost 6 hour long video), so you can use following checkpoint to start with the section you are interested in. The section I find most interesting is to reveal how they won 2014 100TB sorting challenge, watch from 4:49:00 Next Gen Shuffle.
Youtube : Advanced Apache Spark Training - Sameer Farooqui (Databricks)
Slides : Devops Advanced Class
A list of agenda and checkpoint :
- 1:30 Agenda
- 5:14 History of Spark
- 27:40 RDD fundamentals
- 1:20:23 Spark Runtime architecture and resource managers
- 2:49:24 Memory and Persistence
- 3:15:30 Serialization
- 3:19:50 Staging
- 3:42:00 Shuffle
- 3:55:00 Broadcast and accumulators
- 4:31:25 PySpark
- 4:49:00 Next Gen Shuffle
- 5:32:00 Spark Streaming
Extra video :
A Deeper Understanding of Spark Internals - Aaron Davidson (Databricks)