Apache Spark is a fast, in-memory data processing engine with an elegant development API that allows data workers to efficiently execute algorithms which require iterative access to datasets, like machine learning algorithms. Spark on YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.
Below, we are going to explore the basic concepts of Apache Spark and the first few necessary steps to get started.
Table of Contents
- Introduction
- Configuring Hortonworks Sandbox on Azure
- Installing Apache Spark 1.3.1 on HDP 2.2.4.2
- Installing Apache Spark 1.2.0 on HDP 2.2
- Basics of programming Apache Spark
- A short primer on Scala
- Exploring Spark with Scala
- Using Hive and ORC with Apache Spark
- Installing and configuring Zeppelin
- Using IPython Notebook with Apache Spark
Saptak Sen
If you enjoyed this post, you should check out my book: Starting with Spark.