
Apache Spark
Apache Spark is a unified distributed computing engine across different workloads and platforms. Spark can connect to different platforms and process different data workloads using a variety of paradigms such as Spark Streaming, Spark ML, Spark SQL, and Spark Graphx.
Apache Spark is a fast in-memory data processing engine with elegant and expressive development APIs, which allow data workers to efficiently execute streaming machine learning or SQL workloads that require fast interactive access to data sets.
Additional libraries built on top of the core allow the workloads for streaming, SQL, graph processing, and machine learning. SparkML, for instance, is designed for data science and its abstraction makes data science easier.
Spark provides real-time streaming, queries, machine learning, and graph processing. Before Apache Spark, we had to use different technologies for different types of workloads. One for batch analytics, one for interactive queries, one for real-time streaming processing, and another for machine learning algorithms. However, Apache Spark can do all of these just using Apache Spark instead of using multiple technologies that are not always integrated.
Using Apache Spark, all types of workloads could be processed and Spark also supports Scala, Java, R, and Python as a means of writing the client programs.
Apache Spark is an open source distributed computing engine which has key advantages over the MapReduce paradigm:
- Uses in-memory processing as much as possible
- General purpose engine to be used for batch, real-time workloads
- Compatible with YARN and also Mesos
- Integrates well with HBase, Cassandra, MongoDB, HDFS, Amazon S3, and other filesystems and data sources
Spark was created in Berkeley back in 2009 and was a result of the project to build Mesos, a cluster management framework to support different kinds of cluster computing systems.
Hadoop and Apache Spark are both popular big data frameworks, but they don't really serve the same purposes. While Hadoop provides the distributed storage and MapReduce distributed computing framework, Spark on the other hand is a data processing framework that operates on the distributed data storage provided by other technologies.
Spark is generally a lot faster than MapReduce because of the way it processes data. MapReduce operates on splits using disk operations, Spark operates on the dataset much more efficiently than MapReduce with the main reason behind the performance improvement of Apache Spark being the efficient off-heap in-memory processing rather than solely relying on disk-based computations.
MapReduce's processing style can be sufficient if your data operations and reporting requirements are mostly static, and it is okay to use batch processing for your purposes, but if you need to do analytics on streaming data or the processing requirements needed in multistage processing logic, you probably want to want to go with Spark.
The following is the Apache Spark stack:
