Abracadabra

The first course of the Spark

简介

  • 扩充了MapReduce计算模型
  • 基于内存的计算
  • 能够进行批处理、迭代式计算、交互查询和流处理
    • 降低里维护成本
  • 提供了Python、Java、Scala、SQL的API和丰富的内置库
  • 可以与Hadoop、Kafka等整合

组件

components

Spark Core

Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.

Spark SQL

Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON. Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java, and Scala, all within a single application, thus combining SQL with complex analytics. This tight integration with the rich computing environment provided by Spark makes Spark SQL unlike any other open source data warehouse tool. Spark SQL was added to Spark in version 1.0.

Shark was an older SQL-on-Spark project out of the University of California, Berkeley, that modified Apache Hive to run on Spark. It has now been replaced by Spark SQL to provide better integration with the Spark engine and language APIs.

Spark Streaming

Spark Streaming is a Spark component that enables processing of live streams of data. Examples of data streams include logfiles generated by production web servers, or queues of messages containing status updates posted by users of a web service. Spark Streaming provides an API for manipulating data streams that closely matches the Spark Core’s RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving in real time. Underneath its API, Spark Streaming was designed to provide the same degree of fault tolerance, throughput, and scalability as Spark Core.

MLlib

Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms, including classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import. It also provides some lower-level ML primitives, including a generic gradient descent optimization algorithm. All of these methods are designed to scale out across a cluster.

GraphX

GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL, GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. GraphX also provides various operators for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle counting).

Cluster Managers

Under the hood, Spark is designed to efficiently scale up from one to many thousands of compute nodes. To achieve this while maximizing flexibility, Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler. If you are just installing Spark on an empty set of machines, the Standalone Scheduler provides an easy way to get started; if you already have a Hadoop YARN or Mesos cluster, however, Spark’s support for these cluster managers allows your applications to also run on them.

安装

  • Spark由Scala编写,运行于JVM上,运行环境为Java 7+
  • 如果使用Python API,需要安装Python 2.6+ 或者Python 3.4+
  • Spark 1.6.2 – Scala 2.10 / Spark 2.0.0 – Scala 2.11

下载

http://spark.apache.org/downloads.html

不需要Hadoop集群;如果已经搭建好Hadoop集群,可下载相应版本

解压

目录

  • README.md
    • Contains short instructions for getting started with Spark.
  • bin
    • Contains executable files that can be used to interact with Spark in various ways (e.g., the Spark shell, which we will cover later in this chapter).
  • core, streaming, python, …
    • Contains the source code of major components of the Spark project.
  • examples
    • Contains some helpful Spark standalone jobs that you can look at and run tolearn about the Spark API.

Shell

  • Python Shell
    • bin/pyspark
  • Scala Shell
    • bin/spark-shell

开发环境搭建

Scala安装

https://www.scala-lang.org/download/

注意版本对应

IntelliJ IDEA安装

https://www.jetbrains.com/idea/#chooseYourEdition

可以申请教育账号

插件安装

File-Settings-Plugins 搜索Scala,安装

项目创建

File-New-Project-Scala-SBT

同样注意版本匹配(这里用的是Spark 2.1.0, Scala 2.11.11)

配置文件

需要定义使用的Spark版本

build.sbt追加

1
2
3
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.1.0"
)

重建项目即可

源程序编写

New-Scala Class-Class to Object

WordCount.scala

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by root on 4/23/17.
*/
object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("wordcount")
val sc = new SparkContext(conf)
val input = sc.textFile("/home/hduser/Anaconda2-4.3.1-Linux-x86_64.sh")
val lines = input.flatMap(line => line.split(" "))
val count = lines.map(word => (word, 1)).reduceByKey{case (x, y) => x + y}
val output = count.saveAsTextFile("/home/hduser/scala_wordcount_demo_output")
}
}

打包

File-Project Structure-Project Setting-Artifacts-Add-JAR-From modules with dependencies

Build-Build Artifacts-Build

启动集群

  • 启动master
    • sbin/start-master.sh
  • 启动worker
    • bin/spark-class org.apache.spark.deploy.worker.Worker spark://Ubuntu:7077
    • 注意这里的spark服务器地址可以通过浏览器输入localhost:8080来查看
  • 提交作业
    • bin/spark-submit --master spark://Ubuntu:7077 --class WordCount /home/hduser/scala_demo.jar
    • 注意这里的scala_demo.jar文件为打包阶段生成

TODO: RDDs

Zaharia M., et al. Learning Spark (O’Reilly, 2015)(274s).pdf

References

  1. 慕课网 http://www.imooc.com/learn/814
  2. Zaharia M., et al. Learning Spark (O’Reilly, 2015)(274s).pdf