[Semi Thesis Review for me] Resilient Distributed Datasets: A fault-Tolerant Abstraction for In-Memory Cluster Computing

2020. 8. 5. 13:56Computer Architecture/Cluster

Author: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica

 

[Abstract]

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.

To achieve fault tolearnce efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state.

We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.

 

[1. Introduction]

Cluster computing frameworks systems let users write parallel computations using a set of high-level operators, without having to worry about work distribution and fault tolerance.

 

*Falut Tolearnce: 결함이 시스템을 구성하는 부품의 일부에서 발생하여도 정상적 혹은 부분적으로 기능을 수행할 수 있는 시스템이다.

 

Current frameworks lack abstractions for leveraging distributed memory.

This makes them inefficient for an important calss of emerging applications: those that reuse intermediate results across multiple computations.

 

Data reuse is common in many iterative machine learning and graph algorithms.

Another compelling use case is interactive data mining, where a user runs multiple adhoc queries on the same subset of the data.

 

Unfortunately, in most current frameworks, the only way to reuse data between computations (e.g., between two mapReduce Jobs) is to write it to an external stable storage system, e.g., a distributed file system.

This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.

 

In this paper, we propose a new abstraction called resilient distributed datasets (RDDs) that enables efficient data reuse in a broad range of applications.

 

RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data replacement, and manipulate them using a rich set of operators.

 

The main challenge in designing RDDs is defining a programming interface that can provide fault tolerance efficiently.

 

With this interface, the only ways to provide fault tolerance are to replicate the data across machines or to log updates across machines.

RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.

If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just than partition, Thus, lost data can be recovered, often quite quickly, without requiring costly replication.

 

??Altough an interface based on coarse-grained transformations may at first seem limited, RDDs are a good fit for many parallel applications, because these applicatinos naturally apply the same operation to multiple data items.

 

Spark(system called) is being used for research and production applications at UC Berkeley and several companies.

Spark provides a convenient language-integrated programming interface similar to DryadlINQ in the Scala programming language.

In addition, Spark can be used interactively to query big datasets from the Scala interpreter.

We believe that Spark is the first system that allows a general-purpose programming language to be used at interactive speeds for in-memory data mining on clusters.

 

[2. Resilient Distributed Datasets (RDDs)]

 

This section provides an overview of RDDs.

RDD’s programming interface in Spark.

 

[2.1 RDD Abstraction]

Formally, an RDD is a read-only, partitioned collection of records.

RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.

We call these operations transformations("create") to differentiate them from other operations on RDDs.

Ex of transformations include map, filter, and join.

RDDs do not need to be materialized at all times.

Instead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitinos form data in stable storage.

This is powerful property: in essence, a program cannot reference an RDD that it cannot reconstruct after a failure.

 

Finally, users can control two other aspects of RDDs: persistence and partitioning.

Users can indicate which RDDs they will reuse and choose a sotrage strategy for them (e.g., in-memory storage).

They can also ask that an RDD’s elements be partitioned across machines based on a key in each record. This is useful for placement optimizatinos, such as ensuring that two datsets that will be joined together are hash-partitioned in the same way.

 

[2.2 Spark Programming Interface]

Spark exposes RDDs thorugh a language-integrated API similar to DryadLINQ and FlumeJava, where each dataset is represented as an object and transformations are invoked using mehtods on these objects.

Programmers start by defining one or more RDDs through transformations on data in stable storage (e.g., map and filter).

They can then use these RDDs in actions, which are operations that return a value to the application or export data to a storage system.

Ex of actions: count(which returns the nuber of elements in the dataset), collect (which returns the elements themsleves), and save (which outputs the dataset to a storage system).

Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.

Users can also request other persistence strategies, such as storing the RDD only on disk or replicating it across machines, through flags to persist.

 

**HDFS: Hadoop Distributed File System

분산처리 시스템인 구글 파일 시스템을 대체할 수 있는 하둡 분산 파일 시스템

**아파치 하둡: 대량의 자료를 처리할 수 있는 큰 컴퓨터 클러스터에서 동작하는 분산 응용 프로그램을 지원하는 프리웨어 자바 소프트에워 프레임워크이다.

**HDFS: 하둡 프레임어쿠르르 위해 자바 언어로 작성된 분산 확장 파일 시스템이다. HDFS은 여러 기계에 대용량 파일들을 나눠서 저장한다. 데이터들을 여러 서버에 중복해서 저장을 함으로써 데이터 안정성을 얻는다. 따라서 호스트에 RAID 저장장치를 사용하지 않아도 된다.

하둡은 다양한 기업에서 사용되고 있다. 야후, 페이스북 등이 하둡을 서비스의 여러 부분에 많이 사용하고 있는 것으로 알려져있다. 또한 데이터 분석 기업들이 하둡을 기반으로 플랫폼을 구축하고 있다.

 

 

[2.2.1 Example: Console Log Mining]

Web service occurs erros → operator wants to search terabytes of logs in the Hadoop filesystem (HDFS) to find the cause → Using Spark, the operator can load just the error messages from the logs into Ram across a set of nodes and query them interactively. → code

 

Image from this thesis paper.

 

** What is Lineage Graph in Spark?

The RDDs in Spark, depend upon one or more other RDDs.

RDD Lineage (RDD operator graph or RDD dependency graph) actually is a graph of all the parent RDDs of an RDD.

 

**RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.

 

**Apache Spark is open source cluster computing framework.

Image from this thesis paper.

Line 1 defines an RDD backed by an HDFS file (as a collection of lines of text),

while line 2 derives a filtered RDD from it.

Line 3 then asks for errors to persist in memory so that it can be shared across queries.

Note that the argument to filter is Scala syntax for a closure.

At this point, no work has been performed on the cluster.

However, the user can now use the RDD in actions, e.g., to count the number of messages:

 

errors.count()

 

Spark will store the partitions of errors in memory, greatly speeding up subsequent computations on it.

Note that the base RDD, lines, is not loaded into RAM.

This is desirable because the error messages might only be a small fraction of the data (small enough to fit into memory).

 

** Spark RDD개념 및 예제

https://kkn1220.tistory.com/126

 

Spark RDD 개념 및 예제

본 내용은 러닝스파크 책을 참고하였음. RDD 개념 - RDD(Resilient Distributed Dataset) - Resilient: Memory 내 데이터 손실 시, 다시 생성할 수 있음. 즉, 유실된  파티션을 재연산해 복구 (탄력적) - Distrib..

kkn1220.tistory.com

 

Image from this thesis paper.

Finally, to illustrate how our model achieves fault tolerance, we show the lineage graph for the RDDs in our third query in Figure 1. In this query, we started with errors, the result of a filter on lines, and applied a further filter and map before running a collect. The Spark scheduler will pipeline the latter two transformations and send a set of tasks to compute them to the nodes holding the cached partitions of errors. In addition, if a partition of errors is lost, Spark rebuilds it by applying a filter on only the corresponding aprtition of lines.

 

Image from this thesis paper.

** Spark 참고자료:

https://www.slideshare.net/KangDognhyun/apache-spark-70360736

 

Apache spark 소개 및 실습

빅데이터 개념 부터 시작해서 빅데이터 분석 플랫폼의 출현(hadoop)과 스파크의 등장배경까지 풀어서 작성된 spark 소개 자료 입니다. 스파크는 RDD에 대한 개념과 spark SQL 라이브러리에 대한 자료가

www.slideshare.net

- Fine-grained update 지원: ex>DBMS에서 table의 특정 element만 update하는 것과 유사

- Fault-tolerance가 성능을 저하: RAM 상의 데이터 셋이 망가짐을 대비해 replicating, replicating 하는 동안 연산을 멈춰야하며, checkpointing 시에는 더욱 시간이 오래 걸림.

- Hadoop (Disk 기반)으로 데이트 프로세싱을 했음. --> Fine-grained update 지원하지 않음.

  : file to file로의 변경만 지원하며 변경된 파일을 새로 씀, modify가 불가능한 파일 시스템. DBMS 변환이 필요한 경우 update는 사용할 수 없고, "create table as (select ~~~)' query를 통해 새로운 table을 만드는 것과 유사.

 

** https://www.slideshare.net/yongho/rdd-paper-review

 

Spark 의 핵심은 무엇인가? RDD! (RDD paper review)

요즘 Hadoop 보다 더 뜨고 있는 Spark. 그 Spark의 핵심을 이해하기 위해서는 핵심 자료구조인 Resilient Distributed Datasets (RDD)를 이해하는 것이 필요합니다. RDD가 어떻게 동작하는지, 원 논문을 리뷰하며

www.slideshare.net

Spark = RDD + Interface

RAM도 read-only로만 쓴 것이 RDD!(Resilient Distributed Datasets)

RDD는 만들어진 이래 고쳐진 적이 없음(immutable = read-only)

그럼 어떻게 만들어졌는지만 기록해두면 또 만들 수 있겠네? --> 부모로부터 어찌 만들어졌는지 계보(lineage)만 기록해도 fault-tolerant

자료가 어떻게 변하갈지를 그리는 와중(transformation)에는 실제 계산은 일어나지 않는다.

 

 

**GFS:

https://zzsza.github.io/data/2018/05/26/bigdata-preprocessing-skills/

 

대용량 데이터 처리 기술(GFS, HDFS, MapReduce, Spark)

대용량 데이터 처리 기술에 대해 작성한 글입니다 실제 대용량 데이터 처리하는 방법이 궁금하신 분은 BigQuery와 Datalab을 사용해 데이터 분석하기를 참고하시면 좋을 것 같습니다

zzsza.github.io

** GFS: https://swalloow.github.io/map-reduce/

 

GFS, HDFS 그리고 MapReduce

데이터가 급속히 늘어나면서 기존의 방법으로 처리가 힘들어지자, 빅데이터를 위한 대용량 분산 파일 시스템이 나타나기 시작했습니다. 여기에서는 GFS, HDFS 그리고 Map Reduce 개념에 대해 정리해�

swalloow.github.io

[2.3 Advantages of the RDD Model]

 

Image from this thesis paper.

To understand the benefits of RDDs as a distributed memory abstraction, we compare them against distributed shared memory (DSM) in Table 1.

DSM systems) applications read and write to arbitrary locations in a global address space.

DSM is a very general abstraction, but this generality makes it harder to implement in an efficient and fault-tolerant manner on commodity clusters.

 

** coarse-grained vs fine-grained

https://icthuman.tistory.com/entry/FineGrained-vs-CoarseGrained

 

Fine-Grained vs Coarse-Grained

Spark관련 논문 Resilient Distributed Datasets 을 읽던 중 coarse-grained 와 fine-grained 의 내용이 있어서 간단히 정리를 해본다. 영어를 사용하는 문화권에서는 확실히 이해가 쉬울 듯 한데.. grain은 곡식..

icthuman.tistory.com

하나의 프로세스를 쪼개는 정도에 따라서 구분한다.

 

Fine-Grained) 잘게 쪼개는 것으로, 단계를 여러개로 쪼개고 이를 각각 메소드로 구성할 수 있다. 메소드를 모아서 하나의 모듈로 만들 수 있으며 유연하게 개발할 수 있는 장점. Flexible System상에서 유용

Coarse-Grained) 덩어리로 작업, 일반적으로 EnterpriseApplicationDesign에서는 선호하지 않는 방식이지만 Distributed System상에서 유용하다. BigData를 위한 병렬처리쪽으로 구현이 필요한 경우(Spark,Akka등..)가 생길 것 같아서 2번으로의 전환이 필요하다.

transformation(e.g., map: m(a) --> b)

 

RDD's benefits lists:

1.

The main difference between RDDs and DSM is that RDDs can only be created ("written") through coarse-grained transformations, while DSM allows reads and writes to each memory location.

This restricts RDDs to applications that perform bulk writes, but allows for more efficient fault tolerance.

2.

Furthermore, only the lost partitions of an RDD need to be recomputed upon failure, and they can be recomputed in parallel on different nodes, without having to roll back the whole program.

3.

A benefit of RDDs is that their immutable nature lets a system mitigate slow nodes (stragglers) by running backup copies of slow tasks as in MapReduce.

Backup tasks would be hard to implement with DSM, as the two copies of a task would access the same memory locations and interfere with each other's updates.

4.

RDDs provide two other benefits over DSM. 

First, in bulk oprations on RDDs, a runtime can schedule tasks based on data locality to improve performance.

Second, RDDs degrade gracefully when there is not enough memory to store them, as long as they are only being used in scan-based operations.

Partitions that do not fit in RAM can be stored on disk and will provide similar performance to current data-parallel systems.

 

[2.4 Applications Not Sutiable for RDDs]

As discussed in the Inroduction, RDDs are best suited for batch applications that apply the same operation to all elements of a dataset. 

In these cases, RDDs can efficiently remember each transformation as one step in a lineage graph and can recover lost partitions without having to log large amounts of data.

 

RDDs would be less suitable for applications that make asynchronous fine-grained updates to shared state, such as a storage system for a web application or an incremental web crawler.

Our goal is to provide an efficient programming model for batch analytics and leave these asynchronous applications to specialized systems.

 

[3. Spark Programming Interface]

Spark provides the RDD abstraction through a language integrated API similar to DryadLINQ in Scala, a statically typed functional programming language for the Java VM.

We chose Scala due to its combination of consciseness (which is convenient for interactive use) and efficiency (due to static typing). However, nothing about the RDD abstraction requires a functional language.

 

To use Spark, developers write a driver program that connects to a cluster of workers, as shown in Figure 2.

Image from this thesis paper.

The driver defines one or more RDDs and invokes actions on them.

Spark code on the driver also tracks the RDDs' lineage.

The workers are long-lived processes that can store RDD partitions in RAM across operations.

As we showed in the log mining example in Section 2.2.1, users provide arguments to RDD operations like map by passing closures (function iterals).

 

Scala represents each closure as a Java object, and these objects can be serialized and loaded on another node to pass the closure across the network.

Scala also saves any variables bound in the closrue as fields in the Java object.

For example, one can write code like var x = 5; rdd.map(_+x) to add 5 to each element of an RDD.

RDDs themselves are statically typed objects parametrized by an element type. 

RDD[Int] is an RDD of integers.

However, most of our examples omit types since Scala supports type inference.

Although our method of expoisng RDDs in Scala is conceptually simple, we had to work around issues with Scala's closure objects using reflection.

Nonetherless, we did not have to modify the Scala compiler.

 

[3.1 RDD Operations in Spark]

Image from this thesis paper.

Table 2 lists the main RDD transformations and actions available in Spark.

We give the signature of each operation, showing type parameters in square brackets.

 

transformations: lazy operations that define a new RDD, 

actions: launch a computation to return a value to the program or write data to external storage.

join: are only available on RDDs of key-value paris.

 

In addition to these operators, users can ask for an RDD to persist. 

Furthermore, users can get an RDD's partition order, which is represented by a Partitioner class, and partition another dataset according to it.

Operations such as groupByKey, reduceByKey and sort automatically result in a hash or range partitioned RDD.

 

[3.2 Example Applications]

We complement the data mining example in Section 2.2.1 with two iterative applications: logistic regressino and PageRank.

The latter also showcases how control of RDDs' partitioning can improve performance.

 

[3.2.1 Logistic Regression]

EX) machine learing - common classification algorithm

Image from this thesis paper.

We start by defining a persistent RDD called "points" as the result of a map transformation on a text file that parses each line of text into a Point object.

We then repeatedly run map and reduce on points to compute the gradient at each step by summing a functino of the current w.

Keeping points in memory across iterations can yield a 20X sppedup, as we show in Section 6.1.

 

[3.2.2 PageRank]

A more complex pattern of data sharing occurs in PageRank.

** https://sungmooncho.com/2012/08/26/pagerank/

 

‘쉽게 설명한’ 구글의 페이지 랭크 알고리즘

네이버 검색엔진의 문제점을 처음 지적한 글을 썼던 2년 전부터 이 블로그에 언젠가 한 번 써보고 싶었던 주제가 하나 있었다. 구글의 PageRank 알고리즘을 설명하는 것이다. 원리는 간단하지만 알

sungmooncho.com

구글 검색엔진의 핵심 알고리즘.

구글은 링크 구조와 링크 달린 텍스트를 사용.

 

Image from this thesis paper.
Image from this thesis paper.

The algorithm iteratively updates a rank for each document by adding up contributions from documents that link to it.

On each iteration, each document sends a contribution of r/n to its neighbors, where r is its rank and n is its number of neighbors.

It then updates its rank to α/N + (1 − α)∑ci, where the sum is over the contributions it received and N is the total number of documents.

We can write PageRank in Spark as above.

 

Above program(code) leads to the RDD lineage graph in Figure 3.

On each iteration, we create a new ranks dataset based on the contribs and ranks from the previous tieration adn the static links dataset.

One interesting feature of this graph is that it grows longer with the number of iterations.

Thus, in a job with many iterations, it may be necessary to reliably replicate some of the versions of ranks to reduce fault recovery times.

The user can call persist with a RELIABLE flag to do this.

However, note that the links dataset does not need to be replicated, because partitions of it can be rebuilt efficiently by reruning a map on blocks of the input file.

This dataset will typically  be much larger than ranks, because each documetn has many links but only one number as its rank, so recovering it using lineage saves time over systems that checkpoint a program's entire in-memory state.

 

Finally, we can optimize communication in PageRank by controlling the partitioning of the RDDs. 

If we specify a partitioning of links (e.g., hash-partition the link lists by URL across nodes), we can partition ranks in the same way and ensure that the join operation between links and ranks requires no communication (as each URL's rank will be on the same machine as its link list).

 

Image from this thesis paper.

After this initial call, the join operation between links and ranks will automatically aggregate the contributions for each URL to the machine that its link lists is on, calculate its new rank there, and join it with its links.

This type of consistent partitioning across iterations is one of the main optimizatinos in specialized frameworks like Pregel.

RDDs let the user express this goal directly.

 

[4. Representing RDDs]

One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.

We propose a simple graph-based representation for RDDs that facilitates these goals.

We have used this representation in Spark to support a wide range of transformations without adding special logic to the scheduler for each one, which greatly simplified the system design.

 

In a nutshell, we propose representing each RDD through a common interface that exposes five pieces of information:

a set of partitions, which are atomic pieces of the dataset;

a set of dependencies on parent RDDs;

a function for computing the dataset based on its parents; and metadata about its partitioning scheme and data placement.

Ex, an RDD representing an HDFS file has a partition for each block of the file and knows which machines each block is on.

 

The most interesting question in designing this interface is how to represent dependencies betweeen RDDs.

narrow dependencies: where each partitino of the parent RDD is used by at most one pratition of the child RDD

wilde dependencies: where multiple child partitions may depend on it.

 

map leads to a narrow dependency, while join leads to to wide dependencies.

Image from this thesis paper.

This distinction is useful for two reasons.

First, narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions.

Second, recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes.

In constrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.

 

This common interface for RDDs made it possible to implement most transformations in Spark in less than 20 lines of code.

Indeed, even new Spark users have implemented new transformations (e.g., smapling and various types of joins) without knowing the details of the scheduler.

We sketch some RDD implementations below.

HDFS files: The input RDDs in our samples have been files in HDFS.

 

map: Calling map on any RDD returns a MappedRDD object.

union: Calling union on two RDDs returns an RDD whose partitions are the union of those of the parents.

sample: Sampling is similar to mapping, except that the RDD stores a random number generator seed for each partition to deterministically sample parent records.

join: Joining two RDDs may lead to either two narrow dependencies (if they are both hash/range partitioned with the same partitioner), two side dependencies, or a mix (if one parent has aprtitioner and one does not).

[5. Implementation]

We have implemented Spark in about 14,000 lines of Scala.

The system runs over the Mesos cluster manager, allowing it to share resources with Hadoop, MPI and other applications.

Each Spark program runs as a separate Mesos application, with its own driver (master) and workers, and resource sharing between these applications is handled by Mesos.

Spark can read data from any Hadoop input source (e.g., HDFS or HBase) using Hadoop's existing input lugin APIs, and runs on an unmodified version of Scala.

https://bcho.tistory.com/1027

 

Apache Spark - RDD (Resilient Distributed DataSet) 이해하기 - #1

Spark RDD  이해하기 #1 조대협(http://bcho.tistory.com) 기본 개념 잡기 RDD 는 여러 분산 노드에 걸쳐서 저장되는 변경이 불가능한 데이타(객체)의 집합으로 각각의 RDD는 여러개의 파티션으로 분리가 된�

bcho.tistory.com

[5.1 Job Scheduling]

Spark's scheduler uses our representation of RDDs, described in Section 4.

Whenever a user runs an action (e.g., count or save) on an RDD, the scheduler examines that RDD's lineage graph to build a DAG of stages to execute, as illustrated in Figure 5.

Image from this thesis paper.

Each stage contains as many pipelined transformations with narrow dependencies as possible.

The boundaries of the stages are the shuffle operations required for wide dependencies, or any already computed partitions that can short-circuit the computation of a parent RDD. 

The scheduler then launches tasks to compute missing partitions from each stage until it has computed the target RDD.

 

Our scheduler assigns tasks to machines based on data locality using delay scheduling.

If a task needs to process a partition that is available in memory on a node, we send it to that node.

Otherwise, if a task processes a partition for which the containing RDD provides preferred locations (e.g., an HDFS file), we sendit to those.

For whilde dependencies (i.e., shuffle dependencies), we currently materialize intermediate records on the nodes holding parent partitions to simplify fault recovery, much like MapReduce materializes map outputs.

If a task fails, we re-run it on another node as long as its stage's parents are still available.

If some stages have become huavailable (e.g., because an output from the "map side" of a shuffle was lost), we resubmit tasks to compute the missing partitions in parallel. 

 

Finally, although all computations in Spark currently run in response to actions called in the driver program, we are also experimenting with letting tasks on the cluster (e.g., maps) call the lookup operation, which provides random access to elements of hash-partitioned RDDs by key.

In this case, tasks would need to tell the scheduler to compute the required partition if it is missing.

 

[5.2 Interpreter Integration]

Scala includes an interactive shell similar to those of Ruby and Python.

Given the low latencies attained with in-memory data, we wanted to let users run Spark interactively from the interpreter to query big datasets.

The Scala interpreter normally operates by compiling a class for each line typed by the user, loading it into the JVM, and invoking a function on it.

 

-https://medium.com/@lazysoul/jvm-%EC%9D%B4%EB%9E%80-c142b01571f2

 

JVM 이란?

# JIT

medium.com

Java Virtual Machine 의 줄임말 이며 Java Byte Code를 OS에 맞게 해석 해주는 역할을 합니다.

 

Ex. if the user types var x=5 followed by println(x), the interpreter defines a class called Line1 containing x and causes the second line to compile to println(Line1.getInstance().x).

 

We made two changes to the interpreter in Spark:

1. Class shipping: To let the worker nodes fetch the bytecode for the classes created on each line, we made the interpreter serve these classes over HTTP.

2. Modified Code generation: Normally, the singleton object created for each line of code is accessed through a static method on it scorresmponding class.

We modified the code generation logic to reference the instance of each line object directly.

 

Image from this thesis paper.

Figure 6 shows how the interpreter translates a set of lines typed by the user to Java objects after our changes.

We found the Spark interpreter to be useful in processing large traces obtained as part of our resaerch and exploring datasets sotred in HDFS.

We also plan to use to run high-level query languages interactively, e.g., SQL.

 

[5.3. Memory Management]

Spark provides three options for storage of persistent RDDs:

1. in-memory storage as deserialized Java objects,

2. in-memory storage as serialized data,

3. on-disk sotrage.

 

1 provides the fastest performance, because the Java VM can access each RDD element natively.

2 option lets users choose a more memory-efficient representation than Java object graphs when space is limited, at the cost of lower performance.

3 is useful for RDDs that are too large to keep in RAM but costly to recompute on each use.

 

To manage the limited memory available, we use an LRU eviction policy at the level of RDDs.

When a new RDD partitino is computed but there is not enough space to store it, we evict a partition form the least recently accessed RDD, unless this is the same RDD as the one with the new partition.

This is important because most operations will run tasks over an entire RDD, so it is quite likely that the partition already in memory will be needed in the future.

 

Finally, each instance of Spark on a cluster currently has its own separate memory space.

In future work, we plan to investigate sharing RDDs across instances of Spark through a unified memory manager.

 

[5.4 Support for Checkpointing]

Although lineage can always be used to recover RDDs after a failure, such recovery may be time-consuming for RDDs with long lineage chains.

Thus, it can be helpful to checkpoint some RDDs to stable storage.

In general, checkpointing is useful for RDDs with long lineage graphs containing wide dependencies, such as the rank datasets in our PageRank example.

 

[6. Evaluation]

We evaluated Spark and RDDs through a series of experiments on Amazon EC2, as well as benchmarks of user applications.

- Spark outperforms Hadoop by up to 20X in iterative machine learning and graph applications. The speedup comes from a voiding I/O and deserialization costs by storing data in memory as Java objects.

- Applications written by our users perform and scale well. In particular, we used Spark to speed up an analytics report that was running on Hadoop by 40X.

- When nodes fail, Spark can recover quickly by rebuilding only the lost RDD partitions.

- Spark can be used to query a 1 TB dataset interactively with latencies of 5-7 seconds.

We start by presenting benchmarks for iterative machine learning applications and PageRank against Hadoop.

We then evaluate fault recovery in Spark and behavior when a dataset does not fit in memory.

Finally, we discuss results for user applications and interactive data mining.

 

[6.1 Iterative Machine Learning Applications]

logistic regression VS k-means

We ran both algorithms for 10 iterations on 100 GB datasets using 25-100 machines.

The iteration time of k-means is dominated by computation, while logistic regression is less compute-intensive and thus more sensitive to time spent in deserialization and I/O.

 

We find that sharing data via RDDs greatly sppeds up future iterations.

Image from this thesis paper.

First iterations.

All three systems read text input from HDFS in their first iterations.

 

[7. Discussion]

Although RDDs seem to offer a limited programming interface due to their immutalbe nature and coars-grained transformations, we have found them suitable for a wide class of applications. 

RDDs can express a surprising number of cluster programming models that have so far been proposed as separate frameworks, allowing users to compose these models in one program (e.g., run a MapReduce operation to build a graph, then run Pregel on int) and share data between them.

 

'Computer Architecture > Cluster' 카테고리의 다른 글

쿠버네티스 정보  (0) 2020.08.06