大數據流式架構生態分析-_第1頁
大數據流式架構生態分析-_第2頁
大數據流式架構生態分析-_第3頁
大數據流式架構生態分析-_第4頁
大數據流式架構生態分析-_第5頁
已閱讀5頁,還剩42頁未讀 繼續免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

大數據流式架構生態分析Functional

ComparisonandPerformanceEvaluationOverviewStreaming

CoreMISCPerformance

BenchmarkChoose

your

weapon

!Execution

Model

+Fault

ToleranceMechanismApache

SparkStreaming*AapcheFlink*ApacheStorm*Apache

StormTrident*ApacheGearpump*TwitterHeron*Micro-BatchSourceOperatorSinkContinuousStreaming4SourceOperatorSinkAapcheFlink*Storm*Apache

StormTrident*ApacheGearpump*This

is

the

critical

part,

as

it

affects

manyfeaturesMicro-BatchCheckpoint

perABpaacthce

hSpark

Streaming*Continuous

StreamingCheckpoint

“per

Batch”AckerJobManager/

HDFSidoffsetstate

strackSourceOperatoSinkSourceOperatoSinkSourceOperatoSinkrrrDriverStorageStoragejob

statusHDFSidoffsetstate

strContinuousStreaming

AckAppacehre

Record

TwitterHeron*Storage5Low

LatencyHigh

LatencyHigh

OverheadLowThroughputLow

OverheadHighThroughput6AapcheFlink*Storm*Apache

StormTrident*ApacheGearpump*rHeron*Micro-BatchCheckpoint

perABpaacthce

hSpark

Streaming*Continuous

StreamingCheckpoint

“per

Batch”ContinuousStreaming

AckAppacehre

Record

TwitteDelivery

GuaranteeAt

least

onceExactly

once

Ackers

know

about

if

arecord

is

processedsuccessfully

or

not.

If

itfailed,

replay

it.

There

is

no

stateconsistency

guarantee.

State

is

persisted

indurable

storage

Checkpoint

is

linkedwithstate

storage

perBatchApache

SparkStreaming*AapcheFlink*ApacheStorm*Apache

StormTrident*ApacheGearpump*TwitterHeron*Native

StateOperatorYes*YesFlink

Java

API:ValueStateListStateReduceStateFlink

Scala

API:·

mapWithStateGearpump·

persistStateYesSpark

1.5:·

updateStateByKeySpark

1.6:·

mapWithStateTrident:persistentAggregateStateStorm:·

KeyValueStateHeron:X

UserMaintainApache

SparkStreaming*AapcheFlink*ApacheStorm*Apache

StormTrident*ApacheGearpump*TwitterHeron*APIDeclarativeHigher

order

function

as

operators

(filter,

mapWithState…)Logical

plan

optimizationDataStream<String>

text

=

env.readTextFile(params.get("input"));DataStream<Tuple2<String,

Integer>>

counts

=

text.flatMap(new

Tokenizer()).keyBy(0).sum(1);“foo,

foo,

bar”“foo”,

“foo”,

“bar”

{“foo”:

1,

“foo”:

1,

“ba{r“”f:o1o}”:

2,

“bar”:

1}11Apache

SparkStreaming*Apache

StormTrident*AapcheFlink*ApacheGearpumpStatistical

Data

scientistfriendlyDynamic

typePythonlines

=

ssc.textFileStream(params.get("input"))

words

=

lines.flatMap(lambda

line:

line.split(“,"))pairs

=

words.map(lambda

word:

(word,

1))counts

=

pairs.reduceByKey(lambda

x,

y:

x

+

y)counts.saveAsTextFiles(params.get("output"))Rlines

<-

textFile(sc,

“input”)words

<-

flatMap(lines,

function(line)

{strsplit(line,

”)[[1]]})wordCount

<-

lapply(words,

function(word){

list(word,

1L)}counts

<-

reduceByKey(wordCount,

“+”,

2L)StructuredStreaming*12Apache

SparkStreaming*ApacheStorm*TwitterHeron*ApacheStorm*SQLORDERS

(ID

INT

PRIMARY

KEY,

UNIT_PRICE

INT,

QUANTITYINT)LOCATION

"kafka://localhost:2181/brokers?topic=orders"TBLPROPERTIES

"{...}}‘INSERT

INTO

LARGE_ORDERS

SELECT

ID,

UNIT_PRICE

*QUANTITYAS

TOTAL

FROM

ORDERS

WHERE

UNIT_PRICE

*

QUANTITY

>50bin/

storm

sql

XXXX.

sqlInputDStream.transform((rdd:

RDD[Order],

time:

Time)

=>{import

sqlContext.implicits._rdd.toDF.registAsTempTableval

SQL

=

"SELECT

ID,

UNIT_PRICE

*

QUANTITY

ASTOTAL

FROM

ORDERS

WHERE

UNIT_PRICE

*QUANTITY

>

50"val

largeOrderDF

=

sqlContext.sql(SQL)largeOrderDF.toRDD})Fusion

StylePure

StyleCREATE

EXTERNAL

TABLE13Apache

SparkStreaming*AapcheFlink*StructuredStreamingApache

StormTrident*Summary14CompositionalDeclarativePython/RSQLApache

SparkStreaming*X√√√ApacheStorm*√X√NOT

supportaggregation,windowing

andjoiningApache

StormTrident*X√XApacheGearpump*√√XXAapcheFlink*X√XSupport

select,

from,where,

unionTwitterHeron*√X√XRuntimeModel

Multi

Tasks

of

Multi

Applications

on

SingleProcessJVMProcessConnectwithlocal

SMThreadThreadTaskSingle

Task

on

Single

ProcessTaskTaskJVMProcessThreadTaskTaskJVMProcessTaskThreadThreadtask

from

applicationAThreadThreadtask

from

applicationBTaskTaskJVMProcessThreadTaskConnectwithlocal

SMThread16TwitterAapcheFlink*Multi

Tasks

of

Single

application

on

SingleTaskTaskThreadTaskTaskTaskThreadJVMProcessThreThreadado

Multi

tasks

on

single

threadTaskTaskJVMProcessThreaThreaddTaskTaskJVMProcessThreadTaskThreadThreadTaskTaskTaskJVMProcess17PorSoicnegslse

task

on

single

thread

Apache

SparkStreaming*ApacheStorm*Apache

StormTrident*ApacheGearpump*Window

Support ●

Out-of-order

Processing ●

Memory

ManagementResource

Management ●

Web

UI

Community

MaturityWindowSupportsmaller

than

gapsession

gapttSliding

Window

CountWindowSession

WindowSliding

WindowCount

WindowSession

WindowApache

SparkStreaming*√XXApacheStorm*√√XApache

StormTrident*√√XApacheGearpump*√XXApache

Flink*√√√ApacheHeron*XXX19Out-of-orderProcessing20Processing

TimeEvent

TimeWatermarkApache

SparkStreaming*√√XApacheStorm*√√√√XXApache

StormTrident*ApacheGearpump*√√√AapcheFlink*√√√TwitterHeron*√XXMemoryManagement21JVM

ManageSelf

Manage

on-heapSelf

Manage

off-heap√√√Apache

SparkStreaming*√√√AapcheFlink*√XXApacheStorm*√XXApacheGearpump*√XXTwitterHeron*Resource

Management2StandaloneYARNMesos√√√Apache

SparkStreaming*√√√ApacheStorm*√√√Apache

StormTrident*ApacheGearpump*√√XAapcheFlink*√√XTwitterHeron*√√√Web

UI23SubmitJobsCancelJobsInspectJobsShowStatisticsShowInput

RateCheckExceptionsInspeConfApacheSparkStreaming*X√√√√√√ApacheStorm*X√√√√√√ApacheGearpump*√√√√√√√ApacheFlink*√√√√X√√TwitterHeron*XX√√√√√21612371615147725002000150010005000Resloved78013002341217184202151000800600

4002000Past

1

Months

Summary

on

GitHubCommittoCommitsrCommunityMaturityInitiationTimeApacheTopProjectContributorsApacheSparkStreaming*20132014926ApacheStorm*20112014219ApacheGearpump*2014Incubator21ApacheFlink*20102015208TwitterHeron*2014N/A44Spark

Storm20Gearpump

FlinkHeron

Source

website:/apache/spark/pulse/monthlyPast

3

Months

Summary

on

JIRACreatedSpark

StormGearpump

FlinkHeronSource

website:

/jira/secure/Dashboard.js24HiBench

6.0“Lazy

Benchmarking”Simple

test

case

infer

practical

use

caseTest

PhilosophicalCluster

SetupApache

Kafka*

ClusterCPU:

2

x

Intel(R)

Xeon(R)

CPU

E5-2699

v3@

2.30GHzMem:

128

GBDisk:

8

x

HDD

(1TB)Network:

10

Gbps10

GbpsTest

ClusterCPU:

2

x

Intel(R)

Xeon(R)

CPU

E5-2697

v2@

2.70GHzCore:

20

/

24Mem:

80

/

128

GBDisk:

8

x

HDD

(1TB

)Network:

10

Gbpsx7x3NameVersionJava1.8Scala2.11.7Apache

Hadoop*2.6.2Apache

Zookeeper*3.4.8Apache

Kafka*Apache

Spark*1.6.1Apache

Storm*1.0.1Apache

Flink*1.0.3Apache

Gearpump*0.8.1Apache

Heron*

require

specific

Operation

System

(Ubuntu/CentOS/MacOS)Structured

Streaming

doesn’t

support

Kafka

source

yet (Spark

2.0)27ArchitectureTestCluster

(Standalone)DataGeneratorMetrics

ReaderFileSystemKafkaBrokerKafkaBrokerKafkaBrokerClientMasterSlaveSlaveSlaveSlave20

Core80GMem20

Core80GMemSlave20

Core80GMem20

Core80GMemSlave20

Core80GMem20

Core80GMemSlave20

Core80GMemTopic

ATopic

BResultTopic

AIn

TimeOut

TimeOut

Time

In

TimeFrameworkConfigurationFrameworkRelated

Configuration7

Executor140

Parallelism7

TaskManager140

Parallelism28

Worker140

KafkaSpout28

Executors140

KafkaSource29Apache

SparkStreaming*ApacheStorm*AapcheFlink*ApacheGearpump*Raw

Input

DataKafka

Topic

Partition:

140Size

Per

Message

(configurable):

200

bytesRaw

Input

Message

Example:“0,6,nbizrgdziebsaecsecujfjcqtvnpcnxxwiopmddorcxnlijdizgoi,1991-06-10,0.115967035,Mozilla/5.0

(iPhone;

U;

CPUlike

Mac

OS

X)AppleWebKit/420.1

(KHTML

like

Gecko)

Version/3.0

Mobile/4A93Safari/419.3,YEM,YEM-AR,snowdrops,1”Strong

Type:

class

UserVisit

(ip,

sessionId,

browser)

Keep

feeding

data

at

specific

rate

for

5minutes5

minutes30Data

Input

RateThroughputMessage/SecondKafkaProducer

Num40KB/s0.2K1400KB/s2K14MB/s20K140MB/s200K180MB/s400K1400MB/s2M10600MB/s3M15800MB/s4M20Let"s

start

with

the

simplestcaseTest

Case:

IdentityThe

application

reads

input

data

from

Kafka

and

then

writes

resultto

Kafka

immediately,

there

is

no

complex

business

logic

involved.Result8765432100100700800P99

Latency

(s)200Apache

Spark*30

400

5000

Input

Ra6t0e0(MB/s)Apache

Flink*Apache

Storm*

without

Apache

Storm*

with

AckAckFor

more

complete

information

about

performance

and

benchmark

results,

visit

/benchmarks.Results

have

been

estimated

or

simulated

using

internal

Intel

analysis

or

architecture

simulation

or

modeling,

and

provided

to

you

for

informational

purposes.

Any

differencesin

your

system

hardware,

software

or

configuration

may

affect

your

actual

performance.Test

Case:

RepartitionBasically,

this

test

case

can

stand

for

the

efficiency

of

data

shuffle.NetworkShuffleResult01002003004000200800400

600Input

Rate

(MB/s)P99

Latency

(s)Apache

Spark*Apache

Flink*Apache

Storm*

withoutAck

Apache

Gearpump*Apache

Storm*

with

Ack0200400600800020040

60800Input

0Rate

(MB/s)0Throughput

(MB/s)Apache

Spark*Apache

Flink*Apache

Storm*

withoutAck

Apache

Gearpump*Apache

Storm*

with

AckFor

more

complete

information

about

performance

and

benchmark

results,

visit

/benchmarks.Results

have

been

estimated

or

simulated

using

internal

Intel

analysis

or

architecture

simulation

or

modeling,

and

provided

to

you

for

informational

purposes.

Any

differencesin

your

system

hardware,

software

or

configuration

may

affect

your

actual

performance.Observation

Spark

Streaming

need

to

schedule

task

with

additional

context.

Undertiny

batch

interval

case,

the

overhead

could

be

dramatic

worsecompared

to

other

frameworks.According

to

our

test,

minimum

Batch

Interval

of

Spark

is

about

80ms(140

tasks

per

batch),

otherwise

task

schedule

delay

will

keep

increasingRepartition

is

heavy

for

every

framework,

but

usually

it’s

unavoidable.

Latency

of

Gearpump

is

still

quite

low

even

under

800MB/sinput

throughput.Test

Case:

Stateful

WordCountNative

state

operator

is

supported

by

all

frameworks

weevaluated

Stateful

operator

performance

+

Checkpoint/AckercostResult0204060801002000800400

60P99

Latency

(s)InputRate

(MB/s)

0Apache

Flink*Apache

Spark*Apache

Flink*

without

CP Apache

Storm*Apache

Gearpump*80070060050040030020010000200800400

60Input

Rate

(MB/s)0Throughput

(MB/s)Apache

Spark* Apache

Flink*Apache

Storm*

Gearpump*For

more

complete

information

about

performance

and

benchmark

results,

visit

/benchmarks.Results

have

been

estimated

or

simulated

using

internal

Intel

analysis

or

architecture

simulation

or

modeling,

and

provided

to

you

for

informational

purposes.

Any

differencesin

your

system

hardware,

software

or

configuration

may

affect

your

actual

performance.Observation?Exactly-once

semantics

usually

require

state

management

and

checkpoint.But

better

guarantees

come

at

high

cost.

There

is

no

obvious

performance

difference

in

Flink

when

switchingfault

tolerance

on

or

off.Checkpoint

mechanisms

and

storages

play

a

critical

role

here.Test

Case:

Window

Based

AggregationThis

test

case

manages

a

10-seconds

slidingwindowResult2001801601401201008060402000200800400

60Input

Rate

(MB/s)

0P99

Latency

(s)Apache

Spark*Apache

Flink*Storm*60050040030020010000200800400

60Input

Rate

(MB/s)0Throughput

(MB/s)Apache

Spark*Apache

Flink*Storm*For

more

complete

information

about

performance

and

benchmark

results,

visit

/benchmarks.Results

have

been

estimated

or

simulated

using

internal

Intel

analysis

or

architecture

simulation

or

modeling,

and

provided

to

you

for

informational

purposes.

Any

differencesin

your

system

hardware,

software

or

configuration

may

affect

your

actual

performance.So

which

streaming

frameworkshould

I

use?Do

your

ownbenchmarkHiBench

:

a

cross

platforms

micro-benchmark

suite

for

bigdata

(/intel-hadoop/HiBench)Open

Source

since

2012Better

streaming

benchmark

supporting

will

be

included

in

nextrelease[HiBench

6.0]Legal

DisclaimerNo

license

(express

or

implied,

by

estoppel

or

otherwise)

to

any

intellectual

property

rights

is

granted

by

this

document.Intel

does

not

control

or

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論