




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
大數據流式架構生態分析Functional
ComparisonandPerformanceEvaluationOverviewStreaming
CoreMISCPerformance
BenchmarkChoose
your
weapon
!Execution
Model
+Fault
ToleranceMechanismApache
SparkStreaming*AapcheFlink*ApacheStorm*Apache
StormTrident*ApacheGearpump*TwitterHeron*Micro-BatchSourceOperatorSinkContinuousStreaming4SourceOperatorSinkAapcheFlink*Storm*Apache
StormTrident*ApacheGearpump*This
is
the
critical
part,
as
it
affects
manyfeaturesMicro-BatchCheckpoint
perABpaacthce
hSpark
Streaming*Continuous
StreamingCheckpoint
“per
Batch”AckerJobManager/
HDFSidoffsetstate
strackSourceOperatoSinkSourceOperatoSinkSourceOperatoSinkrrrDriverStorageStoragejob
statusHDFSidoffsetstate
strContinuousStreaming
AckAppacehre
Record
TwitterHeron*Storage5Low
LatencyHigh
LatencyHigh
OverheadLowThroughputLow
OverheadHighThroughput6AapcheFlink*Storm*Apache
StormTrident*ApacheGearpump*rHeron*Micro-BatchCheckpoint
perABpaacthce
hSpark
Streaming*Continuous
StreamingCheckpoint
“per
Batch”ContinuousStreaming
AckAppacehre
Record
TwitteDelivery
GuaranteeAt
least
onceExactly
once
Ackers
know
about
if
arecord
is
processedsuccessfully
or
not.
If
itfailed,
replay
it.
There
is
no
stateconsistency
guarantee.
State
is
persisted
indurable
storage
Checkpoint
is
linkedwithstate
storage
perBatchApache
SparkStreaming*AapcheFlink*ApacheStorm*Apache
StormTrident*ApacheGearpump*TwitterHeron*Native
StateOperatorYes*YesFlink
Java
API:ValueStateListStateReduceStateFlink
Scala
API:·
mapWithStateGearpump·
persistStateYesSpark
1.5:·
updateStateByKeySpark
1.6:·
mapWithStateTrident:persistentAggregateStateStorm:·
KeyValueStateHeron:X
UserMaintainApache
SparkStreaming*AapcheFlink*ApacheStorm*Apache
StormTrident*ApacheGearpump*TwitterHeron*APIDeclarativeHigher
order
function
as
operators
(filter,
mapWithState…)Logical
plan
optimizationDataStream<String>
text
=
env.readTextFile(params.get("input"));DataStream<Tuple2<String,
Integer>>
counts
=
text.flatMap(new
Tokenizer()).keyBy(0).sum(1);“foo,
foo,
bar”“foo”,
“foo”,
“bar”
{“foo”:
1,
“foo”:
1,
“ba{r“”f:o1o}”:
2,
“bar”:
1}11Apache
SparkStreaming*Apache
StormTrident*AapcheFlink*ApacheGearpumpStatistical
Data
scientistfriendlyDynamic
typePythonlines
=
ssc.textFileStream(params.get("input"))
words
=
lines.flatMap(lambda
line:
line.split(“,"))pairs
=
words.map(lambda
word:
(word,
1))counts
=
pairs.reduceByKey(lambda
x,
y:
x
+
y)counts.saveAsTextFiles(params.get("output"))Rlines
<-
textFile(sc,
“input”)words
<-
flatMap(lines,
function(line)
{strsplit(line,
“
”)[[1]]})wordCount
<-
lapply(words,
function(word){
list(word,
1L)}counts
<-
reduceByKey(wordCount,
“+”,
2L)StructuredStreaming*12Apache
SparkStreaming*ApacheStorm*TwitterHeron*ApacheStorm*SQLORDERS
(ID
INT
PRIMARY
KEY,
UNIT_PRICE
INT,
QUANTITYINT)LOCATION
"kafka://localhost:2181/brokers?topic=orders"TBLPROPERTIES
"{...}}‘INSERT
INTO
LARGE_ORDERS
SELECT
ID,
UNIT_PRICE
*QUANTITYAS
TOTAL
FROM
ORDERS
WHERE
UNIT_PRICE
*
QUANTITY
>50bin/
storm
sql
XXXX.
sqlInputDStream.transform((rdd:
RDD[Order],
time:
Time)
=>{import
sqlContext.implicits._rdd.toDF.registAsTempTableval
SQL
=
"SELECT
ID,
UNIT_PRICE
*
QUANTITY
ASTOTAL
FROM
ORDERS
WHERE
UNIT_PRICE
*QUANTITY
>
50"val
largeOrderDF
=
sqlContext.sql(SQL)largeOrderDF.toRDD})Fusion
StylePure
StyleCREATE
EXTERNAL
TABLE13Apache
SparkStreaming*AapcheFlink*StructuredStreamingApache
StormTrident*Summary14CompositionalDeclarativePython/RSQLApache
SparkStreaming*X√√√ApacheStorm*√X√NOT
supportaggregation,windowing
andjoiningApache
StormTrident*X√XApacheGearpump*√√XXAapcheFlink*X√XSupport
select,
from,where,
unionTwitterHeron*√X√XRuntimeModel
Multi
Tasks
of
Multi
Applications
on
SingleProcessJVMProcessConnectwithlocal
SMThreadThreadTaskSingle
Task
on
Single
ProcessTaskTaskJVMProcessThreadTaskTaskJVMProcessTaskThreadThreadtask
from
applicationAThreadThreadtask
from
applicationBTaskTaskJVMProcessThreadTaskConnectwithlocal
SMThread16TwitterAapcheFlink*Multi
Tasks
of
Single
application
on
SingleTaskTaskThreadTaskTaskTaskThreadJVMProcessThreThreadado
Multi
tasks
on
single
threadTaskTaskJVMProcessThreaThreaddTaskTaskJVMProcessThreadTaskThreadThreadTaskTaskTaskJVMProcess17PorSoicnegslse
task
on
single
thread
Apache
SparkStreaming*ApacheStorm*Apache
StormTrident*ApacheGearpump*Window
Support ●
Out-of-order
Processing ●
Memory
ManagementResource
Management ●
Web
UI
●
Community
MaturityWindowSupportsmaller
than
gapsession
gapttSliding
Window
CountWindowSession
WindowSliding
WindowCount
WindowSession
WindowApache
SparkStreaming*√XXApacheStorm*√√XApache
StormTrident*√√XApacheGearpump*√XXApache
Flink*√√√ApacheHeron*XXX19Out-of-orderProcessing20Processing
TimeEvent
TimeWatermarkApache
SparkStreaming*√√XApacheStorm*√√√√XXApache
StormTrident*ApacheGearpump*√√√AapcheFlink*√√√TwitterHeron*√XXMemoryManagement21JVM
ManageSelf
Manage
on-heapSelf
Manage
off-heap√√√Apache
SparkStreaming*√√√AapcheFlink*√XXApacheStorm*√XXApacheGearpump*√XXTwitterHeron*Resource
Management2StandaloneYARNMesos√√√Apache
SparkStreaming*√√√ApacheStorm*√√√Apache
StormTrident*ApacheGearpump*√√XAapcheFlink*√√XTwitterHeron*√√√Web
UI23SubmitJobsCancelJobsInspectJobsShowStatisticsShowInput
RateCheckExceptionsInspeConfApacheSparkStreaming*X√√√√√√ApacheStorm*X√√√√√√ApacheGearpump*√√√√√√√ApacheFlink*√√√√X√√TwitterHeron*XX√√√√√21612371615147725002000150010005000Resloved78013002341217184202151000800600
4002000Past
1
Months
Summary
on
GitHubCommittoCommitsrCommunityMaturityInitiationTimeApacheTopProjectContributorsApacheSparkStreaming*20132014926ApacheStorm*20112014219ApacheGearpump*2014Incubator21ApacheFlink*20102015208TwitterHeron*2014N/A44Spark
Storm20Gearpump
FlinkHeron
Source
website:/apache/spark/pulse/monthlyPast
3
Months
Summary
on
JIRACreatedSpark
StormGearpump
FlinkHeronSource
website:
/jira/secure/Dashboard.js24HiBench
6.0“Lazy
Benchmarking”Simple
test
case
infer
practical
use
caseTest
PhilosophicalCluster
SetupApache
Kafka*
ClusterCPU:
2
x
Intel(R)
Xeon(R)
CPU
E5-2699
v3@
2.30GHzMem:
128
GBDisk:
8
x
HDD
(1TB)Network:
10
Gbps10
GbpsTest
ClusterCPU:
2
x
Intel(R)
Xeon(R)
CPU
E5-2697
v2@
2.70GHzCore:
20
/
24Mem:
80
/
128
GBDisk:
8
x
HDD
(1TB
)Network:
10
Gbpsx7x3NameVersionJava1.8Scala2.11.7Apache
Hadoop*2.6.2Apache
Zookeeper*3.4.8Apache
Kafka*Apache
Spark*1.6.1Apache
Storm*1.0.1Apache
Flink*1.0.3Apache
Gearpump*0.8.1Apache
Heron*
require
specific
Operation
System
(Ubuntu/CentOS/MacOS)Structured
Streaming
doesn’t
support
Kafka
source
yet (Spark
2.0)27ArchitectureTestCluster
(Standalone)DataGeneratorMetrics
ReaderFileSystemKafkaBrokerKafkaBrokerKafkaBrokerClientMasterSlaveSlaveSlaveSlave20
Core80GMem20
Core80GMemSlave20
Core80GMem20
Core80GMemSlave20
Core80GMem20
Core80GMemSlave20
Core80GMemTopic
ATopic
BResultTopic
AIn
TimeOut
TimeOut
Time
–
In
TimeFrameworkConfigurationFrameworkRelated
Configuration7
Executor140
Parallelism7
TaskManager140
Parallelism28
Worker140
KafkaSpout28
Executors140
KafkaSource29Apache
SparkStreaming*ApacheStorm*AapcheFlink*ApacheGearpump*Raw
Input
DataKafka
Topic
Partition:
140Size
Per
Message
(configurable):
200
bytesRaw
Input
Message
Example:“0,6,nbizrgdziebsaecsecujfjcqtvnpcnxxwiopmddorcxnlijdizgoi,1991-06-10,0.115967035,Mozilla/5.0
(iPhone;
U;
CPUlike
Mac
OS
X)AppleWebKit/420.1
(KHTML
like
Gecko)
Version/3.0
Mobile/4A93Safari/419.3,YEM,YEM-AR,snowdrops,1”Strong
Type:
class
UserVisit
(ip,
sessionId,
browser)
Keep
feeding
data
at
specific
rate
for
5minutes5
minutes30Data
Input
RateThroughputMessage/SecondKafkaProducer
Num40KB/s0.2K1400KB/s2K14MB/s20K140MB/s200K180MB/s400K1400MB/s2M10600MB/s3M15800MB/s4M20Let"s
start
with
the
simplestcaseTest
Case:
IdentityThe
application
reads
input
data
from
Kafka
and
then
writes
resultto
Kafka
immediately,
there
is
no
complex
business
logic
involved.Result8765432100100700800P99
Latency
(s)200Apache
Spark*30
400
5000
Input
Ra6t0e0(MB/s)Apache
Flink*Apache
Storm*
without
Apache
Storm*
with
AckAckFor
more
complete
information
about
performance
and
benchmark
results,
visit
/benchmarks.Results
have
been
estimated
or
simulated
using
internal
Intel
analysis
or
architecture
simulation
or
modeling,
and
provided
to
you
for
informational
purposes.
Any
differencesin
your
system
hardware,
software
or
configuration
may
affect
your
actual
performance.Test
Case:
RepartitionBasically,
this
test
case
can
stand
for
the
efficiency
of
data
shuffle.NetworkShuffleResult01002003004000200800400
600Input
Rate
(MB/s)P99
Latency
(s)Apache
Spark*Apache
Flink*Apache
Storm*
withoutAck
Apache
Gearpump*Apache
Storm*
with
Ack0200400600800020040
60800Input
0Rate
(MB/s)0Throughput
(MB/s)Apache
Spark*Apache
Flink*Apache
Storm*
withoutAck
Apache
Gearpump*Apache
Storm*
with
AckFor
more
complete
information
about
performance
and
benchmark
results,
visit
/benchmarks.Results
have
been
estimated
or
simulated
using
internal
Intel
analysis
or
architecture
simulation
or
modeling,
and
provided
to
you
for
informational
purposes.
Any
differencesin
your
system
hardware,
software
or
configuration
may
affect
your
actual
performance.Observation
Spark
Streaming
need
to
schedule
task
with
additional
context.
Undertiny
batch
interval
case,
the
overhead
could
be
dramatic
worsecompared
to
other
frameworks.According
to
our
test,
minimum
Batch
Interval
of
Spark
is
about
80ms(140
tasks
per
batch),
otherwise
task
schedule
delay
will
keep
increasingRepartition
is
heavy
for
every
framework,
but
usually
it’s
unavoidable.
Latency
of
Gearpump
is
still
quite
low
even
under
800MB/sinput
throughput.Test
Case:
Stateful
WordCountNative
state
operator
is
supported
by
all
frameworks
weevaluated
Stateful
operator
performance
+
Checkpoint/AckercostResult0204060801002000800400
60P99
Latency
(s)InputRate
(MB/s)
0Apache
Flink*Apache
Spark*Apache
Flink*
without
CP Apache
Storm*Apache
Gearpump*80070060050040030020010000200800400
60Input
Rate
(MB/s)0Throughput
(MB/s)Apache
Spark* Apache
Flink*Apache
Storm*
Gearpump*For
more
complete
information
about
performance
and
benchmark
results,
visit
/benchmarks.Results
have
been
estimated
or
simulated
using
internal
Intel
analysis
or
architecture
simulation
or
modeling,
and
provided
to
you
for
informational
purposes.
Any
differencesin
your
system
hardware,
software
or
configuration
may
affect
your
actual
performance.Observation?Exactly-once
semantics
usually
require
state
management
and
checkpoint.But
better
guarantees
come
at
high
cost.
There
is
no
obvious
performance
difference
in
Flink
when
switchingfault
tolerance
on
or
off.Checkpoint
mechanisms
and
storages
play
a
critical
role
here.Test
Case:
Window
Based
AggregationThis
test
case
manages
a
10-seconds
slidingwindowResult2001801601401201008060402000200800400
60Input
Rate
(MB/s)
0P99
Latency
(s)Apache
Spark*Apache
Flink*Storm*60050040030020010000200800400
60Input
Rate
(MB/s)0Throughput
(MB/s)Apache
Spark*Apache
Flink*Storm*For
more
complete
information
about
performance
and
benchmark
results,
visit
/benchmarks.Results
have
been
estimated
or
simulated
using
internal
Intel
analysis
or
architecture
simulation
or
modeling,
and
provided
to
you
for
informational
purposes.
Any
differencesin
your
system
hardware,
software
or
configuration
may
affect
your
actual
performance.So
which
streaming
frameworkshould
I
use?Do
your
ownbenchmarkHiBench
:
a
cross
platforms
micro-benchmark
suite
for
bigdata
(/intel-hadoop/HiBench)Open
Source
since
2012Better
streaming
benchmark
supporting
will
be
included
in
nextrelease[HiBench
6.0]Legal
DisclaimerNo
license
(express
or
implied,
by
estoppel
or
otherwise)
to
any
intellectual
property
rights
is
granted
by
this
document.Intel
does
not
control
or
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年演出經紀人之演出經紀實務模擬題庫(能力提升)
- 生物-2025年中考考前最后一卷試題押題猜想(陜西卷)
- 八年級英語上冊新教材解讀課件(人教版2024)
- 急性腹痛問診要點2025
- 河南省周口市扶溝縣2023-2024學年七年級下學期7月期末考試英語試題(含答案無聽力音頻及原文)
- 甘肅省酒泉市敦煌中學2024-2025學年高一上學期期中考試數學(B)試卷(含答案)
- 2025年云南中考數學第一次模擬試卷(無答案)
- 2025年廣東省廣州市花都區中考二模道德與法治試卷(含答案)
- 2025室內墻面涂料供貨合同樣本范文
- Tiagabine-d6-NO050328-d-sub-6-sub-生命科學試劑-MCE
- 2025河南開放大學人力資源管理050504期末在線考試答案
- 餐廳投資協議書
- 高二日語考試試卷及答案
- 鋼結構安裝施工記錄 - 副本
- 超市食品安全管理制度手冊
- 海鮮水餃供貨合同協議
- 公共組織績效評估-形考任務二(占10%)-國開(ZJ)-參考資料
- GA/T 2185-2024法庭科學步態信息采集通用技術規范
- 2024年河北省安平縣事業單位公開招聘村務工作者筆試題帶答案
- 非財務人員的財務管理方法與案例
- 2025《廣東省勞動合同書》
評論
0/150
提交評論