Speed Processing - EDSA Online Courses

SpeedProcessing
Thenextparttodiscussinthelambdaarchitectureisthespeedlayer.First,amessaging
infrastructuresuitableforbigdatasystemsisneededtopassdatathroughthislayerinaway
thatisscalableandfault-tolerant.Then,parallelstreamprocessingisemployedtohandlerealtimeupdatestothesystem.Finally,adatabasewithhighwriteperformanceisneededtohandle
frequentupdatestothereal-timeview.
BigDataMessaging
Asindicatedinourarchitecturediagrams,amessagingsystemisneededtotransportdatatothe
differentmodulesofthebigdataarchitecture.Itshouldbecapableofhandlinglargeamountsof
dataquicklyandalsoprovidemeasurestosafeguardagainstdataloss.Thelatterisoften
achievedbymessagebufferingwhichalsoallowstheplaybackofmessagestocomponentsthat
needtoberestartedorfordebuggingpurposes.
Inourexampleapplication,wehavechosenKafkaasamessagingsystem.Kafkaissimilartothe
JavaMessagingSystem(JMS).However,itisfaster,scaleswithinacomputingcluster,allowsas
manymessagesasdesired,andmessagesaresavedpersistentlyonharddisk.
Messagingsystemsareoftenbasedonthepublish/subscribemodel.Thisisacommunication
modelforloosely-coupledprocesseswhichareconnectedvianamedmessagequeuesor
channels.Ateachchanneltherecanbebothmessageproducerprocessesandmessageconsumer
processes.Producersgeneratemessagesandfeedthemintothechannel.Consumersare
informedaboutmessagesandreceivetheircontents.Communicationisanonymousas
addressingworksonlybythequeuename.Producersandconsumerscanregisterdynamicallyat
aqueue.
ParallelStreamProcessing
Abigdataapplicationusuallymakesuseofseveralparalleldatastreams.Inthespeedlayer,
thesehavetobeprocessedquicklytogivetheuserimmediateaccess.Thisrequirement,
however,meansthatbufferinglargepartsoftheinputbeforerunningcomputationsisnot
feasibleinthispartofthearchitecture.Speedlayerprocessingisbestmodelledasstream
processinginwhicheachprocessorhasonlyaconstantamountofmemoryintermsofthesize
ofthedatastream.
ApacheStormisareal-timecomputationsystemsuitableforusingcomputerclusterstoprocess
unboundedstreamsofdata.ItisthusthespeedlayerequivalentofHadoopforthebatchlayer.
ThebasicconceptinStormisthetopology.Atopologycomprisesthelogicforaspeedlayer
function.ItisanalogoustoaMapReducejob.Itscomponentsarestreams(datasequences),
spouts(datasources),andbolts(processingunits).
InStorm,dataflowsinso-calledstreamswhicharedefinedaspossiblyunboundedsequencesof
tuples.Similarasintheapproachesabove,atuplecontainsafixednumberofobjects.Streams
canbenamedandconnectspouts(thedatasourcesinStorm)andbolts(thedataprocessors).
Spoutsfeeddatastreamsintothesystemandthereareanumberofpredefinedspoutsinthe
Stormlibraries.Forexample,thereisaKafkaSpoutfeedingdatafromtheKafkamessaging
pipelineintostormtopologies.Eachspoutreceivesinformationwhetheratuplethatithasfed
intoatopologyhaspassedthetopology(ACK)orfailed(FAIL)withinapredefinedperiodof
time.Thisallowshandlingdatalossbyresendingtuplesorbyotherconcepts.
Boltsareprogrammablefilterssolvingtaskslikeaggregation,analysis,andstorage.They
receivetuplesfrominputstreams,processthem,andmayemitnewtuplestooutputstreams.A
boltacknowledges(ACK)whenatuplehasbeenreceivedandprocessed.Spoutsandboltsare
executedasmanytasksacrossthecluster.Theprogrammerspecifiesthedegreeofparallelism
perspoutandperbolt.Stormthendistributesandmonitorstheinstancesinthecluster.Itisalso
possibletosimulatealocalclusteronasinglemachinetoallowlocaldevelopmentand
debugginginEclipse.
SimilartohowdataisdistributedbyHadoop,Stormperformsstreamgroupingtodecidewhich
tuplegoestowhichtask.Thetopologydeterminesthestreamgroupingforeveryconsumer.
Thereareseveralgroupingoptionstoselectfrom:
•
•
•
•
Ifshufflegroupingisselected,tuplesaredistributedatrandom.
Infieldgrouping,consistenthashingonasubsetoftuplefieldsgovernstuple
distribution.
Allgroupingallowssendingeachtupletoalltasks.
Globalgroupingalwayspicksthetaskwiththesmallestidforatuple.
ThefollowingfiguregivesanoverviewofhowtoapplytheconceptsofStormtocreatethespeed
functionforourexampleapplication.
Figure1:WorkflowoverviewforthespeedlayerofourexampleapplicationinStorm.
Inthisapplication,StormreceivesastreamofXMLpagesandcreatesastreamofpostings
annotatingthemwithmetadataconcerningjoyorannoyance.
ThefirstcomponentoftheworkflowisaspoutreadingthepagesfromtheKafkamessaging
pipeline.Thisspoutisconnectedtoachainofboltsrealisingthesubsequentcomputations:
•
ThePostingsDetectorBoltdecomposesapageintoastreamofpostings.
•
•
•
TheEmotionAnnotatorBoltanalyseseachpostingwithregardtoemotions.
TheMetaDataGeneratorBoltcreatesmetadatatofillthedatabasecolumns(Forum,emo,
Date,...).
TheCassandraJDBCBoltwriteseachpostingwhichisannotatedbymetadataintothe
Cassandradatabase.
Thespoutsandboltsarecombinedintoatopologyasfollows:
Figure2:ConstructingtheStormtopologyforourexample.
Withinourapplication,theboltshavenomemoryandboltsareconnectedwith„shuffle“
grouping.Let’slookatsomesourcecodeforabolt.HereistheEmotionAnnotatorBolt
encapsulatingtheemotiondetector.
Bolt(bigdata-showcase-storm-emotions/..storm.bolts/EmotionAnnotaterBolt):
publicclassEmotionAnnotatorBoltimplementsIRichBolt{
privateOutputCollectorcollector;
privateEmotionDetectordetector;
@Override
publicvoidprepare(Mapmap,TopologyContext
topologyContext,OutputCollectorcollector){ //Isexecutedonce this.collector=collector;
this.detector=newEmotionDetector();//Analysisclass,sameasinHadoop/Cascading
}
@Override
publicvoidexecute(Tupletuple){
//Isexecutedforeacharrivingtuple,...
}
@Override
publicvoiddeclareOutputFields(OutputFieldsDeclarerdeclarer){
//Definesoutputstream(s)
declarer.declare(newFields("json"));//Tuplewithafieldnamed„json“
}
Nowletushaveacloserlookattheexecutefunctiontoseehowincomingtuplesareprocessed
andnewoutputtuplesarecreated.
Bolt(bigdata-showcase-storm-emotions/..stom.bolts/EmotionAnnotaterBolt):
@Overridepublicvoidexecute(Tupletuple){
Stringtext=tuple.getStringByField("cas");
//postingiscontainedintuplefield„cas“
DataTupleprocess=detector.process(newDataTuple(text));
//postingisanalysed,resultisreturnedasJSON
Valuesvalues=newValues();//createaStormtupleforoutput
values.add(process.getString(0));//withafieldJSONtext
collector.emit(values);//sendoutputtupletooutputstream
collector.ack(tuple);//Done! }
SimilartoCascadingforHadoop,thereisahigherabstractionlevelontopofStormcalled
Trident.Itprovidesfeaturessuchasbatching,states,andhigherlevelworkflowoperators.
Real-timeViews
Inspeedprocessing,highwriteperformanceofthedatabaseisoftheessencelestthedatabase
becomeabottleneckofthespeedlayer.ApacheCassandraisabigdatadatabasewithhighwrite
performance.Itexemplifiesmanyoftheaspectsrelevantforreal-timeviews.Itscales
automatically,isflexiblyextendable,andelastic.Itispartitiontolerant,hasnosinglepointof
failureandtheconsistencylevelsarecustomisablebytheuser.Mostimportantly,ithashigh
writeperformance.
Cassandraprovidesadistributedkey-valuestorewithbasicoperationstoinsert(PUT),retrieve
(GET),anddeletedata.Theseoperationsworkbasedontimestampsasdescribedabove.
Acentralissueindistributeddatabasesisdatareplication.InCassandra,replicationis
implementedasahashring.Computersareorganisedinaringandeachcomputerhasandid.
Themainreplicaofadataitemisstoredonamachineidentifiedbasedonthehashvalueofthe
datakey.Additionalreplicasarestoredclockwiseinthering.Cassandraprovidestwomain
parameterstoinfluencethereplicationstrategy:
n Replicationfactor:Thisindicateshowmanyreplicasofeachkey-valuepairarestored.
n Strategyclass:Thisdetermineswhetherreplicasarestoredsuccessivelyalongthering
orwhetherthenetworktopologyistakenintoaccount,sothatitcanbeensuredtohave
replicasindifferentdatacentresorindifferentracksofsingledatacentre.
Alldataiscombinedwithatimestamp.TheGEToperationisevaluatedsuchthatitprovidesthat
valuetoakeywiththenewesttimestampfromthereplicasthatcouldberead.Timestampsare
setbytheclientduringthePUToperation.
Cassandracanbeconfiguredtousethequorumconcepttoensureconsistency.Inthiscase,write
operationsareonlyconsideredsuccessfulifmorethanhalfofthereplicationsarewritten
successfully(acknowledgedbythedatanodes).Writeoperationsareatomicontherowlevel,
i.e.,writingorupdatingthevaluesinasinglerowisconsideredasasinglewriteoperation.
Applicationsmustbedesignedtoavoidconcurrentupdates.Ifpresent,concurrentupdatesmust
notcauseproblemswithintheapplication.Inourapplicationexample,Stormonlywritestothe
databaseandeverylineiswrittenexactlyonce.Thejoinserverontheapplicationbackendonly
readsfromdatabase.
EventhoughCassandraisakey-valuestore,itsupportstableswithcolumnsandacustom
structuredquerylanguagecalledCQLwithSQLasitsrolemodel.Initially,columnsarejust
namesattributedtothedifferentelementsofthedatatupleswhicharethenassumedtobeof
equallength.SinceCQL3thereislimitedsupportforquerieswithWHEREclausesasinSQL.
However,suchqueriescanonlybeansweredinsufficientlyfastiftheWHEREclauserespects
datapartition.Undercertainconditions,itiseasytoextractconsecutivedatafromacolumn.In
thiscontext,theorderofcolumnsisimportant.Inpractice,thismeansthatWHEREclausescan
includecomparisonsoncolumnsotherthanthefirstbutonlyaslongasallpreviouskeycomponentcolumnshavealreadybeenidentifiedwithstrictequalitycomparisons.Thelast
givenkeycomponentcolumncanthenbeanysortofcomparison.
Tocloseourdiscussion,letushavealookathowsomeofthequeriesusedinourexample
applicationlooklikeinCQL.First,wecreateatableforforumpostings:
CREATETABLEpostings(
forumtext,
emotext,
datetext,
idtext,
jsontext,
PRIMARYKEY(forum,emo,date,id)
)
NotehowtheprimarykeyisgiventosupportquerieswithWHEREclauses.
Nowwecandoqueriesonthistable:
SELECTCOUNT(*)FROMpostings;
SELECTid,forum,emo,dateFROMpostingsWHEREforum='Audi';
SELECTid,forum,emo,dateFROMpostingsWHERE
forum='Audi'ANDemo='pos'ANDdate>'200120101';
ThelastqueryshowshowWHEREqueriesneedtofollowtheorderofcolumnsintheprimary
key.Thelargerthanstatementforthedatefieldisonlypossiblebecauseequalityconditionsare
givenforalltheprecedingcolumnsoftheprimarykey.
Download PDF
Similar pages