




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
BADACTORSINNEWSREPORTING
TRACKINGNEWSMANIPULATIONBYSTATEACTORS
ChristianJohnson|WilliamMarcellino
Theglobalspreadofthecoronavirusdisease2019(COVID-19)createdfertilegroundforattemptstoinfluenceanddestabilizedifferentpopulationsandcountries.Inresponsetothis,RANDCorporationresearchersconducted
aproof-of-conceptstudyfordetectingtheseeffortsatscale.Marryingalarge-scalecollectionpipelineforglobalnewswithmachine-learninganddataanalysisworkflows,theRANDteamfoundthatbothRussiaandChinaappeartohaveemployedinformationmanipulationduringtheCOVID-19pandemicinservicetotheirrespectiveglobalagendas.Thisreportisthesecondinaseriesoftworeports;thefirst(Matthews,Migacheva,andBrown,2021)focusedonqualitativeanddescriptiveanalysisofthesamedatareferredtointhisreport.Here,wedescribeouranalyticworkflowsfordetectinganddocumentingstate-sponsoredmalignandsubversiveinformationefforts,andwereportquantitativeresultsthatsupportthequalitativefindingsfromthefirstreport.
Introduction
Aspartofouranalysis,wesearchedforbothdifferencesandsimilaritiesinthetopicsdiscussedbyRussian,Chinese,andWesternnewsmedia,andwefoundthatconspiracytheoriesandgeopoliticalposturingwererelativelycommoninRussianandChinesenewsarticlescomparedwithWestern(U.S.andUK)articles.Theworkwedescribeherelaysthefoundationforarobustprotectivecapabilitythatdetectsandshedslightonstate-actorinformationmanipulationandmisconductintheglobalarena.
Disinformation,Propaganda,andTruthDecay
Theworldisexperiencingacrisisrelatedtodisagreementsovertheestablishedtruth,aphenomenonthatRANDreferstoasTruthDecay—ashiftinpublicdiscourseawayfromfactsandanalysisthatiscausedbyfourinterrelateddrivers(RichandKavanagh,2018):
anincreasingdisagreementaboutfactsandanalyticalinterpretationsoffactsanddata
ablurringofthelinebetweenopinionandfact
anincreasingrelativevolume,andresultinginfluence,ofopinionandpersonalexperienceoverfact
adecliningtrustinformerlyrespectedsourcesoffactualinformation.
TruthDecayisaseriousthreattobothdomesticU.S.andinternationalsecurity,onethatisbeingexacerbatedbymaligneffortsfromavarietyofnationalbadactors.Theseill-intentionedeffortstomisuseinformationarelabeledmanyways—readersmighthaveseentheseeffortslabeledasdisinformation,misinformation,fakenews,andinformationoperations.Forclarityandconsistency’ssake,weusethedefinitionstakenfromRichandKavanagh,2018,intheremainderofthispaper.OurdefinitionofconspiracytheoriescomesfromDouglasetal.,2019.(SeetheKeyInformationDefinitionsbox.)
|1|
KEYINFORMATIONDEFINITIONS
Topic Definition
Disinformation Falseormisleadinginformationspreadintentionally,usuallytoachievesomepoliticaloreconomicobjective,influencepublicattitudes,orhidethetruth(asynonymforpropaganda)
Misinformation Falseormisleadinginformationthatisspreadunintentionally,byerrorormistake
Conspiracy Informationthatattemptstoexplainthe
theories ultimatecausesofsignificantsocialandpoliticaleventsandcircumstanceswithclaimsofsecretplotsbytwoormorepowerfulactors
Fakenews Newspaperarticles,televisionsnewsshows,orotherinformationdisseminatedthroughbroadcastorsocialmediathatareintentionallybasedonfalsehoodsorthatintentionallyusemisleadingframingtoofferadistortednarrative
NewsManipulationfromBothChinaandRussia
WefoundthatduringtheCOVID-19pandemic,bothRussiaandChinaengagedinnewsmanipulationthatservedtheirgeopoliticalgoals.1AlthoughEnglish-languagenewsmediafrombothnationsdidengageintraditionalreportingonCOVID-19—reportingoninfection,deathrates,andmedicalresponsesglobally—theyalsoconducteddistinctmediaeffortsthatappeartobepoliticallydrivennewsmanipulation.WefoundthatRussianmediaadvancedanti-U.S.conspiracytheoriesaboutthevirusandthatChinesemediaadvancedpro-ChinanewsthatlaunderedBeijing’sreputationintermsofCOVID-19response.Additionally,wefoundthatearlyinthepandemic,RussianmediasupportedChina’sefforttoburnishitsreputation.
Intotal,threemainpillarsofChineseandRussiannewsabouttheCOVID-19pandemicwereidentified.First,unsurprisingly,ChineseandRussiannewsagenciesreportedonstorieswithbroadinterest—thatis,newstopicscoveredsimilarlybyWesternnewsagencies.GoodexamplesofthispillararearticlesdescribingthecasenumbersanddeathsrelatedtoCOVID-19.
Thesecondpillarofnewsstoriesconsistsofarticlesthatperformgeopoliticalreputation-launderingonbehalfofRussiaandChina.ManyChinesenewsarticles,
forexample,praiseChina’shandlingofthepandemicandhighlightitsdonationsofaidtoforeigncountries.Interestingly,RussiannewspraisesChinainasimilarway.RussiannewsalsoappearedtodownplaytheoriginalCOVID-19outbreakinWuhan.(Weconsidertheinteractionbetweenthesedifferentpillarslaterinthisreport.)
Finally,RussianandChinesenewsagenciespromotedconspiracytheoriesregardingCOVID-19andthepublichealthmeasuresimplementedtocontainit.ExamplesofnewsinthispillararethesuggestionthatCOVID-19isabioweaponorotherwiseengineeredinalaboratoryortheideathatcontact-tracingeffortsarepartofaneffortbygovernmentandtechnologycompaniestotrackcitizens.
Thesuccessofourproof-of-conceptstudysupportstheideathatexisting,off-the-shelfnaturallanguage–processingmethodscanbeusedtomakesenseofnewsreportingbynation,ataglobalscale.Thesemethods,linkedtoascalableinfrastructurethatingestsnewsfromaroundtheworld,couldcreateaU.S.-supportedcapabilitytodetectnewsmanipulationatthenation-statelevel.Inplaceofattemptstoidentifyindividualnewsstoriesorsourcesthatareunreliable,suchacapabilitycouldmakemanipulationofthebroadernewslandscapepubliclyvisible.Automatedsummarizingofanation’snewsoutputatanaggregatelevelwouldquicklyuncoveramanipulationeffort—forexample,thespreadingofaconspiracytheorythatcontact-tracingprogramsarepartofagovernmenttrackingeffort.(ThisisarealexamplethatRussiannewssourcesspreadandthatourmodeldetected).
Wehaveseveralreasonsforchoosingtofocusouranalysisondataaggregatedatthenation-statelevel(asopposedto,forexample,theindividualnewsoutletlevel).First,weviewedthisstudyasanextensionofpriorworklooking
atnation-stateleveldisinformationefforts(Marcellino,Johnson,etal.,2020;Marcellino,Marcinek,etal.,2020).Thesepriorworkslookedatnation-stateactorsengagedinbroaddisinformationeffortstointerferewithelections,andwelookedspecificallyatstatemanipulationofnewsmediaduringapandemic.Second,keyfeaturesthatpresentthemselvesonlyatthenationallevelwereofinterest:Mostimportantly,theUnitedStatesandUnitedKingdomhaverobust,independentpresseswhileRussiaandChinaexertstatecontrolovertheirnewsmedia.Aseparateandequallycompellinganalysiswouldexaminepotentialnews
Bynewsmanipulation,wemeanthatnewsarticleswerepublishedtofurthertheagendaofastatesponsorratherthantoinformthepublic.Thesearticlesarethereforesubjecttopressuresbeyondthestandardeditorialcontrolofanewsagency.
|2|
disinformationwithinnations(forexample,bypartisannewssourcesintheUnitedStates).Itislikelythatsuchananalysiswouldfindsignificantdifferencesbetweenindividualoutletsthatareworthexploring,especiallythroughthelensofpoliticalpolarizationintheUnitedStates—partisannewshaspreviouslybeenidentifiedasadriverofTruthDecay(RichandKavanagh,2018).
ApotentiallimitationofthisworkisthatwefocusedonlyonEnglish-languagearticles.RussiaandChinaarenotmajorityEnglish-speaking,sowearecomparingnewsstoriesaimedatdomesticaudiences(U.S.andUK)withonesaimedatforeignaudiences(RussianandChinese).InsofarasthenewsoutletsaretryingtoinfluenceEnglish-speakingpeople,however,wefeelthattheycanbeusefullycompared.Cross-linguisticcomparisonofdomesticallyorientedreportingisanotherpotentiallinetofutureresearch.
Giventheeffectivenessofcombiningexistingoff-the-shelfmethodsinourreport,apublicsystemformonitoringglobalnewsthatdetectsanddescribesglobalnewsthemesbynationisplausible.SuchasystemcouldhelpguardagainstTruthDecayeffortsfrommaliciousstateactors.Thesystemalsocouldanalyzeadditionalsourcesofdata,suchassocialmediaposts,tounderstandboththenarrativesbeingpushedandwhichonestakehold.Moreinsightcouldalsobegarneredbyperformingdeeperanalysisattheindividualnewsagencylevel:Differentonlinecommunitiesarelikelytoresponddifferentlytosimilarnewsstories,dependingonwhichsourcetheyoriginatefrom,forexample.MorediscussionofsuchanewsmonitoringsystemcanbefoundintheDiscussionsection.
Methodology
Identifyingdisinformationinalarge,complexdatasetisnotasimpletask.Theworddisinformationisacatchalltermusedtorefertoanarrayofdifferentphenomena—from“fakenews,”toopinionpiecesmasqueradingasjournalism,tolegitimatenewsstoriesthatheapinordinateattentiononcertaintopics(whileignoringothers).Asdescribedinthedefinitionsbox,disinformationisusedtorefertothedeliberatespreadingofmisleadingorincorrectinformation;misinformationreferstohonestbutincorrectknowledge.However,thelinebetweenthetwocansometimesbeblurred;priorRANDwork(Marcellino,Johnson,etal.,
2020)showedthatcoordinatedbotactivitywaslikelyuseddeliberatelyintherun-uptothe2020U.S.presidentialelectiontoamplifyauthentictweetsandmakethemappearmorepopularthantheyreallywere(commonlycalledastroturfing)inanattempttocreateafalseimpression
ofgrassrootsspread.Ourgoal,therefore,wasnottodetectdisinformationperse,buttoidentifywhenand
Theword
disinformation
isacatchalltermusedtorefertoanarrayofdifferentphenomena—from“fakenews,”toopinionpiecesmasqueradingasjournalism,tolegitimatenewsstoriesthatheapinordinateattentiononcertaintopics.
|3|
howRussianandChinesenewsmediaappeartobemanipulatedbyforcesoutsidethenormalnewscycleandeditorialprocesses.BecauseourdatasetfeaturedmanyarticlesfromavarietyofU.S.andUKmedia,wemakethekeyassumptionthatnewsworthystorieswillbecoveredbytheseWesternoutlets;instancesinwhichRussianandChinesemediacoverstoriesthatarequalitativelydifferentfromthosecoveredbyWesternmediaareworthyofmorescrutinytodeterminewhethertheycouldbepartofadisinformationcampaign.
Computationaltechniqueshavepreviouslybeenusedbyresearcherstostudydisseminationoffakenews,particularlyonTwitter.Grinbergetal.,2019,demonstratedthatfakenewsinthelead-uptothe2016U.S.presidentialelectionwasseenandsharedprimarilybyarelativelysmallnumberofTwitterusers,primarilyconsistingbothofhighlyconservativeandcyborgaccounts.2Usingasimilarmethodology,Lazeretal.,2020,foundthatthesameconclusionsessentiallyheldtrueforthespreadoffakenewsrelatedtoCOVID-19.Marcellino,Johnson,etal.,2020,usedadifferentmethodologytodeterminethatbot-likeaccountslikelyplayedasignificantroleinspreadingfar-rightconspiracytheoriesanddisinformationleadinguptothe2020election.Inshort,theavailableresearchsuggeststhatmuchofthedisinformationonsocialmediaisspreadbyarelativelysmallnumberofmalignusers.
Thesestudieshavemostlyexaminedmetadataandderivedfeaturestodrawtheirconclusionsinsteadofstudying
thelanguageofdisinformationitself.3Thispaperbuildsonexistingresearchtostudynotonlymetadataaboutnews,buttheactualcontentofthenewsitself.Wehopedthatunderstandingthetopicalthemesbeingspreadviadisinformationwouldleadtonewinsightsthatcannotbeseensimplybylookingatuserengagementonsocialnetworks,suchasTwitter.
ThefirstreportinthisseriesidentifiedseveralkeymarkersofdisinformationinRussianandChinesenews:conspiracytheories,geopoliticalposturing,andanti-U.S.messaging.
Althoughwehopedthatadata-drivenapproachwouldreplicatethesefindings,wesoughttoperformouranalysisasblindlyaspossible;thatis,wedidnotseektoconfirmoursuspicionsandsimplysearchthedatatofindconspiracytheories.Instead,weusedalgorithmstodetect
thedominantthemesinthedataandonlythenanalyzedthesethemestodeterminetheircontent.
Ouroverallstrategy,asmentionedearlier,restedontheideathatanydisinformationpublishedbyRussianandChinesenewssourceswouldbedetectablebecauseitscontentwoulddiffermeaningfullyfromthecontentinU.S.andUKnewsarticles.Certainly,somedifferencesincontentaretobeexpectedunderano-manipulationhypothesis:Forexample,RussiannewssourcesmightbemorelikelytocoverstoriesaboutEasternEuropethannewsfromtheUnitedStates,simplybecauseofgeographicalproximity.However,wehypothesizedthatbyinspectingthesedifferencesclosely,wewouldbeabletouncoverpatternsassociatedwithmanipulation.Ultimately,anydifferencesbetweenWesternandnon-Westernnewsarticleswouldalsorequirehumananalysistodeterminewhetherthedifferenceswereinnocuousormalign.
DataDescription
WeusedNewsAPItocollectallEnglish-languagearticlesfrom43newssources(nineofwhichareRussian,fiveChinese,27U.S.,andtwoUK)fortheperiodJanuary1,2020,throughAugust31,2020,thatfeaturedeither“coronavirus”or“COVID”inthetext.4Thisresultedinatotalof247,315articles,thevastmajorityofthem(230,865)fromU.S./UKsources,withsmallernumbersfromRussian(14,309)andChinese(2,141)sources.(WeprovideamoredetailedbreakdownofarticlespublishedbynewsoutletintheAppendix.)
Foroursearchperiod,theoverallfrequencyofpublishedarticleswitheithertermmentionedgrewrapidlythroughJanuaryandFebruary,reachingapeakinMarchandApril.ArticlefrequencybycountryoforiginisshownovertimeinFigure1.AsimilarpatternwasseeninpublishingfrequencyovertimeacrossU.S./UK,Russian,andChinesesources,althoughRussiannewssourcesappearedtopublishsomewhatlessfrequentlyinmidtolateFebruary.More-detailedanalysisofthisapparentRussianslowdownisdescribedlaterinthisreport.
Acyborgaccountisonethatmixesautomatedbotactivitywithrealhumantweets.
DerivedfeaturesreferstosuchthingsasthepresenceoffakenewsURLsinaTwitterfeed.
NewsAPIisanapplicationprogramminginterfacethatallowsuserstoautomaticallyconnecttoandsearchalargedatabaseofnewsarticles,includingnewswireservices(animportantadvantageoversuchrivalsourcesasLexisNexis).RANDhasbuiltascalableinfrastructuretoretrieve,store,query,andthenanalyzeverylargenewsarticledatasets.Thisscalablearchitectureisapowerfultoolthatallowsustogatheranenormousamountofnewsdataforanalysis,butitalsohasaconstraint:Wecancollectonlynewsarticlesfromsourcescoveredbytheservice,whichdoesnotincludesourcesthatarebehindpaywallsorotherwiserestrictedinaccess.Forourstudy,inparticular,onlynineRussianandfiveChinesesourcesinEnglisharecoveredbyNewsAPI.
|4|
FIGURE1ArticleFrequencyoverTimein2020
Articlesperday
U.S./UKnews Russiannews Chinesenews
103
102
101
100
10–1
Jan Feb Mar Apr May Jun Jul Aug
NOTE:Themovingseven-dayaveragepublishingrateisoverlaidoneachsourceasasolidline.Notethatthey-axisislogarithmicallyscaled;wehaveaboutanorderofmagnitudefewerRussiannewsarticlesthanU.S./UK,andaboutanorderofmagnitudefewerChinesearticlesthanRussian.
BecausetheCOVID-19pandemicwassuchanimpactfulworldwideevent,wewerenotsurprisedtofindthatnewsstoriesaboutmanyothertopics,suchasthosethatweredominantlyabouteconomicorpoliticalstories,werealsorepresentedinourdatasetbecausetheyalsoreferencedthepandemicinsomeway.However,acursoryexaminationofrandomarticlesinourdatasetshowedthatthemajoritywerefocusedonadifferent(nonpandemic)topic,althoughthepandemicplayedasignificantroleinmanyofthesearticles.
WedecidedtomodelhowthisassortmentofdifferentsubjectsvariedacrossRussian,Chinese,andWesternnewsmedia.IfwecoulddeterminethatcertaintopicswerebeingdiscussedquiteoftenbyRussianorChinesenewsbutrarelybyWesternoutlets,thatwouldsuggesttheneedforadditionalexaminationandmightevenbeindicativeofamalignefforttopushcertainnarratives.
Naturallanguageprocessing,thebranchofmachine-learningthatdealswithunstructuredtext,hasavarietyoftechniquesforperformingthiskindoftopicmodeling.WedecidedtouseLatentDirichletAllocation(LDA),awidelyusedmethodthatsimultaneouslyidentifiesthetopicsassociatedwitheacharticle,alongwiththewordsassociatedwitheachtopic.LDAhastheadvantageofbeingrelativelyfasttoperformonalargedatasetandproducesresultsthatareeasilyunderstoodandinterpretedbyhumans.OurLDAmodelwasbuiltusinggensimversion3.8.3,(RehurekandSojka,2011)andtextpreprocessingwasperformedwiththeNaturalLanguageToolkit(LoperandBird,2002).
TosupplementtheLDAmodel,wealsoconsideredhowtoanalyzethewaysthatnewschangedovertime,usingtime-seriesclusters.Thenewscycleisconstantlyinflux,andmalignforeignentitiesareliabletochangenarratives
|5|
LDAidentifiestopicsbythecollectionofwordsthatappearwithindocuments.Itis
bag-of-wordstechnique;thatis,itconsidersonlywhichwordsappearinadocument,nottheorderinwhichtheyappear.
overtimeasdifferenttopicsbecomemoreimportant.Wethereforeexpecteddifferencesnotonlyinwhatwasbeingdiscussedbythenewsmedia,butwhenitwasbeingreported.Ourtime-seriesmodel,describedinmoredetaillaterinthisreport,yieldsclustersofwordsthatriseandfallinfrequencysimultaneously.AsinLDA,humaninspectionofthekeywordsisnecessarytoassignmeaningtotheclusters.Weusedthesamepreprocessingsteps(stopwordremoval,stemming,andtokenization)forbothourLDAandtime-seriesclusteringmodels.
Eachofourmethods—LDAandtime-seriesmodeling—yieldedgroupsofarticlesandwordsassociatedwithaparticulartopic.Wefoundthatbothmethodsgeneratedareasonablysmallnumberofclusters(about20–50uniquetopics,dependingonthemethodandwhetherthemodelwasanalyzingRussian,Chinese,orWesternnewsarticles)thatcouldthenbestudiedindividually.Wefoundthatmeaningwassomewhatsimplertoassigntothetime-seriesclustersbecausethewordsineachclusterofinterestweretypicallycloselyrelated.Moreover,spikesinthefrequencyofdifferentwordswereusuallyeasytocorrelatewithreal-worldevents.Ontheotherhand,thetime-seriesclusterswereobservedtobesomewhatbiasedtowardrareorinfrequentwords.Thismadethetime-seriesdatamoreusefulforidentifyingnichesubjectslikeconspiracytheoriesbutlessusefulforbroadertrends.
LDAModeling
LDAidentifiestopicsbythecollectionofwordsthatappearwithindocuments.Itisabag-of-wordstechnique;that
is,itconsidersonlywhichwordsappearinadocument,nottheorderinwhichtheyappear.Thealgorithmtakesasitsinputaseriesofdocumentsandapredeterminednumberoftopicstoidentify,thenassignseachwordandeachdocumenttooneormoretopicsinsuchawayastoproducetopicsashomogenousaspossible.Forinstance,thewordBrexitappearsalmostexclusivelyinnewsarticlesabouttheUnitedKingdomandEurope,sothatwordisassignedtothattopic,asarearticlesthatusethatword.Bothdocumentsandwordscanbeassignedtomultipletopics:ItiseasytoimagineanewsarticlethatdiscussesboththeU.S.economyandthepresidentialelection.However,LDAfavorsassigningahandfuloftopicsatmosttoeachwordordocument,undertheassumptionthateachdocumenthasarelativelysingularfocus.
WetrainedthreeseparateLDAmodels,oneforeachofourRussian,Chinese,andWesternnewsdatasets.Standard
|6|
preprocessingtechniqueswereappliedtothedata:Thetextofeacharticlewasstemmedandtokenized,andstopwordswereremoved.5
Wechosetoallocate20topicsforeachcountry.Thischoicewasinitiallymotivatedbecauseitseemedareasonablecompromisebetweencapturingmanynichetopicsandsplittinglargertopicsintoduplicates,butwealsocomputedtopiccoherenceforarangeofeightto
28topics.Wefoundamodestpeakincoherenceat20topics,buttheresultswerequitenoisyandlittlesignificantimprovementwasseenformoreorfewertopics.Weconcludedthat20topicsforeachofourdatasetswasareasonablechoice.
WeinitiallytrainedasingleLDAmodelonallofourtextinsteadoftrainingthreeseparatemodels.WefoundtheresultingtopicstobedominatedbyU.S.newsarticles,whichmakessensebecausethedatasetwasheavilyskewedtowardU.S.outlets.Forexample,articlesfromRussianandChinesesourcesweregroupedtogetherwithmanyotherarticlesfromU.S.sourcesaboutU.S.foreignpolicy.BecausewewerespecificallyinterestedinthenarrativesbeingpromotedbyRussianandChinesesources,wedecidedtobuildthreeseparateLDAmodels,whichimmediatelyresultedinafarmoregranularlookatthedifferentthemesinthesearticlesofinterest.
Toassignameaningfullabeltoeachtopic,wetheniden-tifiedthe20articleswiththehighestscoreassociatedwitheachofour20differenttopicsandreadeachofthehead-lines.6Wealsoobservedthe20wordsmoststronglyasso-ciatedwitheachtopic.Inmostcases,readingtheheadlineswasenoughtoquicklyidentifythetopicofdiscussion.
Inafewcases,topicsweremoredifficulttodiscern.Sometimesthiswasbecausetwoseparatearticlesubjectswereplacedintoasingletopic.InourWesternnewscategory,forexample,wefoundthatinternationalnewsstoriesandarticlesaboutBlackLivesMatter(orBLM)protestswereplacedtogetherinthesametopic.Thiskindofconflationisnotunexpected;thetwotopicssharemanykeywords:“protest,”state,”“l(fā)aw,”“people,”“security,”andsoforth.
Meanwhile,wefoundseveraltopicswhosecontentsappearedtobeduplicatedinourChinesedataset,
probablybecauseoftherelativelysmallnumberofarticles.TheLDAalgorithmhadtosplitsomerelativelycohesivetopicstoreachatotalof20topics.ThiswaslesslikelytooccurinourWesternandRussiandatasetsbecausethereweresimplymoremajortopicsdiscussed.
Afterassigningalabeltoeachofthetopics,wecross-referencedthethreecategoriesofarticles(U.S./UK,Russian,andChinese)toseewhichtopicsweresharedandwhichwereunique.Theresultsaredescribedinalatersectionofthisreport.
Time-SeriesClustering
Tounderstandthetime-seriesnatureofthetopicsunderdiscussioninourdatasets,wesetouttofindclustersofwordsforwhichthefrequencieschangesimultaneously.Giventhenatureofthenewscycle,weexpectedthatwordsassociatedwithsometopicswouldspikeintheirusagewhenthattopicbecameofbroadinterest.Considerthewordsoil,barrel,OPEC,andprice:Onewouldexpectthatthesewordsareusedyear-roundbutbecomemorecommonwhenthereisanewsworthyeventinvolvingoilmarkets(forexample,ifgaspricesgoup).Thepattern
oftheirfrequencyoverthecourseoftheyear,then,isrelativelyunique,andwecandiscoverotherwordsthathavethesametime-seriessignature.WeappliedthistechniqueonlytotheRussianandWesterndatasets;theChinesedatasetwastoosmalltoyieldinformativeresults.
Toperformthisanalysis,wefirstidentifiedthe5,000mostcommonwordsinourtwodatasetsofinterest.TextpreprocessingwasperformedidenticallytothepreprocessingfortheLDAanalysis.Foreachword,wethencalculateditsdailyfrequencybydividingthenumberofmentionsofthatwordbythetotalnumberofwordsusedacrossallarticlesintheassociateddataset.Tosmoothoutdailyfluctuations,wethencomputedthefive-dayrollingaverageofthesefrequenciesandthennormalizedeachword’sfrequencybydividingbythepeakfrequencyacrosstheentiretimespan.Thislatterstepallowsustoeasilycomparethetime-seriesfluctuationsofcommonwordsanduncommonwordsonanequalfooting.Withthetime-seriesfrequenciesofeachwordinhand,wefinallyappliedtheOPTICS(OrderingPointsToIdentifythe
Stemmingreferstoconvertingawordtoitsrootform(e.g.,“running”becomes“run”).Tokenizationistheprocessofconvertingallwordstolowercaseandsplittingthemintoindividualwordcomponents(e.g.,splittingthecontraction“they’re”into“they”and“re”).Stopwordsthataresocommontheyaddlittleinformation(e.g.,“the,”“and,”“of”).WeusedthePorterStemmerasimplementedintheNaturalLanguageToolkit,thestandardtoolkitstopwordsdictionary,andthegenism.utils.simple_preprocesstokenizer.
Invirtuallyallcases,wefoundthat20articlesweresufficienttoeasilyidentifythetopicbeingdiscussed.
|7|
ClusteringStructure)clusteringalgorithmtofindwordsthatclustertogether.7WechoseOPTICSovermore-commonclusteringalgorithmsfortwomainreasons:First,OPTICSdoesnotrequireaninputforthenumberofclusters,insteaddeterminingthenumberofclustersdirectlyfromthedata;second,itclassifieslow-densitypointsasnoiseanddoesnotassignthemtoacluster.Withmanywordsthatwerenottopic-indicativeinourdata,thiswasacrucialfeaturethatallowedustominimizenoiseandproducemeaningfulclusters.Afewotherclusteringalgorithmshavesimilarfeatures,andwechoseOPTICSbothforitscomputationalspeedandforthe(admittedlysubjective)observedqualityoftheresultingclusters.
TheOPTICSalgorithmfindshigh-densityclusters,meaningmanywordsarenotsimilarenoughtoanyotherwordtowarrantclustermembership.Thewordsineachcluster,therefore,h
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 社區(qū)藥品登記管理制度
- 工程部制度管理制度
- 茶坊日常安全管理制度
- 自帶辦公電腦管理制度
- 小微型設(shè)備管理制度
- 維修作業(yè)風(fēng)險(xiǎn)管理制度
- 鄉(xiāng)生態(tài)保護(hù)管理制度
- 監(jiān)理公司采購(gòu)管理制度
- 網(wǎng)絡(luò)企業(yè)員工管理制度
- 小區(qū)自主式管理制度
- 北師大版七年級(jí)上冊(cè)數(shù)學(xué)27有理數(shù)的乘法課件(2課時(shí))
- 安全生產(chǎn)標(biāo)準(zhǔn)化推進(jìn)計(jì)劃 模板
- 2023年黑龍江省文化和旅游系統(tǒng)事業(yè)單位人員招聘筆試模擬試題及答案解析
- 2023年江西新余市數(shù)字產(chǎn)業(yè)投資發(fā)展有限公司招聘筆試題庫(kù)含答案解析
- LY/T 3323-2022草原生態(tài)修復(fù)技術(shù)規(guī)程
- 部編版六年級(jí)語文下冊(cè)課件第1課《北京的春節(jié)》《臘八粥》
- 涂裝工模擬練習(xí)題含答案
- 2023-2024學(xué)年河南省永城市小學(xué)數(shù)學(xué)二年級(jí)下冊(cè)期末評(píng)估測(cè)試題
- 乳腺疾病的超聲診斷 (超聲科)
- 服務(wù)精神:馬里奧特之路
- 《建筑施工安全檢查標(biāo)準(zhǔn)》JGJ59-2011圖解
評(píng)論
0/150
提交評(píng)論