蘭德-新聞報(bào)道中的壞角色-追蹤國(guó)家行為者操縱新聞的行為(英)-2021.11-20正式版_第1頁
蘭德-新聞報(bào)道中的壞角色-追蹤國(guó)家行為者操縱新聞的行為(英)-2021.11-20正式版_第2頁
蘭德-新聞報(bào)道中的壞角色-追蹤國(guó)家行為者操縱新聞的行為(英)-2021.11-20正式版_第3頁
蘭德-新聞報(bào)道中的壞角色-追蹤國(guó)家行為者操縱新聞的行為(英)-2021.11-20正式版_第4頁
蘭德-新聞報(bào)道中的壞角色-追蹤國(guó)家行為者操縱新聞的行為(英)-2021.11-20正式版_第5頁
已閱讀5頁,還剩15頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

BADACTORSINNEWSREPORTING

TRACKINGNEWSMANIPULATIONBYSTATEACTORS

ChristianJohnson|WilliamMarcellino

Theglobalspreadofthecoronavirusdisease2019(COVID-19)createdfertilegroundforattemptstoinfluenceanddestabilizedifferentpopulationsandcountries.Inresponsetothis,RANDCorporationresearchersconducted

aproof-of-conceptstudyfordetectingtheseeffortsatscale.Marryingalarge-scalecollectionpipelineforglobalnewswithmachine-learninganddataanalysisworkflows,theRANDteamfoundthatbothRussiaandChinaappeartohaveemployedinformationmanipulationduringtheCOVID-19pandemicinservicetotheirrespectiveglobalagendas.Thisreportisthesecondinaseriesoftworeports;thefirst(Matthews,Migacheva,andBrown,2021)focusedonqualitativeanddescriptiveanalysisofthesamedatareferredtointhisreport.Here,wedescribeouranalyticworkflowsfordetectinganddocumentingstate-sponsoredmalignandsubversiveinformationefforts,andwereportquantitativeresultsthatsupportthequalitativefindingsfromthefirstreport.

Introduction

Aspartofouranalysis,wesearchedforbothdifferencesandsimilaritiesinthetopicsdiscussedbyRussian,Chinese,andWesternnewsmedia,andwefoundthatconspiracytheoriesandgeopoliticalposturingwererelativelycommoninRussianandChinesenewsarticlescomparedwithWestern(U.S.andUK)articles.Theworkwedescribeherelaysthefoundationforarobustprotectivecapabilitythatdetectsandshedslightonstate-actorinformationmanipulationandmisconductintheglobalarena.

Disinformation,Propaganda,andTruthDecay

Theworldisexperiencingacrisisrelatedtodisagreementsovertheestablishedtruth,aphenomenonthatRANDreferstoasTruthDecay—ashiftinpublicdiscourseawayfromfactsandanalysisthatiscausedbyfourinterrelateddrivers(RichandKavanagh,2018):

anincreasingdisagreementaboutfactsandanalyticalinterpretationsoffactsanddata

ablurringofthelinebetweenopinionandfact

anincreasingrelativevolume,andresultinginfluence,ofopinionandpersonalexperienceoverfact

adecliningtrustinformerlyrespectedsourcesoffactualinformation.

TruthDecayisaseriousthreattobothdomesticU.S.andinternationalsecurity,onethatisbeingexacerbatedbymaligneffortsfromavarietyofnationalbadactors.Theseill-intentionedeffortstomisuseinformationarelabeledmanyways—readersmighthaveseentheseeffortslabeledasdisinformation,misinformation,fakenews,andinformationoperations.Forclarityandconsistency’ssake,weusethedefinitionstakenfromRichandKavanagh,2018,intheremainderofthispaper.OurdefinitionofconspiracytheoriescomesfromDouglasetal.,2019.(SeetheKeyInformationDefinitionsbox.)

|1|

KEYINFORMATIONDEFINITIONS

Topic Definition

Disinformation Falseormisleadinginformationspreadintentionally,usuallytoachievesomepoliticaloreconomicobjective,influencepublicattitudes,orhidethetruth(asynonymforpropaganda)

Misinformation Falseormisleadinginformationthatisspreadunintentionally,byerrorormistake

Conspiracy Informationthatattemptstoexplainthe

theories ultimatecausesofsignificantsocialandpoliticaleventsandcircumstanceswithclaimsofsecretplotsbytwoormorepowerfulactors

Fakenews Newspaperarticles,televisionsnewsshows,orotherinformationdisseminatedthroughbroadcastorsocialmediathatareintentionallybasedonfalsehoodsorthatintentionallyusemisleadingframingtoofferadistortednarrative

NewsManipulationfromBothChinaandRussia

WefoundthatduringtheCOVID-19pandemic,bothRussiaandChinaengagedinnewsmanipulationthatservedtheirgeopoliticalgoals.1AlthoughEnglish-languagenewsmediafrombothnationsdidengageintraditionalreportingonCOVID-19—reportingoninfection,deathrates,andmedicalresponsesglobally—theyalsoconducteddistinctmediaeffortsthatappeartobepoliticallydrivennewsmanipulation.WefoundthatRussianmediaadvancedanti-U.S.conspiracytheoriesaboutthevirusandthatChinesemediaadvancedpro-ChinanewsthatlaunderedBeijing’sreputationintermsofCOVID-19response.Additionally,wefoundthatearlyinthepandemic,RussianmediasupportedChina’sefforttoburnishitsreputation.

Intotal,threemainpillarsofChineseandRussiannewsabouttheCOVID-19pandemicwereidentified.First,unsurprisingly,ChineseandRussiannewsagenciesreportedonstorieswithbroadinterest—thatis,newstopicscoveredsimilarlybyWesternnewsagencies.GoodexamplesofthispillararearticlesdescribingthecasenumbersanddeathsrelatedtoCOVID-19.

Thesecondpillarofnewsstoriesconsistsofarticlesthatperformgeopoliticalreputation-launderingonbehalfofRussiaandChina.ManyChinesenewsarticles,

forexample,praiseChina’shandlingofthepandemicandhighlightitsdonationsofaidtoforeigncountries.Interestingly,RussiannewspraisesChinainasimilarway.RussiannewsalsoappearedtodownplaytheoriginalCOVID-19outbreakinWuhan.(Weconsidertheinteractionbetweenthesedifferentpillarslaterinthisreport.)

Finally,RussianandChinesenewsagenciespromotedconspiracytheoriesregardingCOVID-19andthepublichealthmeasuresimplementedtocontainit.ExamplesofnewsinthispillararethesuggestionthatCOVID-19isabioweaponorotherwiseengineeredinalaboratoryortheideathatcontact-tracingeffortsarepartofaneffortbygovernmentandtechnologycompaniestotrackcitizens.

Thesuccessofourproof-of-conceptstudysupportstheideathatexisting,off-the-shelfnaturallanguage–processingmethodscanbeusedtomakesenseofnewsreportingbynation,ataglobalscale.Thesemethods,linkedtoascalableinfrastructurethatingestsnewsfromaroundtheworld,couldcreateaU.S.-supportedcapabilitytodetectnewsmanipulationatthenation-statelevel.Inplaceofattemptstoidentifyindividualnewsstoriesorsourcesthatareunreliable,suchacapabilitycouldmakemanipulationofthebroadernewslandscapepubliclyvisible.Automatedsummarizingofanation’snewsoutputatanaggregatelevelwouldquicklyuncoveramanipulationeffort—forexample,thespreadingofaconspiracytheorythatcontact-tracingprogramsarepartofagovernmenttrackingeffort.(ThisisarealexamplethatRussiannewssourcesspreadandthatourmodeldetected).

Wehaveseveralreasonsforchoosingtofocusouranalysisondataaggregatedatthenation-statelevel(asopposedto,forexample,theindividualnewsoutletlevel).First,weviewedthisstudyasanextensionofpriorworklooking

atnation-stateleveldisinformationefforts(Marcellino,Johnson,etal.,2020;Marcellino,Marcinek,etal.,2020).Thesepriorworkslookedatnation-stateactorsengagedinbroaddisinformationeffortstointerferewithelections,andwelookedspecificallyatstatemanipulationofnewsmediaduringapandemic.Second,keyfeaturesthatpresentthemselvesonlyatthenationallevelwereofinterest:Mostimportantly,theUnitedStatesandUnitedKingdomhaverobust,independentpresseswhileRussiaandChinaexertstatecontrolovertheirnewsmedia.Aseparateandequallycompellinganalysiswouldexaminepotentialnews

Bynewsmanipulation,wemeanthatnewsarticleswerepublishedtofurthertheagendaofastatesponsorratherthantoinformthepublic.Thesearticlesarethereforesubjecttopressuresbeyondthestandardeditorialcontrolofanewsagency.

|2|

disinformationwithinnations(forexample,bypartisannewssourcesintheUnitedStates).Itislikelythatsuchananalysiswouldfindsignificantdifferencesbetweenindividualoutletsthatareworthexploring,especiallythroughthelensofpoliticalpolarizationintheUnitedStates—partisannewshaspreviouslybeenidentifiedasadriverofTruthDecay(RichandKavanagh,2018).

ApotentiallimitationofthisworkisthatwefocusedonlyonEnglish-languagearticles.RussiaandChinaarenotmajorityEnglish-speaking,sowearecomparingnewsstoriesaimedatdomesticaudiences(U.S.andUK)withonesaimedatforeignaudiences(RussianandChinese).InsofarasthenewsoutletsaretryingtoinfluenceEnglish-speakingpeople,however,wefeelthattheycanbeusefullycompared.Cross-linguisticcomparisonofdomesticallyorientedreportingisanotherpotentiallinetofutureresearch.

Giventheeffectivenessofcombiningexistingoff-the-shelfmethodsinourreport,apublicsystemformonitoringglobalnewsthatdetectsanddescribesglobalnewsthemesbynationisplausible.SuchasystemcouldhelpguardagainstTruthDecayeffortsfrommaliciousstateactors.Thesystemalsocouldanalyzeadditionalsourcesofdata,suchassocialmediaposts,tounderstandboththenarrativesbeingpushedandwhichonestakehold.Moreinsightcouldalsobegarneredbyperformingdeeperanalysisattheindividualnewsagencylevel:Differentonlinecommunitiesarelikelytoresponddifferentlytosimilarnewsstories,dependingonwhichsourcetheyoriginatefrom,forexample.MorediscussionofsuchanewsmonitoringsystemcanbefoundintheDiscussionsection.

Methodology

Identifyingdisinformationinalarge,complexdatasetisnotasimpletask.Theworddisinformationisacatchalltermusedtorefertoanarrayofdifferentphenomena—from“fakenews,”toopinionpiecesmasqueradingasjournalism,tolegitimatenewsstoriesthatheapinordinateattentiononcertaintopics(whileignoringothers).Asdescribedinthedefinitionsbox,disinformationisusedtorefertothedeliberatespreadingofmisleadingorincorrectinformation;misinformationreferstohonestbutincorrectknowledge.However,thelinebetweenthetwocansometimesbeblurred;priorRANDwork(Marcellino,Johnson,etal.,

2020)showedthatcoordinatedbotactivitywaslikelyuseddeliberatelyintherun-uptothe2020U.S.presidentialelectiontoamplifyauthentictweetsandmakethemappearmorepopularthantheyreallywere(commonlycalledastroturfing)inanattempttocreateafalseimpression

ofgrassrootsspread.Ourgoal,therefore,wasnottodetectdisinformationperse,buttoidentifywhenand

Theword

disinformation

isacatchalltermusedtorefertoanarrayofdifferentphenomena—from“fakenews,”toopinionpiecesmasqueradingasjournalism,tolegitimatenewsstoriesthatheapinordinateattentiononcertaintopics.

|3|

howRussianandChinesenewsmediaappeartobemanipulatedbyforcesoutsidethenormalnewscycleandeditorialprocesses.BecauseourdatasetfeaturedmanyarticlesfromavarietyofU.S.andUKmedia,wemakethekeyassumptionthatnewsworthystorieswillbecoveredbytheseWesternoutlets;instancesinwhichRussianandChinesemediacoverstoriesthatarequalitativelydifferentfromthosecoveredbyWesternmediaareworthyofmorescrutinytodeterminewhethertheycouldbepartofadisinformationcampaign.

Computationaltechniqueshavepreviouslybeenusedbyresearcherstostudydisseminationoffakenews,particularlyonTwitter.Grinbergetal.,2019,demonstratedthatfakenewsinthelead-uptothe2016U.S.presidentialelectionwasseenandsharedprimarilybyarelativelysmallnumberofTwitterusers,primarilyconsistingbothofhighlyconservativeandcyborgaccounts.2Usingasimilarmethodology,Lazeretal.,2020,foundthatthesameconclusionsessentiallyheldtrueforthespreadoffakenewsrelatedtoCOVID-19.Marcellino,Johnson,etal.,2020,usedadifferentmethodologytodeterminethatbot-likeaccountslikelyplayedasignificantroleinspreadingfar-rightconspiracytheoriesanddisinformationleadinguptothe2020election.Inshort,theavailableresearchsuggeststhatmuchofthedisinformationonsocialmediaisspreadbyarelativelysmallnumberofmalignusers.

Thesestudieshavemostlyexaminedmetadataandderivedfeaturestodrawtheirconclusionsinsteadofstudying

thelanguageofdisinformationitself.3Thispaperbuildsonexistingresearchtostudynotonlymetadataaboutnews,buttheactualcontentofthenewsitself.Wehopedthatunderstandingthetopicalthemesbeingspreadviadisinformationwouldleadtonewinsightsthatcannotbeseensimplybylookingatuserengagementonsocialnetworks,suchasTwitter.

ThefirstreportinthisseriesidentifiedseveralkeymarkersofdisinformationinRussianandChinesenews:conspiracytheories,geopoliticalposturing,andanti-U.S.messaging.

Althoughwehopedthatadata-drivenapproachwouldreplicatethesefindings,wesoughttoperformouranalysisasblindlyaspossible;thatis,wedidnotseektoconfirmoursuspicionsandsimplysearchthedatatofindconspiracytheories.Instead,weusedalgorithmstodetect

thedominantthemesinthedataandonlythenanalyzedthesethemestodeterminetheircontent.

Ouroverallstrategy,asmentionedearlier,restedontheideathatanydisinformationpublishedbyRussianandChinesenewssourceswouldbedetectablebecauseitscontentwoulddiffermeaningfullyfromthecontentinU.S.andUKnewsarticles.Certainly,somedifferencesincontentaretobeexpectedunderano-manipulationhypothesis:Forexample,RussiannewssourcesmightbemorelikelytocoverstoriesaboutEasternEuropethannewsfromtheUnitedStates,simplybecauseofgeographicalproximity.However,wehypothesizedthatbyinspectingthesedifferencesclosely,wewouldbeabletouncoverpatternsassociatedwithmanipulation.Ultimately,anydifferencesbetweenWesternandnon-Westernnewsarticleswouldalsorequirehumananalysistodeterminewhetherthedifferenceswereinnocuousormalign.

DataDescription

WeusedNewsAPItocollectallEnglish-languagearticlesfrom43newssources(nineofwhichareRussian,fiveChinese,27U.S.,andtwoUK)fortheperiodJanuary1,2020,throughAugust31,2020,thatfeaturedeither“coronavirus”or“COVID”inthetext.4Thisresultedinatotalof247,315articles,thevastmajorityofthem(230,865)fromU.S./UKsources,withsmallernumbersfromRussian(14,309)andChinese(2,141)sources.(WeprovideamoredetailedbreakdownofarticlespublishedbynewsoutletintheAppendix.)

Foroursearchperiod,theoverallfrequencyofpublishedarticleswitheithertermmentionedgrewrapidlythroughJanuaryandFebruary,reachingapeakinMarchandApril.ArticlefrequencybycountryoforiginisshownovertimeinFigure1.AsimilarpatternwasseeninpublishingfrequencyovertimeacrossU.S./UK,Russian,andChinesesources,althoughRussiannewssourcesappearedtopublishsomewhatlessfrequentlyinmidtolateFebruary.More-detailedanalysisofthisapparentRussianslowdownisdescribedlaterinthisreport.

Acyborgaccountisonethatmixesautomatedbotactivitywithrealhumantweets.

DerivedfeaturesreferstosuchthingsasthepresenceoffakenewsURLsinaTwitterfeed.

NewsAPIisanapplicationprogramminginterfacethatallowsuserstoautomaticallyconnecttoandsearchalargedatabaseofnewsarticles,includingnewswireservices(animportantadvantageoversuchrivalsourcesasLexisNexis).RANDhasbuiltascalableinfrastructuretoretrieve,store,query,andthenanalyzeverylargenewsarticledatasets.Thisscalablearchitectureisapowerfultoolthatallowsustogatheranenormousamountofnewsdataforanalysis,butitalsohasaconstraint:Wecancollectonlynewsarticlesfromsourcescoveredbytheservice,whichdoesnotincludesourcesthatarebehindpaywallsorotherwiserestrictedinaccess.Forourstudy,inparticular,onlynineRussianandfiveChinesesourcesinEnglisharecoveredbyNewsAPI.

|4|

FIGURE1ArticleFrequencyoverTimein2020

Articlesperday

U.S./UKnews Russiannews Chinesenews

103

102

101

100

10–1

Jan Feb Mar Apr May Jun Jul Aug

NOTE:Themovingseven-dayaveragepublishingrateisoverlaidoneachsourceasasolidline.Notethatthey-axisislogarithmicallyscaled;wehaveaboutanorderofmagnitudefewerRussiannewsarticlesthanU.S./UK,andaboutanorderofmagnitudefewerChinesearticlesthanRussian.

BecausetheCOVID-19pandemicwassuchanimpactfulworldwideevent,wewerenotsurprisedtofindthatnewsstoriesaboutmanyothertopics,suchasthosethatweredominantlyabouteconomicorpoliticalstories,werealsorepresentedinourdatasetbecausetheyalsoreferencedthepandemicinsomeway.However,acursoryexaminationofrandomarticlesinourdatasetshowedthatthemajoritywerefocusedonadifferent(nonpandemic)topic,althoughthepandemicplayedasignificantroleinmanyofthesearticles.

WedecidedtomodelhowthisassortmentofdifferentsubjectsvariedacrossRussian,Chinese,andWesternnewsmedia.IfwecoulddeterminethatcertaintopicswerebeingdiscussedquiteoftenbyRussianorChinesenewsbutrarelybyWesternoutlets,thatwouldsuggesttheneedforadditionalexaminationandmightevenbeindicativeofamalignefforttopushcertainnarratives.

Naturallanguageprocessing,thebranchofmachine-learningthatdealswithunstructuredtext,hasavarietyoftechniquesforperformingthiskindoftopicmodeling.WedecidedtouseLatentDirichletAllocation(LDA),awidelyusedmethodthatsimultaneouslyidentifiesthetopicsassociatedwitheacharticle,alongwiththewordsassociatedwitheachtopic.LDAhastheadvantageofbeingrelativelyfasttoperformonalargedatasetandproducesresultsthatareeasilyunderstoodandinterpretedbyhumans.OurLDAmodelwasbuiltusinggensimversion3.8.3,(RehurekandSojka,2011)andtextpreprocessingwasperformedwiththeNaturalLanguageToolkit(LoperandBird,2002).

TosupplementtheLDAmodel,wealsoconsideredhowtoanalyzethewaysthatnewschangedovertime,usingtime-seriesclusters.Thenewscycleisconstantlyinflux,andmalignforeignentitiesareliabletochangenarratives

|5|

LDAidentifiestopicsbythecollectionofwordsthatappearwithindocuments.Itis

bag-of-wordstechnique;thatis,itconsidersonlywhichwordsappearinadocument,nottheorderinwhichtheyappear.

overtimeasdifferenttopicsbecomemoreimportant.Wethereforeexpecteddifferencesnotonlyinwhatwasbeingdiscussedbythenewsmedia,butwhenitwasbeingreported.Ourtime-seriesmodel,describedinmoredetaillaterinthisreport,yieldsclustersofwordsthatriseandfallinfrequencysimultaneously.AsinLDA,humaninspectionofthekeywordsisnecessarytoassignmeaningtotheclusters.Weusedthesamepreprocessingsteps(stopwordremoval,stemming,andtokenization)forbothourLDAandtime-seriesclusteringmodels.

Eachofourmethods—LDAandtime-seriesmodeling—yieldedgroupsofarticlesandwordsassociatedwithaparticulartopic.Wefoundthatbothmethodsgeneratedareasonablysmallnumberofclusters(about20–50uniquetopics,dependingonthemethodandwhetherthemodelwasanalyzingRussian,Chinese,orWesternnewsarticles)thatcouldthenbestudiedindividually.Wefoundthatmeaningwassomewhatsimplertoassigntothetime-seriesclustersbecausethewordsineachclusterofinterestweretypicallycloselyrelated.Moreover,spikesinthefrequencyofdifferentwordswereusuallyeasytocorrelatewithreal-worldevents.Ontheotherhand,thetime-seriesclusterswereobservedtobesomewhatbiasedtowardrareorinfrequentwords.Thismadethetime-seriesdatamoreusefulforidentifyingnichesubjectslikeconspiracytheoriesbutlessusefulforbroadertrends.

LDAModeling

LDAidentifiestopicsbythecollectionofwordsthatappearwithindocuments.Itisabag-of-wordstechnique;that

is,itconsidersonlywhichwordsappearinadocument,nottheorderinwhichtheyappear.Thealgorithmtakesasitsinputaseriesofdocumentsandapredeterminednumberoftopicstoidentify,thenassignseachwordandeachdocumenttooneormoretopicsinsuchawayastoproducetopicsashomogenousaspossible.Forinstance,thewordBrexitappearsalmostexclusivelyinnewsarticlesabouttheUnitedKingdomandEurope,sothatwordisassignedtothattopic,asarearticlesthatusethatword.Bothdocumentsandwordscanbeassignedtomultipletopics:ItiseasytoimagineanewsarticlethatdiscussesboththeU.S.economyandthepresidentialelection.However,LDAfavorsassigningahandfuloftopicsatmosttoeachwordordocument,undertheassumptionthateachdocumenthasarelativelysingularfocus.

WetrainedthreeseparateLDAmodels,oneforeachofourRussian,Chinese,andWesternnewsdatasets.Standard

|6|

preprocessingtechniqueswereappliedtothedata:Thetextofeacharticlewasstemmedandtokenized,andstopwordswereremoved.5

Wechosetoallocate20topicsforeachcountry.Thischoicewasinitiallymotivatedbecauseitseemedareasonablecompromisebetweencapturingmanynichetopicsandsplittinglargertopicsintoduplicates,butwealsocomputedtopiccoherenceforarangeofeightto

28topics.Wefoundamodestpeakincoherenceat20topics,buttheresultswerequitenoisyandlittlesignificantimprovementwasseenformoreorfewertopics.Weconcludedthat20topicsforeachofourdatasetswasareasonablechoice.

WeinitiallytrainedasingleLDAmodelonallofourtextinsteadoftrainingthreeseparatemodels.WefoundtheresultingtopicstobedominatedbyU.S.newsarticles,whichmakessensebecausethedatasetwasheavilyskewedtowardU.S.outlets.Forexample,articlesfromRussianandChinesesourcesweregroupedtogetherwithmanyotherarticlesfromU.S.sourcesaboutU.S.foreignpolicy.BecausewewerespecificallyinterestedinthenarrativesbeingpromotedbyRussianandChinesesources,wedecidedtobuildthreeseparateLDAmodels,whichimmediatelyresultedinafarmoregranularlookatthedifferentthemesinthesearticlesofinterest.

Toassignameaningfullabeltoeachtopic,wetheniden-tifiedthe20articleswiththehighestscoreassociatedwitheachofour20differenttopicsandreadeachofthehead-lines.6Wealsoobservedthe20wordsmoststronglyasso-ciatedwitheachtopic.Inmostcases,readingtheheadlineswasenoughtoquicklyidentifythetopicofdiscussion.

Inafewcases,topicsweremoredifficulttodiscern.Sometimesthiswasbecausetwoseparatearticlesubjectswereplacedintoasingletopic.InourWesternnewscategory,forexample,wefoundthatinternationalnewsstoriesandarticlesaboutBlackLivesMatter(orBLM)protestswereplacedtogetherinthesametopic.Thiskindofconflationisnotunexpected;thetwotopicssharemanykeywords:“protest,”state,”“l(fā)aw,”“people,”“security,”andsoforth.

Meanwhile,wefoundseveraltopicswhosecontentsappearedtobeduplicatedinourChinesedataset,

probablybecauseoftherelativelysmallnumberofarticles.TheLDAalgorithmhadtosplitsomerelativelycohesivetopicstoreachatotalof20topics.ThiswaslesslikelytooccurinourWesternandRussiandatasetsbecausethereweresimplymoremajortopicsdiscussed.

Afterassigningalabeltoeachofthetopics,wecross-referencedthethreecategoriesofarticles(U.S./UK,Russian,andChinese)toseewhichtopicsweresharedandwhichwereunique.Theresultsaredescribedinalatersectionofthisreport.

Time-SeriesClustering

Tounderstandthetime-seriesnatureofthetopicsunderdiscussioninourdatasets,wesetouttofindclustersofwordsforwhichthefrequencieschangesimultaneously.Giventhenatureofthenewscycle,weexpectedthatwordsassociatedwithsometopicswouldspikeintheirusagewhenthattopicbecameofbroadinterest.Considerthewordsoil,barrel,OPEC,andprice:Onewouldexpectthatthesewordsareusedyear-roundbutbecomemorecommonwhenthereisanewsworthyeventinvolvingoilmarkets(forexample,ifgaspricesgoup).Thepattern

oftheirfrequencyoverthecourseoftheyear,then,isrelativelyunique,andwecandiscoverotherwordsthathavethesametime-seriessignature.WeappliedthistechniqueonlytotheRussianandWesterndatasets;theChinesedatasetwastoosmalltoyieldinformativeresults.

Toperformthisanalysis,wefirstidentifiedthe5,000mostcommonwordsinourtwodatasetsofinterest.TextpreprocessingwasperformedidenticallytothepreprocessingfortheLDAanalysis.Foreachword,wethencalculateditsdailyfrequencybydividingthenumberofmentionsofthatwordbythetotalnumberofwordsusedacrossallarticlesintheassociateddataset.Tosmoothoutdailyfluctuations,wethencomputedthefive-dayrollingaverageofthesefrequenciesandthennormalizedeachword’sfrequencybydividingbythepeakfrequencyacrosstheentiretimespan.Thislatterstepallowsustoeasilycomparethetime-seriesfluctuationsofcommonwordsanduncommonwordsonanequalfooting.Withthetime-seriesfrequenciesofeachwordinhand,wefinallyappliedtheOPTICS(OrderingPointsToIdentifythe

Stemmingreferstoconvertingawordtoitsrootform(e.g.,“running”becomes“run”).Tokenizationistheprocessofconvertingallwordstolowercaseandsplittingthemintoindividualwordcomponents(e.g.,splittingthecontraction“they’re”into“they”and“re”).Stopwordsthataresocommontheyaddlittleinformation(e.g.,“the,”“and,”“of”).WeusedthePorterStemmerasimplementedintheNaturalLanguageToolkit,thestandardtoolkitstopwordsdictionary,andthegenism.utils.simple_preprocesstokenizer.

Invirtuallyallcases,wefoundthat20articlesweresufficienttoeasilyidentifythetopicbeingdiscussed.

|7|

ClusteringStructure)clusteringalgorithmtofindwordsthatclustertogether.7WechoseOPTICSovermore-commonclusteringalgorithmsfortwomainreasons:First,OPTICSdoesnotrequireaninputforthenumberofclusters,insteaddeterminingthenumberofclustersdirectlyfromthedata;second,itclassifieslow-densitypointsasnoiseanddoesnotassignthemtoacluster.Withmanywordsthatwerenottopic-indicativeinourdata,thiswasacrucialfeaturethatallowedustominimizenoiseandproducemeaningfulclusters.Afewotherclusteringalgorithmshavesimilarfeatures,andwechoseOPTICSbothforitscomputationalspeedandforthe(admittedlysubjective)observedqualityoftheresultingclusters.

TheOPTICSalgorithmfindshigh-densityclusters,meaningmanywordsarenotsimilarenoughtoanyotherwordtowarrantclustermembership.Thewordsineachcluster,therefore,h

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論