




版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
LogisticRegressioninRareEventsDataWestudyrareeventsdata,binarydependentvariableswithdozenstothousandsoftimesfewerones(events,suchaswars,vetoes,casesofpoliticalactivism,orepidemiologicalinfections)thanzeros(“nonevents”).Inmanyliteratures,thesevariableshaveprovendif-ficulttoexplainandpredict,aproblemthatseemstohaveatleasttwosources.First,popularstatisticalprocedures,suchaslogisticregression,cansharplyunderestimatetheprobabilityofrareevents.Werecommendcorrectionsthatoutperformexistingmethodsandchangetheestimatesofabsoluteandrelativerisksbyasmuchassomeestimatedeffectsreportedintheliterature.Second,commonlyuseddatacollectionstrategiesaregrosslyinefficientforrareeventsdata.Thefearofcollectingdatawithtoofeweventshasledtodatacollectionswithhugenumbersofobservationsbutrelativelyfew,andpoorlymeasured,explanatoryvariables,suchasininternationalconflictdatawithmorethanaquarter-milliondyads,onlyafewofwhichareatwar.Asitturnsout,moreefficientsam-wars)andatinyfractionofnonevents(peace).Thisenablesscholarstosaveasmuchas99%oftheir(nonfixed)datacollectioncostsortocollectmuchmoremeaningfulex-KristianGleditsch,GuidoImbens,ChuckManski,PeterMcCullagh,WalterMebane,JonathanNagler,BruceRussett,KenScheve,PhilSchrodt,MartinTanner,andRichardTuckerforhelpfulsuggestions;ScottBennett,KristianGleditsch,PaulHuth,andRichardTuckerfordata;andtheNationalScienceFoundation(SBR-9729884andSBR-9753126),theCentersforDiseaseControlandPrevention(DivisionofDiabetesTranslation),theintheSocialSciencesforresearchsupport.Softwarewewrotetoimplementthemethodsinthispaper,called“ReLogit:RareEventsLogisticRegression,”isavailableforStataandforGaussfrom\hhttp://GKing.Harvard.Edu.Wehavewrittenacompanionpiecetothisarticlethatoverlapsthisone:itexcludesthemathematicalproofsandothertechnicalmaterial,andhaslessgeneralnotation,butitincludesempiricalexamplesandmorepedagogicallyorientedmaterial(seeKingandZeng2000b;copyavailableat\hhttp://GKing.Harvard.Edu).Copyright2001bytheSocietyforPolitical
我們研究罕見事件數據,這些數據是二元依賴變量,其中事件(如戰爭、否決、政治活動案例或流行病感染)少幾十到幾千倍。在許多文獻中,這以節省多達99%的(非固定)數據收集成本,或者收集更多有意義的解釋變量。我們提作者注:感謝JamesFowler、EthanKatz和MikeTomz提供研究協助;JimAlt、JohnFreeman、KristianGleditsch、GuidoImbens、ChuckManski、PeterMcCullagh、WalterMebane、JonathanNagler、BruceRussett、KenScheve、PhilSchrodt、MartinTanner和RichardTucker出有益的建議;ScottBennett、KristianGleditsch、PaulHuth和RichardTucker提供數據;以及美國國家科學基金會(SBR?9729884和SBR?9753126)、疾病控制與預防中心(糖尿病翻譯部)、美國國家老齡化研究所(P01AG17625?01)、世界衛生組織和社會科學基礎研究中心提供研究支持。我們編寫的用于實現本文中方法的軟件“ReLogit:罕見事件邏輯回歸”,可在Stata和Gauss上使用,網址為\hhttp://GKing.Harvard.Edu。我們還為本文撰寫了一篇配套文章,與本文重疊:它不包括數學證明和其他技術材料,符號也不夠通用,但包括經驗示例和更多面向教學的材料(參見King和Zeng2000b;副本可在\hhttp://GKing.Harvard.Edu獲取)。P1:P1:P1:P1: GaryKingandLangche1WEADDRESSPROBLEMSinthestatisticalanalysisofrareeventsdata—binarydepen-dentvariableswithdozenstothousandsoftimesfewerones(events,suchaswars,coups,presidentialvetoes,decisionsofcitizenstorunforpoliticaloffice,orinfectionsbyun-commondiseases)thanzeros(“nonevents”).(Ofcourse,bytrivialrecoding,thisdefinitionandrelatedsocialsciencesandperhapsmostprevalentininternationalconflict(andothertoexplainandpredict,aproblemwebelievehasamultiplicityofsources,includingthetwoweaddresshere:mostpopularstatisticalprocedures,suchaslogisticregression,cansharplyunderestimatetheprobabilityofrareevents,andcommonlyuseddatacollectionstrategiesaregrosslyinefficient.First,althoughthestatisticalpropertiesoflinearregressionmodelsareinvarianttothe(unconditional)meanofthedependentvariable,thesameisnottrueforbinarydependentvariablemodels.Themeanofabinaryvariableistherelativefrequencyofeventsinthedata,which,inadditiontothenumberofobservations,constitutestheinformationcontentofthedataset.Weshowthatthisoftenoverlookedpropertyofbinaryvariablemodelshasbiasedinsmallsamples(underabout200)iswelldocumentedinthestatisticalliterature,butnotaswidelyunderstoodisthatinrareeventsdatathebiasesinprobabilitiescanbesubstantivelymeaningfulwithsamplesizesinthethousandsandareinapredictabledirection:estimatedeventprobabilitiesaretoosmall.Aseparate,andalsooverlooked,problemisthatthealmost-universallyusedmethodofcomputingprobabilitiesofeventsinlogitanalysisissuboptimalinfinitesamplesofrareeventsdata,leadingtoerrorsinthesamedirectionasbiasesinthecoefficients.Appliedresearchersvirtuallynevercorrectfortheunderestimationofeventprobabilities.Theseproblemswillbeinnocuousinsomeapplications,butweoffersimpleMonteCarloexampleswherethebiasesareaslargeassomeestimatedeffectsreportedintheliterature.Wedemonstratehowtocorrectfortheseproblemsandprovidesoftwaretomakethecomputationstraightforward.Asecondsourceofthedifficultiesinanalyzingrareeventsliesindatacollection.Givenbetteroradditionalvariables.Inrareeventsdata,fearofcollectingdatasetswithnoevents(andthuswithoutvariationonY)hasledresearcherstochooseverylargenumbersofobservationswithfew,andinmostcasespoorlymeasured,explanatoryvariables.Thisisareasonablechoice,giventheperceivedconstraints,butitturnsoutthatfarmoreefficientonesandasmallrandomsampleofzerosandnotloseconsistencyorevenmuchefficiencyrelativetothefullsample.Thisresultdrasticallychangestheoptimaltrade-offbetweenmoreobservationsandbettervariables,enablingscholarstofocusdatacollectioneffortswheretheymattermost.Asanexample,weusealldyads(pairsofcountries)foreachyearsinceWorldWarIItogenerateadatasetbelowwith303,814observations,ofwhichonly0.34%,or1042dyads,wereatwar.Datasetsofthissizearenotuncommonininternationalrelations,buttheymakedatamanagementdifficult,statisticalanalysestime-consuming,anddatacollectionexpensive.1(Eventhemorecommon5000–10000observationdatasetsareinconvenienttodealwithifonehastocollectvariablesforallthecases.)Moreover,mostdyads1BennettandStam(1998b)analyzeadatasetwith684,000dyad-yearsand(1998a)haveevendevelopedsophis-ticatedsoftwareformanagingthelarger,1.2million-dyaddatasettheydistribute.
GaryKing和Langche計文獻中,logit(200)中存在偏差是眾所周知的事實,logit中,擔心收集到沒有事件(因此沒有在Y上的變化)的數據集,導致研究人員選例如,我們使用自第二次世界大戰以來每年的所有雙邊關系(國家對)來生成以下數據集,其中包含303,814個觀測值,其中只有0.34%,即1042這種規模的數據集在國際關系研究中并不罕見,但它們使得數據管理變得困難,統計分析耗時,數據收集成本高昂。1(即使更常見的5000?10000個觀測值的數據集,如果必須收集所有案例的變量,也會變得難以處理。)此外,大多數雙邊關系涉及Bennett和Stam(1998b)分析了一個包含684,000個雙邊年的數據集,而(1998a)甚至為它們分發的1,200萬個雙邊關系的數據集開發了復雜的軟件。LogisticRegressioninRareEvents countrieswithlittlerelationshipatall(sayBurkinaFasoandSt.Lucia),muchlesswithsomerealisticprobabilityofgoingtowar,andsothereisawell-foundedperceptionthatmanyofthedataare“nearlyirrelevant”(MaozandRussett1993,p.627).Indeed,manyofthedatahaveverylittleinformationcontent,whichiswhywecanavoidcollectingthevastinpoliticalsciencedesignedtocopewiththisproblem,suchasselectingdyadsthatare“politicallyrelevant”(MaozandRussett1993),arereasonableandpracticalapproachestoadifficultproblem,buttheynecessarilychangethequestionasked,alterthepopulationtowhichweareinferring,orrequireconditionalanalysis(suchasonlycontiguousdyadsoronlythoseinvolvingamajorpower).Lesscarefulusesofthesetypesofdataselectionappropriateeasy-to-applycorrections,nearly300,000observationswithzerosneednotbecollectedorcouldevenbedeletedwithonlyaminorimpactonsubstantiveconclusions.Withtheseprocedures,scholarswhowishtoaddnewvariablestoanexistingcollectioncansaveapproximately99%ofthenonfixedcostsintheirdatacollectionbudgetorcanreallocatedatacollectioneffortstogeneratealargernumberofmoreinformativeandmeaningfulvariablesthanwouldotherwisebepossible.2Relativetosomeotherfieldsinofmeasurementovermanyyearsandhavegeneratedalargequantityofdata.Selectingonthedependentvariableinthewaywesuggesthasthepotentialtobuildontheseefforts,ThisprocedureofselectiononYalsoaddressesalong-standingcontroversyintheinternationalconflictliteraturewherebyqualitativescholarsdevotetheireffortswheretheIncontrast,quantitativescholarsarecriticizedforspendingtimeanalyzingverycrudedeMesquita1981;GellerandSinger1998;Levy1989;Rosenau1976;Vasquez1993).Itmuchmorewiththeonesthanthezeros,butresearchersmustbecarefultoavoidbias.Fortunately,thecorrectionsareeasy,andsothegoalsofbothcampscanbeThemainintendedcontributionofthispaperistointegratethesetwotypesofcorrec-tions,whichhavebeenstudiedmostlyinisolation,andtoclarifythelargelyunnoticedconsequencesofrareeventsdatainthiscontext.Wealsotrytoforgeacriticallinkbetweeneventsbias,andstandarderrorinconsistency,inapopularmethodofcorrectingselectiononY.ThisisusefulwhenselectingonYleadstosmallersamples.Wealsoprovideanimprovedmethodofcomputingprobabilityestimates,proofsoftheequivalenceofsomeleadingeconometricmethods,andsoftwaretoimplementthemethodsdeveloped.Weofferappearinourcompanionpaper(KingandZeng2000b).32Thefixedcostsinvolvedingearinguptocollectdatawouldbebornewitheitherdatacollectionstrategy,andsoselectingonthedependentvariableaswesuggestsavessomethinglessinresearchdollarsthanthefractionofobservationsnotcollected.3WehavefoundnodiscussioninpoliticalscienceoftheeffectsoffinitesamplesandrareeventsonlogisticregressionorofmostofthemethodswediscussthatallowselectiononY.Thereisabriefdiscussionofoneandinanunpublishedpapertheycitethathasrecentlybecomeavailable(Achen1999).
和Russett1993,第627頁)。事實上,許多數據的信息含量非常低,這就是為二元組進行推斷,是有偏見的。通過適當的易于應用的校正,幾乎30萬個零值觀Y(BuenodeMesquita1981;Geller和Singer1998;Levy1989;Rosenau1976Vasquez1993)的非常粗略的測量而受到批評。結果證明,雙方都有一YY導致樣本量更小的時驗的形式提供證據。經驗示例見我們的配套論文(King和Zeng2000b)。3本和罕見事件對邏輯回歸或我們討論的大多數允許對Y進行選擇的方法的影響的討論。BuenodeMesquita和Lalman(1992年附錄)以及他們引用的一篇未發表的論文(Achen1999年)中簡要討論了一種在漸近樣本中糾正對Y的選擇的方法。 February16, GaryKingandLangcheLogisticRegression:ModelandInlogisticregression,asingleoutcomevariableYi(i=1,...,n)followsaBernoulliprobabilityfunctionthattakesonthevalue1withprobabilityπiand0withprobability1?πi.Thenπivariesovertheobservationsasaninverselogisticfunctionofavectorxi,whichincludesaconstantandk?1explanatoryvariables:Yi~Bernoulli(Yi|πi)πi=1+e?xi
GaryKing和LangcheYi(例如,個人的健康狀況或一個國家發動戰爭的可能性)10的概率為1πiπi隨著觀察值的xik1個解釋變量:Yi~Bernoulli(Yi|πi)πi=1+e?xi
Yi1?πi TheBernoullihasprobabilityfunctionP(Yi|πi)=πi(1?πi .Theunknown
伯努利概率函數P(Yi|πi)=πi i。未知參數β=(β0,1)r是一個k×meterβ=(β0,βr)risak×1vector,whereβ0isascalarconstanttermandβ1isavectorwithelementscorrespondingtotheexplanatoryvariables.
β0是一個標量常數項,β1Analternativewaytodefinethesamemodelisbyimagininganunobservedcontinuousfunctionofxi.ThemodelwouldbeveryclosetoalinearregressionifY?wereobserved:
Y(例如,個人的健康狀況或一個國家發動戰爭的可能性)μiμi隨著觀察值xiY?,則該模型將非常接近線性回歸: μi=xi
?|
μi=xi whereLogistic(Yi|μi)istheone-parameterlogisticprobabilitye?(Y??μiP(Y?)
e?(Y??μiP(Y?)
1+e?(Y??μi)2 1+e?(Y??μi)2Unfortunately,insteadofobservingY?,weseeonlyitsdichotomousrealization,YiwhereYi=1ifY?>0andYi=0ifY?≤0.Forexample,ifY?measureshealth,Yi
Y?YiYi1
bedead(1)oralive(0).IfY?werethepropensitytogotowar,Yicouldbeatwar(1)oratpeace(0).ThemodelremainsthesamebecausePr(Yi=1|β)=πi=Pr(Y?>0|
Pr(Yi=1|β)=πi=Pr(Y?>0|∫
∫
Logistic(Yi|μi)dYi=1+e?xi whichisexactlyasinEq.(1).Wealsoknowthattheobservationmechanism,whichturnsthecontinuousY?intothedichotomousYi,generatesmostofthemischief.Thatis,weransimulationstryingtoestimateβfromanobservedY?andmodel2andfoundthatmaximum-likelihoodestimationofβisapproximatelyunbiasedinsmallsamples.Theparametersareestimatedbymaximumlikelihood,withthelikelihood
Logistic(Yi|μi)dYi=1+e?xi 產生了大部分的麻煩。也就是說,我們進行了模擬,試圖從觀察到的βY?和模型2中估計β,發現最大似然估計在樣本量較小的情況下是近似無偏的。 πYi(1?πi)1?Yi
BytakinglogsandusingEq.(1),thelog-likelihoodsimplifieslnL(β|y)=Σln(πi)+Σln(1?πi
i=1
lnL(β|y)=Σln(πi)+Σln(1?πi
= ln1+e(1?2Yi)xi (e.g.,Greene1993,p.643).Maximum-likelihoodlogitanalysisthenworksbyfindingthe
= ln1+e(1?2Yi)xi Greene1993643)。最大似然對數分析通過找到使該函數值最大的β的值來工作,我們將其標記為β?。漸近 February16, LogisticRegressioninRareEvents variancematrix,V(β?),isalsoretainedtocomputestandarderrors.Whenobservationsareselectedrandomly,orrandomlywithinstratadefinedbysomeoralloftheexplanatorycollinearityamongthecolumnsinXorperfectdiscriminationbetweenzerosandones).Thatinrareeventsdataonesaremorestatisticallyinformativethanzeroscanbeseenbystudyingthevariancematrix,
所有解釋變量定義的層內隨機選擇時,β?是一致的,并且漸近有效(X列比
Thepartofthismatrixaffectedbyrareeventsisthefactorπi(1?πi).Mostrareeventslogitmodelhassomeexplanatorypower,theestimateofπiamongobservationsforwhichrareeventsareobserved(i.e.,forwhichYi=1)willusuallybelarger[andcloserto0.5,oneswillcausethevariancetodropmoreandhencearemoreinformativethanadditionalzeros(seeImbens1992,pp.1207,1209;Cosslett1981a;LancasterandImbens1996b).Finally,wenotethatthequantityofinterestinlogisticregressionisrarelytherawoutputbymostcomputerprograms.Instead,scholarsarenormallyinterestedinmoredirectfunctionsoftheprobabilities.Forexample,absoluteriskistheprobabilitythataneventoccursgivenchosenvaluesoftheexplanatoryvariables,Pr(Y=1|X=x).TherelativeriskisthesameprobabilityrelativetotheprobabilityofaneventgivensomebaselinevaluesofX,e.g.,Pr(Y=1|X=1)/Pr(Y=1|X=0),thefractionalincreaseintheThisquantityisfrequentlyreportedinthepopularmedia(e.g.,theprobabilityofsomeformsofcancerincreaseby50%ifonestopsexercising)andiscommoninmanyscholarlyliteratures.Inpoliticalscience,thetermisnotoftenused,butthemeasureisusuallycomputeddirectlyorstudiedimplicitly.Alsoofconsiderableinterestisthefirstdifference(orattributablerisk),thechangeinprobabilityasafunctionofachangeininformativewhenmeasuringeffects,whereasrelativeriskisdimensionlessandsotobeeasiertocompareacrossapplicationsortimeperiods.AlthoughscholarsoftenarguethetwoprobabilitiesthatmakeupeachrelativeriskandeachfirstdifferenceisbestwhenHowtoSelectontheDependentWefirstdistinguishamongalternativedatacollectionstrategiesandshowhowtoadaptthelogitmodelforeach.Then,inSection5,webuildonthesemodelstoalsoallowrareeventandfinitesamplecorrections.Thissectiondiscussesresearchdesignissues,andSection4considersthespecificstatisticalcorrectionsnecessary.DataCollectiontions(X,Y)areselectedatrandom,orexogenousstratifiedsampling,whichallowsYtoberandomlyselectedwithincategoriesdefinedbyX.Optimalstatisticalmodelsareidenticalunderthesetwosamplingschemes.Indeed,inepidemiology,bothareknownunderonename,cohort(orcross-sectional,todistinguishitfromapanel)study.
πi(1πiPr(Yi1|πilogit的稀疏事件的估計πi(即對于Yi=1的觀察值)通常較大[,并且更接近0.5,因為在稀疏事件研究中,概率通常非常?。˙eck2000),而在沒有觀察到Yi0πi(1πi0(其倒數)較小。在這種情況下,額外的1將使方差下降更多,因此比額外的0更有信息量(參見Imbens1992年,第1207頁,第1209頁;Cosslett1981a;Lancaster和Imbens1996b)在給定解釋變量的選擇值的情況下事件發生的概率,Pr(Y=1|Xx)。相對風險是相對于給定某些基線值X的事件發生概率的同一概率,例如,Pr(Y1|X1)Pr(Y1|X0),風險的分數增加。這個量經常在大眾媒體中報道(例如,如果停止鍛煉,某些形式的癌癥的患病概率會增加50%)并且在許多學術文獻中很常例如Pr(Y1|X1?Pr(Y1|X0)。第一差分在測量效應時通常最有信息量,經常爭論它們的相對優點(參見Breslow和Day1980年,第2章;以及Manski1999),但在方便的時候報告構成每個相對風險和每個第一差分的兩在計量經濟學中通常使用的策略,要么是隨機抽樣,其中所有觀測值(X,Y)都是隨機選擇的,要么是fi定抽樣,這允許在由X定義的類別內隨機選擇 GaryKingandLangcheWhenoneofthevaluesofYisrareinthepopulation,considerableresourcesindatacollectioncanbesavedbyrandomlyselectingwithincategoriesofY.Thisisknownineconometricsaschoice-basedorendogenousstratifiedsamplingandinepidemiologyasacase-controldesign(Breslow1996);itisalsousefulforchoosingqualitativecasestudies(Kingetal.1994,Sect.4.4.2).ThestrategyistoselectonYbycollecting(randomlyorallthoseavailable)forwhichY=1(the“cases”)andarandomselectionofvariablescollectedonalargecohort,andthensubsampleusingalltheonesandarandomvariabletoanexistingcollection,suchasthedyadicdatadiscussedaboveandanalyzedfromalargerrandomsample,withveryfewvariables,oftheentireU.S.population.Inthispaper,weuseinformationonthepopulationfractionofoneswhenitisavailable,andsothesamemodelswedescribeapplytobothcase-controlandcase-cohortstudies.MesquitaandLalman’s(1992)designisfairlyclosetoacase-controlstudywith“contam-inatedcontrols,”meaningthatthe“control”samplewasfromthewholepopulationratherthanonlythoseobservationsforwhichY=0(seeLancasterandImbens1996a).Althoughwedonotanalyzehybriddesignsinthispaper,ourviewisnotthatpurecase-controlsam-plingisappropriateforallpoliticalsciencestudiesofrareevents.(Forexample,additionalefficienciesmightbegainedbymodifyingadatacollectionstrategytofitvariablesthatareeasiertocollectwithinregionalorlanguageclusters.)Rather,ourargumentisthatscholarsshouldconsideramuchwiderrangeofpotentialsamplingstrategies,andassociatedsta-tisticalmethods,thanisnowcommon.Thispaperfocusesonlyontheleadingalternativedesignwhichwebelievehasthepotentialtoseewidespreaduseinpoliticalscience.Problemstocarefullyavoided.First,thesamplingdesignforwhichthepriorcorrectionandweightingmethodsareappropriaterequiresindependentrandom(orcomplete)selectionofobser-vationsforwhichY=1andY=0.Thisencompassesthecase-controlandcase-cohortselection,orviahybridapproaches—requiredifferentstatisticalSecond,whenselectingonY,wemustbecarefulnottoselectonXdifferentlyforthetwosamples.Theclassicexampleisselectingallpeopleinthelocalhospitalwithcancer(Y=1)andarandomselectionoftheU.S.populationwithoutlivercancer(Y=0).TheproblemisthatthesampleofcancerpatientsselectsonY=1andimplicitlyontheinclinationtoseekhealthcare,findtherightmedicalspecialist,havetherighttests,etc.NotrecognizingtheimplicitselectiononXistheproblemhere.SincetheY=0sampledoesnotsimilarlyselectonthesameexplanatoryvariables,thesedatawouldinduceselectionbias.OnesolutioninthisexamplemightbetoselecttheY=0samplefromthosewhosymptoms.Anothersolutionwouldbetomeasureandcontrolfortheomittedvariables.ThistypeofinadvertentselectiononXcanbeaseriousprobleminendogenousdesigns,justasselectiononYcanbiasinferencesinexogenousdesigns.Moreover,although
GaryKing和Langche當Y在總體中的某個值很罕見時,通過在Y的類別內隨機選擇,可以在數據收集上節省大量資源。這在計量經濟學中被稱為基于選擇或內生分層抽樣,在流行病學中則稱為病例?對照設計(Breslow1996);它也適用于選擇定性案例研究(King等人,19944.4.2節)。該策略是通過收集觀察值(隨機或所有可用的觀察值)Y(“”)Y(“對照”)。這種抽樣方法通常輔以對總體中一個的已知或估計的先驗知識——這種的解釋變量不可用時也是如此)。最后,-隊列研究開始于對大型隊列的一Verba(1995)對活動家進行的詳細研究,每個活動家都是從更大的隨我們使用一個的總體分數信息,因此我們描述的相同模型也適用于病例?對照和病例?隊列研究。還嘗試了許多其他混合數據收集策略。例如,BuenodeMesquita和Lalman(1992)的設計與病例?對照研究中的“污染對照”相當,這意味著“對照”樣本來自整個總體,而不僅僅是那些Y=0的觀察值(參見Lancaster和Imbens1996a)。盡管我們在這篇論文中沒有分析混合設計,但按照我們建議的方式選擇因變量有幾個陷阱應該小心避免。首先,適用于先驗校正和加權方法的抽樣設計需要獨立隨機(或完全)選擇觀察值,這些觀察值包括1Y0階段抽樣、非隨機選擇或混合方法——需要不同的統計方法。其次,在選擇Y時,我們必須小心不要對兩個樣本選擇不同的X。一個經典的例子是選擇當地醫院中所有患有肝癌的人(Y=1)的整個人口(Y=0)。問題是癌癥患者的樣本在選擇Y=1的同時,也隱含地選擇了尋求醫療保健、找到合適的醫療專家、進行正確的檢查等傾向。沒有認識到對X的隱含選擇是這里的問題。由于Y=0樣本不會以類似的方式選擇相同的解釋變量,這些數據會導致選擇偏差。在這個例子中,一個可能的解決方案是從那些接受了相同的肝癌檢查但最終沒有患病的人中選擇Y=0樣本。這種設計會產生有效的推論,但僅適用于有肝癌樣癥狀的健康意識人群。另一個解決方案是測量并控制遺漏的變量。XY的選擇在LogisticRegressioninRareEvents thesocialsciencesrandom(orexperimentercontrolover)assignmentofthevaluesoftheexplanatoryvariablesforeachunitisoccasionallypossibleinexogenousorrandomsampling(andwithalargenisgenerallydesirablesinceitrulesoutomittedvariablebias),randomassignmentonXisimpossibleinendogenoussampling.Fortunately,biasduetoselectiononXismucheasiertoavoidinapplicationssuchasinternationalconflictandrelatedfields,sinceaclearlydesignatedcensusofcasesisnormallyavailablefromwhichtodrawasample.Insteadofrelyingonthedecisionsofsubjectsaboutwhethertocometoahospitalandtakeatest,theselectionintothedatasetinourfieldcanoftenbeentirelydeterminedbytheinvestigator.SeeHollandandRubin(1988).Third,anotherproblemwithintentionalselectiononYisthatvalidexploratorydataanalysiscanbemorehazardous.Inparticular,onecannotuseanexplanatoryvariableasadependentvariableinanauxiliaryanalysiswithoutspecialprecautions(seeNagelkerkeetal.1995).Finally,theoptimaltrade-offbetweencollectingmoreobservationsversusbetterorjudgmentcallsandqualitativeassessments.Fortunately,tohelpguidethesedecisionsinfieldslikeinternationalrelationswehavelargebodiesofworkonmethodsofquantitativemeasurementand,also,manyqualitativestudiesthatmeasurehard-to-collectvariablesforasmallnumberofcases(suchasleaders’perceptions).ontheoptimaltrade-offbetweenmoreobservationsandbettervariables.First,whenzerosandonesareequallyeasytocollect,andanunlimitednumberofeachareavailable,an“equalsharessamplingdesign”(i.e.,yˉ=0.5)isoptimalinalimitednumberofsituationsandclosetooptimalinalargenumber(Cosslett1981b;Imbens1992).Thisisausefulbutinfieldslikeinternationalrelations,thenumberofobservableones(suchaswars)isstrictlylimited,andsoinmostofourapplicationscollectingallavailableoralargesampleofonesisbest.Theonlyrealdecision,then,ishowmanyzerostocollectinaddition.Ifcollectingzeroswerecostless,weshouldcollectasmanyaswecanget,sincemoredataarealwaysbetter.Ifcollectingzerosisnotcostless,butnot(much)moreexpensivethancollectingones,thenoneshouldcollectmorezerosthanones.However,sincethemarginaltodropasthenumberofzerospassesthenumberofones,wewillnotoftenwanttocollectmorethan(roughly)twotofivetimesmorezerosthanones.Ingeneral,theoptimalnumberofzerosdependsonhowmuchmorevaluabletheexplanatoryvariablesbecomewiththeresourcessavedbycollectingfewerobservations.Finally,ausefulpracticeissequential,involvingfirstthecollectionofallonesand(say)anequalnumberofzeros.Then,ifthestandarderrorsandconfidenceintervalsarenarrowenough,stop.Otherwise,continuetoexplanatoryvariablessequentiallyaswell,butthisisnotoftenthecase.CorrectingEstimatesforSelectiononDesignsthatselectonYcanbeconsistentandefficientbutonlywiththeappropriatestatisticalcorrections.Sections4.1and4.2introducethepriorcorrectionandweightingforthelogitmodel.InAppendixA,weexplicatethisresultandthenprovethatthebesteconometricestimatorinthistraditionalsoreducestothemethodofpriorcorrection
(n)中偶爾是可能的。在內生抽樣中,在X上隨機分配是不可能的。幸運的是,在諸如國際沖突和相關領域等應用中,由于通常可以從一個明確指定的案例普查中抽取樣本,因此對X而不是依賴于受試者是否來醫院接受測試的決定。參見Holland和Rubin殊預防措施(參見Nagelkerke等人,1995年)。在有限的情況下,“等份抽樣設計(即y的等份分配)是最佳的,在大多數情況下接近最佳(Cosslett1981b;Imbens1992年)。這是一個有用的事實,但4.14.220等人(1985年)已經證明,這些計量經濟學方法中的兩種等同于logit模型的先驗修正。在附錄A中,我們解釋了這一結果,并證明在這一傳統中最佳的計量經 February16, GaryKingandLangchethemodelislogitandthesamplingprobability,E(yˉ),isunknown.Toourknowledge,thisresulthasnotappearedpreviouslyintheliterature.PriorPriorcorrectioninvolvescomputingtheusuallogisticregressionMLEandcorrectingtheestimatesbasedonpriorinformationaboutthefractionofonesinthepopulation,τ,andtheobservedfractionofonesinthesample(orsamplingprobability),yˉ.Knowledgeofτcancomefromcensusdata,arandomsamplefromthepopulationmeasuringYonly,acase-cohortsample,orothersources.InAppendixB,wetrytoelucidatethismethodbypresentingaderivationofthemethodofpriorcorrectionforlogitandmostotherstatisticalinanyoftheabovesamplingdesigns,theMLEβ?1isastatisticallyconsistentestimateβ1andthefollowingcorrectedestimateisconsistentfor
GaryKing和LangchelogitEˉyMLE1τ和樣本中觀察到的1的比例(或采樣概率)ˉy的先驗信息來校正估計。τ的知識可以來自人口普查數據、從人口中隨機抽取的僅測量Y的樣本、病例隊列樣本或其他來源。Blogit來闡明這種方法。對于logit模型,在任何上述抽樣設計中,MLE?β1β1的一個統計一致估計,以下校正估計是一致的β0:1?β??ln1?τy1?whichequalsβ?0onlyinrandomlyselectedcross-sectionaldata.Ofcourse,scholarsarenotnormallyinterestedinβbutratherintheprobabilitythataneventoccurs,Pr(Yi=1|β)=πi=(1+exiβ)?1,whichrequiresgoodestimatesofbothβ1andβ0.EpidemiologistsandbiostatisticiansusuallyattributepriorcorrectiontoPrenticeandPyke(1979);ciansattributetheresulttoManskiandLerman(1977),whointurncreditanunpublishedcommentbyDanielMcFadden.Theresultwaswell-knownpreviouslyinthespecialcaseofalldiscretecovariates(e.g.,Bishopetal.1975,p.63)andhasbeenshowntoapplytoothermultiplicativeinterceptmodels(Hsiehetal.1985,p.659).Priorcorrectionrequiresknowledgeofthefractionofonesinthepopulation,τ.For-tunately,τisstraightforwardtodetermineininternationalconflictdatasincethenumberofconflictsisthesubjectofthestudyandthedenominator,thepopulationofcountriesordyads,iseasytocountevenifnotentirelyintheanalysis.4Akeyadvantageofpriorcorrectioniseaseofuse.Anystatisticalsoftwarethatcanestimatelogitcoefficientscanbeused,andEq.(7)iseasytoapplytotheintercept.Ifthefunctionalformandexplanatoryvariablesarecorrect,estimatesareconsistentandasymptoticallyefficient.Thechiefdisadvantageofpriorcorrectionisthatifthemodelismisspecified,estimatesofbothβ0andβ1areslightlylessrobustthanweighting(XieandManski1989),amethodtowhichwenowturn.Analternativeprocedureistoweightthedatatocompensatefordifferencesinthesample(yˉ)andpopulation(τ)fractionsofonesinducedbychoice-basedsampling.Theresultingweightedexogenoussamplingmaximum-likelihoodestimator(duetoManskiandLerman4KingandZeng(2000a),buildingonresultsofManski(1999),modifythemethodsinthispaperforthesituationwhenτisunknownorpartiallyknown.KingandZenguse“robustbayesiananalysis”tospecifyclassesofpriordistributionsonτ,representingfullorpartialignorance.Forexample,theusercanspecifythatτiscompletelyunknownorknowntofallwithsomeprobabilitytolieonlyinagiveninterval.Theresultisclassesofposteriordistributions(insteadofasingleposterior)that,inmanycases,provideinformativeestimatesofquantitiesof
β??ln1?τyˉ 1?這僅在隨機選擇的橫斷面數據中等于?β0。當然,學者們通常對β不感興趣,而是對事件發生的概率感興趣,Pr(Yi1|β)=πi(1+exiβ)?1,這需要β1和β0的良好估計。流行病學家和生物統計學家通常將先驗校正歸功于Prentice和Pyke(1979);計量經濟學家將結果歸功于Manski和Lerman(1977),他們反過來又歸功于DanielMcFadden的一篇未發表的評論。在所有離散協變量(例如,Bishop1?β1的估計略低于加權(XieManski1989),Lerman1977)相對簡單。我們不是最大化公式(5)King和Zeng(2000a)在Manski(1999)的結果基礎上,修改了本文中當τ未知或部分已知時的方法。King和Zeng使用“穩健貝葉斯分析”來指定τ上的先驗分布類別,代表完全或部分未知。例如,用戶可以指定τ完全未知或以某種概率僅位于給定的區間內。結果是后驗分布類別(而不是單 February16, LogisticRegressioninRareEvents theweightedlog-lnLw(β|y)=w1Σln(πi
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年部門級安全培訓考試試題及答案綜合卷
- 25年公司安全管理員安全培訓考試試題及答案綜合卷
- 2025新版車間安全培訓考試試題7A
- 2025民間借款抵押合同書范本
- 2025光纖采購合同范本3
- 2025YY簡易建筑工程勞務承包合同
- 2025標準設備采購合同模板
- 2025家居、電器商品購銷合同
- 2025技術研發合作合同
- 2025年油氣儲層保護劑項目合作計劃書
- 【華為】通信行業:華為下一代鐵路移動通信系統白皮書2023
- Python 程序設計智慧樹知到期末考試答案章節答案2024年四川師范大學
- 03D201-4 10kV及以下變壓器室布置及變配電所常用設備構件安裝
- 城鄉環衛保潔投標方案(技術標)
- 充值合同范本
- 湖南省炎德英才名校聯考聯合體2024年4月春季高一年級下學期第二次(期中)聯考數學試卷
- MSDS中文版(鋰電池電解液)
- 《職業病防治法》知識考試題庫160題(含答案)
- 全國初中數學青年教師優質課一等獎《反比例函數的圖象和性質》教學設計
- 2023-2024學年人教版數學八年級下冊期中復習卷
- 環境監測儀器安裝施工方案(更新版)
評論
0/150
提交評論