MicroarrayData
∗
DaxinJiang†
†
JianPei†‡MuraliRamanathan†ChunTang†AidongZhang†
‡
StateUniversityofNewYorkatBuffalo,USASimonFraserUniversity,CanadaEmail:{djiang3,jianpei}@cse.buffalo.edu,murali@acsu.buffalo.edu,{chuntang,azhang}@cse.buffalo.edu
ABSTRACT
Extensivestudieshaveshownthatminingmicroarraydatasetsisimportantinbioinformaticsresearchandbiomedicalapplications.Inthispaper,weexploreanoveltypeofgene-sample-timemicroarraydatasets,whichrecordstheexpres-sionlevelsofvariousgenesunderasetofsamplesduringaseriesoftimepoints.Inparticular,weproposetheminingofcoherentgeneclustersfromsuchdatasets.Eachclus-tercontainsasubsetofgenesandasubsetofsamplessuchthatthegenesarecoherentonthesamplesalongthetimeseries.Thecoherentgeneclustersmayidentifythesamplescorrespondingtosomephenotypes(e.g.,diseases),andsug-gestthecandidategenescorrelatedtothephenotypes.Wepresenttwoefficientalgorithms,namelytheSample-GeneSearchandtheGene-SampleSearch,tominethecompletesetofcoherentgeneclusters.Weempiricallyevaluatetheperformanceofourapproachesonbotharealmicroarraydatasetandsyntheticdatasets.Thetestresultshaveshownthatourapproachesarebothefficientandeffectivetofindmeaningfulcoherentgeneclusters.
CategoriesandSubjectDescriptors
H.2.8[DatabaseManagement]:DatabaseApplications—DataMining
GeneralTerms
Algorithms,Experimentation
Keywords
Bioinformatics,clustering,microarraydata
∗0234895,Thisresearchispartlysupportedby01A1.IIS-0308001andNIHgrant1NSFP20grantsDBI-dationsAllnecessarilyinthisopinions,findings,conclusionsandGM067650-recommen-reflectpapertheareviewsthoseoftheoffundingtheauthorsagencies.
anddonotPermissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.
KDD’04,August22–25,2004,Seattle,Washington,USA.Copyright2004ACM1-58113-888-1/04/0008...$5.00.
1.INTRODUCTION
Therecentmicroarraytechnologycanmeasuretheex-pressionlevelsofthousandsofgenessimultaneously.Itisanimportantresearchprobleminbioinformaticsandclini-calresearchtoexplorethepatternsinmicroarraydatasets.Forexample,indrugdevelopment,geneexpressionpatternsmayreflectgene-levelresponsestodifferentdrugtreatmentsandprovidedeepinsightsintothenatureofthediseases.Mostmicroarraydatasetscanbedividedintotwocate-gories.Ontheonehand,thegene-timedatasetsrecordtheexpressionlevelsofvariousgenesduringimportantbiologi-calprocessesoveraseriesoftimepoints.Thegene-sampledatasetsaccounttheexpressionlevelsofvariousgenesacrossrelatedsamples.Bothgene-timedatasetsandgene-sampledatasetscanberepresentedbyan×lgeneexpressionmatrixofngenesandlsamples/timepoints.Inthegeneexpressionmatrix,therowsarethegenesandthecolumnsareeithersamples(ingene-sampledatasets)ororderedtimepoints(ingene-timedatasets),whileeachcellrepresentstheex-pressionlevelofacertaingeneonacertainsampleoratacertaintimepoint.
Withthelatestadvancesinthemicroarraytechnology,theexpressionlevelsofasetofgenesunderasetofsam-plescanbemonitoredsynchronicallyduringaseriesoftimepoints[20].Differentfromthepreviousgene-timeorgene-samplemicroarraydatasets,thesenewdatasetshavethreetypesofvariables:genes,samplesandtime.Wecallsuchdatagene-sample-timemicroarraydata,orGSTdataforshort.Figure1(a)elaboratesthestructureofaGSTmi-croarraydataset.
ssamplesj1sj2sj3emitsamplesgenesgi1genesm1111mm1121lgi2m121m1n1m1nlgi3(a)
(b)
Figure1:ThestructureofGSTmicroarraydataIngeneral,eachcellmki,jinaGSTdatasetrepresentstheexpressionlevelofgenegiundersamplesjattimepointtk.Interestingly,aGSTdatasetcanalsobeviewedasan×l
matrix,suchthateachcellmi,jcontainsthetimeserieswithrespecttogenegiundersamplesj,asshowninFigure1(b).Thepreviousstudiesongene-samplemicroarraydata(e.g.,[7,2,15,1])indicatethathighcorrelationsmayexistbe-tweenthegeneexpressionpatternsandsomediseases.ItisnaturaltoextendthesimilaranalysistoGSTmicroarraydata.Thatis,itisinterestingtoidentifyasubsetofgenesGandasubsetofsamplesSinaGSTmicroarraydatasetsuchthateachgeneg∈GhascoherentpatternsacrossthesamplesinSduringthetimeseries.Forexample,inFig-ure1,genegi1,gi2andgi3showcoherentpatternsacrosssamplessj1,sj2andsj3,respectively.Wecallsuchsubsetsofgenesandsamplesacoherentgenecluster.
Whatisthebiologicalmeaningofthecoherentgeneclus-ters?Thecoherentgeneclustersprovidevaluablehypothesisforbiologists.Thesamplesinaclustermaycorrespondtoaphenotype,suchasthepatientshavingadisease,whilethecorrespondingsetofgenesmaysuggestthecandidategenescorrelatedtothedisease.
Thefunctionsofgenesinanorganismarehighlycom-plicated.Therearetypicallymultiplecoherentclustersinadataset.Differentclustersmaycorrelatetodifferentpheno-types,suchasageandgender.Therefore,toavoidmissinganyvaluablehypothesis,itisnecessarytominealltheco-herentclustersinthedataset.
Manypreviousstudiesinvestigatetheminingofinterest-ingpatternsfrommicroarraymatrices.Forexample,vari-ousclusteringalgorithmscanidentifytheco-expressedgenesshowingcoherentpatternsduringthetime-series(e.g.,[17,12,15,8]).Moreover,bothsupervisedorunsupervisedap-proachesareproposedtopartitionthesamplesintohomo-geneousgroups(e.g.,[4,14,9,16]).Additionally,statisticalapproacheshavebeenproposedtovalidatethesignificanceoftheminingresults(e.g.,[3,18,5]).However,allprevi-ousstudiestargetatconventionalgene-timeorgene-samplemicroarraydatasets.Themodelsofclustersinthosepre-viousstudiesaredifferentfromourcoherentgeneclusterswhichdisclosethecorrelationamonggenes,samplesandtimepoints.Therefore,thosealgorithmscannotbeextendeddirectlytosolveourproblem.
Recently,thepattern-basedclusteringapproaches(e.g.,[19])havebeendevelopedtodiscoversubsetsofobjectsfollowingsimilarpatternsonsubsetsofattributes.Conceptually,apattern-basedclusterisacoherentgenecluster.IfwetreattheGSTmicroarraydatasetsasan×lmatrixoftimese-ries,asshowninFigure1(b),thenpattern-basedclustersandcoherentgeneclustersmayhavesomesimilarityatthefirstlook.However,insomepattern-basedapproaches,aclusterrequiresthateachpairofobjectsintheclustermustbecoherentoneachpairofattributes.Sucharequirementisoftentoostronginpractice.Ourcoherentgeneclusteringrelaxestheconstraintsamongtheobjects.Therefore,eachtraditionalpattern-basedclusterisacoherentgenecluster,butnotnecessarilytrueviceversa.
Inthispaper,wetackletheproblemofminingcoherentpatternsfromgene-sample-timemicroarraydatasetsandmakethefollowingcontributions.
First,weproposeamodelofcoherentgeneclustersinGSTmicroarraydatasets.Wejustifythatthemodelismeaningfulforbiomedicalresearch.
Second,weidentifythecomputationalchallengesandcon-ductasystematicresearchonminingcoherentgeneclus-tersfromGSTmicroarraydatasets.Wedeveloptwoap-
proaches,namelytheGene-SampleSearchandtheSample-GeneSearch,tominethecompletesetofcoherentgeneclus-ters.Weillustrateandcomparetheefficiencyandscalabilityofbothapproaches.
Last,weconductanextensiveempiricalevaluationonbothrealdatasetsandsyntheticdatasets.Ourresultsshowthatourproposedmethodscanfindcoherentgeneclustersinterestingtobiomedicalresearchfromrealdatasets.Theresultsonsyntheticdatasetsalsoshowthatouralgorithmsarebothefficientandscalable.
Therestofthepaperisorganizedasfollows.Section2definestheproblem.Section3describesthepreprocessingstepofcomputingthemaximalcoherentsamplessetsforeachindividualgene.Section4presentstwoalgorithmstominecoherentgeneclusters.InSection5,ourmethodsareevaluatedusingrealandsyntheticdatasets.RelatedworkisdiscussedinSection6.Section7concludesthepaper.
2.PROBLEMDESCRIPTION
GivenasetofngenesG-Set={g1,...,gn}andasetoflsamplesS-Set={s1,...,sl},wecanmeasuretheexpressionlevelsofthegenesonthesamples.Theresultsformacon-ventionaln×lmicroarraymatrixM={mi,j},wheremi,jistheexpressionlevelofgenegi(1≤i≤n)onsamplesj(1≤j≤l).Ifsuchmicroarrayexperimentsareconductedsynchronicallyonallgenesandallsamplesattimeinstantst1,...,tT,theresultsforman×l×TGSTmicroarraymatrixM={mti,j},where(1≤t≤T).
AGSTmicroarraymatrixM={mti,jasan×lmatrixM={m
}canalsobeviewed
i,j}suchthatmi,jisavector
ofm1i,j,...,mT
i,j.Hereafter,wedonotstrictlydistinguishthetwonotations.Instead,whenevermi,jiswritten,thevectorisreferredto.
Inthispaper,weareinterestedinfindingthosegenesthatarecoherentonasubsetofsamplesduringthewholetimeseries.Therearevariousmethodstomeasurethecorrela-tionbetweentwotimeseries.However,forgeneexpressiondata,usersareofteninterestedintheoveralltrendsoftheexpressionlevelsinsteadoftheabsolutemagnitudes.There-fore,wechoosethePearson’scorrelationcoefficientasthecoherencemeasure,sinceitisrobusttoshiftingandscalingpatterns[22].Specifically,giventwovectorsmi,j1andmi,j2ofgenegi,thecoherenceρ(mi,j1,mi,j2)isdefinedas
Tmti,jt
t=1(1−mi,j1)(mi,j2−mi,j2)Tt=1(mtT,i,j1)2t=1(mti,j21−mi,j2−mi,j2)mmtwherei,j=Ti,j
istheofgenegonsamplet=1Tmeanoftheexpressionlevelsisj.Thecorrelationcoefficientrangesbetween−1and1.Thelargerthevalue,themorecoherentarethetwovectors.
AgenegiiscoherentacrossasubsetofsamplesS⊆S-Set,ifgivenanypairofsamplessj1,sj2∈S,ρ(mi,j1,mi,j2)≥δ,whereδisaminimumcoherencethresholdspecifiedbytheuser.ForasubsetofgenesG⊆G-SetandasubsetofsamplesS⊆S-Set,ifeverygenegi∈GiscoherentacrosssamplesinS,wecallgenesetGcoherentonsamplesetS.(G×S)iscalledacoherentgenecluster.Acoherentgeneclusterhavingugenesandvsamplesissaida(u,v)-coherentgenecluster.
Trivially,foranygenegiandanysamplesj,({gi}×{sj})isa(1,1)-coherentgenecluster,and(G-Set×{sj})and({gi}×
S-Set)aretrivial(|G-Set|,1)-and(1,|S-Set|)-coherentgeneclusters,respectively.Toavoidthistriviality,werequirethatacoherentgeneclustershouldconsistofatleasttwogenesandtwosamples.
Givenacoherentgenecluster(G×S),foranysubsetsG
⊆GandS⊆S,(G×S)isalsoacoherentgenecluster.Toavoidsuchredundancy,acoherentgenecluster(G×S)ismaximalifthereexistsnoanyothercoherentgeneclus-ter(G×S)suchthatG⊆G,S⊆S.Moreover,ausermaynotbeinterestedinverysmallclusters,whichareoftenformedbychance.Thus,ausercanspecifytheminimumnumbersofgenesandsamplesinacluster.Generally,givenmingandminsasuserdefinedminimumgenesizeandsam-plesizethresholds,acluster(G×S)iscalledsignificantif|G|≥mingand|S|≥mins.
ProblemdefinitionGivenaGSTmicroarraymatrixM,aminimumcoherencethresholdδ,aminimumgenesizethresholdmingandaminimumsamplesizethresholdmins,theproblemofminingcoherentgeneclustersistofindthecompletesetofsignificantmaximalcoherentgeneclustersinMwithrespecttotheparameters.Hereafter,asignifi-cantmaximalcoherentgeneclusteriscalledacoherentgeneclusterforshort.
3.MAXIMALCOHERENTSAMPLESETS
Weproposetwoalgorithmscomputingmaximalcoherentgeneclusters.Inbothalgorithms,tocomputecoherentgeneclusters,weneedtocheckwhetherasubsetofgenesarecoherentonasubsetofsamples.Tofacilitatethetests,foreachgenegk,wecomputethesetsofsamplesSsuchthat(1)|S|≥mins;(2)gkiscoherentonS;and(3)thereexistsnosupersetS⊃SsuchthatgkisalsocoherentonS.Siscalledamaximalcoherentsamplesetofgk.Pleasenotethat,ingeneral,agenemayhavemorethanonemaximalcoherentsampleset.
Foragenegk,allofitsmaximalcoherentsamplesetscanbecomputedefficientlyusingthefollowing2-stepprocess.Inthefirststep,wetestwhethergenegkiscoherentoneachpairofsamples(si,sj).Abinarytrianglematrix{ci,j}ispopulated,where1≤i Unliketheconventionalcliqueproblemwherethecliqueofthemaximalsizeisfound,here,weneedtofindthecompletesetofmaximalcliquesinthegraph.ItiswellknownthattheconventionalcliqueproblemisNP-complete.Soistheproblemoffindingthecompletesetofmaximalcliques.Fortunately,therealGSTmicroarraydatasetsareoftensparseandthenumberofsamplesistypicallybelowonehun-dred.Foreachgene,thenumberofmaximalcliquesisquitesmallandthesamplescanoftenbepartitionedintoexclu-sivesmallsubsets.Ourexperimentalresultsshowthat,withefficientsearchandpruningtechniques,itisstillpracticaltofindthecompletesetofmaximalcliques.Inthefollowing, {} {a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}{a,b,c,d}Figure2:Enumerationofcombinationsofsamples.wewillshowhowtofindthemaximalcliquesofasamplesetbyadepth-firstsearchinasamplesetenumerationtree.GivenasetofsamplesS={s1,...,sl},theset2S(i.e.,allcombinationsofsamples)canbeenumeratedsystematically.Forexample,considerasetofsamplesS={a,b,c,d}.Thecompletesetofnon-emptycombinationsofsamplescanbedividedinto4exclusivesubsets:(1)theoneshavingsamplea;(2)theoneshavingsamplebbutnoa;(3)theoneshavingsamplecbutnoaorb;and(4){d}.TheyareshownastheimmediatechildrenoftherootinFigure2. Thesesubsetscanbefurtherpartitioned.Forexample,thefirstsubsetcanbefurtherdividedintothreeexclusivesub-subsets:(1)theoneshavingsamplesaandb;(2)theoneshavingsamplesaandcbutnob;and(3){a,d}. ThetreeshowninFigure2iscalledasetenumerationtree[13]withrespectto{a,b,c,d}.Itprovidesaconceptualtooltoenumeratethecompletesetofcombinationssystem-atically. Wecanconductarecursive,depth-firstsearchofthesam-plesetenumerationtreetodetectthemaximalcliquesofthesamples.|GivenasetofsamplesS,thesetenumerationtreehas2|Snodes.However,weneverneedtomaterializesuchatree.Instead,weonlyneedtokeepapathfromtherootofthetreetothenodewearesearchingasaworkingset,whichcontainsatmost(|S|+1)nodes. Clearly,inthesetenumerationtree,eachnodecontainsauniquesubsetofsamples.Thuswecanusethesubsetofsamplestorefertothenode.Atnode{si1,...,sik}(1≤i1<··· PruningRule3.1(Pruningsmallsamplesets).Atanodev={si1,...,sik},thesubtreeofvcanbeprunedif(k+|Tail|) Moreover,ifthesamplesatthecurrentnodeanditsTailaresubsumedbysomemaximalcoherentsamplesetfound Input:theGSTdataset,thecoherencethresholdδ; minimumsamplesizethresholdmins; Output:themaximalcoherentsamplesetsforeachgene;Method: foreachgeneggeneratematrixkdo ci,jforg//depth-firstsearch k;fori=1to(l−mins+1)do callsearch-clique({si},{si+1,...,sl});end-forend-forProcedure:search-clique(head,tail) //headrecordsthesamplesinthecurrentnode supposesiisthelastsampleinhead,removesamples sjfromtailsuchthatci,j=0;//Lemma3.1if(|head∪tail| //Pruning3.2 thenreturn;//aborttherecursivesearchiftail=∅thenoutputamaximalclique;else do letj=min{k|sk∈tail}andtail=tail−{sj};callsearch-clique(head∪{sj},tail);untiltail=∅;return; Figure3:Computingmaximalcoherentsamplesets.sofar,thentherecursivesearchcanalsobepruned,sinceitcannotleadtoanynewmaximalcoherentsampleset.PruningRule3.2(Pruningsubsumedsets).Atanodev={si1,...,sik},if{si1,...,sik}∪Tailisasubsetofsomemaximalcoherentsampleset,thenthesubtreeofthenodecanbepruned. Basedontheabovelemmaandpruningrules,theprepro-cessingalgorithmispresentedinFigure3.Forthereadersfamiliarwiththetechniquesofdepth-firstminingofmaxi-mal/closedfrequentpatterns,theideasofpruningheresharethesimilarspiritwiththepruninginfrequentcloseditemsetmining(e.g.,[10]).However,onekeydifferenceisthatthefrequentpatternminingconductscountingondatabases,andwedonotneedtoscanthedatabaseforanycountingoncethetrianglematrix{ci,j}ismaterialized.Thecorrect-nessofthepreprocessingalgorithmcanbeshown.Limitedbyspace,weomitthedetailshere. 4.THEMININGALGORITHMS Ana¨ıvemethodtofindthemaximalcoherentgeneclustersistotesteverypossiblecombinationofgenesandsamplesthoroughly.Afterallthecoherentgeneclustersarefound,wecanidentifyandreportthemaximalones.Thena¨ıvemethodisverycostlyandthusinfeasibleforrealdatasets.Forexample,supposewehave1,000genesand201000samples.Thena¨ıvemethodmayhavetosearchupto(2−1)×(220−1−20)≈1.12×10307combinations! Howcanwesearchthehugespaceefficientlyandpruneunpromisingsubspacessharply?Whencomputingthemaxi-malcoherentsamplesets(Figure3),wesystematicallyenu-meratecombinationsofsamplesinarecursivedepth-firstsearch,anddeveloptechniquestopruneunpromisingsub-spacesaggressively.Stimulatedbythesimilarspirit,here TheSample-GeneSearch depth-firstenumeratesubsetsofsamplesforeachsubsetofsamplesSdo findthemaximalsubsetsofgenesG s.t.G×Sisacoherentgenecluster; testwhether(G×S)isamaximalcoherentgene cluster; end-forTheGene-SampleSearch depth-firstenumeratesubsetsofgenesforeachsubsetofgenesGdo findthemaximalsubsetsofsamplesS s.t.G×Sisacoherentgenecluster; testwhether(G×S)isamaximalcoherentgene cluster; end-for Figure4:TheframeworksoftheSample-GeneSearchandtheGene-SampleSearch. wecanalsosystematicallyenumeratethecombinationsofgenesandsamples,andpruneunfruitfulcombinations. Basically,wehavetwoalternatives.Ontheonehand,wecanenumerateallcombinationsofsamplessystematically.Foreachsubsetofsamples,wecanfindthemaximalsubsetsofgenesthatformcoherentgeneclustersonthesamples,andcheckwhethertheclustersaremaximal.ThismethodiscalledtheSample-GeneSearch.Ontheotherhand,wecanletthegeneenumerationgofirst.Foreachsubsetofgenes,wefindthemaximalsubsetsofsamplesthatformcoherentgeneclusterswiththegenes,andcheckwhethertheclustersaremaximal.ThemethodiscalledtheGene-SampleSearch. TheframeworksoftheSample-GeneSearchandtheGene-SampleSearchareshowninFigure4.Properpruningtech-niquesshouldbedevelopedtopruneunpromisingcombina-tionsandsearchbranchesasearlyaspossible. AfirstlookatFigure4maysuggestthatthetwomethodsaresymmetric.However,sincegenesandsamplesarenotsymmetricintheproblem,thetechnicaldetailsareinfactsubstantiallydifferent.Weareminingcoherentgeneclustersonsamples.Aslongasthegenesrespondcoherentlyonthesamesubsetofsamples,theybelongtothesamecluster.However,theexpressionpatternsofdifferentgenesinthesameclusterononesamplecanbeverydifferent! 4.1Sample-GeneSearch IntheSample-GeneSearch,weneedtoaddressthefol-lowingissues.First,asweenumeratethecombinationsofsamplessystematically,foreachsubsetofsamples,howcanwefindthemaximalsetsofgenessuchthatthegenesarecoherentonthesamples?Second,duringthesamplesetenumeration,whichsamplesetscanbepruned?Third,sim-ilartothesituationinPruningrule3.2,canweidentifyandprunethesearchesthatcannotleadtoanypotentialmaximalcoherentclusters?.Last,howcanwedeterminewhetheracoherentgeneclusterissubsumedbytheothers?Weanswertheabovequestionsinthissubsection. 4.1.1MaximalCoherentGeneSetsforSampleSets ForeachcombinationofsamplesS,weneedtocomputethemaximalcoherentgenesetGSsuchthatthegenesinGS GeneMaximalcoherentsamplesetsg1{s1,s2,s3,s4,s5}g2{s1,s2,s4},{s1,s5}g3{s1,s2,s3,s4,s5}g4{s1,s2,s3},{s5,s6}g5{s1,s5,s6}(a)Themaximalcoherentsamplesetsforgenes SampleTheinvertedlists1{g1.b1,g2.b1,g2.b2,g3.b1,g4.b1,g5.b1}s2{g1.b1,g2.b1,g3.b1,g4.b1}s3{g1.b1,g3.b1,g4.b1}s4{g1.b1,g2.b1,g3.b1}s5{g1.b1,g2.b2,g3.b1,g4.b2,g5.b1}s6 {g4.b2,g5.b1}(b)Theinvertedlistsforsamples Figure5:Themaximalcoherentsamplesetsandtheinvertedlists. arecoherentonSandnopropersupersetG⊃GSalsohasthisproperty. Clearly,forageneg,ifthereexistsamaximalcoherentsamplesetSgsuchthatS⊆Sg,theng∈GS.Inotherwords,GScanbederivedbyonescanofthemaximalco-herentsamplesetsofallgenes.Ifamaximalcoherentsam-plesetisasupersetofS,thenthecorrespondinggenegisinsertedintoGS. Itisexpensivetoscanthecompletelistofmaximalco-herentsamplesetsofallgenesonceforeverycombinationofsamples.Anefficientsolutionistouseaninvertedlist.Supposewehave5genesand6samples.ThemaximalcoherentsamplesetsforeachgenearelistedinFigure5(a).Welabeleachmaximalcoherentsamplesetbythegenegk,andtheset-id,bj,inthegene.Forexample,geneg2hastwomaximalcoherentsamplesets,g2.b1={s1,s2,s4}andg2.b2={s1,s5}. Foreachsamples,wemakeuptheinvertedlistLsasthelistofallmaximalcoherentsamplesetscontainings,asshowninFigure5(b). Now,whenwewanttocomputethemaximalcoherentgenesetsforasubsetofsamples,say{s1,s2,s3},wedonotneedtosearchthecompletelistinFigure5(a).Instead,weonlyneedtogettheintersectionoftheinvertedlistsofthesampless1,s2ands3,whichis{g1.b1,g3.b1,g4.b1}.Bythisintersection,weknowthat{g1,g3,g4}isthemaximalcoherentgeneset. 4.1.2PruningIrrelevantSamples ForacombinationofsamplesS={si1,...,sik},wherei1<··· SimilartothesituationinPruningrule3.2,wecanprunetheunpromisingcombinationsthatcannotleadtoanynewmaximalcoherentgenecluster.Here,twopruningtech-niquescanbeapplied. Forexample,inourrunningexample(Figure5),sup-posewefindthemaximalcoherentgenecluster({g1,g3}×{s1,s2,s3,s5})beforewesearchsamplesetS={s1,s3}.ForS={s1,s3},Stail={s5}andGS∪Stail={g1,g3}.Thatis,bothS∪StailandGS∪Stailaresubsumedbythemaxi-malcoherentgenecluster.TherecursivesearchofScannotleadtoanymaximalcoherentgeneclusterandthuscanbepruned. Ingeneral,supposewearesearchingasampleS combination.Ifthereexistsamaximalcoherentgenecluster(G×S) foundbeforesuchthatS∪Stail⊆SandGS∪Stail⊆G, thenanyrecursivesearchfromS resultsinacoherentgeneclustersubsumedby(G×S),andthuscanbepruned. Moreover,ifthereexistsamaximalcoherentgenecluster(G×S)foundbeforesuchthatS⊆SandeverymaximalcoherentsamplesetcontainingSalsocontainsS,thentherecursivesearchofScannotleadtoanymaximalcoherentgeneclustereither,andthuscanbepruned.Forexample,supposewesearchthesamplesetS={s2}afterwefindthemaximalcoherentgenecluster({g1,g2,g3,g4}×{s1,s2})inourrunningexample(Figure5).FromFigure5(a),wecanseethateverymaximalcoherentsamplesetcontainings2alsocontainss1.Inotherwords,thereexistsnomaximalcoherentgeneclustercontainings2butnos1.Thus,thesearchofScanbepruned. 4.1.4Determiningmaximalcoherentgeneclusters WhenwesearchacombinationofsamplesS,weneedtocheckwhetherGS×Sisamaximalcoherentgenecluster.Welookatthosemaximalcoherentgeneclusters(G×S)suchthatS⊃S.Clearly,sinceweconductdepth-firstsearchinthesetenumerationtree,suchmaximalcoherentgeneclustersshouldbereportedeitherbeforeSissearched,orinthesubtreerootedatS. Basedontheabovediscussion,wehavetheSample-GeneSearchalgorithminFigure6. 4.2Gene-SampleSearch TheframeworkofGene-SampleSearchissymmetrictothatofSample-GeneSearch.Thatis,weenumeratethecombinationsofgenessystematically.Foreachcombinationofgenes,wecomputethemaximalsetsofsamplesthatthegenesarecoherenton.ManypruningtechniquesinSample-GeneSearchhavethesymmetricversionsinGene-SampleSearch.Limitedbyspace,weomitthedetailshere.Instead,weonlyfocusonthedifferencesbetweenthetwoapproaches. 4.2.1DeterminingCoherentGeneClusters Theconceptofcoherentsamplesetsforagenecanbegeneralizedforasetofgenes.GivenasetofgenesG,amaximalcoherentsamplesetwithrespecttoGisasetofsamplesSGsuchthat(1)genesinGarecoherentonSG;and(2)thereexistsnoS⊃SGthatsamplesinGarealsocoherentonS.Pleasenotethattherecanbemorethanonemaximalcoherentsamplesetforagivensetofgenes.Howcanwecomputethemaximalcoherentsamplesetsefficiently?Interestingly,SGcanbecomputedbysomesim-pleintersectionoperations.Forexample,supposeS{g1}= Input:themaximalcoherentsamplessetsforgenes, //fromthealgorithminFigure3 Output:thecompletesetofcoherentgeneclustersMethod: generatetheinvertedlistforsamples asdescribedinSection4.1.1;fori=1to(|S-Set|−mins)do letS={si}andStail={si+1,...,s|S-Set|};callrecursive-search(S,Stail);end-forProcedure:recursive-search(S,Ssamplesfromtail) removeirrelevantStailasdescribedin Section4.1.2; if(|S|+|Stail| inSasdescribedinSection4.1.1;ifScanbeprunedbythecriteriain Section4.1.3thenreturn;whileSletitail=∅do =min{j|sj∈Slettail=tail−{stail};i}; callrecursive-search(S∪{si},Stail);end-while derivethemaximalcoherentgenesetGGS; output(S×S)asamaximalcoherentgeneclusterif itisnotsubsumedbyanymaximalcoherentgeneclusterfoundbefore End Figure6:TheSample-GeneSearchalgorithm. {(s1,s2,s3)}andS{g2}={(s2,s3,s4),(s5,s7)}.Then,{s2,s3}istheonlymaximalsamplesetthatbothg1andg2arecoherenton.Thatis,S{gS1,g2}={(s2,s3)}.Inotherwords,wecanderive{g1,g2}from{s1,s2,s3}∩{s2,s3,s4}and{s1,s2,s3}∩{s5,s7}. Ingeneral,ifgenesetG=G1∪G2,thenSGcanbede-rivedfromSG1andSG2bythefunctionfind-max-coherent-sample-setsinFigure7. 4.2.2 PruningherentGeneIrrelevantClusters GenesandUnpromisingCo-SimilartotheideainSection4.1.2,wecanprunegenesthatcannotbeusedtoextendthecurrentcombinationofgenes.GivenasetofgenesG={gi1,...,gik},where1≤ii<··· Functionfind-max-coherent-sample-sets Input:genesetsG1andG2,setsofmaximalcoherent samplesetsSmaximalG1andScoherentG2; Output:thesamplesetsw.r.t.G1∪G2Method: letSG1∪G2=∅; foreachmaximalcoherentsamplesetSi∈SforeachmaximalcoherentsamplesetSG1do j∈SletSG2do k=Si∩Sj; if|Sk|≥minstheninsertSkintoSG1∪G2;end-forend-for foreachSk∈SG1∪G2do ifSthenkisapropersubsetofS∈SS−{SG1∪G2 G1∪G2=SG1∪G2k}; end-for outputSG1∪G2; Figure7:Computingmaximalcoherentsamplesetforasetofgenes. 4.2.3MergingCoherentGenesintheTailList Inourrunningexample,themaximalcoherentsamplesetswithrespecttogeneg1andg3areidentical.Then,foranycoherentgenecluster(G×S)suchthatg1∈G,(G∪{g3}×S)mustalsobeacoherentgenecluster.Thus,wecansearch{g1,ge}inashoot. Ingeneral,wehavethefollowingresult.Lemma4.1.WhensearchacombinationofgenesG,ifthereexistgenes{gj1,...,gjk}⊆GtailsuchthattheyarecoherentoneverymaximalcoherentsamplesetofG,thenthereexistsnomaximalcoherentgeneclustercontainingGbutno{gj1,...,gjk}. BasedonLemma4.1,wecanimmediatelymergegenes{gj1,...,gjk}toGatthecurrentnode,andthusshrinkthenumberofrecursions.Thecomputationtimeissavedaswell,sinceweonlyneedtocheckthecoherentgeneclusters,prunetheirrelevantgenesorunpromisinggeneclustersforallthesegenesinoneshoot. Inourexperimentsonrealdatasets,weobservemanygenescanbemergedbyLemma4.1.Thereal-worldGSTmicroarraydatasetsaretypicallysparseandgenesareco-herentonaquitesmallnumberofsamplesets.Asacon-sequence,theperformanceofGene-SampleSearchcanbeimprovedsubstantiallybythisoptimization. Onemayask,“DowehaveasymmetricpruningforSample-GeneSearch?”Wecanapplythesimilaroptimiza-tiontechniqueforsample-gene.Thatis,asamplesjismergedintocurrentcombinationofsamplesSaslongastheinvertedlistofSisasubsetofthatofsj.However,suchasituationisrareinpractice,sinceitisrarethat,fortwodifferentsamplesetsS1andS2,themaximalcoherentgenesetswithrespecttoS1andS2areidentical. Basedontheabovediscussion,wehavetheGene-SampleSearchalgorithmasshowninFigure8. 5.EXPERIMENTALRESULTS WeimplementedandtestedourapproachesonbotharealGSTmicroarraydatasetandsyntheticdatasets.ThesystemisimplementedinJava.ThetestsareconductedonaSunUltra10workstationwitha440MHzCPUand256MBmainmemory. Inputandoutput:sameastheSample-GeneSearch algorithm(Figure6)Method: fori=1to(|G-Set|−ming)do letG={gi}andGtail={gi+1,...,g|G-Set|};callrecursive-search(G,Gend-fortail);Procedure:recursive-search(G,GremoveirrelevantgenesfromGtail) tailasdescribedin Section4.2.2; if(|G|+|Gtail| Section4.2.3; foreachmaximalcoherentsamplesetSi∈SifSGdo icanbeprunedbythesecondcriteriain Section4.2.2thenremoveSifromSG; end-for if(SG=∅)thenreturn;whileGtail=∅do leti=min{j|gj∈GGtail};lettail=Gtail−{gi}; callrecursive-search(G∪{gi},Gtail);end-while foreachsamplesetSiinSoutput(G×SGdo i)asamaximalcoherentgene clusterifitisnotsubsumedbyanymaximalcoherentgeneclusterfoundbefore end-forEnd Figure8:TheGene-SampleSearchalgorithm. 5.1TheDataSets Therealdataset.Weusetherealgene-sample-timemi-croarraydatasetreportedin[20].Itconsistsofthemi-croarraymeasurementsof4,324genesin13multiplesclero-sis(MS)patientsbeforeandat1,2,4,8,24,48,120and 168hoursafterIFN-βtreatments.MSpatientsshowhetero-geneousresponsestoIFN-βtreatments.Forexample,thepatientswithrelapsingMSrespondbettertoIFN-βtreat-mentsthanthepatientswithprogressivediseasedo.How-ever,relapsingMSpatientsalsoexhibitconsiderableinter-individualheterogeneityintheirclinicalresponsestoIFN-βtherapies.Sofar,theeffectsofIFN-βtreatmentatthege-nomiclevelinhumansarepoorlyunderstood.Researchersareinterestedindistinguishingtheheterogeneousclinicalre-sponsetoIFN-βtherapyamongthepatients.Ontheotherhand,characterizedgeneexpressiondynamicscorrelatedtotheheterogeneousresponsespotentiallyhelpinexploringthecausingmechanismsatthemolecularlevel. Syntheticdata.Weobservethatthepreprocessing,i.e.,miningthemaximalcoherentsamplesetsforeachindivid-ualgene,isrelativelyfast.Themajorbottleneckinminingcoherentgeneclustersisinthelatterpart.Therefore,in-steadofgeneratingsyntheticGSTdatasets,wesimulatethetableofmaximalcoherentsamplesetsforgenessuchasinFigure5(a).Initially,anemptytableiscreated.Then,acertainnumberofcoherentgeneclusters(G×S)areran-domlygenerated.Foreachg∈G,Sisinsertedintothetableasonemaximalcoherentsamplesetwithrespecttog.Inadditiontothesizeofthesyntheticdataset,i.e.,thetotalnumberofgenesinG-SetandthenumberofsamplesinS-Set,thesyntheticdatageneratortakesthefollowingparameters:(1)k,thenumberofcoherentgeneclustersin thedataset;(2)maxgeneandmingene,themaximalandminimalnumbersofgenesinacoherentgenecluster,re-spectively;and(3)maxsampleandminsample,themaximalandminimalnumbersofsamplesinacluster,respectively.Wegeneratethedatasetsbysettingmingene=10andminsample=5.maxsampleissettothesamevalueof|S-Set|,andmaxgeneissetto1000.Inpractice,onlyasmallnumberofgenesarecorrelatedwithaphenotype[7].Whenthesizeofthedatasetgrows,weexpecttoseemoreco-herentgeneclusters.Tosimulatethesituation,wesetkto|G-Set|·|S-Set| 3000.5.2ResultsontheMSMicroarrayData WeapplyouralgorithmsontheMSmicroarraydatawithming=15,mins=3andδ=0.8.Intotal,21coherentgeneclustersarereported.Tobetterunderstandthemin-ingresults,wefeedthegenesineachclustertoOnto-Express(http://vortex.cs.wayne.edu/Projects.html)andobtainahi-erarchyoffunctionalannotationsintermsofGeneOntologyforeachcluster.Anexampleofgeneontologytreeforclus-ter17isshowninFigure9(a).Then,wefurtherinvestigatethegenesandsamplesintheclusters.Someinterestingob-servationsareobtainedasfollows. First,asexpected,themajoritygenesintheclustersareinvolvedincellularprocessesandphysiologicalprocesses,whilegenesinvolvedinotherbiologicalprocesses(e.g.,de-velopment,behaviorandvirallifecycle)arenothighlyrep-resented(Figure9(b)).Moreover,amongthegenesinvolvedincellularprocess,thoseinvolvedincellcommunicationandcellgrowthand/ormaintenancearepredominant(Fig-ure9(c)).SinceIFN-βisknowntohaveanti-proliferativeactivities,thehighpopulationofcellularprocessgenesin-volvedincellgrowthand/ormaintenanceisbiologicallyplausible. Second,abunchofgenes(e.g.,incluster10,genesH38522,AA704613,N92443,N75595,etc.)thatarewellknownfortranscriptionalsignalingandcellularsignalingcanbeidenti-fiedintheresultedclusters.Thosegenes,togetherwiththeothergenesinthesameclusterthatareunknownorpoorlyunderstood,mayserveasswitchesinthegeneticnetworkandhenceplayanessentialroleinthebiologicalprocesses.Thus,studyingthetime-seriesofthegenesinthecoherentgeneclustersmaygreatlyhelppeopleunderstandtheregu-latorymechanismsbehindtheresponsetoIFN-βtreatment.Last,coherentgeneclustersalsoconsistofdifferentgroup-ingsofsamples,whichprovidepromisinghypothesisfordif-ferentphenotypes.Forexample,intheMSmicroarraydata,theexpressiondatafrompatientswithdifferentresponsestoIFN-βtreatmentarecollected.Amongthe21reportedclus-ters,only2clusters(cluster10and17)consistof4samples,whileotherclustersonlyconsistoftheminsnumberofsam-ples.Therefore,thosetwoclustersmaybecomegoodcan-didatesfortargetphenotype.Basedontheclinicalrecordsofthepatients,andcombinedwiththegeneinformationfromtheclusters,theinterpretationofthesamplegroupsiscurrentlyunderinvestigation. 5.3EffectsoftheParameters Themaximalcoherentgeneclusterisdefinedwithrespecttothreeparameters,i.e.,theminimumnumberofgenesming,theminimumnumberofsamplesminsandtheco-herencethresholdδ.WetesttheeffectoftheparametersontherealGSTdataset.Figure10(a)showsthenumber A–CellularprocessB–DevelopmentC–Physiologicalprocesses D–BehaviorE–Virallifecycle 80Gene population (percent) 70 60 50 40 30 20 10 0 A 8.5 0.6 0.3 BCDECategory of biological process33.1 56.1 (b)Thedistributionofbiologicalprocess A–CellcommunicationB–CelldeathC–Cellgrowthormaintenance D–CellmotilityE–Celldifferentiation 60Gene population (percent) 50 40 30 20 12.0 10 0 A 3.4 BCDCategory of cellular process 0.0E 41.9 41.9 (a)Thegeneontologytreeforgenesincluster17 (c)Thedistributionofcellularprocess Figure9:Thegeneontologytreeandthedistributionoffunctionsforgenesincluster17. ofcoherentgeneclusterswhenmingvariesfrom5to100,mins=3andδ=0.8.Clearly,thenumberofcoherentgeneclustersinthedatasetdecreaseswhenmingincreases.Theresultconcurstheintuition:withalowermingvalue,wecancatchmoreclusterswithmoreorlessgenes.Asamatteroffact,withfixedminsandδ,letBibethecompletesetofcoherentblockswhenming=i.Then,wecanshowB1⊇...Bi⊇...Bn. Figure10(b)showsthenumberofcoherentgeneclusterswithrespecttovariousminswhenmingandδarefixedto10and0.8,respectively.Thisresultcanbeexplainedinawaysimilartothesituationofming.Figure10(c)showstheeffectofδonthenumberofcoherentgeneclustersinthedataset,withming=10andmins=3.Whenwelowerthecoherencethreshold,morecombinationsofsamplesare“coherent”bychancewithrespecttoaminimumofminggenes. Interestingly,thethreecurvesinFigure10sharesimilartrends.Thatis,whenthevalueoftheparameter(repre-sentedbytheXaxis)increases,thenumberofcoherentgeneclusters(representedbytheYaxis)goesdown.Thecurvedropssharplyuntila“knot”ismet,thenthecurvegoesstablytotheright.Forexample,wecanseethe“knots”ofming=20inFigure10(a),mins=4inFigure10(b)andδ=0.85inFigure10(c).These“knots”indicatethatthereexiststableandsignificantcoherentgeneclustersintherealdataset.Theyarehighlycorrelated,involvingastatisticallysignificantnumberofgenesandsamples.The“knots”alsosuggestthebestsettingsoftheparameterstoavoidthecoherentgeneclustersformedjustbychance. 5.4Scalability Wefirsttesttheefficiencyofthepreprocessing(algorithminFigure3)onvariousrandomsubsets(bysampling)oftherealmicroarraydataset.Thesizeofthesubsetsvariesfrom500to4324genes,andallthesamplesareincluded.Foreachsize,wesampled30subsetsandcalculatetheaverageruntime.Figure11(a)illustratesthescalabilityforthepre-processingstep.AswediscussedinSection3,therealGSTmicroarraydatasetsareoftensparse.Withtheefficientpruningtechniques,thepreprocessingalgorithmislinearlyscalabletothesizeofthedatasets. WethentestthescalabilityofbothGene-SampleSearchandSample-GeneSearchonsyntheticdatasets.Wesetmins=5,ming=10,andδ=0.8.Wefirstfixthenum-berofsamplesto30,andreporttheruntimewithrespecttonumberofgenes(Figure11(b)).Wecanseebothapproachesshowanapproximatelylinearscalabilitywithrespecttothenumberofgenes.Figure11(c)showsthescalabilityforbothapproachesunderdifferentsizesofsamplesets(from30to100),whenthenumberofgenesisfixedto3000.Wecanseebothapproachesscalewellwithrespecttothenumberofsamples.Sincethenumberofgenesforamicroarraydataistypicallybyfarlargerthanthatofthesamples,theenumer-ationofgenesismuchmoreexpensivethantheenumerationofsamples.ThisexplainswhytheSample-GeneSearchisfasterthantheGene-SampleSearch. 5.5TheEffectofLemma4.1 Lemma4.1canidentifythegenesgithatcanbemergedintothecurrentcombinationofgenes,andthuscanreduce 100 90 80 70 60 50 40 30 20 10 0 180 160 140 120 100 80 60 40 20 0 2 3 4 min_s 600 500 Number of MCBsNumber of MCBsNumber of MCBs 400 300 200 100 0 0 10 20 30 40 50 60 70 80 90 100 min_g 5 6 7 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 delta (a)Numberofclustersvs.ming mins=3,δ=0.8(b)Numberofclustersvs.mins ming=10,δ=0.8(c)Numberofclustersvs.δming=10,mins=3 Figure10:Theeffectsoftheparametersonthenumberofclusters. 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0.5 300 250Time (Seconds) 300g-s searchs-g searchTime (Seconds) 250 200 150 100 50g-s searchs-g searchTime (Seconds) 200 150 100 50 1 1.5 2 2.5 3 3.5 4 4.5 0 3 4 5 6 7 8 9 10 0 30 40 50 60 70 80 90 100Number of genes (K)Number of genes (K)Number of samples (a)Preprocessing (b)Scalabilityw.r.t.numberofgenes(numberofsamples:30) Figure11:Scalabilityonlargedatasets. (c)Scalabilityw.r.t.numberofsamples(numberofgenes:3,000) thenumberofrecursionsinthemining.Weusesomesam-plesoftherealmicroarraydataset(eachsubsetcontains100to1000genesand12patients)tocomparetheperformanceoftheGene-SampleSearchwithandwithouttheoptimiza-tion.Thecomparisonisconductedinthreeaspects:(1)themaximalnumberofrecursionlevelsintheGene-SampleSearch;(2)thenumberofgenecombinationsintheGene-SampleSearch;and(3)theruntime.Figure12showstheresults.Wecanclearlyseethat(1)themaximalnumberofrecursionlevelscanbereducedsubstantially(Figure12(a));(2)withtheoptimization,thetotalnumberofgenecom-binationsneededtobecheckedgoesdownsharply(12(b));and(3)theruntimeismuchshorterwhentheoptimizationisapplied(12(c)).TheresultsstronglyconfirmthattheoptimizationiseffectiveforGene-SampleSearch. WealsoapplythespiritofLemma4.1ontheSample-GeneSearch,andconductsimilartests.However,wecanhardlyseeanysignificantimprovementbroughtbytheoptimiza-tion.AswediscussedinSection4.2.3,duetothesparsityofthemicroarraydata,manygenescanbemergedbecausetheyarecoherentonthesamesampleset.However,fewsampleswouldbemergedtogethersinceusuallythemaxi-malcoherentgenesetswithrespecttotwodifferentsamplesetsarenotidentical. 6.RELATEDWORK Thisresearchisrelatedtopreviousworkonclusteringconventionalgene-time/gene-samplemicroarraydataandfrequentitemsetmining. AsweintroducedinSection1,therehavebeentwocate-goriesofconventionalmicroarraydatasets:gene-timedatasetsandgene-sampledatasets.Forgene-timemicroarray data,variousalgorithms(e.g.[17,15,8])havefocusedonclusteringthegenes.Thatis,co-expressedgenesaregroupedbasedontheirexpressionpatternsduringthetimeseries.Ontheotherhand,differentapproaches(e.g.[4,21,16])havebeenproposedtopartitionthesamplesetstofindtheirmacroscopicphenotypesaswellastodetectinforma-tivegeneswhichmanifestthesamplepartition.However,alltheclustermodelsinthosepreviousstudiesaresubstantiallydifferentfromourcoherentgeneclusters.Asaconsequence,thosealgorithmscannotbeextendeddirectlytosolveourproblem. In[6],ChengandChurchintroducetheconceptofbiclus-tertomeasurethecoherencebetweengenesandconditions(eithertimeseriesorsamples).Givenasetofgenesandasetofconditions,abiclusterisasubsetofgenescoher-entwithasubsetofconditions.Yangetal.[22]proposeamove-basedalgorithmtofindbiclustersmoreefficiently.Bothalgorithmsin[6]and[22]adoptheuristicsearchap-proaches,andthuscannotguaranteetofindthecompletesetofbiclustersinthedataset. In[19],Wangetal.proposethemodelofpattern-basedcluster.GivenasubsetofobjectsOandasubsetofat-tributesA,pair(O,A)formsapattern-basedclusterifforanypairofobjectsx,y∈O,andanypairofattributesa,b∈A,thedifferenceofchangeofvaluesonattributesaandbbetweenobjectsxandyissmallerthanathresh-oldδ.Inarecentstudy[11],Peietal.hasproposedanefficientalgorithm,MaPletominethecompletesetofmax-imalpattern-basedclusters. Weborrowsomeimportantideasfrompreviousstudiesonfrequentitemsetmining.First,theframeworkofourapproachesissimilarinspirittothatofpattern-growth 400 350 300 250Level 2500Opt OffOpt OnNumber of nodes 140Opt OffOpt OnTime (Seconds) 2000 1500 1000 500 120 100 80 60 40 20 0Opt OffOpt On 200 150 100 50 0 200 300 400 500 600 700 800 900 1000Number of genes 0 200 300 400 500 600 700 800 900 1000Number of genes 200 300 400 500 600 700 800 900 1000 Number of genes (a)Maximalnumberof recursionlevels(b)Numberofgenecombinationssearched (c)Runtime Figure12:EffectofLemma4.1forGene-SampleSearch. methodsforfrequentpatternmining.Second,thepruningtechniquesinourapproachessharesomeinterestingsimi-laritieswiththemethodsofminingfrequentcloseditem-sets(e.g.,[10]).However,therearetwoessentialdifferencesbetweenthefrequentpatternminingmethodsandtheap-proachesdevelopedinthispaper.Ontheonehand,thecoherentgeneclustersareinherentlydifferentfromfrequentitemsets.Thus,thesimilaritybetweenthetwocategoriesofmethodsisonlyatthelevelofspirit(e.g.,setenumerationandpruning).Thetechnicaldetailsaredramaticallydiffer-ent.Ontheotherhand,newtechniquessuchastheinvertedlistsareadoptedtotackletheparticularmicroarraydata. 7.CONCLUSIONS Inthispaper,wehaveinvestigatedanoveltypeofgene-sample-timemicroarraydatasetsandproposeanewprob-lemofminingcoherentgeneclustersfromsuchdatasets.Wehaveconductedasystematicstudytodeveloptwomin-ingmethods:theSample-GeneSearchandtheGene-SampleSearch.Ourextensiveperformancestudyonbotharealmi-croarraydatasetandsyntheticdatasetsshowsthatthereexistinterestingandsignificantcoherentgeneclustersintherealdataset,andboththealgorithmshavegoodperfor-mance.Sincethenumberofgenesinthemicroarraydataistypicallybyfarlargerthanthenumberofsamples,theSample-GeneSearchusuallyoutperformstheGene-SampleSearch. 8.REFERENCES [1]AlizadehA.A.,etal.Distincttypesofdiffuselargeb-cell lymphomaidentifiedbygeneexpressionprofiling.Nature,Vol.403:503–511,February2000. [2]AlonU.,etal.Broadpatternsofgeneexpressionrevealed byclusteringanalysisoftumorandnormalcolontissuesprobedbyoligonucleotidearray.Proc.Natl.Acad.Sci.USA,Vol.96(12):6745–6750,June1999. [3]BlackM.A.andDoergeR.W.Calculationoftheminimum numberofreplicatespotsrequiredfordetectionofsignificantgeneexpressionfoldchangeinmicroarrayexperimentsBioinformatics,18:1609-1616,2002 [4]Ben-DorA.,etal.Classdiscoveryingeneexpressiondata. InProc.FifthAnnualInter.Conf.onComputationalMolecularBiology(RECOMB2001). [5]KerrM.K.andChurchillG.A.Bootstrappingcluster analysis:Assessingthereliabilityofconclusionsfrom microarrayexperiments.Proc.Natl.Acad.Sci.USA,Vol.98:8961-8965. [6]ChengY.andChurchG.M.Biclusteringofexpressiondata. ProceedingsoftheEighthInternationalConferenceonIntelligentSystemsforMolecularBiology(ISMB),2000. [7]GolubT.R.,etal.Molecularclassificationofcancer:Class discoveryandclasspredictionbygeneexpression monitoring.Science,Vol.286(15):531–537,October1999.[8]JiangD.,etal.InteractiveExplorationofCoherent PatternsinTime-SeriesGeneExpressionData.InKDD’03.[9]MolerE.J.,etal.AnalysisofMolecularProfileDataUsing GenerativeandDiscriminativeMethods.PhysiologicalGenomics,Vol.4(2):109–126,2000. [10]PeiJ.,etal.CLOSET:Anefficientalgorithmformining frequentcloseditemsets.InDMKD’00. [11]PeiJ.,etal.MaPle:AFastAlgorithmforMaximal Pattern-basedClusterin.InIEEEICDM’03.[12]HerwigR.,etal.Large-ScaleClusteringof cDNA-FingerprintingData.GenomeResearch,9:1093–1105,1999. [13]RymonR.Searchthroughsystematicsetenumeration.In Proc.1992Int.Conf.PrincipleofKnowledgeRepresentationandReasoning(KR’92). [14]DudoitS.,etal.Comparisonofdiscriminationmethodsfor theclassificationoftumorsusinggeneexpressiondata.J.oftheAmericanStatisticalAssociation,Vol.97,No.457.[15]TamayoP.,etal.Interpretingpatternsofgeneexpression withself-organizingmaps:Methodsandapplicationtohematopoieticdifferentiation.Proc.Natl.Acad.Sci.USA,Vol.96(6):2907–2912,March1999. [16]TangC.,etal.Miningphenotypesandinformativegenes fromgeneexpressiondata.InSIGKDD’03. [17]TavazoieS.,etal.Systematicdeterminationofgenetic networkarchitecture.NatureGenet,pages281–285,1999.[18]TusherV.G.,etal.SignificanceAnalysisofMicroarrays AppliedtotheIonizingRadiationResponse.Proc.Natl.Acad.Sci.USA,Vol.98(9):5116–5121,April2001. [19]WangH.,etal.ClusteringbyPatternSimilarityinLarge DataSets.InSIGMOD2002. [20]Weinstock-GuttmanB.,etal..Genomiceffectsof interferon-binmultiplesclerosispatients.AcceptedpendingrevisionsJournalofImmunology,2003. [21]XingE.P.andKarpR.M.Cliff:Clusteringof high-dimensionalmicroarraydataviaiterativefeaturefilteringusingnormalizedcuts.Bioinformatics,Vol.17(1):306–315,2001. [22]YangJ.,etal.δ-cluster:CapturingSubspaceCorrelationin aLargeDataSet.InICDE2002. 因篇幅问题不能全部显示,请点此查看更多更全内容