Mining coherent gene clusters from gene-sample-time microarray data

2020-12-17 来源：步旅网

MiningCoherentGeneClustersfromGene-Sample-Time

MicroarrayData

∗

DaxinJiang†

†

JianPei†‡MuraliRamanathan†ChunTang†AidongZhang†

‡

StateUniversityofNewYorkatBuﬀalo,USASimonFraserUniversity,CanadaEmail:{djiang3,jianpei}@cse.buﬀalo.edu,murali@acsu.buﬀalo.edu,{chuntang,azhang}@cse.buﬀalo.edu

ABSTRACT

Extensivestudieshaveshownthatminingmicroarraydatasetsisimportantinbioinformaticsresearchandbiomedicalapplications.Inthispaper,weexploreanoveltypeofgene-sample-timemicroarraydatasets,whichrecordstheexpres-sionlevelsofvariousgenesunderasetofsamplesduringaseriesoftimepoints.Inparticular,weproposetheminingofcoherentgeneclustersfromsuchdatasets.Eachclus-tercontainsasubsetofgenesandasubsetofsamplessuchthatthegenesarecoherentonthesamplesalongthetimeseries.Thecoherentgeneclustersmayidentifythesamplescorrespondingtosomephenotypes(e.g.,diseases),andsug-gestthecandidategenescorrelatedtothephenotypes.Wepresenttwoeﬃcientalgorithms,namelytheSample-GeneSearchandtheGene-SampleSearch,tominethecompletesetofcoherentgeneclusters.Weempiricallyevaluatetheperformanceofourapproachesonbotharealmicroarraydatasetandsyntheticdatasets.Thetestresultshaveshownthatourapproachesarebotheﬃcientandeﬀectivetoﬁndmeaningfulcoherentgeneclusters.

CategoriesandSubjectDescriptors

H.2.8[DatabaseManagement]:DatabaseApplications—DataMining

GeneralTerms

Algorithms,Experimentation

Keywords

Bioinformatics,clustering,microarraydata

∗0234895,Thisresearchispartlysupportedby01A1.IIS-0308001andNIHgrant1NSFP20grantsDBI-dationsAllnecessarilyinthisopinions,ﬁndings,conclusionsandGM067650-recommen-reﬂectpapertheareviewsthoseoftheoffundingtheauthorsagencies.

anddonotPermissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforproﬁtorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontheﬁrstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspeciﬁcpermissionand/orafee.

1.INTRODUCTION

Therecentmicroarraytechnologycanmeasuretheex-pressionlevelsofthousandsofgenessimultaneously.Itisanimportantresearchprobleminbioinformaticsandclini-calresearchtoexplorethepatternsinmicroarraydatasets.Forexample,indrugdevelopment,geneexpressionpatternsmayreﬂectgene-levelresponsestodiﬀerentdrugtreatmentsandprovidedeepinsightsintothenatureofthediseases.Mostmicroarraydatasetscanbedividedintotwocate-gories.Ontheonehand,thegene-timedatasetsrecordtheexpressionlevelsofvariousgenesduringimportantbiologi-calprocessesoveraseriesoftimepoints.Thegene-sampledatasetsaccounttheexpressionlevelsofvariousgenesacrossrelatedsamples.Bothgene-timedatasetsandgene-sampledatasetscanberepresentedbyan×lgeneexpressionmatrixofngenesandlsamples/timepoints.Inthegeneexpressionmatrix,therowsarethegenesandthecolumnsareeithersamples(ingene-sampledatasets)ororderedtimepoints(ingene-timedatasets),whileeachcellrepresentstheex-pressionlevelofacertaingeneonacertainsampleoratacertaintimepoint.

Withthelatestadvancesinthemicroarraytechnology,theexpressionlevelsofasetofgenesunderasetofsam-plescanbemonitoredsynchronicallyduringaseriesoftimepoints[20].Diﬀerentfromthepreviousgene-timeorgene-samplemicroarraydatasets,thesenewdatasetshavethreetypesofvariables:genes,samplesandtime.Wecallsuchdatagene-sample-timemicroarraydata,orGSTdataforshort.Figure1(a)elaboratesthestructureofaGSTmi-croarraydataset.

ssamplesj1sj2sj3emitsamplesgenesgi1genesm1111mm1121lgi2m121m1n1m1nlgi3(a)

(b)

Figure1:ThestructureofGSTmicroarraydataIngeneral,eachcellmki,jinaGSTdatasetrepresentstheexpressionlevelofgenegiundersamplesjattimepointtk.Interestingly,aGSTdatasetcanalsobeviewedasan×l

matrix,suchthateachcellmi,jcontainsthetimeserieswithrespecttogenegiundersamplesj,asshowninFigure1(b).Thepreviousstudiesongene-samplemicroarraydata(e.g.,[7,2,15,1])indicatethathighcorrelationsmayexistbe-tweenthegeneexpressionpatternsandsomediseases.ItisnaturaltoextendthesimilaranalysistoGSTmicroarraydata.Thatis,itisinterestingtoidentifyasubsetofgenesGandasubsetofsamplesSinaGSTmicroarraydatasetsuchthateachgeneg∈GhascoherentpatternsacrossthesamplesinSduringthetimeseries.Forexample,inFig-ure1,genegi1,gi2andgi3showcoherentpatternsacrosssamplessj1,sj2andsj3,respectively.Wecallsuchsubsetsofgenesandsamplesacoherentgenecluster.

Whatisthebiologicalmeaningofthecoherentgeneclus-ters?Thecoherentgeneclustersprovidevaluablehypothesisforbiologists.Thesamplesinaclustermaycorrespondtoaphenotype,suchasthepatientshavingadisease,whilethecorrespondingsetofgenesmaysuggestthecandidategenescorrelatedtothedisease.

Thefunctionsofgenesinanorganismarehighlycom-plicated.Therearetypicallymultiplecoherentclustersinadataset.Diﬀerentclustersmaycorrelatetodiﬀerentpheno-types,suchasageandgender.Therefore,toavoidmissinganyvaluablehypothesis,itisnecessarytominealltheco-herentclustersinthedataset.

Manypreviousstudiesinvestigatetheminingofinterest-ingpatternsfrommicroarraymatrices.Forexample,vari-ousclusteringalgorithmscanidentifytheco-expressedgenesshowingcoherentpatternsduringthetime-series(e.g.,[17,12,15,8]).Moreover,bothsupervisedorunsupervisedap-proachesareproposedtopartitionthesamplesintohomo-geneousgroups(e.g.,[4,14,9,16]).Additionally,statisticalapproacheshavebeenproposedtovalidatethesigniﬁcanceoftheminingresults(e.g.,[3,18,5]).However,allprevi-ousstudiestargetatconventionalgene-timeorgene-samplemicroarraydatasets.Themodelsofclustersinthosepre-viousstudiesarediﬀerentfromourcoherentgeneclusterswhichdisclosethecorrelationamonggenes,samplesandtimepoints.Therefore,thosealgorithmscannotbeextendeddirectlytosolveourproblem.

Recently,thepattern-basedclusteringapproaches(e.g.,[19])havebeendevelopedtodiscoversubsetsofobjectsfollowingsimilarpatternsonsubsetsofattributes.Conceptually,apattern-basedclusterisacoherentgenecluster.IfwetreattheGSTmicroarraydatasetsasan×lmatrixoftimese-ries,asshowninFigure1(b),thenpattern-basedclustersandcoherentgeneclustersmayhavesomesimilarityattheﬁrstlook.However,insomepattern-basedapproaches,aclusterrequiresthateachpairofobjectsintheclustermustbecoherentoneachpairofattributes.Sucharequirementisoftentoostronginpractice.Ourcoherentgeneclusteringrelaxestheconstraintsamongtheobjects.Therefore,eachtraditionalpattern-basedclusterisacoherentgenecluster,butnotnecessarilytrueviceversa.

Inthispaper,wetackletheproblemofminingcoherentpatternsfromgene-sample-timemicroarraydatasetsandmakethefollowingcontributions.

First,weproposeamodelofcoherentgeneclustersinGSTmicroarraydatasets.Wejustifythatthemodelismeaningfulforbiomedicalresearch.

Second,weidentifythecomputationalchallengesandcon-ductasystematicresearchonminingcoherentgeneclus-tersfromGSTmicroarraydatasets.Wedeveloptwoap-

proaches,namelytheGene-SampleSearchandtheSample-GeneSearch,tominethecompletesetofcoherentgeneclus-ters.Weillustrateandcomparetheeﬃciencyandscalabilityofbothapproaches.

Last,weconductanextensiveempiricalevaluationonbothrealdatasetsandsyntheticdatasets.Ourresultsshowthatourproposedmethodscanﬁndcoherentgeneclustersinterestingtobiomedicalresearchfromrealdatasets.Theresultsonsyntheticdatasetsalsoshowthatouralgorithmsarebotheﬃcientandscalable.

Therestofthepaperisorganizedasfollows.Section2deﬁnestheproblem.Section3describesthepreprocessingstepofcomputingthemaximalcoherentsamplessetsforeachindividualgene.Section4presentstwoalgorithmstominecoherentgeneclusters.InSection5,ourmethodsareevaluatedusingrealandsyntheticdatasets.RelatedworkisdiscussedinSection6.Section7concludesthepaper.

2.PROBLEMDESCRIPTION

GivenasetofngenesG-Set={g1,...,gn}andasetoflsamplesS-Set={s1,...,sl},wecanmeasuretheexpressionlevelsofthegenesonthesamples.Theresultsformacon-ventionaln×lmicroarraymatrixM={mi,j},wheremi,jistheexpressionlevelofgenegi(1≤i≤n)onsamplesj(1≤j≤l).Ifsuchmicroarrayexperimentsareconductedsynchronicallyonallgenesandallsamplesattimeinstantst1,...,tT,theresultsforman×l×TGSTmicroarraymatrixM={mti,j},where(1≤t≤T).

AGSTmicroarraymatrixM={mti,jasan×lmatrixM󰀂={m󰀂

}canalsobeviewed

i,j}suchthatm󰀂i,jisavector

of󰀒m1i,j,...,mT

i,j󰀓.Hereafter,wedonotstrictlydistinguishthetwonotations.Instead,whenevermi,jiswritten,thevectorisreferredto.

Inthispaper,weareinterestedinﬁndingthosegenesthatarecoherentonasubsetofsamplesduringthewholetimeseries.Therearevariousmethodstomeasurethecorrela-tionbetweentwotimeseries.However,forgeneexpressiondata,usersareofteninterestedintheoveralltrendsoftheexpressionlevelsinsteadoftheabsolutemagnitudes.There-fore,wechoosethePearson’scorrelationcoeﬃcientasthecoherencemeasure,sinceitisrobusttoshiftingandscalingpatterns[22].Speciﬁcally,giventwovectorsmi,j1andmi,j2ofgenegi,thecoherenceρ(mi,j1,mi,j2)isdeﬁnedas

󰀂T󰀁mti,jt

t=1(1−mi,j1)(mi,j2−mi,j2)󰀂T󰀁t=1(mt󰀂T,i,j1)2t=1(mti,j21−mi,j2−mi,j2)m󰀂mtwherei,j=Ti,j

istheofgenegonsamplet=1Tmeanoftheexpressionlevelsisj.Thecorrelationcoeﬃcientrangesbetween−1and1.Thelargerthevalue,themorecoherentarethetwovectors.

AgenegiiscoherentacrossasubsetofsamplesS⊆S-Set,ifgivenanypairofsamplessj1,sj2∈S,ρ(mi,j1,mi,j2)≥δ,whereδisaminimumcoherencethresholdspeciﬁedbytheuser.ForasubsetofgenesG⊆G-SetandasubsetofsamplesS⊆S-Set,ifeverygenegi∈GiscoherentacrosssamplesinS,wecallgenesetGcoherentonsamplesetS.(G×S)iscalledacoherentgenecluster.Acoherentgeneclusterhavingugenesandvsamplesissaida(u,v)-coherentgenecluster.

Trivially,foranygenegiandanysamplesj,({gi}×{sj})isa(1,1)-coherentgenecluster,and(G-Set×{sj})and({gi}×

S-Set)aretrivial(|G-Set|,1)-and(1,|S-Set|)-coherentgeneclusters,respectively.Toavoidthistriviality,werequirethatacoherentgeneclustershouldconsistofatleasttwogenesandtwosamples.

Givenacoherentgenecluster(G×S),foranysubsetsG󰀂

⊆GandS󰀂⊆S,(G󰀂×S󰀂)isalsoacoherentgenecluster.Toavoidsuchredundancy,acoherentgenecluster(G×S)ismaximal󰀂ifthereexistsnoanyothercoherentgeneclus-ter(G󰀂×S)suchthatG⊆G󰀂,S⊆S󰀂.Moreover,ausermaynotbeinterestedinverysmallclusters,whichareoftenformedbychance.Thus,ausercanspecifytheminimumnumbersofgenesandsamplesinacluster.Generally,givenmingandminsasuserdeﬁnedminimumgenesizeandsam-plesizethresholds,acluster(G×S)iscalledsigniﬁcantif|G|≥mingand|S|≥mins.

ProblemdeﬁnitionGivenaGSTmicroarraymatrixM,aminimumcoherencethresholdδ,aminimumgenesizethresholdmingandaminimumsamplesizethresholdmins,theproblemofminingcoherentgeneclustersistoﬁndthecompletesetofsigniﬁcantmaximalcoherentgeneclustersinMwithrespecttotheparameters.Hereafter,asigniﬁ-cantmaximalcoherentgeneclusteriscalledacoherentgeneclusterforshort.

3.MAXIMALCOHERENTSAMPLESETS

Weproposetwoalgorithmscomputingmaximalcoherentgeneclusters.Inbothalgorithms,tocomputecoherentgeneclusters,weneedtocheckwhetherasubsetofgenesarecoherentonasubsetofsamples.Tofacilitatethetests,foreachgenegk,wecomputethesetsofsamplesSsuchthat(1)|S|≥mins;(2)gkiscoherentonS;and(3)thereexistsnosupersetS󰀂⊃SsuchthatgkisalsocoherentonS󰀂.Siscalledamaximalcoherentsamplesetofgk.Pleasenotethat,ingeneral,agenemayhavemorethanonemaximalcoherentsampleset.

Foragenegk,allofitsmaximalcoherentsamplesetscanbecomputedeﬃcientlyusingthefollowing2-stepprocess.Intheﬁrststep,wetestwhethergenegkiscoherentoneachpairofsamples(si,sj).Abinarytrianglematrix{ci,j}ispopulated,where1≤iOncethematrix{ci,j}ispopulated,theproblemofﬁnd-inggk’smaximalcoherentsamplesetscanbereducedtotheproblemofﬁndingallmaximalcliquesofsizeatleastminsingraphGk=(S-Set,E),where(si,sj)isanedgeinthegraphifandonlyifci,j=1.AcliqueSiscalledmaximalifthereexistsnoanyothercliqueS󰀂suchthatS⊂S󰀂.Pleasenotethattheremayexistmorethanonemaximalcliqueinagraph.

Unliketheconventionalcliqueproblemwherethecliqueofthemaximalsizeisfound,here,weneedtoﬁndthecompletesetofmaximalcliquesinthegraph.ItiswellknownthattheconventionalcliqueproblemisNP-complete.Soistheproblemofﬁndingthecompletesetofmaximalcliques.Fortunately,therealGSTmicroarraydatasetsareoftensparseandthenumberofsamplesistypicallybelowonehun-dred.Foreachgene,thenumberofmaximalcliquesisquitesmallandthesamplescanoftenbepartitionedintoexclu-sivesmallsubsets.Ourexperimentalresultsshowthat,witheﬃcientsearchandpruningtechniques,itisstillpracticaltoﬁndthecompletesetofmaximalcliques.Inthefollowing,

{}

{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}{a,b,c,d}Figure2:Enumerationofcombinationsofsamples.wewillshowhowtoﬁndthemaximalcliquesofasamplesetbyadepth-ﬁrstsearchinasamplesetenumerationtree.GivenasetofsamplesS={s1,...,sl},theset2S(i.e.,allcombinationsofsamples)canbeenumeratedsystematically.Forexample,considerasetofsamplesS={a,b,c,d}.Thecompletesetofnon-emptycombinationsofsamplescanbedividedinto4exclusivesubsets:(1)theoneshavingsamplea;(2)theoneshavingsamplebbutnoa;(3)theoneshavingsamplecbutnoaorb;and(4){d}.TheyareshownastheimmediatechildrenoftherootinFigure2.

Thesesubsetscanbefurtherpartitioned.Forexample,theﬁrstsubsetcanbefurtherdividedintothreeexclusivesub-subsets:(1)theoneshavingsamplesaandb;(2)theoneshavingsamplesaandcbutnob;and(3){a,d}.

ThetreeshowninFigure2iscalledasetenumerationtree[13]withrespectto{a,b,c,d}.Itprovidesaconceptualtooltoenumeratethecompletesetofcombinationssystem-atically.

Wecanconductarecursive,depth-ﬁrstsearchofthesam-plesetenumerationtreetodetectthemaximalcliquesofthesamples.|GivenasetofsamplesS,thesetenumerationtreehas2|Snodes.However,weneverneedtomaterializesuchatree.Instead,weonlyneedtokeepapathfromtherootofthetreetothenodewearesearchingasaworkingset,whichcontainsatmost(|S|+1)nodes.

Clearly,inthesetenumerationtree,eachnodecontainsauniquesubsetofsamples.Thuswecanusethesubsetofsamplestorefertothenode.Atnode{si1,...,sik}(1≤i1<···Lemma3.1.Atnodev={si1,...,sik}ofthesamplesetenumerationtree,where(1≤i1<···Wecanpruneanysubtreethatcannotleadtoacoherentsamplesetofatleastminssamples.

PruningRule3.1(Pruningsmallsamplesets).Atanodev={si1,...,sik},thesubtreeofvcanbeprunedif(k+|Tail|)Forexample,forasetoflsamples,eventhecompletesetofsamplecombinationscanbedividedintolexclusivesubsetsasshownbefore,weonlyneedtosearchtheﬁrst(l−mins+1)subsets,sinceeachofthelast(mins−1)subsetscontainslessthanminssamples.

Moreover,ifthesamplesatthecurrentnodeanditsTailaresubsumedbysomemaximalcoherentsamplesetfound

Input:theGSTdataset,thecoherencethresholdδ;

minimumsamplesizethresholdmins;

Output:themaximalcoherentsamplesetsforeachgene;Method:

foreachgeneggeneratematrixkdo

ci,jforg//depth-ﬁrstsearch

k;fori=1to(l−mins+1)do

callsearch-clique({si},{si+1,...,sl});end-forend-forProcedure:search-clique(head,tail)

//headrecordsthesamplesinthecurrentnode

supposesiisthelastsampleinhead,removesamples

sjfromtailsuchthatci,j=0;//Lemma3.1if(|head∪tail|or(head∪tail⊂S)s.t.Sisamaximalclique

//Pruning3.2

thenreturn;//aborttherecursivesearchiftail=∅thenoutputamaximalclique;else

letj=min{k|sk∈tail}andtail=tail−{sj};callsearch-clique(head∪{sj},tail);untiltail=∅;return;

Figure3:Computingmaximalcoherentsamplesets.sofar,thentherecursivesearchcanalsobepruned,sinceitcannotleadtoanynewmaximalcoherentsampleset.PruningRule3.2(Pruningsubsumedsets).Atanodev={si1,...,sik},if{si1,...,sik}∪Tailisasubsetofsomemaximalcoherentsampleset,thenthesubtreeofthenodecanbepruned.

Basedontheabovelemmaandpruningrules,theprepro-cessingalgorithmispresentedinFigure3.Forthereadersfamiliarwiththetechniquesofdepth-ﬁrstminingofmaxi-mal/closedfrequentpatterns,theideasofpruningheresharethesimilarspiritwiththepruninginfrequentcloseditemsetmining(e.g.,[10]).However,onekeydiﬀerenceisthatthefrequentpatternminingconductscountingondatabases,andwedonotneedtoscanthedatabaseforanycountingoncethetrianglematrix{ci,j}ismaterialized.Thecorrect-nessofthepreprocessingalgorithmcanbeshown.Limitedbyspace,weomitthedetailshere.

4.THEMININGALGORITHMS

Ana¨ıvemethodtoﬁndthemaximalcoherentgeneclustersistotesteverypossiblecombinationofgenesandsamplesthoroughly.Afterallthecoherentgeneclustersarefound,wecanidentifyandreportthemaximalones.Thena¨ıvemethodisverycostlyandthusinfeasibleforrealdatasets.Forexample,supposewehave1,000genesand201000samples.Thena¨ıvemethodmayhavetosearchupto(2−1)×(220−1−20)≈1.12×10307combinations!

Howcanwesearchthehugespaceeﬃcientlyandpruneunpromisingsubspacessharply?Whencomputingthemaxi-malcoherentsamplesets(Figure3),wesystematicallyenu-meratecombinationsofsamplesinarecursivedepth-ﬁrstsearch,anddeveloptechniquestopruneunpromisingsub-spacesaggressively.Stimulatedbythesimilarspirit,here

TheSample-GeneSearch

depth-ﬁrstenumeratesubsetsofsamplesforeachsubsetofsamplesSdo

ﬁndthemaximalsubsetsofgenesG

s.t.G×Sisacoherentgenecluster;

testwhether(G×S)isamaximalcoherentgene

cluster;

end-forTheGene-SampleSearch

depth-ﬁrstenumeratesubsetsofgenesforeachsubsetofgenesGdo

ﬁndthemaximalsubsetsofsamplesS

s.t.G×Sisacoherentgenecluster;

testwhether(G×S)isamaximalcoherentgene

cluster;

end-for

Figure4:TheframeworksoftheSample-GeneSearchandtheGene-SampleSearch.

wecanalsosystematicallyenumeratethecombinationsofgenesandsamples,andpruneunfruitfulcombinations.

Basically,wehavetwoalternatives.Ontheonehand,wecanenumerateallcombinationsofsamplessystematically.Foreachsubsetofsamples,wecanﬁndthemaximalsubsetsofgenesthatformcoherentgeneclustersonthesamples,andcheckwhethertheclustersaremaximal.ThismethodiscalledtheSample-GeneSearch.Ontheotherhand,wecanletthegeneenumerationgoﬁrst.Foreachsubsetofgenes,weﬁndthemaximalsubsetsofsamplesthatformcoherentgeneclusterswiththegenes,andcheckwhethertheclustersaremaximal.ThemethodiscalledtheGene-SampleSearch.

TheframeworksoftheSample-GeneSearchandtheGene-SampleSearchareshowninFigure4.Properpruningtech-niquesshouldbedevelopedtopruneunpromisingcombina-tionsandsearchbranchesasearlyaspossible.

AﬁrstlookatFigure4maysuggestthatthetwomethodsaresymmetric.However,sincegenesandsamplesarenotsymmetricintheproblem,thetechnicaldetailsareinfactsubstantiallydiﬀerent.Weareminingcoherentgeneclustersonsamples.Aslongasthegenesrespondcoherentlyonthesamesubsetofsamples,theybelongtothesamecluster.However,theexpressionpatternsofdiﬀerentgenesinthesameclusterononesamplecanbeverydiﬀerent!

4.1Sample-GeneSearch

IntheSample-GeneSearch,weneedtoaddressthefol-lowingissues.First,asweenumeratethecombinationsofsamplessystematically,foreachsubsetofsamples,howcanweﬁndthemaximalsetsofgenessuchthatthegenesarecoherentonthesamples?Second,duringthesamplesetenumeration,whichsamplesetscanbepruned?Third,sim-ilartothesituationinPruningrule3.2,canweidentifyandprunethesearchesthatcannotleadtoanypotentialmaximalcoherentclusters?.Last,howcanwedeterminewhetheracoherentgeneclusterissubsumedbytheothers?Weanswertheabovequestionsinthissubsection.

4.1.1MaximalCoherentGeneSetsforSampleSets

ForeachcombinationofsamplesS,weneedtocomputethemaximalcoherentgenesetGSsuchthatthegenesinGS

GeneMaximalcoherentsamplesetsg1{s1,s2,s3,s4,s5}g2{s1,s2,s4},{s1,s5}g3{s1,s2,s3,s4,s5}g4{s1,s2,s3},{s5,s6}g5{s1,s5,s6}(a)Themaximalcoherentsamplesetsforgenes

SampleTheinvertedlists1{g1.b1,g2.b1,g2.b2,g3.b1,g4.b1,g5.b1}s2{g1.b1,g2.b1,g3.b1,g4.b1}s3{g1.b1,g3.b1,g4.b1}s4{g1.b1,g2.b1,g3.b1}s5{g1.b1,g2.b2,g3.b1,g4.b2,g5.b1}s6

{g4.b2,g5.b1}(b)Theinvertedlistsforsamples

Figure5:Themaximalcoherentsamplesetsandtheinvertedlists.

arecoherentonSandnopropersupersetG󰀂⊃GSalsohasthisproperty.

Clearly,forageneg,ifthereexistsamaximalcoherentsamplesetSgsuchthatS⊆Sg,theng∈GS.Inotherwords,GScanbederivedbyonescanofthemaximalco-herentsamplesetsofallgenes.Ifamaximalcoherentsam-plesetisasupersetofS,thenthecorrespondinggenegisinsertedintoGS.

Itisexpensivetoscanthecompletelistofmaximalco-herentsamplesetsofallgenesonceforeverycombinationofsamples.Aneﬃcientsolutionistouseaninvertedlist.Supposewehave5genesand6samples.ThemaximalcoherentsamplesetsforeachgenearelistedinFigure5(a).Welabeleachmaximalcoherentsamplesetbythegenegk,andtheset-id,bj,inthegene.Forexample,geneg2hastwomaximalcoherentsamplesets,g2.b1={s1,s2,s4}andg2.b2={s1,s5}.

Foreachsamples,wemakeuptheinvertedlistLsasthelistofallmaximalcoherentsamplesetscontainings,asshowninFigure5(b).

Now,whenwewanttocomputethemaximalcoherentgenesetsforasubsetofsamples,say{s1,s2,s3},wedonotneedtosearchthecompletelistinFigure5(a).Instead,weonlyneedtogettheintersectionoftheinvertedlistsofthesampless1,s2ands3,whichis{g1.b1,g3.b1,g4.b1}.Bythisintersection,weknowthat{g1,g3,g4}isthemaximalcoherentgeneset.

4.1.2PruningIrrelevantSamples

ForacombinationofsamplesS={si1,...,sik},wherei1<···Moreover,similartoPruningrule3.1,if|S|+|Stail|4.1.3PruningUnpromisingCoherentGeneClusters

SimilartothesituationinPruningrule3.2,wecanprunetheunpromisingcombinationsthatcannotleadtoanynewmaximalcoherentgenecluster.Here,twopruningtech-niquescanbeapplied.

Forexample,inourrunningexample(Figure5),sup-poseweﬁndthemaximalcoherentgenecluster({g1,g3}×{s1,s2,s3,s5})beforewesearchsamplesetS={s1,s3}.ForS={s1,s3},Stail={s5}andGS∪Stail={g1,g3}.Thatis,bothS∪StailandGS∪Stailaresubsumedbythemaxi-malcoherentgenecluster.TherecursivesearchofScannotleadtoanymaximalcoherentgeneclusterandthuscanbepruned.

Ingeneral,supposewearesearchingasampleS󰀂

combination.Ifthereexistsamaximalcoherentgenecluster(G×S)

foundbeforesuchthatS󰀂∪S󰀂tail⊆SandGS󰀁∪S󰀁tail⊆G,

thenanyrecursivesearchfromS󰀂

resultsinacoherentgeneclustersubsumedby(G×S),andthuscanbepruned.

Moreover,ifthereexistsamaximalcoherentgenecluster(G×S)foundbeforesuchthatS󰀂⊆SandeverymaximalcoherentsamplesetcontainingS󰀂alsocontainsS,thentherecursivesearchofS󰀂cannotleadtoanymaximalcoherentgeneclustereither,andthuscanbe󰀂pruned.Forexample,supposewesearchthesamplesetS={s2}afterweﬁndthemaximalcoherentgenecluster({g1,g2,g3,g4}×{s1,s2})inourrunningexample(Figure5).FromFigure5(a),wecanseethateverymaximalcoherentsamplesetcontainings2alsocontainss1.Inotherwords,thereexistsnomaximalcoherentgeneclustercontainings2butnos1.Thus,thesearchofS󰀂canbepruned.

4.1.4Determiningmaximalcoherentgeneclusters

WhenwesearchacombinationofsamplesS,weneedtocheckwhetherGS×Sisamaximalcoherentgenecluster.Welookatthosemaximalcoherentgeneclusters(G󰀂×S󰀂)suchthatS󰀂⊃S.Clearly,sinceweconductdepth-ﬁrstsearchinthesetenumerationtree,suchmaximalcoherentgeneclustersshouldbereportedeitherbeforeSissearched,orinthesubtreerootedatS.

Basedontheabovediscussion,wehavetheSample-GeneSearchalgorithminFigure6.

4.2Gene-SampleSearch

TheframeworkofGene-SampleSearchissymmetrictothatofSample-GeneSearch.Thatis,weenumeratethecombinationsofgenessystematically.Foreachcombinationofgenes,wecomputethemaximalsetsofsamplesthatthegenesarecoherenton.ManypruningtechniquesinSample-GeneSearchhavethesymmetricversionsinGene-SampleSearch.Limitedbyspace,weomitthedetailshere.Instead,weonlyfocusonthediﬀerencesbetweenthetwoapproaches.

4.2.1DeterminingCoherentGeneClusters

Theconceptofcoherentsamplesetsforagenecanbegeneralizedforasetofgenes.GivenasetofgenesG,amaximalcoherentsamplesetwithrespecttoGisasetofsamplesSGsuchthat(1)genesinGarecoherentonSG;and(2)thereexistsnoS󰀂⊃SGthatsamplesinGarealsocoherentonS󰀂.Pleasenotethattherecanbemorethanonemaximalcoherentsamplesetforagivensetofgenes.Howcanwecomputethemaximalcoherentsamplesetseﬃciently?Interestingly,SGcanbecomputedbysomesim-pleintersectionoperations.Forexample,supposeS{g1}=

Input:themaximalcoherentsamplessetsforgenes,

//fromthealgorithminFigure3

Output:thecompletesetofcoherentgeneclustersMethod:

generatetheinvertedlistforsamples

asdescribedinSection4.1.1;fori=1to(|S-Set|−mins)do

letS={si}andStail={si+1,...,s|S-Set|};callrecursive-search(S,Stail);end-forProcedure:recursive-search(S,Ssamplesfromtail)

removeirrelevantStailasdescribedin

Section4.1.2;

if(|S|+|Stail|derivetheintersectionofinvertedlistsforsamples

inSasdescribedinSection4.1.1;ifScanbeprunedbythecriteriain

Section4.1.3thenreturn;whileSletitail=∅do

=min{j|sj∈Slettail=tail−{stail};i};

callrecursive-search(S∪{si},Stail);end-while

derivethemaximalcoherentgenesetGGS;

output(S×S)asamaximalcoherentgeneclusterif

itisnotsubsumedbyanymaximalcoherentgeneclusterfoundbefore

End

Figure6:TheSample-GeneSearchalgorithm.

{(s1,s2,s3)}andS{g2}={(s2,s3,s4),(s5,s7)}.Then,{s2,s3}istheonlymaximalsamplesetthatbothg1andg2arecoherenton.Thatis,S{gS1,g2}={(s2,s3)}.Inotherwords,wecanderive{g1,g2}from{s1,s2,s3}∩{s2,s3,s4}and{s1,s2,s3}∩{s5,s7}.

Ingeneral,ifgenesetG=G1∪G2,thenSGcanbede-rivedfromSG1andSG2bythefunctionﬁnd-max-coherent-sample-setsinFigure7.

4.2.2

PruningherentGeneIrrelevantClusters

GenesandUnpromisingCo-SimilartotheideainSection4.1.2,wecanprunegenesthatcannotbeusedtoextendthecurrentcombinationofgenes.GivenasetofgenesG={gi1,...,gik},where1≤ii<···BasedonthesameideainSection4.1.3,wecanusethemaximalcoherentgeneclusterstoprunetheunpromisingcoherentgeneclusters.SupposewearesearchingagenecombinationG1.LetS1beonemaximalcoherentsamplesetwithrespecttoG1,i.e.,S1∈SG1.Ifthereexistsamax-imalcoherentgenecluster(G×S)suchthatS1⊆SandG1⊆G,thenS1shouldberemovedfromthelistofthemax-imalcoherentsamplesetsSG1,sinceitcannotleadtoanymaximalcoherentgenecluster.Moreover,ifSG1becomesemptyafterthepruning,thenG1shouldbepruned,sinceanyrecursivesearchfromG1cannotleadtoanymaximalcoherentgenecluster.

Functionﬁnd-max-coherent-sample-sets

Input:genesetsG1andG2,setsofmaximalcoherent

samplesetsSmaximalG1andScoherentG2;

Output:thesamplesetsw.r.t.G1∪G2Method:

letSG1∪G2=∅;

foreachmaximalcoherentsamplesetSi∈SforeachmaximalcoherentsamplesetSG1do

j∈SletSG2do

k=Si∩Sj;

if|Sk|≥minstheninsertSkintoSG1∪G2;end-forend-for

foreachSk∈SG1∪G2do

ifSthenkisapropersubsetofS󰀂∈SS−{SG1∪G2

G1∪G2=SG1∪G2k};

end-for

outputSG1∪G2;

Figure7:Computingmaximalcoherentsamplesetforasetofgenes.

4.2.3MergingCoherentGenesintheTailList

Inourrunningexample,themaximalcoherentsamplesetswithrespecttogeneg1andg3areidentical.Then,foranycoherentgenecluster(G×S)suchthatg1∈G,(G∪{g3}×S)mustalsobeacoherentgenecluster.Thus,wecansearch{g1,ge}inashoot.

Ingeneral,wehavethefollowingresult.Lemma4.1.WhensearchacombinationofgenesG,ifthereexistgenes{gj1,...,gjk}⊆GtailsuchthattheyarecoherentoneverymaximalcoherentsamplesetofG,thenthereexistsnomaximalcoherentgeneclustercontainingGbutno{gj1,...,gjk}.

BasedonLemma4.1,wecanimmediatelymergegenes{gj1,...,gjk}toGatthecurrentnode,andthusshrinkthenumberofrecursions.Thecomputationtimeissavedaswell,sinceweonlyneedtocheckthecoherentgeneclusters,prunetheirrelevantgenesorunpromisinggeneclustersforallthesegenesinoneshoot.

Inourexperimentsonrealdatasets,weobservemanygenescanbemergedbyLemma4.1.Thereal-worldGSTmicroarraydatasetsaretypicallysparseandgenesareco-herentonaquitesmallnumberofsamplesets.Asacon-sequence,theperformanceofGene-SampleSearchcanbeimprovedsubstantiallybythisoptimization.

Onemayask,“DowehaveasymmetricpruningforSample-GeneSearch?”Wecanapplythesimilaroptimiza-tiontechniqueforsample-gene.Thatis,asamplesjismergedintocurrentcombinationofsamplesSaslongastheinvertedlistofSisasubsetofthatofsj.However,suchasituationisrareinpractice,sinceitisrarethat,fortwodiﬀerentsamplesetsS1andS2,themaximalcoherentgenesetswithrespecttoS1andS2areidentical.

Basedontheabovediscussion,wehavetheGene-SampleSearchalgorithmasshowninFigure8.

5.EXPERIMENTALRESULTS

WeimplementedandtestedourapproachesonbotharealGSTmicroarraydatasetandsyntheticdatasets.ThesystemisimplementedinJava.ThetestsareconductedonaSunUltra10workstationwitha440MHzCPUand256MBmainmemory.

Inputandoutput:sameastheSample-GeneSearch

algorithm(Figure6)Method:

fori=1to(|G-Set|−ming)do

letG={gi}andGtail={gi+1,...,g|G-Set|};callrecursive-search(G,Gend-fortail);Procedure:recursive-search(G,GremoveirrelevantgenesfromGtail)

tailasdescribedin

Section4.2.2;

if(|G|+|Gtail|mergecoherentgenesinGtailasdescribedin

Section4.2.3;

foreachmaximalcoherentsamplesetSi∈SifSGdo

icanbeprunedbythesecondcriteriain

Section4.2.2thenremoveSifromSG;

end-for

if(SG=∅)thenreturn;whileGtail=∅do

leti=min{j|gj∈GGtail};lettail=Gtail−{gi};

callrecursive-search(G∪{gi},Gtail);end-while

foreachsamplesetSiinSoutput(G×SGdo

i)asamaximalcoherentgene

clusterifitisnotsubsumedbyanymaximalcoherentgeneclusterfoundbefore

end-forEnd

Figure8:TheGene-SampleSearchalgorithm.

5.1TheDataSets

Therealdataset.Weusetherealgene-sample-timemi-croarraydatasetreportedin[20].Itconsistsofthemi-croarraymeasurementsof4,324genesin13multiplesclero-sis(MS)patientsbeforeandat1,2,4,8,24,48,120and

168hoursafterIFN-βtreatments.MSpatientsshowhetero-geneousresponsestoIFN-βtreatments.Forexample,thepatientswithrelapsingMSrespondbettertoIFN-βtreat-mentsthanthepatientswithprogressivediseasedo.How-ever,relapsingMSpatientsalsoexhibitconsiderableinter-individualheterogeneityintheirclinicalresponsestoIFN-βtherapies.Sofar,theeﬀectsofIFN-βtreatmentatthege-nomiclevelinhumansarepoorlyunderstood.Researchersareinterestedindistinguishingtheheterogeneousclinicalre-sponsetoIFN-βtherapyamongthepatients.Ontheotherhand,characterizedgeneexpressiondynamicscorrelatedtotheheterogeneousresponsespotentiallyhelpinexploringthecausingmechanismsatthemolecularlevel.

Syntheticdata.Weobservethatthepreprocessing,i.e.,miningthemaximalcoherentsamplesetsforeachindivid-ualgene,isrelativelyfast.Themajorbottleneckinminingcoherentgeneclustersisinthelatterpart.Therefore,in-steadofgeneratingsyntheticGSTdatasets,wesimulatethetableofmaximalcoherentsamplesetsforgenessuchasinFigure5(a).Initially,anemptytableiscreated.Then,acertainnumberofcoherentgeneclusters(G×S)areran-domlygenerated.Foreachg∈G,Sisinsertedintothetableasonemaximalcoherentsamplesetwithrespecttog.Inadditiontothesizeofthesyntheticdataset,i.e.,thetotalnumberofgenesinG-SetandthenumberofsamplesinS-Set,thesyntheticdatageneratortakesthefollowingparameters:(1)k,thenumberofcoherentgeneclustersin

thedataset;(2)maxgeneandmingene,themaximalandminimalnumbersofgenesinacoherentgenecluster,re-spectively;and(3)maxsampleandminsample,themaximalandminimalnumbersofsamplesinacluster,respectively.Wegeneratethedatasetsbysettingmingene=10andminsample=5.maxsampleissettothesamevalueof|S-Set|,andmaxgeneissetto1000.Inpractice,onlyasmallnumberofgenesarecorrelatedwithaphenotype[7].Whenthesizeofthedatasetgrows,weexpecttoseemoreco-herentgeneclusters.Tosimulatethesituation,wesetkto|G-Set|·|S-Set|

3000.5.2ResultsontheMSMicroarrayData

WeapplyouralgorithmsontheMSmicroarraydatawithming=15,mins=3andδ=0.8.Intotal,21coherentgeneclustersarereported.Tobetterunderstandthemin-ingresults,wefeedthegenesineachclustertoOnto-Express(http://vortex.cs.wayne.edu/Projects.html)andobtainahi-erarchyoffunctionalannotationsintermsofGeneOntologyforeachcluster.Anexampleofgeneontologytreeforclus-ter17isshowninFigure9(a).Then,wefurtherinvestigatethegenesandsamplesintheclusters.Someinterestingob-servationsareobtainedasfollows.

First,asexpected,themajoritygenesintheclustersareinvolvedincellularprocessesandphysiologicalprocesses,whilegenesinvolvedinotherbiologicalprocesses(e.g.,de-velopment,behaviorandvirallifecycle)arenothighlyrep-resented(Figure9(b)).Moreover,amongthegenesinvolvedincellularprocess,thoseinvolvedincellcommunicationandcellgrowthand/ormaintenancearepredominant(Fig-ure9(c)).SinceIFN-βisknowntohaveanti-proliferativeactivities,thehighpopulationofcellularprocessgenesin-volvedincellgrowthand/ormaintenanceisbiologicallyplausible.

Second,abunchofgenes(e.g.,incluster10,genesH38522,AA704613,N92443,N75595,etc.)thatarewellknownfortranscriptionalsignalingandcellularsignalingcanbeidenti-ﬁedintheresultedclusters.Thosegenes,togetherwiththeothergenesinthesameclusterthatareunknownorpoorlyunderstood,mayserveasswitchesinthegeneticnetworkandhenceplayanessentialroleinthebiologicalprocesses.Thus,studyingthetime-seriesofthegenesinthecoherentgeneclustersmaygreatlyhelppeopleunderstandtheregu-latorymechanismsbehindtheresponsetoIFN-βtreatment.Last,coherentgeneclustersalsoconsistofdiﬀerentgroup-ingsofsamples,whichprovidepromisinghypothesisfordif-ferentphenotypes.Forexample,intheMSmicroarraydata,theexpressiondatafrompatientswithdiﬀerentresponsestoIFN-βtreatmentarecollected.Amongthe21reportedclus-ters,only2clusters(cluster10and17)consistof4samples,whileotherclustersonlyconsistoftheminsnumberofsam-ples.Therefore,thosetwoclustersmaybecomegoodcan-didatesfortargetphenotype.Basedontheclinicalrecordsofthepatients,andcombinedwiththegeneinformationfromtheclusters,theinterpretationofthesamplegroupsiscurrentlyunderinvestigation.

5.3EffectsoftheParameters

Themaximalcoherentgeneclusterisdeﬁnedwithrespecttothreeparameters,i.e.,theminimumnumberofgenesming,theminimumnumberofsamplesminsandtheco-herencethresholdδ.WetesttheeﬀectoftheparametersontherealGSTdataset.Figure10(a)showsthenumber

A–CellularprocessB–DevelopmentC–Physiologicalprocesses

D–BehaviorE–Virallifecycle

80Gene population (percent) 70 60 50 40 30 20 10 0

8.5

0.6

0.3

BCDECategory of biological process33.1

56.1

(b)Thedistributionofbiologicalprocess

A–CellcommunicationB–CelldeathC–Cellgrowthormaintenance

D–CellmotilityE–Celldiﬀerentiation

60Gene population (percent) 50 40 30 20

12.0

10 0

3.4

BCDCategory of cellular process

0.0E

41.9

(a)Thegeneontologytreeforgenesincluster17

(c)Thedistributionofcellularprocess

Figure9:Thegeneontologytreeandthedistributionoffunctionsforgenesincluster17.

ofcoherentgeneclusterswhenmingvariesfrom5to100,mins=3andδ=0.8.Clearly,thenumberofcoherentgeneclustersinthedatasetdecreaseswhenmingincreases.Theresultconcurstheintuition:withalowermingvalue,wecancatchmoreclusterswithmoreorlessgenes.Asamatteroffact,withﬁxedminsandδ,letBibethecompletesetofcoherentblockswhenming=i.Then,wecanshowB1⊇...Bi⊇...Bn.

Figure10(b)showsthenumberofcoherentgeneclusterswithrespecttovariousminswhenmingandδareﬁxedto10and0.8,respectively.Thisresultcanbeexplainedinawaysimilartothesituationofming.Figure10(c)showstheeﬀectofδonthenumberofcoherentgeneclustersinthedataset,withming=10andmins=3.Whenwelowerthecoherencethreshold,morecombinationsofsamplesare“coherent”bychancewithrespecttoaminimumofminggenes.

Interestingly,thethreecurvesinFigure10sharesimilartrends.Thatis,whenthevalueoftheparameter(repre-sentedbytheXaxis)increases,thenumberofcoherentgeneclusters(representedbytheYaxis)goesdown.Thecurvedropssharplyuntila“knot”ismet,thenthecurvegoesstablytotheright.Forexample,wecanseethe“knots”ofming=20inFigure10(a),mins=4inFigure10(b)andδ=0.85inFigure10(c).These“knots”indicatethatthereexiststableandsigniﬁcantcoherentgeneclustersintherealdataset.Theyarehighlycorrelated,involvingastatisticallysigniﬁcantnumberofgenesandsamples.The“knots”alsosuggestthebestsettingsoftheparameterstoavoidthecoherentgeneclustersformedjustbychance.

5.4Scalability

Weﬁrsttesttheeﬃciencyofthepreprocessing(algorithminFigure3)onvariousrandomsubsets(bysampling)oftherealmicroarraydataset.Thesizeofthesubsetsvariesfrom500to4324genes,andallthesamplesareincluded.Foreachsize,wesampled30subsetsandcalculatetheaverageruntime.Figure11(a)illustratesthescalabilityforthepre-processingstep.AswediscussedinSection3,therealGSTmicroarraydatasetsareoftensparse.Withtheeﬃcientpruningtechniques,thepreprocessingalgorithmislinearlyscalabletothesizeofthedatasets.

WethentestthescalabilityofbothGene-SampleSearchandSample-GeneSearchonsyntheticdatasets.Wesetmins=5,ming=10,andδ=0.8.Weﬁrstﬁxthenum-berofsamplesto30,andreporttheruntimewithrespecttonumberofgenes(Figure11(b)).Wecanseebothapproachesshowanapproximatelylinearscalabilitywithrespecttothenumberofgenes.Figure11(c)showsthescalabilityforbothapproachesunderdiﬀerentsizesofsamplesets(from30to100),whenthenumberofgenesisﬁxedto3000.Wecanseebothapproachesscalewellwithrespecttothenumberofsamples.Sincethenumberofgenesforamicroarraydataistypicallybyfarlargerthanthatofthesamples,theenumer-ationofgenesismuchmoreexpensivethantheenumerationofsamples.ThisexplainswhytheSample-GeneSearchisfasterthantheGene-SampleSearch.

5.5TheEffectofLemma4.1

Lemma4.1canidentifythegenesgithatcanbemergedintothecurrentcombinationofgenes,andthuscanreduce

100 90 80 70 60 50 40 30 20 10 0

180 160 140 120 100 80 60 40 20 0

min_s

600 500

Number of MCBsNumber of MCBsNumber of MCBs 400 300 200 100 0

0 10 20 30 40 50 60 70 80 90 100

min_g

5 6 7 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

delta

(a)Numberofclustersvs.ming

mins=3,δ=0.8(b)Numberofclustersvs.mins

ming=10,δ=0.8(c)Numberofclustersvs.δming=10,mins=3

Figure10:Theeﬀectsoftheparametersonthenumberofclusters.

2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0.5

300 250Time (Seconds) 300g-s searchs-g searchTime (Seconds) 250 200 150 100 50g-s searchs-g searchTime (Seconds) 200 150 100 50 1 1.5 2 2.5 3 3.5 4 4.5

0 3 4 5 6 7 8 9 10 0 30 40 50 60 70 80 90 100Number of genes (K)Number of genes (K)Number of samples

(a)Preprocessing

(b)Scalabilityw.r.t.numberofgenes(numberofsamples:30)

Figure11:Scalabilityonlargedatasets.

(c)Scalabilityw.r.t.numberofsamples(numberofgenes:3,000)

thenumberofrecursionsinthemining.Weusesomesam-plesoftherealmicroarraydataset(eachsubsetcontains100to1000genesand12patients)tocomparetheperformanceoftheGene-SampleSearchwithandwithouttheoptimiza-tion.Thecomparisonisconductedinthreeaspects:(1)themaximalnumberofrecursionlevelsintheGene-SampleSearch;(2)thenumberofgenecombinationsintheGene-SampleSearch;and(3)theruntime.Figure12showstheresults.Wecanclearlyseethat(1)themaximalnumberofrecursionlevelscanbereducedsubstantially(Figure12(a));(2)withtheoptimization,thetotalnumberofgenecom-binationsneededtobecheckedgoesdownsharply(12(b));and(3)theruntimeismuchshorterwhentheoptimizationisapplied(12(c)).TheresultsstronglyconﬁrmthattheoptimizationiseﬀectiveforGene-SampleSearch.

WealsoapplythespiritofLemma4.1ontheSample-GeneSearch,andconductsimilartests.However,wecanhardlyseeanysigniﬁcantimprovementbroughtbytheoptimiza-tion.AswediscussedinSection4.2.3,duetothesparsityofthemicroarraydata,manygenescanbemergedbecausetheyarecoherentonthesamesampleset.However,fewsampleswouldbemergedtogethersinceusuallythemaxi-malcoherentgenesetswithrespecttotwodiﬀerentsamplesetsarenotidentical.

6.RELATEDWORK

Thisresearchisrelatedtopreviousworkonclusteringconventionalgene-time/gene-samplemicroarraydataandfrequentitemsetmining.

AsweintroducedinSection1,therehavebeentwocate-goriesofconventionalmicroarraydatasets:gene-timedatasetsandgene-sampledatasets.Forgene-timemicroarray

data,variousalgorithms(e.g.[17,15,8])havefocusedonclusteringthegenes.Thatis,co-expressedgenesaregroupedbasedontheirexpressionpatternsduringthetimeseries.Ontheotherhand,diﬀerentapproaches(e.g.[4,21,16])havebeenproposedtopartitionthesamplesetstoﬁndtheirmacroscopicphenotypesaswellastodetectinforma-tivegeneswhichmanifestthesamplepartition.However,alltheclustermodelsinthosepreviousstudiesaresubstantiallydiﬀerentfromourcoherentgeneclusters.Asaconsequence,thosealgorithmscannotbeextendeddirectlytosolveourproblem.

In[6],ChengandChurchintroducetheconceptofbiclus-tertomeasurethecoherencebetweengenesandconditions(eithertimeseriesorsamples).Givenasetofgenesandasetofconditions,abiclusterisasubsetofgenescoher-entwithasubsetofconditions.Yangetal.[22]proposeamove-basedalgorithmtoﬁndbiclustersmoreeﬃciently.Bothalgorithmsin[6]and[22]adoptheuristicsearchap-proaches,andthuscannotguaranteetoﬁndthecompletesetofbiclustersinthedataset.

In[19],Wangetal.proposethemodelofpattern-basedcluster.GivenasubsetofobjectsOandasubsetofat-tributesA,pair(O,A)formsapattern-basedclusterifforanypairofobjectsx,y∈O,andanypairofattributesa,b∈A,thediﬀerenceofchangeofvaluesonattributesaandbbetweenobjectsxandyissmallerthanathresh-oldδ.Inarecentstudy[11],Peietal.hasproposedaneﬃcientalgorithm,MaPletominethecompletesetofmax-imalpattern-basedclusters.

Weborrowsomeimportantideasfrompreviousstudiesonfrequentitemsetmining.First,theframeworkofourapproachesissimilarinspirittothatofpattern-growth

400 350 300 250Level 2500Opt OffOpt OnNumber of nodes 140Opt OffOpt OnTime (Seconds) 2000 1500 1000 500 120 100 80 60 40 20 0Opt OffOpt On 200 150 100 50 0 200 300 400 500 600 700 800 900 1000Number of genes

0 200 300 400 500 600 700 800 900 1000Number of genes

200 300 400 500 600 700 800 900 1000

Number of genes

(a)Maximalnumberof

recursionlevels(b)Numberofgenecombinationssearched

(c)Runtime

Figure12:EﬀectofLemma4.1forGene-SampleSearch.

methodsforfrequentpatternmining.Second,thepruningtechniquesinourapproachessharesomeinterestingsimi-laritieswiththemethodsofminingfrequentcloseditem-sets(e.g.,[10]).However,therearetwoessentialdiﬀerencesbetweenthefrequentpatternminingmethodsandtheap-proachesdevelopedinthispaper.Ontheonehand,thecoherentgeneclustersareinherentlydiﬀerentfromfrequentitemsets.Thus,thesimilaritybetweenthetwocategoriesofmethodsisonlyatthelevelofspirit(e.g.,setenumerationandpruning).Thetechnicaldetailsaredramaticallydiﬀer-ent.Ontheotherhand,newtechniquessuchastheinvertedlistsareadoptedtotackletheparticularmicroarraydata.

7.CONCLUSIONS

Inthispaper,wehaveinvestigatedanoveltypeofgene-sample-timemicroarraydatasetsandproposeanewprob-lemofminingcoherentgeneclustersfromsuchdatasets.Wehaveconductedasystematicstudytodeveloptwomin-ingmethods:theSample-GeneSearchandtheGene-SampleSearch.Ourextensiveperformancestudyonbotharealmi-croarraydatasetandsyntheticdatasetsshowsthatthereexistinterestingandsigniﬁcantcoherentgeneclustersintherealdataset,andboththealgorithmshavegoodperfor-mance.Sincethenumberofgenesinthemicroarraydataistypicallybyfarlargerthanthenumberofsamples,theSample-GeneSearchusuallyoutperformstheGene-SampleSearch.

8.REFERENCES

[1]AlizadehA.A.,etal.Distincttypesofdiﬀuselargeb-cell

lymphomaidentiﬁedbygeneexpressionproﬁling.Nature,Vol.403:503–511,February2000.

[2]AlonU.,etal.Broadpatternsofgeneexpressionrevealed

byclusteringanalysisoftumorandnormalcolontissuesprobedbyoligonucleotidearray.Proc.Natl.Acad.Sci.USA,Vol.96(12):6745–6750,June1999.

[3]BlackM.A.andDoergeR.W.Calculationoftheminimum

numberofreplicatespotsrequiredfordetectionofsigniﬁcantgeneexpressionfoldchangeinmicroarrayexperimentsBioinformatics,18:1609-1616,2002

[4]Ben-DorA.,etal.Classdiscoveryingeneexpressiondata.

InProc.FifthAnnualInter.Conf.onComputationalMolecularBiology(RECOMB2001).

[5]KerrM.K.andChurchillG.A.Bootstrappingcluster

analysis:Assessingthereliabilityofconclusionsfrom

microarrayexperiments.Proc.Natl.Acad.Sci.USA,Vol.98:8961-8965.

[6]ChengY.andChurchG.M.Biclusteringofexpressiondata.

ProceedingsoftheEighthInternationalConferenceonIntelligentSystemsforMolecularBiology(ISMB),2000.

[7]GolubT.R.,etal.Molecularclassiﬁcationofcancer:Class

discoveryandclasspredictionbygeneexpression

monitoring.Science,Vol.286(15):531–537,October1999.[8]JiangD.,etal.InteractiveExplorationofCoherent

PatternsinTime-SeriesGeneExpressionData.InKDD’03.[9]MolerE.J.,etal.AnalysisofMolecularProﬁleDataUsing

GenerativeandDiscriminativeMethods.PhysiologicalGenomics,Vol.4(2):109–126,2000.

[10]PeiJ.,etal.CLOSET:Aneﬃcientalgorithmformining

frequentcloseditemsets.InDMKD’00.

[11]PeiJ.,etal.MaPle:AFastAlgorithmforMaximal

Pattern-basedClusterin.InIEEEICDM’03.[12]HerwigR.,etal.Large-ScaleClusteringof

cDNA-FingerprintingData.GenomeResearch,9:1093–1105,1999.

[13]RymonR.Searchthroughsystematicsetenumeration.In

Proc.1992Int.Conf.PrincipleofKnowledgeRepresentationandReasoning(KR’92).

[14]DudoitS.,etal.Comparisonofdiscriminationmethodsfor

theclassiﬁcationoftumorsusinggeneexpressiondata.J.oftheAmericanStatisticalAssociation,Vol.97,No.457.[15]TamayoP.,etal.Interpretingpatternsofgeneexpression

withself-organizingmaps:Methodsandapplicationtohematopoieticdiﬀerentiation.Proc.Natl.Acad.Sci.USA,Vol.96(6):2907–2912,March1999.

[16]TangC.,etal.Miningphenotypesandinformativegenes

fromgeneexpressiondata.InSIGKDD’03.

[17]TavazoieS.,etal.Systematicdeterminationofgenetic

networkarchitecture.NatureGenet,pages281–285,1999.[18]TusherV.G.,etal.SigniﬁcanceAnalysisofMicroarrays

AppliedtotheIonizingRadiationResponse.Proc.Natl.Acad.Sci.USA,Vol.98(9):5116–5121,April2001.

[19]WangH.,etal.ClusteringbyPatternSimilarityinLarge

DataSets.InSIGMOD2002.

[20]Weinstock-GuttmanB.,etal..Genomiceﬀectsof

interferon-binmultiplesclerosispatients.AcceptedpendingrevisionsJournalofImmunology,2003.

[21]XingE.P.andKarpR.M.Cliﬀ:Clusteringof

high-dimensionalmicroarraydataviaiterativefeatureﬁlteringusingnormalizedcuts.Bioinformatics,Vol.17(1):306–315,2001.

[22]YangJ.,etal.δ-cluster:CapturingSubspaceCorrelationin

aLargeDataSet.InICDE2002.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

Mining coherent gene clusters from gene-sample-time microarray data