您的当前位置:首页正文

Mining coherent gene clusters from gene-sample-time microarray data

2020-12-17 来源:步旅网
MiningCoherentGeneClustersfromGene-Sample-Time

MicroarrayData

DaxinJiang†

JianPei†‡MuraliRamanathan†ChunTang†AidongZhang†

StateUniversityofNewYorkatBuffalo,USASimonFraserUniversity,CanadaEmail:{djiang3,jianpei}@cse.buffalo.edu,murali@acsu.buffalo.edu,{chuntang,azhang}@cse.buffalo.edu

ABSTRACT

Extensivestudieshaveshownthatminingmicroarraydatasetsisimportantinbioinformaticsresearchandbiomedicalapplications.Inthispaper,weexploreanoveltypeofgene-sample-timemicroarraydatasets,whichrecordstheexpres-sionlevelsofvariousgenesunderasetofsamplesduringaseriesoftimepoints.Inparticular,weproposetheminingofcoherentgeneclustersfromsuchdatasets.Eachclus-tercontainsasubsetofgenesandasubsetofsamplessuchthatthegenesarecoherentonthesamplesalongthetimeseries.Thecoherentgeneclustersmayidentifythesamplescorrespondingtosomephenotypes(e.g.,diseases),andsug-gestthecandidategenescorrelatedtothephenotypes.Wepresenttwoefficientalgorithms,namelytheSample-GeneSearchandtheGene-SampleSearch,tominethecompletesetofcoherentgeneclusters.Weempiricallyevaluatetheperformanceofourapproachesonbotharealmicroarraydatasetandsyntheticdatasets.Thetestresultshaveshownthatourapproachesarebothefficientandeffectivetofindmeaningfulcoherentgeneclusters.

CategoriesandSubjectDescriptors

H.2.8[DatabaseManagement]:DatabaseApplications—DataMining

GeneralTerms

Algorithms,Experimentation

Keywords

Bioinformatics,clustering,microarraydata

∗0234895,Thisresearchispartlysupportedby01A1.IIS-0308001andNIHgrant1NSFP20grantsDBI-dationsAllnecessarilyinthisopinions,findings,conclusionsandGM067650-recommen-reflectpapertheareviewsthoseoftheoffundingtheauthorsagencies.

anddonotPermissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.

KDD’04,August22–25,2004,Seattle,Washington,USA.Copyright2004ACM1-58113-888-1/04/0008...$5.00.

1.INTRODUCTION

Therecentmicroarraytechnologycanmeasuretheex-pressionlevelsofthousandsofgenessimultaneously.Itisanimportantresearchprobleminbioinformaticsandclini-calresearchtoexplorethepatternsinmicroarraydatasets.Forexample,indrugdevelopment,geneexpressionpatternsmayreflectgene-levelresponsestodifferentdrugtreatmentsandprovidedeepinsightsintothenatureofthediseases.Mostmicroarraydatasetscanbedividedintotwocate-gories.Ontheonehand,thegene-timedatasetsrecordtheexpressionlevelsofvariousgenesduringimportantbiologi-calprocessesoveraseriesoftimepoints.Thegene-sampledatasetsaccounttheexpressionlevelsofvariousgenesacrossrelatedsamples.Bothgene-timedatasetsandgene-sampledatasetscanberepresentedbyan×lgeneexpressionmatrixofngenesandlsamples/timepoints.Inthegeneexpressionmatrix,therowsarethegenesandthecolumnsareeithersamples(ingene-sampledatasets)ororderedtimepoints(ingene-timedatasets),whileeachcellrepresentstheex-pressionlevelofacertaingeneonacertainsampleoratacertaintimepoint.

Withthelatestadvancesinthemicroarraytechnology,theexpressionlevelsofasetofgenesunderasetofsam-plescanbemonitoredsynchronicallyduringaseriesoftimepoints[20].Differentfromthepreviousgene-timeorgene-samplemicroarraydatasets,thesenewdatasetshavethreetypesofvariables:genes,samplesandtime.Wecallsuchdatagene-sample-timemicroarraydata,orGSTdataforshort.Figure1(a)elaboratesthestructureofaGSTmi-croarraydataset.

ssamplesj1sj2sj3emitsamplesgenesgi1genesm1111mm1121lgi2m121m1n1m1nlgi3(a)

(b)

Figure1:ThestructureofGSTmicroarraydataIngeneral,eachcellmki,jinaGSTdatasetrepresentstheexpressionlevelofgenegiundersamplesjattimepointtk.Interestingly,aGSTdatasetcanalsobeviewedasan×l

matrix,suchthateachcellmi,jcontainsthetimeserieswithrespecttogenegiundersamplesj,asshowninFigure1(b).Thepreviousstudiesongene-samplemicroarraydata(e.g.,[7,2,15,1])indicatethathighcorrelationsmayexistbe-tweenthegeneexpressionpatternsandsomediseases.ItisnaturaltoextendthesimilaranalysistoGSTmicroarraydata.Thatis,itisinterestingtoidentifyasubsetofgenesGandasubsetofsamplesSinaGSTmicroarraydatasetsuchthateachgeneg∈GhascoherentpatternsacrossthesamplesinSduringthetimeseries.Forexample,inFig-ure1,genegi1,gi2andgi3showcoherentpatternsacrosssamplessj1,sj2andsj3,respectively.Wecallsuchsubsetsofgenesandsamplesacoherentgenecluster.

Whatisthebiologicalmeaningofthecoherentgeneclus-ters?Thecoherentgeneclustersprovidevaluablehypothesisforbiologists.Thesamplesinaclustermaycorrespondtoaphenotype,suchasthepatientshavingadisease,whilethecorrespondingsetofgenesmaysuggestthecandidategenescorrelatedtothedisease.

Thefunctionsofgenesinanorganismarehighlycom-plicated.Therearetypicallymultiplecoherentclustersinadataset.Differentclustersmaycorrelatetodifferentpheno-types,suchasageandgender.Therefore,toavoidmissinganyvaluablehypothesis,itisnecessarytominealltheco-herentclustersinthedataset.

Manypreviousstudiesinvestigatetheminingofinterest-ingpatternsfrommicroarraymatrices.Forexample,vari-ousclusteringalgorithmscanidentifytheco-expressedgenesshowingcoherentpatternsduringthetime-series(e.g.,[17,12,15,8]).Moreover,bothsupervisedorunsupervisedap-proachesareproposedtopartitionthesamplesintohomo-geneousgroups(e.g.,[4,14,9,16]).Additionally,statisticalapproacheshavebeenproposedtovalidatethesignificanceoftheminingresults(e.g.,[3,18,5]).However,allprevi-ousstudiestargetatconventionalgene-timeorgene-samplemicroarraydatasets.Themodelsofclustersinthosepre-viousstudiesaredifferentfromourcoherentgeneclusterswhichdisclosethecorrelationamonggenes,samplesandtimepoints.Therefore,thosealgorithmscannotbeextendeddirectlytosolveourproblem.

Recently,thepattern-basedclusteringapproaches(e.g.,[19])havebeendevelopedtodiscoversubsetsofobjectsfollowingsimilarpatternsonsubsetsofattributes.Conceptually,apattern-basedclusterisacoherentgenecluster.IfwetreattheGSTmicroarraydatasetsasan×lmatrixoftimese-ries,asshowninFigure1(b),thenpattern-basedclustersandcoherentgeneclustersmayhavesomesimilarityatthefirstlook.However,insomepattern-basedapproaches,aclusterrequiresthateachpairofobjectsintheclustermustbecoherentoneachpairofattributes.Sucharequirementisoftentoostronginpractice.Ourcoherentgeneclusteringrelaxestheconstraintsamongtheobjects.Therefore,eachtraditionalpattern-basedclusterisacoherentgenecluster,butnotnecessarilytrueviceversa.

Inthispaper,wetackletheproblemofminingcoherentpatternsfromgene-sample-timemicroarraydatasetsandmakethefollowingcontributions.

First,weproposeamodelofcoherentgeneclustersinGSTmicroarraydatasets.Wejustifythatthemodelismeaningfulforbiomedicalresearch.

Second,weidentifythecomputationalchallengesandcon-ductasystematicresearchonminingcoherentgeneclus-tersfromGSTmicroarraydatasets.Wedeveloptwoap-

proaches,namelytheGene-SampleSearchandtheSample-GeneSearch,tominethecompletesetofcoherentgeneclus-ters.Weillustrateandcomparetheefficiencyandscalabilityofbothapproaches.

Last,weconductanextensiveempiricalevaluationonbothrealdatasetsandsyntheticdatasets.Ourresultsshowthatourproposedmethodscanfindcoherentgeneclustersinterestingtobiomedicalresearchfromrealdatasets.Theresultsonsyntheticdatasetsalsoshowthatouralgorithmsarebothefficientandscalable.

Therestofthepaperisorganizedasfollows.Section2definestheproblem.Section3describesthepreprocessingstepofcomputingthemaximalcoherentsamplessetsforeachindividualgene.Section4presentstwoalgorithmstominecoherentgeneclusters.InSection5,ourmethodsareevaluatedusingrealandsyntheticdatasets.RelatedworkisdiscussedinSection6.Section7concludesthepaper.

2.PROBLEMDESCRIPTION

GivenasetofngenesG-Set={g1,...,gn}andasetoflsamplesS-Set={s1,...,sl},wecanmeasuretheexpressionlevelsofthegenesonthesamples.Theresultsformacon-ventionaln×lmicroarraymatrixM={mi,j},wheremi,jistheexpressionlevelofgenegi(1≤i≤n)onsamplesj(1≤j≤l).Ifsuchmicroarrayexperimentsareconductedsynchronicallyonallgenesandallsamplesattimeinstantst1,...,tT,theresultsforman×l×TGSTmicroarraymatrixM={mti,j},where(1≤t≤T).

AGSTmicroarraymatrixM={mti,jasan×lmatrixM󰀂={m󰀂

}canalsobeviewed

i,j}suchthatm󰀂i,jisavector

of󰀒m1i,j,...,mT

i,j󰀓.Hereafter,wedonotstrictlydistinguishthetwonotations.Instead,whenevermi,jiswritten,thevectorisreferredto.

Inthispaper,weareinterestedinfindingthosegenesthatarecoherentonasubsetofsamplesduringthewholetimeseries.Therearevariousmethodstomeasurethecorrela-tionbetweentwotimeseries.However,forgeneexpressiondata,usersareofteninterestedintheoveralltrendsoftheexpressionlevelsinsteadoftheabsolutemagnitudes.There-fore,wechoosethePearson’scorrelationcoefficientasthecoherencemeasure,sinceitisrobusttoshiftingandscalingpatterns[22].Specifically,giventwovectorsmi,j1andmi,j2ofgenegi,thecoherenceρ(mi,j1,mi,j2)isdefinedas

󰀂T󰀁mti,jt

t=1(1−mi,j1)(mi,j2−mi,j2)󰀂T󰀁t=1(mt󰀂T,i,j1)2t=1(mti,j21−mi,j2−mi,j2)m󰀂mtwherei,j=Ti,j

istheofgenegonsamplet=1Tmeanoftheexpressionlevelsisj.Thecorrelationcoefficientrangesbetween−1and1.Thelargerthevalue,themorecoherentarethetwovectors.

AgenegiiscoherentacrossasubsetofsamplesS⊆S-Set,ifgivenanypairofsamplessj1,sj2∈S,ρ(mi,j1,mi,j2)≥δ,whereδisaminimumcoherencethresholdspecifiedbytheuser.ForasubsetofgenesG⊆G-SetandasubsetofsamplesS⊆S-Set,ifeverygenegi∈GiscoherentacrosssamplesinS,wecallgenesetGcoherentonsamplesetS.(G×S)iscalledacoherentgenecluster.Acoherentgeneclusterhavingugenesandvsamplesissaida(u,v)-coherentgenecluster.

Trivially,foranygenegiandanysamplesj,({gi}×{sj})isa(1,1)-coherentgenecluster,and(G-Set×{sj})and({gi}×

S-Set)aretrivial(|G-Set|,1)-and(1,|S-Set|)-coherentgeneclusters,respectively.Toavoidthistriviality,werequirethatacoherentgeneclustershouldconsistofatleasttwogenesandtwosamples.

Givenacoherentgenecluster(G×S),foranysubsetsG󰀂

⊆GandS󰀂⊆S,(G󰀂×S󰀂)isalsoacoherentgenecluster.Toavoidsuchredundancy,acoherentgenecluster(G×S)ismaximal󰀂ifthereexistsnoanyothercoherentgeneclus-ter(G󰀂×S)suchthatG⊆G󰀂,S⊆S󰀂.Moreover,ausermaynotbeinterestedinverysmallclusters,whichareoftenformedbychance.Thus,ausercanspecifytheminimumnumbersofgenesandsamplesinacluster.Generally,givenmingandminsasuserdefinedminimumgenesizeandsam-plesizethresholds,acluster(G×S)iscalledsignificantif|G|≥mingand|S|≥mins.

ProblemdefinitionGivenaGSTmicroarraymatrixM,aminimumcoherencethresholdδ,aminimumgenesizethresholdmingandaminimumsamplesizethresholdmins,theproblemofminingcoherentgeneclustersistofindthecompletesetofsignificantmaximalcoherentgeneclustersinMwithrespecttotheparameters.Hereafter,asignifi-cantmaximalcoherentgeneclusteriscalledacoherentgeneclusterforshort.

3.MAXIMALCOHERENTSAMPLESETS

Weproposetwoalgorithmscomputingmaximalcoherentgeneclusters.Inbothalgorithms,tocomputecoherentgeneclusters,weneedtocheckwhetherasubsetofgenesarecoherentonasubsetofsamples.Tofacilitatethetests,foreachgenegk,wecomputethesetsofsamplesSsuchthat(1)|S|≥mins;(2)gkiscoherentonS;and(3)thereexistsnosupersetS󰀂⊃SsuchthatgkisalsocoherentonS󰀂.Siscalledamaximalcoherentsamplesetofgk.Pleasenotethat,ingeneral,agenemayhavemorethanonemaximalcoherentsampleset.

Foragenegk,allofitsmaximalcoherentsamplesetscanbecomputedefficientlyusingthefollowing2-stepprocess.Inthefirststep,wetestwhethergenegkiscoherentoneachpairofsamples(si,sj).Abinarytrianglematrix{ci,j}ispopulated,where1≤iOncethematrix{ci,j}ispopulated,theproblemoffind-inggk’smaximalcoherentsamplesetscanbereducedtotheproblemoffindingallmaximalcliquesofsizeatleastminsingraphGk=(S-Set,E),where(si,sj)isanedgeinthegraphifandonlyifci,j=1.AcliqueSiscalledmaximalifthereexistsnoanyothercliqueS󰀂suchthatS⊂S󰀂.Pleasenotethattheremayexistmorethanonemaximalcliqueinagraph.

Unliketheconventionalcliqueproblemwherethecliqueofthemaximalsizeisfound,here,weneedtofindthecompletesetofmaximalcliquesinthegraph.ItiswellknownthattheconventionalcliqueproblemisNP-complete.Soistheproblemoffindingthecompletesetofmaximalcliques.Fortunately,therealGSTmicroarraydatasetsareoftensparseandthenumberofsamplesistypicallybelowonehun-dred.Foreachgene,thenumberofmaximalcliquesisquitesmallandthesamplescanoftenbepartitionedintoexclu-sivesmallsubsets.Ourexperimentalresultsshowthat,withefficientsearchandpruningtechniques,itisstillpracticaltofindthecompletesetofmaximalcliques.Inthefollowing,

{}

{a}{b}{c}{d}{a,b}{a,c}{a,d}{b,c}{b,d}{c,d}{a,b,c}{a,b,d}{a,c,d}{b,c,d}{a,b,c,d}Figure2:Enumerationofcombinationsofsamples.wewillshowhowtofindthemaximalcliquesofasamplesetbyadepth-firstsearchinasamplesetenumerationtree.GivenasetofsamplesS={s1,...,sl},theset2S(i.e.,allcombinationsofsamples)canbeenumeratedsystematically.Forexample,considerasetofsamplesS={a,b,c,d}.Thecompletesetofnon-emptycombinationsofsamplescanbedividedinto4exclusivesubsets:(1)theoneshavingsamplea;(2)theoneshavingsamplebbutnoa;(3)theoneshavingsamplecbutnoaorb;and(4){d}.TheyareshownastheimmediatechildrenoftherootinFigure2.

Thesesubsetscanbefurtherpartitioned.Forexample,thefirstsubsetcanbefurtherdividedintothreeexclusivesub-subsets:(1)theoneshavingsamplesaandb;(2)theoneshavingsamplesaandcbutnob;and(3){a,d}.

ThetreeshowninFigure2iscalledasetenumerationtree[13]withrespectto{a,b,c,d}.Itprovidesaconceptualtooltoenumeratethecompletesetofcombinationssystem-atically.

Wecanconductarecursive,depth-firstsearchofthesam-plesetenumerationtreetodetectthemaximalcliquesofthesamples.|GivenasetofsamplesS,thesetenumerationtreehas2|Snodes.However,weneverneedtomaterializesuchatree.Instead,weonlyneedtokeepapathfromtherootofthetreetothenodewearesearchingasaworkingset,whichcontainsatmost(|S|+1)nodes.

Clearly,inthesetenumerationtree,eachnodecontainsauniquesubsetofsamples.Thuswecanusethesubsetofsamplestorefertothenode.Atnode{si1,...,sik}(1≤i1<···Lemma3.1.Atnodev={si1,...,sik}ofthesamplesetenumerationtree,where(1≤i1<···Wecanpruneanysubtreethatcannotleadtoacoherentsamplesetofatleastminssamples.

PruningRule3.1(Pruningsmallsamplesets).Atanodev={si1,...,sik},thesubtreeofvcanbeprunedif(k+|Tail|)Forexample,forasetoflsamples,eventhecompletesetofsamplecombinationscanbedividedintolexclusivesubsetsasshownbefore,weonlyneedtosearchthefirst(l−mins+1)subsets,sinceeachofthelast(mins−1)subsetscontainslessthanminssamples.

Moreover,ifthesamplesatthecurrentnodeanditsTailaresubsumedbysomemaximalcoherentsamplesetfound

Input:theGSTdataset,thecoherencethresholdδ;

minimumsamplesizethresholdmins;

Output:themaximalcoherentsamplesetsforeachgene;Method:

foreachgeneggeneratematrixkdo

ci,jforg//depth-firstsearch

k;fori=1to(l−mins+1)do

callsearch-clique({si},{si+1,...,sl});end-forend-forProcedure:search-clique(head,tail)

//headrecordsthesamplesinthecurrentnode

supposesiisthelastsampleinhead,removesamples

sjfromtailsuchthatci,j=0;//Lemma3.1if(|head∪tail|or(head∪tail⊂S)s.t.Sisamaximalclique

//Pruning3.2

thenreturn;//aborttherecursivesearchiftail=∅thenoutputamaximalclique;else

do

letj=min{k|sk∈tail}andtail=tail−{sj};callsearch-clique(head∪{sj},tail);untiltail=∅;return;

Figure3:Computingmaximalcoherentsamplesets.sofar,thentherecursivesearchcanalsobepruned,sinceitcannotleadtoanynewmaximalcoherentsampleset.PruningRule3.2(Pruningsubsumedsets).Atanodev={si1,...,sik},if{si1,...,sik}∪Tailisasubsetofsomemaximalcoherentsampleset,thenthesubtreeofthenodecanbepruned.

Basedontheabovelemmaandpruningrules,theprepro-cessingalgorithmispresentedinFigure3.Forthereadersfamiliarwiththetechniquesofdepth-firstminingofmaxi-mal/closedfrequentpatterns,theideasofpruningheresharethesimilarspiritwiththepruninginfrequentcloseditemsetmining(e.g.,[10]).However,onekeydifferenceisthatthefrequentpatternminingconductscountingondatabases,andwedonotneedtoscanthedatabaseforanycountingoncethetrianglematrix{ci,j}ismaterialized.Thecorrect-nessofthepreprocessingalgorithmcanbeshown.Limitedbyspace,weomitthedetailshere.

4.THEMININGALGORITHMS

Ana¨ıvemethodtofindthemaximalcoherentgeneclustersistotesteverypossiblecombinationofgenesandsamplesthoroughly.Afterallthecoherentgeneclustersarefound,wecanidentifyandreportthemaximalones.Thena¨ıvemethodisverycostlyandthusinfeasibleforrealdatasets.Forexample,supposewehave1,000genesand201000samples.Thena¨ıvemethodmayhavetosearchupto(2−1)×(220−1−20)≈1.12×10307combinations!

Howcanwesearchthehugespaceefficientlyandpruneunpromisingsubspacessharply?Whencomputingthemaxi-malcoherentsamplesets(Figure3),wesystematicallyenu-meratecombinationsofsamplesinarecursivedepth-firstsearch,anddeveloptechniquestopruneunpromisingsub-spacesaggressively.Stimulatedbythesimilarspirit,here

TheSample-GeneSearch

depth-firstenumeratesubsetsofsamplesforeachsubsetofsamplesSdo

findthemaximalsubsetsofgenesG

s.t.G×Sisacoherentgenecluster;

testwhether(G×S)isamaximalcoherentgene

cluster;

end-forTheGene-SampleSearch

depth-firstenumeratesubsetsofgenesforeachsubsetofgenesGdo

findthemaximalsubsetsofsamplesS

s.t.G×Sisacoherentgenecluster;

testwhether(G×S)isamaximalcoherentgene

cluster;

end-for

Figure4:TheframeworksoftheSample-GeneSearchandtheGene-SampleSearch.

wecanalsosystematicallyenumeratethecombinationsofgenesandsamples,andpruneunfruitfulcombinations.

Basically,wehavetwoalternatives.Ontheonehand,wecanenumerateallcombinationsofsamplessystematically.Foreachsubsetofsamples,wecanfindthemaximalsubsetsofgenesthatformcoherentgeneclustersonthesamples,andcheckwhethertheclustersaremaximal.ThismethodiscalledtheSample-GeneSearch.Ontheotherhand,wecanletthegeneenumerationgofirst.Foreachsubsetofgenes,wefindthemaximalsubsetsofsamplesthatformcoherentgeneclusterswiththegenes,andcheckwhethertheclustersaremaximal.ThemethodiscalledtheGene-SampleSearch.

TheframeworksoftheSample-GeneSearchandtheGene-SampleSearchareshowninFigure4.Properpruningtech-niquesshouldbedevelopedtopruneunpromisingcombina-tionsandsearchbranchesasearlyaspossible.

AfirstlookatFigure4maysuggestthatthetwomethodsaresymmetric.However,sincegenesandsamplesarenotsymmetricintheproblem,thetechnicaldetailsareinfactsubstantiallydifferent.Weareminingcoherentgeneclustersonsamples.Aslongasthegenesrespondcoherentlyonthesamesubsetofsamples,theybelongtothesamecluster.However,theexpressionpatternsofdifferentgenesinthesameclusterononesamplecanbeverydifferent!

4.1Sample-GeneSearch

IntheSample-GeneSearch,weneedtoaddressthefol-lowingissues.First,asweenumeratethecombinationsofsamplessystematically,foreachsubsetofsamples,howcanwefindthemaximalsetsofgenessuchthatthegenesarecoherentonthesamples?Second,duringthesamplesetenumeration,whichsamplesetscanbepruned?Third,sim-ilartothesituationinPruningrule3.2,canweidentifyandprunethesearchesthatcannotleadtoanypotentialmaximalcoherentclusters?.Last,howcanwedeterminewhetheracoherentgeneclusterissubsumedbytheothers?Weanswertheabovequestionsinthissubsection.

4.1.1MaximalCoherentGeneSetsforSampleSets

ForeachcombinationofsamplesS,weneedtocomputethemaximalcoherentgenesetGSsuchthatthegenesinGS

GeneMaximalcoherentsamplesetsg1{s1,s2,s3,s4,s5}g2{s1,s2,s4},{s1,s5}g3{s1,s2,s3,s4,s5}g4{s1,s2,s3},{s5,s6}g5{s1,s5,s6}(a)Themaximalcoherentsamplesetsforgenes

SampleTheinvertedlists1{g1.b1,g2.b1,g2.b2,g3.b1,g4.b1,g5.b1}s2{g1.b1,g2.b1,g3.b1,g4.b1}s3{g1.b1,g3.b1,g4.b1}s4{g1.b1,g2.b1,g3.b1}s5{g1.b1,g2.b2,g3.b1,g4.b2,g5.b1}s6

{g4.b2,g5.b1}(b)Theinvertedlistsforsamples

Figure5:Themaximalcoherentsamplesetsandtheinvertedlists.

arecoherentonSandnopropersupersetG󰀂⊃GSalsohasthisproperty.

Clearly,forageneg,ifthereexistsamaximalcoherentsamplesetSgsuchthatS⊆Sg,theng∈GS.Inotherwords,GScanbederivedbyonescanofthemaximalco-herentsamplesetsofallgenes.Ifamaximalcoherentsam-plesetisasupersetofS,thenthecorrespondinggenegisinsertedintoGS.

Itisexpensivetoscanthecompletelistofmaximalco-herentsamplesetsofallgenesonceforeverycombinationofsamples.Anefficientsolutionistouseaninvertedlist.Supposewehave5genesand6samples.ThemaximalcoherentsamplesetsforeachgenearelistedinFigure5(a).Welabeleachmaximalcoherentsamplesetbythegenegk,andtheset-id,bj,inthegene.Forexample,geneg2hastwomaximalcoherentsamplesets,g2.b1={s1,s2,s4}andg2.b2={s1,s5}.

Foreachsamples,wemakeuptheinvertedlistLsasthelistofallmaximalcoherentsamplesetscontainings,asshowninFigure5(b).

Now,whenwewanttocomputethemaximalcoherentgenesetsforasubsetofsamples,say{s1,s2,s3},wedonotneedtosearchthecompletelistinFigure5(a).Instead,weonlyneedtogettheintersectionoftheinvertedlistsofthesampless1,s2ands3,whichis{g1.b1,g3.b1,g4.b1}.Bythisintersection,weknowthat{g1,g3,g4}isthemaximalcoherentgeneset.

4.1.2PruningIrrelevantSamples

ForacombinationofsamplesS={si1,...,sik},wherei1<···Moreover,similartoPruningrule3.1,if|S|+|Stail|4.1.3PruningUnpromisingCoherentGeneClusters

SimilartothesituationinPruningrule3.2,wecanprunetheunpromisingcombinationsthatcannotleadtoanynewmaximalcoherentgenecluster.Here,twopruningtech-niquescanbeapplied.

Forexample,inourrunningexample(Figure5),sup-posewefindthemaximalcoherentgenecluster({g1,g3}×{s1,s2,s3,s5})beforewesearchsamplesetS={s1,s3}.ForS={s1,s3},Stail={s5}andGS∪Stail={g1,g3}.Thatis,bothS∪StailandGS∪Stailaresubsumedbythemaxi-malcoherentgenecluster.TherecursivesearchofScannotleadtoanymaximalcoherentgeneclusterandthuscanbepruned.

Ingeneral,supposewearesearchingasampleS󰀂

combination.Ifthereexistsamaximalcoherentgenecluster(G×S)

foundbeforesuchthatS󰀂∪S󰀂tail⊆SandGS󰀁∪S󰀁tail⊆G,

thenanyrecursivesearchfromS󰀂

resultsinacoherentgeneclustersubsumedby(G×S),andthuscanbepruned.

Moreover,ifthereexistsamaximalcoherentgenecluster(G×S)foundbeforesuchthatS󰀂⊆SandeverymaximalcoherentsamplesetcontainingS󰀂alsocontainsS,thentherecursivesearchofS󰀂cannotleadtoanymaximalcoherentgeneclustereither,andthuscanbe󰀂pruned.Forexample,supposewesearchthesamplesetS={s2}afterwefindthemaximalcoherentgenecluster({g1,g2,g3,g4}×{s1,s2})inourrunningexample(Figure5).FromFigure5(a),wecanseethateverymaximalcoherentsamplesetcontainings2alsocontainss1.Inotherwords,thereexistsnomaximalcoherentgeneclustercontainings2butnos1.Thus,thesearchofS󰀂canbepruned.

4.1.4Determiningmaximalcoherentgeneclusters

WhenwesearchacombinationofsamplesS,weneedtocheckwhetherGS×Sisamaximalcoherentgenecluster.Welookatthosemaximalcoherentgeneclusters(G󰀂×S󰀂)suchthatS󰀂⊃S.Clearly,sinceweconductdepth-firstsearchinthesetenumerationtree,suchmaximalcoherentgeneclustersshouldbereportedeitherbeforeSissearched,orinthesubtreerootedatS.

Basedontheabovediscussion,wehavetheSample-GeneSearchalgorithminFigure6.

4.2Gene-SampleSearch

TheframeworkofGene-SampleSearchissymmetrictothatofSample-GeneSearch.Thatis,weenumeratethecombinationsofgenessystematically.Foreachcombinationofgenes,wecomputethemaximalsetsofsamplesthatthegenesarecoherenton.ManypruningtechniquesinSample-GeneSearchhavethesymmetricversionsinGene-SampleSearch.Limitedbyspace,weomitthedetailshere.Instead,weonlyfocusonthedifferencesbetweenthetwoapproaches.

4.2.1DeterminingCoherentGeneClusters

Theconceptofcoherentsamplesetsforagenecanbegeneralizedforasetofgenes.GivenasetofgenesG,amaximalcoherentsamplesetwithrespecttoGisasetofsamplesSGsuchthat(1)genesinGarecoherentonSG;and(2)thereexistsnoS󰀂⊃SGthatsamplesinGarealsocoherentonS󰀂.Pleasenotethattherecanbemorethanonemaximalcoherentsamplesetforagivensetofgenes.Howcanwecomputethemaximalcoherentsamplesetsefficiently?Interestingly,SGcanbecomputedbysomesim-pleintersectionoperations.Forexample,supposeS{g1}=

Input:themaximalcoherentsamplessetsforgenes,

//fromthealgorithminFigure3

Output:thecompletesetofcoherentgeneclustersMethod:

generatetheinvertedlistforsamples

asdescribedinSection4.1.1;fori=1to(|S-Set|−mins)do

letS={si}andStail={si+1,...,s|S-Set|};callrecursive-search(S,Stail);end-forProcedure:recursive-search(S,Ssamplesfromtail)

removeirrelevantStailasdescribedin

Section4.1.2;

if(|S|+|Stail|derivetheintersectionofinvertedlistsforsamples

inSasdescribedinSection4.1.1;ifScanbeprunedbythecriteriain

Section4.1.3thenreturn;whileSletitail=∅do

=min{j|sj∈Slettail=tail−{stail};i};

callrecursive-search(S∪{si},Stail);end-while

derivethemaximalcoherentgenesetGGS;

output(S×S)asamaximalcoherentgeneclusterif

itisnotsubsumedbyanymaximalcoherentgeneclusterfoundbefore

End

Figure6:TheSample-GeneSearchalgorithm.

{(s1,s2,s3)}andS{g2}={(s2,s3,s4),(s5,s7)}.Then,{s2,s3}istheonlymaximalsamplesetthatbothg1andg2arecoherenton.Thatis,S{gS1,g2}={(s2,s3)}.Inotherwords,wecanderive{g1,g2}from{s1,s2,s3}∩{s2,s3,s4}and{s1,s2,s3}∩{s5,s7}.

Ingeneral,ifgenesetG=G1∪G2,thenSGcanbede-rivedfromSG1andSG2bythefunctionfind-max-coherent-sample-setsinFigure7.

4.2.2

PruningherentGeneIrrelevantClusters

GenesandUnpromisingCo-SimilartotheideainSection4.1.2,wecanprunegenesthatcannotbeusedtoextendthecurrentcombinationofgenes.GivenasetofgenesG={gi1,...,gik},where1≤ii<···BasedonthesameideainSection4.1.3,wecanusethemaximalcoherentgeneclusterstoprunetheunpromisingcoherentgeneclusters.SupposewearesearchingagenecombinationG1.LetS1beonemaximalcoherentsamplesetwithrespecttoG1,i.e.,S1∈SG1.Ifthereexistsamax-imalcoherentgenecluster(G×S)suchthatS1⊆SandG1⊆G,thenS1shouldberemovedfromthelistofthemax-imalcoherentsamplesetsSG1,sinceitcannotleadtoanymaximalcoherentgenecluster.Moreover,ifSG1becomesemptyafterthepruning,thenG1shouldbepruned,sinceanyrecursivesearchfromG1cannotleadtoanymaximalcoherentgenecluster.

Functionfind-max-coherent-sample-sets

Input:genesetsG1andG2,setsofmaximalcoherent

samplesetsSmaximalG1andScoherentG2;

Output:thesamplesetsw.r.t.G1∪G2Method:

letSG1∪G2=∅;

foreachmaximalcoherentsamplesetSi∈SforeachmaximalcoherentsamplesetSG1do

j∈SletSG2do

k=Si∩Sj;

if|Sk|≥minstheninsertSkintoSG1∪G2;end-forend-for

foreachSk∈SG1∪G2do

ifSthenkisapropersubsetofS󰀂∈SS−{SG1∪G2

G1∪G2=SG1∪G2k};

end-for

outputSG1∪G2;

Figure7:Computingmaximalcoherentsamplesetforasetofgenes.

4.2.3MergingCoherentGenesintheTailList

Inourrunningexample,themaximalcoherentsamplesetswithrespecttogeneg1andg3areidentical.Then,foranycoherentgenecluster(G×S)suchthatg1∈G,(G∪{g3}×S)mustalsobeacoherentgenecluster.Thus,wecansearch{g1,ge}inashoot.

Ingeneral,wehavethefollowingresult.Lemma4.1.WhensearchacombinationofgenesG,ifthereexistgenes{gj1,...,gjk}⊆GtailsuchthattheyarecoherentoneverymaximalcoherentsamplesetofG,thenthereexistsnomaximalcoherentgeneclustercontainingGbutno{gj1,...,gjk}.

BasedonLemma4.1,wecanimmediatelymergegenes{gj1,...,gjk}toGatthecurrentnode,andthusshrinkthenumberofrecursions.Thecomputationtimeissavedaswell,sinceweonlyneedtocheckthecoherentgeneclusters,prunetheirrelevantgenesorunpromisinggeneclustersforallthesegenesinoneshoot.

Inourexperimentsonrealdatasets,weobservemanygenescanbemergedbyLemma4.1.Thereal-worldGSTmicroarraydatasetsaretypicallysparseandgenesareco-herentonaquitesmallnumberofsamplesets.Asacon-sequence,theperformanceofGene-SampleSearchcanbeimprovedsubstantiallybythisoptimization.

Onemayask,“DowehaveasymmetricpruningforSample-GeneSearch?”Wecanapplythesimilaroptimiza-tiontechniqueforsample-gene.Thatis,asamplesjismergedintocurrentcombinationofsamplesSaslongastheinvertedlistofSisasubsetofthatofsj.However,suchasituationisrareinpractice,sinceitisrarethat,fortwodifferentsamplesetsS1andS2,themaximalcoherentgenesetswithrespecttoS1andS2areidentical.

Basedontheabovediscussion,wehavetheGene-SampleSearchalgorithmasshowninFigure8.

5.EXPERIMENTALRESULTS

WeimplementedandtestedourapproachesonbotharealGSTmicroarraydatasetandsyntheticdatasets.ThesystemisimplementedinJava.ThetestsareconductedonaSunUltra10workstationwitha440MHzCPUand256MBmainmemory.

Inputandoutput:sameastheSample-GeneSearch

algorithm(Figure6)Method:

fori=1to(|G-Set|−ming)do

letG={gi}andGtail={gi+1,...,g|G-Set|};callrecursive-search(G,Gend-fortail);Procedure:recursive-search(G,GremoveirrelevantgenesfromGtail)

tailasdescribedin

Section4.2.2;

if(|G|+|Gtail|mergecoherentgenesinGtailasdescribedin

Section4.2.3;

foreachmaximalcoherentsamplesetSi∈SifSGdo

icanbeprunedbythesecondcriteriain

Section4.2.2thenremoveSifromSG;

end-for

if(SG=∅)thenreturn;whileGtail=∅do

leti=min{j|gj∈GGtail};lettail=Gtail−{gi};

callrecursive-search(G∪{gi},Gtail);end-while

foreachsamplesetSiinSoutput(G×SGdo

i)asamaximalcoherentgene

clusterifitisnotsubsumedbyanymaximalcoherentgeneclusterfoundbefore

end-forEnd

Figure8:TheGene-SampleSearchalgorithm.

5.1TheDataSets

Therealdataset.Weusetherealgene-sample-timemi-croarraydatasetreportedin[20].Itconsistsofthemi-croarraymeasurementsof4,324genesin13multiplesclero-sis(MS)patientsbeforeandat1,2,4,8,24,48,120and

168hoursafterIFN-βtreatments.MSpatientsshowhetero-geneousresponsestoIFN-βtreatments.Forexample,thepatientswithrelapsingMSrespondbettertoIFN-βtreat-mentsthanthepatientswithprogressivediseasedo.How-ever,relapsingMSpatientsalsoexhibitconsiderableinter-individualheterogeneityintheirclinicalresponsestoIFN-βtherapies.Sofar,theeffectsofIFN-βtreatmentatthege-nomiclevelinhumansarepoorlyunderstood.Researchersareinterestedindistinguishingtheheterogeneousclinicalre-sponsetoIFN-βtherapyamongthepatients.Ontheotherhand,characterizedgeneexpressiondynamicscorrelatedtotheheterogeneousresponsespotentiallyhelpinexploringthecausingmechanismsatthemolecularlevel.

Syntheticdata.Weobservethatthepreprocessing,i.e.,miningthemaximalcoherentsamplesetsforeachindivid-ualgene,isrelativelyfast.Themajorbottleneckinminingcoherentgeneclustersisinthelatterpart.Therefore,in-steadofgeneratingsyntheticGSTdatasets,wesimulatethetableofmaximalcoherentsamplesetsforgenessuchasinFigure5(a).Initially,anemptytableiscreated.Then,acertainnumberofcoherentgeneclusters(G×S)areran-domlygenerated.Foreachg∈G,Sisinsertedintothetableasonemaximalcoherentsamplesetwithrespecttog.Inadditiontothesizeofthesyntheticdataset,i.e.,thetotalnumberofgenesinG-SetandthenumberofsamplesinS-Set,thesyntheticdatageneratortakesthefollowingparameters:(1)k,thenumberofcoherentgeneclustersin

thedataset;(2)maxgeneandmingene,themaximalandminimalnumbersofgenesinacoherentgenecluster,re-spectively;and(3)maxsampleandminsample,themaximalandminimalnumbersofsamplesinacluster,respectively.Wegeneratethedatasetsbysettingmingene=10andminsample=5.maxsampleissettothesamevalueof|S-Set|,andmaxgeneissetto1000.Inpractice,onlyasmallnumberofgenesarecorrelatedwithaphenotype[7].Whenthesizeofthedatasetgrows,weexpecttoseemoreco-herentgeneclusters.Tosimulatethesituation,wesetkto|G-Set|·|S-Set|

3000.5.2ResultsontheMSMicroarrayData

WeapplyouralgorithmsontheMSmicroarraydatawithming=15,mins=3andδ=0.8.Intotal,21coherentgeneclustersarereported.Tobetterunderstandthemin-ingresults,wefeedthegenesineachclustertoOnto-Express(http://vortex.cs.wayne.edu/Projects.html)andobtainahi-erarchyoffunctionalannotationsintermsofGeneOntologyforeachcluster.Anexampleofgeneontologytreeforclus-ter17isshowninFigure9(a).Then,wefurtherinvestigatethegenesandsamplesintheclusters.Someinterestingob-servationsareobtainedasfollows.

First,asexpected,themajoritygenesintheclustersareinvolvedincellularprocessesandphysiologicalprocesses,whilegenesinvolvedinotherbiologicalprocesses(e.g.,de-velopment,behaviorandvirallifecycle)arenothighlyrep-resented(Figure9(b)).Moreover,amongthegenesinvolvedincellularprocess,thoseinvolvedincellcommunicationandcellgrowthand/ormaintenancearepredominant(Fig-ure9(c)).SinceIFN-βisknowntohaveanti-proliferativeactivities,thehighpopulationofcellularprocessgenesin-volvedincellgrowthand/ormaintenanceisbiologicallyplausible.

Second,abunchofgenes(e.g.,incluster10,genesH38522,AA704613,N92443,N75595,etc.)thatarewellknownfortranscriptionalsignalingandcellularsignalingcanbeidenti-fiedintheresultedclusters.Thosegenes,togetherwiththeothergenesinthesameclusterthatareunknownorpoorlyunderstood,mayserveasswitchesinthegeneticnetworkandhenceplayanessentialroleinthebiologicalprocesses.Thus,studyingthetime-seriesofthegenesinthecoherentgeneclustersmaygreatlyhelppeopleunderstandtheregu-latorymechanismsbehindtheresponsetoIFN-βtreatment.Last,coherentgeneclustersalsoconsistofdifferentgroup-ingsofsamples,whichprovidepromisinghypothesisfordif-ferentphenotypes.Forexample,intheMSmicroarraydata,theexpressiondatafrompatientswithdifferentresponsestoIFN-βtreatmentarecollected.Amongthe21reportedclus-ters,only2clusters(cluster10and17)consistof4samples,whileotherclustersonlyconsistoftheminsnumberofsam-ples.Therefore,thosetwoclustersmaybecomegoodcan-didatesfortargetphenotype.Basedontheclinicalrecordsofthepatients,andcombinedwiththegeneinformationfromtheclusters,theinterpretationofthesamplegroupsiscurrentlyunderinvestigation.

5.3EffectsoftheParameters

Themaximalcoherentgeneclusterisdefinedwithrespecttothreeparameters,i.e.,theminimumnumberofgenesming,theminimumnumberofsamplesminsandtheco-herencethresholdδ.WetesttheeffectoftheparametersontherealGSTdataset.Figure10(a)showsthenumber

A–CellularprocessB–DevelopmentC–Physiologicalprocesses

D–BehaviorE–Virallifecycle

80Gene population (percent) 70 60 50 40 30 20 10 0

A

8.5

0.6

0.3

BCDECategory of biological process33.1

56.1

(b)Thedistributionofbiologicalprocess

A–CellcommunicationB–CelldeathC–Cellgrowthormaintenance

D–CellmotilityE–Celldifferentiation

60Gene population (percent) 50 40 30 20

12.0

10 0

A

3.4

BCDCategory of cellular process

0.0E

41.9

41.9

(a)Thegeneontologytreeforgenesincluster17

(c)Thedistributionofcellularprocess

Figure9:Thegeneontologytreeandthedistributionoffunctionsforgenesincluster17.

ofcoherentgeneclusterswhenmingvariesfrom5to100,mins=3andδ=0.8.Clearly,thenumberofcoherentgeneclustersinthedatasetdecreaseswhenmingincreases.Theresultconcurstheintuition:withalowermingvalue,wecancatchmoreclusterswithmoreorlessgenes.Asamatteroffact,withfixedminsandδ,letBibethecompletesetofcoherentblockswhenming=i.Then,wecanshowB1⊇...Bi⊇...Bn.

Figure10(b)showsthenumberofcoherentgeneclusterswithrespecttovariousminswhenmingandδarefixedto10and0.8,respectively.Thisresultcanbeexplainedinawaysimilartothesituationofming.Figure10(c)showstheeffectofδonthenumberofcoherentgeneclustersinthedataset,withming=10andmins=3.Whenwelowerthecoherencethreshold,morecombinationsofsamplesare“coherent”bychancewithrespecttoaminimumofminggenes.

Interestingly,thethreecurvesinFigure10sharesimilartrends.Thatis,whenthevalueoftheparameter(repre-sentedbytheXaxis)increases,thenumberofcoherentgeneclusters(representedbytheYaxis)goesdown.Thecurvedropssharplyuntila“knot”ismet,thenthecurvegoesstablytotheright.Forexample,wecanseethe“knots”ofming=20inFigure10(a),mins=4inFigure10(b)andδ=0.85inFigure10(c).These“knots”indicatethatthereexiststableandsignificantcoherentgeneclustersintherealdataset.Theyarehighlycorrelated,involvingastatisticallysignificantnumberofgenesandsamples.The“knots”alsosuggestthebestsettingsoftheparameterstoavoidthecoherentgeneclustersformedjustbychance.

5.4Scalability

Wefirsttesttheefficiencyofthepreprocessing(algorithminFigure3)onvariousrandomsubsets(bysampling)oftherealmicroarraydataset.Thesizeofthesubsetsvariesfrom500to4324genes,andallthesamplesareincluded.Foreachsize,wesampled30subsetsandcalculatetheaverageruntime.Figure11(a)illustratesthescalabilityforthepre-processingstep.AswediscussedinSection3,therealGSTmicroarraydatasetsareoftensparse.Withtheefficientpruningtechniques,thepreprocessingalgorithmislinearlyscalabletothesizeofthedatasets.

WethentestthescalabilityofbothGene-SampleSearchandSample-GeneSearchonsyntheticdatasets.Wesetmins=5,ming=10,andδ=0.8.Wefirstfixthenum-berofsamplesto30,andreporttheruntimewithrespecttonumberofgenes(Figure11(b)).Wecanseebothapproachesshowanapproximatelylinearscalabilitywithrespecttothenumberofgenes.Figure11(c)showsthescalabilityforbothapproachesunderdifferentsizesofsamplesets(from30to100),whenthenumberofgenesisfixedto3000.Wecanseebothapproachesscalewellwithrespecttothenumberofsamples.Sincethenumberofgenesforamicroarraydataistypicallybyfarlargerthanthatofthesamples,theenumer-ationofgenesismuchmoreexpensivethantheenumerationofsamples.ThisexplainswhytheSample-GeneSearchisfasterthantheGene-SampleSearch.

5.5TheEffectofLemma4.1

Lemma4.1canidentifythegenesgithatcanbemergedintothecurrentcombinationofgenes,andthuscanreduce

100 90 80 70 60 50 40 30 20 10 0

180 160 140 120 100 80 60 40 20 0

2

3

4

min_s

600 500

Number of MCBsNumber of MCBsNumber of MCBs 400 300 200 100 0

0 10 20 30 40 50 60 70 80 90 100

min_g

5 6 7 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95

delta

(a)Numberofclustersvs.ming

mins=3,δ=0.8(b)Numberofclustersvs.mins

ming=10,δ=0.8(c)Numberofclustersvs.δming=10,mins=3

Figure10:Theeffectsoftheparametersonthenumberofclusters.

2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0.5

300 250Time (Seconds) 300g-s searchs-g searchTime (Seconds) 250 200 150 100 50g-s searchs-g searchTime (Seconds) 200 150 100 50 1 1.5 2 2.5 3 3.5 4 4.5

0 3 4 5 6 7 8 9 10 0 30 40 50 60 70 80 90 100Number of genes (K)Number of genes (K)Number of samples

(a)Preprocessing

(b)Scalabilityw.r.t.numberofgenes(numberofsamples:30)

Figure11:Scalabilityonlargedatasets.

(c)Scalabilityw.r.t.numberofsamples(numberofgenes:3,000)

thenumberofrecursionsinthemining.Weusesomesam-plesoftherealmicroarraydataset(eachsubsetcontains100to1000genesand12patients)tocomparetheperformanceoftheGene-SampleSearchwithandwithouttheoptimiza-tion.Thecomparisonisconductedinthreeaspects:(1)themaximalnumberofrecursionlevelsintheGene-SampleSearch;(2)thenumberofgenecombinationsintheGene-SampleSearch;and(3)theruntime.Figure12showstheresults.Wecanclearlyseethat(1)themaximalnumberofrecursionlevelscanbereducedsubstantially(Figure12(a));(2)withtheoptimization,thetotalnumberofgenecom-binationsneededtobecheckedgoesdownsharply(12(b));and(3)theruntimeismuchshorterwhentheoptimizationisapplied(12(c)).TheresultsstronglyconfirmthattheoptimizationiseffectiveforGene-SampleSearch.

WealsoapplythespiritofLemma4.1ontheSample-GeneSearch,andconductsimilartests.However,wecanhardlyseeanysignificantimprovementbroughtbytheoptimiza-tion.AswediscussedinSection4.2.3,duetothesparsityofthemicroarraydata,manygenescanbemergedbecausetheyarecoherentonthesamesampleset.However,fewsampleswouldbemergedtogethersinceusuallythemaxi-malcoherentgenesetswithrespecttotwodifferentsamplesetsarenotidentical.

6.RELATEDWORK

Thisresearchisrelatedtopreviousworkonclusteringconventionalgene-time/gene-samplemicroarraydataandfrequentitemsetmining.

AsweintroducedinSection1,therehavebeentwocate-goriesofconventionalmicroarraydatasets:gene-timedatasetsandgene-sampledatasets.Forgene-timemicroarray

data,variousalgorithms(e.g.[17,15,8])havefocusedonclusteringthegenes.Thatis,co-expressedgenesaregroupedbasedontheirexpressionpatternsduringthetimeseries.Ontheotherhand,differentapproaches(e.g.[4,21,16])havebeenproposedtopartitionthesamplesetstofindtheirmacroscopicphenotypesaswellastodetectinforma-tivegeneswhichmanifestthesamplepartition.However,alltheclustermodelsinthosepreviousstudiesaresubstantiallydifferentfromourcoherentgeneclusters.Asaconsequence,thosealgorithmscannotbeextendeddirectlytosolveourproblem.

In[6],ChengandChurchintroducetheconceptofbiclus-tertomeasurethecoherencebetweengenesandconditions(eithertimeseriesorsamples).Givenasetofgenesandasetofconditions,abiclusterisasubsetofgenescoher-entwithasubsetofconditions.Yangetal.[22]proposeamove-basedalgorithmtofindbiclustersmoreefficiently.Bothalgorithmsin[6]and[22]adoptheuristicsearchap-proaches,andthuscannotguaranteetofindthecompletesetofbiclustersinthedataset.

In[19],Wangetal.proposethemodelofpattern-basedcluster.GivenasubsetofobjectsOandasubsetofat-tributesA,pair(O,A)formsapattern-basedclusterifforanypairofobjectsx,y∈O,andanypairofattributesa,b∈A,thedifferenceofchangeofvaluesonattributesaandbbetweenobjectsxandyissmallerthanathresh-oldδ.Inarecentstudy[11],Peietal.hasproposedanefficientalgorithm,MaPletominethecompletesetofmax-imalpattern-basedclusters.

Weborrowsomeimportantideasfrompreviousstudiesonfrequentitemsetmining.First,theframeworkofourapproachesissimilarinspirittothatofpattern-growth

400 350 300 250Level 2500Opt OffOpt OnNumber of nodes 140Opt OffOpt OnTime (Seconds) 2000 1500 1000 500 120 100 80 60 40 20 0Opt OffOpt On 200 150 100 50 0 200 300 400 500 600 700 800 900 1000Number of genes

0 200 300 400 500 600 700 800 900 1000Number of genes

200 300 400 500 600 700 800 900 1000

Number of genes

(a)Maximalnumberof

recursionlevels(b)Numberofgenecombinationssearched

(c)Runtime

Figure12:EffectofLemma4.1forGene-SampleSearch.

methodsforfrequentpatternmining.Second,thepruningtechniquesinourapproachessharesomeinterestingsimi-laritieswiththemethodsofminingfrequentcloseditem-sets(e.g.,[10]).However,therearetwoessentialdifferencesbetweenthefrequentpatternminingmethodsandtheap-proachesdevelopedinthispaper.Ontheonehand,thecoherentgeneclustersareinherentlydifferentfromfrequentitemsets.Thus,thesimilaritybetweenthetwocategoriesofmethodsisonlyatthelevelofspirit(e.g.,setenumerationandpruning).Thetechnicaldetailsaredramaticallydiffer-ent.Ontheotherhand,newtechniquessuchastheinvertedlistsareadoptedtotackletheparticularmicroarraydata.

7.CONCLUSIONS

Inthispaper,wehaveinvestigatedanoveltypeofgene-sample-timemicroarraydatasetsandproposeanewprob-lemofminingcoherentgeneclustersfromsuchdatasets.Wehaveconductedasystematicstudytodeveloptwomin-ingmethods:theSample-GeneSearchandtheGene-SampleSearch.Ourextensiveperformancestudyonbotharealmi-croarraydatasetandsyntheticdatasetsshowsthatthereexistinterestingandsignificantcoherentgeneclustersintherealdataset,andboththealgorithmshavegoodperfor-mance.Sincethenumberofgenesinthemicroarraydataistypicallybyfarlargerthanthenumberofsamples,theSample-GeneSearchusuallyoutperformstheGene-SampleSearch.

8.REFERENCES

[1]AlizadehA.A.,etal.Distincttypesofdiffuselargeb-cell

lymphomaidentifiedbygeneexpressionprofiling.Nature,Vol.403:503–511,February2000.

[2]AlonU.,etal.Broadpatternsofgeneexpressionrevealed

byclusteringanalysisoftumorandnormalcolontissuesprobedbyoligonucleotidearray.Proc.Natl.Acad.Sci.USA,Vol.96(12):6745–6750,June1999.

[3]BlackM.A.andDoergeR.W.Calculationoftheminimum

numberofreplicatespotsrequiredfordetectionofsignificantgeneexpressionfoldchangeinmicroarrayexperimentsBioinformatics,18:1609-1616,2002

[4]Ben-DorA.,etal.Classdiscoveryingeneexpressiondata.

InProc.FifthAnnualInter.Conf.onComputationalMolecularBiology(RECOMB2001).

[5]KerrM.K.andChurchillG.A.Bootstrappingcluster

analysis:Assessingthereliabilityofconclusionsfrom

microarrayexperiments.Proc.Natl.Acad.Sci.USA,Vol.98:8961-8965.

[6]ChengY.andChurchG.M.Biclusteringofexpressiondata.

ProceedingsoftheEighthInternationalConferenceonIntelligentSystemsforMolecularBiology(ISMB),2000.

[7]GolubT.R.,etal.Molecularclassificationofcancer:Class

discoveryandclasspredictionbygeneexpression

monitoring.Science,Vol.286(15):531–537,October1999.[8]JiangD.,etal.InteractiveExplorationofCoherent

PatternsinTime-SeriesGeneExpressionData.InKDD’03.[9]MolerE.J.,etal.AnalysisofMolecularProfileDataUsing

GenerativeandDiscriminativeMethods.PhysiologicalGenomics,Vol.4(2):109–126,2000.

[10]PeiJ.,etal.CLOSET:Anefficientalgorithmformining

frequentcloseditemsets.InDMKD’00.

[11]PeiJ.,etal.MaPle:AFastAlgorithmforMaximal

Pattern-basedClusterin.InIEEEICDM’03.[12]HerwigR.,etal.Large-ScaleClusteringof

cDNA-FingerprintingData.GenomeResearch,9:1093–1105,1999.

[13]RymonR.Searchthroughsystematicsetenumeration.In

Proc.1992Int.Conf.PrincipleofKnowledgeRepresentationandReasoning(KR’92).

[14]DudoitS.,etal.Comparisonofdiscriminationmethodsfor

theclassificationoftumorsusinggeneexpressiondata.J.oftheAmericanStatisticalAssociation,Vol.97,No.457.[15]TamayoP.,etal.Interpretingpatternsofgeneexpression

withself-organizingmaps:Methodsandapplicationtohematopoieticdifferentiation.Proc.Natl.Acad.Sci.USA,Vol.96(6):2907–2912,March1999.

[16]TangC.,etal.Miningphenotypesandinformativegenes

fromgeneexpressiondata.InSIGKDD’03.

[17]TavazoieS.,etal.Systematicdeterminationofgenetic

networkarchitecture.NatureGenet,pages281–285,1999.[18]TusherV.G.,etal.SignificanceAnalysisofMicroarrays

AppliedtotheIonizingRadiationResponse.Proc.Natl.Acad.Sci.USA,Vol.98(9):5116–5121,April2001.

[19]WangH.,etal.ClusteringbyPatternSimilarityinLarge

DataSets.InSIGMOD2002.

[20]Weinstock-GuttmanB.,etal..Genomiceffectsof

interferon-binmultiplesclerosispatients.AcceptedpendingrevisionsJournalofImmunology,2003.

[21]XingE.P.andKarpR.M.Cliff:Clusteringof

high-dimensionalmicroarraydataviaiterativefeaturefilteringusingnormalizedcuts.Bioinformatics,Vol.17(1):306–315,2001.

[22]YangJ.,etal.δ-cluster:CapturingSubspaceCorrelationin

aLargeDataSet.InICDE2002.

因篇幅问题不能全部显示,请点此查看更多更全内容