andsharingwithcolleagues.Otheruses,includingreproductionanddistribution,orsellingorlicensingcopies,orpostingtopersonal,institutionalorthirdparty
websitesareprohibited.Inmostcasesauthorsarepermittedtoposttheirversionofthearticle(e.g.inWordorTexform)totheirpersonalwebsiteorinstitutionalrepository.AuthorsrequiringfurtherinformationregardingElsevier’sarchivingandmanuscriptpoliciesare
encouragedtovisit:
http://www.elsevier.com/copyright
Author's personal copy
JournalofTheoreticalBiology267(2010)272–275ContentslistsavailableatScienceDirectJournalofTheoreticalBiologyjournalhomepage:www.elsevier.com/locate/yjtbiAhigh-accuracyproteinstructuralclasspredictionalgorithmusingpredictedsecondarystructuralinformationTianLiua,1,CangzhiJiab,n,1abDepartmentofBioscienceandBiotechnology,DalianUniversityofTechnology,No.2Linggongroad,Dalian116024,ChinaDepartmentofMathematics,NortheastUniversity,No.11,Lane3,Wenhuaroad,Shenyang0004,ChinaarticleinfoArticlehistory:Received15June2010Receivedinrevisedform2September2010Accepted2September2010Availableonline8September2010Keywords:ProteinstructuralclasspredictionSecondarystructureAlternatingfrequencyParallelandanti-parallelb-sheetsSupportvectormachineabstractOnemajorproblemwiththeexistingalgorithmforthepredictionofproteinstructuralclassesislowaccuraciesforproteinsfroma/banda+bclasses.Inthisstudy,threenovelfeatureswererationallydesignedtomodelthedifferencesbetweenproteinsfromthesetwoclasses.Incombinationwithotherrationaldesignedfeatures,an11-dimensionalvectorpredictionmethodwasproposed.Bymeansofthismethod,theoverallpredictionaccuracybasedon25PDBdatasetwas1.5%higherthanthepreviousbest-performingmethod,MODAS.Furthermore,thepredictionaccuracyforproteinsfroma+bclassbasedon25PDBdatasetwas5%higherthanthepreviousbest-performingmethod,SCPRED.ThepredictionaccuraciesobtainedwiththeD675andFC699datasetswerealsoimproved.&2010ElsevierLtd.Allrightsreserved.1.IntroductionInformationonthestructuralclassesofproteinshasbeenproventobeimportantinmanyfieldsofbioinformatics(Chou,2004,2005;KurganandHomaeian,2006;CostantiniandFacchiano,2009).Thereare110,800proteindomainswithknownstructuralclassesintheStructureClassificationofProteins(SCOP)database,andabout90%ofthembelongtothefourmajorclasses;all-a,all-b,a+banda/bclasses(Andreevaetal.,2004;Murzinetal.,1995).TheclassificationofproteinstructuresinSCOPisdonemanuallybasedonproteinswithknowntertiarystructures.Sincetherapiddevelopmentofthegenomicsandproteomics,therehasbeenenormousaccumulationofdataontheaminoacidsequencesofproteins.Therefore,themanualmethodapparentlycannotcopewiththedemandforrapidclassification.Henceoverthepasttwodecades,researchershavemadeunremittingeffortsincomputationalpredictionofstructuralclassesonthebasisoftheaminoacidsequencesoftheproteins.Therearegenerallytwoaspectsinthecomputationalprediction:featurevectorandclassificationalgorithm.Theexistingsequencerepresentationmethodsandclassificationalgorithmshavebeenextensivelyreviewed(Chou,2005;KurganandHomaeian,2006).nCorrespondingauthor.Tel./fax:+86041184707245.E-mailaddress:cangzhijia@yahoo.com.cn(C.Jia).1TheseauthorscontributedequallytothisworkOneofthedeficienciesinthecurrentmethodsislowaccuracyforthedatasetswithlowsequencesimilarity(KurganandHomaeian,2006;Kurganetal.,2008a,2008b).Thismaybeduetotheuseofinformationextractedonlyfromtheaminoacidsequences,whereasclassificationofproteinstructuresisbasedonthecontentsandspatialarrangementsofsecondarystructuralelements.Thusoneofthewaystoimprovepredictionaccuracyisbyaddingthesecondarystructuralinformationtothefeaturevector.Undertheguidanceofthisidea,SCPRED(Kurganetal.,2008b)andMODAS(MiziantyandKurgan,2009)wereconstructed.InSCPREDandMODAS,alargenumberoffeatureswerecreatedbasedonboththeaminoacidsequencesofproteinsandthepredictedsecondarystructuralsequencesobtainedbyPSI-PRED(Jones,1999).Thesefeatureswererandomlycombinedandtested,andthebest-performingcombinationwasselected.Whentestingwasmadeonthe25PDBdatasetincluding1673proteinswithtwilight-zonesimilarity,anoverallaveragepredictionaccuracyof80.6%wasobtainedwithbothSCPREDandMODAS(Table1),whichis17.9%higherthanthemostcompetingmethodusingonlyinformationextractedfromaminoacidsequences(KurganandChen,2007;Kurganetal.,2008b;MiziantyandKurgan,2009).Thoughtheoverallpredictionaccuracyhasbeenimproved,thepredictionaccuraciesfora+banda/bclassesarestillunsatisfac-tory(73.6%onaverage)(Table1).Inthisstudy,wetriedtofurtherimprovethepredictionaccuracyinadifferentway.The11featureswereselectedbyknowledge-basedrationaldesignratherthanrandomscreening.Threeofthefeatureswerespeciallydesignedtoimprovethepredictionaccuraciesforproteinsfrom0022-5193/$-seefrontmatter&2010ElsevierLtd.Allrightsreserved.doi:10.1016/j.jtbi.2010.09.007Author's personal copy
T.Liu,C.Jia/JournalofTheoreticalBiology267(2010)272–275273a+banda/bclasses.Thepredictionperformedwithanoptimizedsupportvectormachinerevealedanimprovementinpredictionaccuracy,especiallyforproteinsfroma+banda/bclasses.2.Materialsandmethods2.1.MaterialsThreewidelyuseddatasetswithlowsequenceidentitywereusedinthisstudytocomparetheaccuracyofourpredictionwiththoseofexistingpredictionmethods.The25PDBdatasetisthetouchstoneforthepredictionofproteinsecondarystructuralclassesduetoalargenumberofproteins(1673proteinsanddomains)andlowidentityamongsamples(averageidentityof25%;KurganandHomaeian,2006).TheD675dataset(675proteinsanddomainswithaverageidentityof30%)(MiziantyandKurgan,2009)andtheFC699dataset(699proteinsanddomainswithaverageidentityof40%;Kurganetal.,2008b)werealsoused.PSI-PRED(Jones,1999)wasusedtopredictthesecondarystructuralsequencesofproteinfromtheiraminoacidsequences.2.2.FeaturevectorProteinsaremanuallyclassifiedintodifferentstructuralclassesaccordingtotheir3-DstructuresinSCOP,hencethefeaturesderivedfromthesestructuresmightdirectlybeappliedtothepredictionofproteinstructuralclasses.Basedonthisrationale,11featuresweredesignedtoreflectthegeneralcontentsandspatialTable1Comparisonofpredictionaccuraciesamongvariousdifferentstructuralclasspredictionmethods.DatasetMethodAccuracy(%)all-a25PDBSCPREDMODASThispaperSCPREDMODASThispaperSCPREDMODASThispaper92.692.392.689.189.990.8––97.7all-b80.183.781.381.881.881.4––88.0a/b74.081.281.590.484.284.7––89.1a+b71.068.376.058.265.968.6––84.2overall79.781.482.979.580.082.087.5–89.6D675FC699arrangementsofthesecondarystructuralelementsofagivenproteinsequence.Eightofthemhavebeenusedinthepreviousworks(Kurganetal.,2008a,2008b).Sincethefirstproposedstandardforproteinstructureclassifi-cationisthecontentofthesecondarystructuralelements(Chou,2005),ConHandConEwereproposedtoreflectthecontentsofHandEresidues,respectively(Kurganetal.,2008a,2008b).Astheobjectsofstructuralclassificationareglobularproteins,sothelengthsoftherigidstructuralelementssuchasa-helicesandb-strandswillaffectthespatialarrangementsofthesestructuralelements.ThusMaxSegH(MaxSegE)andAvgSegH(AvgSegE)wereproposedtoreflectthelengthofthelongesta-helices(b-strands)andtheaveragelengthofa-helices(b-strands),respectively(Kurganetal.,2008a,2008b).Twocompositionmomentfeatures,CMVHandCMVE,werealsopurposedtoreflectthespatialarrangementofthesecondarystructuralelements(Kurganetal.,2008a,2008b).Threenovelfeaturesofthesecondarystructurewereproposedonthebasisofthestructuralcharacteristicsofproteinsfroma/banda+bclasses.a-helicesandb-strandsareusuallyseparatedina/bproteins,butareusuallyinterspersedina+bproteins.Ina/bproteins,a-helicesandb-strandsalternatemorefrequentlythanina+bproteins(Fig.1A).Therefore,thefirstfeaturewaschosenasthealternatingfrequencyofa-helicesandb-strands(Altn).Takethesecondarystructuresequenceof1E6B(aminoacidsequencesfrom8to87)asanexample(Fig.1BandC),thea-helicesandb-strandsalternatefivetimes(Altn¼5).Considerthattheb-strandsina/bproteinsareusuallycomposedofparallelb-sheets,whiletheb-strandsina+bproteinsareusuallycomposedofanti-parallelb-sheets,thesecondandthethirdfeaturesarebasedonthenumberofb-strandsthatformparallelb-sheets(PnE)andthenumberofb-strandsthatformanti-parallel(APnE)b-sheets,respectively(Fig.1A).Weproposedthatiftwob-strands(segmentsofE)areseparatedbya-helix(segmentsofH),thesetwob-strandswouldformparallelb-sheets.Otherwise,theywouldformanti-parallelb-sheets.Takethesecondarystructuresequenceof1E6Basanexample(Fig.1B),b-strand1andb-strand2,aswellasb-strand2andb-strand3,aresupposedtoformparallelb-sheets(Fig.1BandC),andb-strand3andb-strand4aresupposedtoformanti-parallelb-sheets(Fig.1BandC).Sotherearethreeb-strandsthatformparallelb-sheets(PnE¼3),andtwob-strandsthatformanti-parallelb-sheetsinthesecondarystructuresequence(APnE¼2).Basedonthesefeatures,the11-dimensionalfeaturevectorcanbeformallyexpressedasP¼ðp1,p2,ÁÁÁ,p11ÞTFig.1.(A)Structuresofproteinsfroma/bclass(1BQC)anda+bclass(1BOB).(B)and(C)Graphicalrepresentationoftheproposeddeterminationofb-strandscomposingparallelb-sheetsoranti-parallelb-sheetsdirectlyfromproteinsecondarystructuralsequences.Theprotein(1E6B,residue8-87)isshownasanexample.Thea-helicesandb-strandsarelabeledfromatodand1to4,respectively.Author's personal copy
274T.Liu,C.Jia/JournalofTheoreticalBiology267(2010)272–275ThesequencelengthwasdenotedbyN.(1)P1andP2representthecontentofresiduesH(ConH)andE(ConE),respectively,inthesecondarystructuralsequence.(2)P3andP4representnormalizedlengthofthelongesta-helix(MaxSegH/N)andb-strand(MaxSegE/N),respectively.(3)P5andP6representthenormalizedaveragelengthofa-helices(AvgSegH/N)andb-strands(AvgSegE/N),respectively.(4)LetP7(CMVH)andP8(CMVE)representthecompositionmomentvectorsHandE,respectively,whichareformulatedasPnHpnPnHjEj¼1j¼1nEj7¼NðNÀ1Þ,p8¼NðNÀ1ÞwherenHandnEarethetotalnumberofHandEresiduesinthesequenceofthesecondarystructure,respectively;nHjandnEjarethejthposition(inthesecondarystructuresequence)ofHandEresidues,respectively.(5)P9representsthenormalizedalternatingfrequencyofa-helicesandb-strands(Altn/N).(6)P10andP11representtheproportionofparallelb-sheetsandanti-parallelb-sheets,respectively,whichcanbecalculatedasfollows:pEAPnE10¼PnPnAPn,p11¼EþEPnEþAPnE2.3.ClassificationalgorithmconstructionSupportvectormachine(SVM)hasbeensuccessfullyusedinthepredictionofproteinsecondarystructuralclassbecauseofitshighaccuracy(Kurganetal.,2008a,2008b).TheSVMclassifiermapsfeaturevectorsintomulti-dimensionalspacebyusingkernelfunctionK,x-insensitivelossfunctionandregulatoryparameterC.Here,GuassiankernelfunctionK(xi,xj)¼exp(Àg:xiÀxj:2)waschosenforitssuperiorityforsolvingnonlinearproblemscomparedwithotherkernelfunctions(Yuanetal.,2005).Theparameteriza-tionofSVMwasperformedthroughagridsearchovergandCvaluesbasedonfifteen-foldcross-validationonthe25PDBdataset.ThefinalclassifierusesC¼362andg¼0.7.3.Results3.1.PredictionresultsandcomparisonwithothermethodsThepredictionmethodwasexaminedwith25PDB,D675andFC699datasetsbyjackknifetest,andthepredictionaccuraciesforproteinsfroma+banda/bclasseswerecomparedwiththeaccuraciesoftheothercompetingpredictionmethodsforthesamedatasets.TheresultsaresummarizedinTable1.TheaccuracyofproteinstructuralclasspredictionhasobviouslybeenimprovedbySCPREDandMODASusingproteinsecondarystructuralinformation,butthepredictionaccuraciesfora+banda/bclassesarestillunsatisfactory(Kurganetal.,2008a,2008b).Inthisstudy,thepredictionaccuracies,especiallyforproteinsfroma+bclassanda/bclass,wereimprovedusingthe25PDBdataset(Table1).Theoverallaccuracyofourpredictionmethodwas82.9%,whichis1.5%higherthanthepreviousbest-performingmethodMODAS,and3.2%higherthantheprevioussecondbest-performingmethodSCPRED.Asforproteinsfromthea/bclass,theaccuracyofourpredictionmethodonthe25PDBdatasetwas81.5%,whichwas0.3%higherthanMODASand7.5%higherthanSCPRED.Asforproteinsfromthea+bclass,theaccuracyofourpredictionmethodwiththe25PDBdatasetwas76.0%,whichis7.7%higherthanMODASand5.0%higherthanTable2Comparisonoftheaccuraciesbetweenthemethodthatincludes11featuresandonethatincludesonlyeightfeatures.DatasetFeaturesAccuracy(%)all-aall-ba/ba+boverall25PDBAllfeaturesincluded92.681.381.576.082.9p1,p2,p3excluded91.679.573.770.579.1D675Allfeaturesincluded90.881.485.268.681.1p1,p2,p3excluded90.880.185.260.678.7FC699Allfeaturesincluded97.788.089.184.289.6p1,p2,p3excluded97.787.488.982.989.3SCPRED.WhentestedwiththeD675andFC699dataset,ourmethodalsoperformedbetterthanSCPREDandMODASwithoverallaccuraciesof82.0%and89.6%,respectively(Table1).3.2.FeaturevectoranalysisUnlikeSCPREDandMODAS,the11featuresinourmethodwererationallydesignedtoreflectthegeneralcontentandspatialarrangementofthesecondarystructuralelementsofagivenproteinsequence.Inparticular,Altn,PnEandAPnEweredesignedtoimprovethepredictionaccuraciesofproteinsfroma+banda/bclasses.Thecombinationofthesefeatureswasproventobeeffectiveinproteinstructuralclassprediction,especiallyforproteinsfroma+banda/bclasses.Theaccuraciesofproteinsfromthesetwoclassesinthe25PDBdatasetwereobviouslyincreasedandtheincrementofoverallaccuracywasmostlycontributedbytheimprovementinthepredictionaccuraciesforthesetwoclasses(Table1).AsfortheD675dataset,theincrementofoverallaccuracywasmostlycontributedbytheincrementofaccuracyofproteinsfroma+bclass(Table1).TofurthervalidatethecontributionofAltn,PnEandAPnEontheaccuracyofthisproteinstructuralclasspredictionmethod,predictionsusingonlyeightvectors(excludingp1,p2andp3,whichcontaintheinformationofAltn,PnEandAPnE,respectively)wereperformedwiththe25PDB,D675andFC699datasets.Theresultsobtainedfrom8featureswerecomparedwiththeresultsobtainedfrom11features(Table2).Theoverallpredictionaccuraciesobtainedwithallofthethreedatasetsdeclinedafterremovalofp1,p2andp3,buttheextentsofdeclinevariedamongdifferentdatasets.Removalofthesefeatureshadthegreatesteffectonthe25PDBdataset,sequentiallyfollowedbytheD675datasetandtheFC699dataset.Thedeclineofoverallaccuracywaslargelycontributedbythedeclineintheaccuraciesforproteinsfroma+banda/bclasses.Forinstance,theoverallaccuracyforthe25PDBdatasetdeclinedby3.8%,buttheaccuraciesforproteinsfroma+banda/bclassesdeclinedby7.8%and5.5%,respectively.4.DiscussionOneofthedeficienciesinthecurrentproteinstructuralclasspredictionmethodsislowaccuracyforthedatasetswithlowsequencesimilarity(Kurganetal.,2006;Kurganetal.,2008a,2008b).Recently,theoverallpredictionaccuraciesforthesedatasetshasobviouslybeenimprovedbyaddingthesecondarystructuralinformation,butthepredictionaccuraciesforproteinsfroma+banda/bclassesarestillunsatisfactory(Kurganetal.,2008b;MiziantyandKurgan,2009),possiblybecausemostofthefeaturesarebasedonthecontentsofsecondarystructuralelements.However,therearenosignificantdifferencesintheAuthor's personal copy
T.Liu,C.Jia/JournalofTheoreticalBiology267(2010)272–275275contentsofa-heliceandb-strandbetweenproteinsfroma/bclassanda+bclass(Chou,2005).Hencethefeaturesreflectingthespatialarrangementsofsecondarystructuralelementsmightbehelpfultoclassifyproteinsfroma/banda+bclasses.Byaddingthefeaturesthatquantifycollocationofhelicesandstrands,MODASincreasesthepredictionaccuracyofproteinsfroma/bclassto81.2%,butthepredictionaccuracyofproteinsfroma+bclassstillhasnotimproved(MiziantyandKurgan,2009).Forproteinsfroma/bclass,thea-helicesandb-strandsalternatefrequentlyandtheb-strandsusuallyformparallelb-sheets,whereasinproteinsfroma+bclass,thea-helicesandb-strandsappearsuccessivelyandtheb-strandsusuallyformanti-parallelb-sheets(Fig.1A).Inthisstudy,wegavethefirstsimplificationforpredictingwhethertheb-strandsformparalleloranti-parallelb-sheetsdirectlyfromproteinsecondarystructuresequences.Weproposediftwob-strandsareseparatedbya-helix,thesetwob-strandswouldformparallelb-sheets,otherwisetheywouldformanti-parallelb-sheets.Onthebasisofthesecharacteristics,threenovelfeatureswereproposedinthisstudytoclassifyproteinsfroma/banda+bclasses.Thefirstoneisthealternatingfrequencyofa-helicesandb-strands(Altn),andtheothertwoarethenumberofb-strandscomposingparallelb-sheets(PnE)andthenumberofb-strandscomposinganti-parallelb-sheets(APnE).Incombinationwithotherknowledge-basedfeatures,anovelproteinstructuralclasspredictionmethodwithatotalof11featureswasproposed.Throughthismethod,theoverallpredictionaccuracyonthe25PDBdatasetis1.5%higherthanthepreviousbest-performingmethodMODAS(MiziantyandKurgan,2009).Furthermore,thepredictionaccuracyfortheproteinsfroma+bclassinthe25PDBdatasetwas5%higherthanthepreviousbest-performingmethodSCPRED(Kurganetal.,2008b).ThepredictionaccuraciesontheD675andFC699datasetsarealsoimproved.Theoverallpredictionaccuraciesaswellasthepredictionaccuraciesfora+banda/bclassesobtainedwithallofthethreedatasetsdelinedaftertheremovalofthefeaturesdesignedonthebasisofAltn,PnEandAPnE.Theresultsuggestedfurtherminingthespatialstructuralinformationimpliedintheproteinsecondarystructuralsequencewillbeaneffectivepathtoincreasetheaccuracyofproteinstructuralclassprediction.Thoughtheaccuracyofproteinsecondarystructuralclasspredictionincreasedto82.5%inthiswork,butitisstillnothighenough.Thisispartlybecausetherealstructuresofproteinsaremuchmorecomplexthanourtheoreticalmodel.Thecommonsecondarystructuralelementssuchasa-helicesandb-strandsarenotregularsometimes.Moreover,somelesscommonsecondarystructuralelementssuchasb-turns,b-buglesand310-heliceswerenotincluded.Inourfuturework,thesefactswillbetakenintoconsiderationtofurtherimprovethepredictionaccuracy.5.ConclusionInthiswork,anovelmethodforproteinstructuralclasspredictionisreported.Notonlytheoverallpredictionaccuracybutalsotheaccuraciesforproteinsfroma+banda/bclassesarehigherthanpreviousreportedmethods.Furthermore,rationaldesignonthebasisofproteinspatialstructuralinformationisprovedtobeasuccessfulapproachtoobtainnewfeaturesand,consequently,toimprovethepredictionaccuracy.AcknowledgementsTheprogramfilecanbeobtainedbye-mailfromthecorrespondingauthor.WeexpressourthankstoProf.LukaszKurganandDr.MarcinMiziantyforprovidingthedatasetsusedinthispaper.WealsothankDr.AlanK.Changforreviewingthemanuscript.TheprojectwassupportedbytheNationalScienceFoundationofChina(YouthGrant10801026).ReferencesAndreeva,A.,Howorth,D.,Brenner,S.E.,Hubbard,T.J.,Chothia,C.,Murzin,A.G.,2004.SCOPdatabasein2004:refinementsintegratestructureandsequencefamilydata.NucleicAcidsRes.32,D226–229.Chou,K.C.,2004.Structuralbioinformaticsanditsimpacttobiomedicalscience.Curr.Med.Chem.11,2105–2134.Chou,K.C.,2005.Progressinproteinstructuralclasspredictionanditsimpacttobioinformaticsandproteomics.Curr.ProteinPept.Sci.6,423–436.Costantini,S.,Facchiano,A.M.,2009.Predictionoftheproteinstructuralclassbyspecificpeptidefrequencies.Biochimie91,226–229.Jones,D.T.,1999.Proteinsecondarystructurepredictionbasedonposition-specificscoringmatrices.J.Mol.Biol.292,195–202.Kurgan,L.A.,Homaeian,L.,2006.Predictionofstructuralclassesforproteinsequencesanddomains—impactofpredictionalgorithms,sequencerepre-sentationandhomology,andtestproceduresonaccuracy.PatternRecognit.39,2323–2343.Kurgan,L.A.,Chen,K.,2007.Predictionofproteinstructuralclassforthetwilightzonesequences.Biochem.Biophys.Res.Commun.357,453–460.Kurgan,L.A.,Zhang,T.,Zhang,H.,Shen,S.,Ruan,J.,2008a.Secondarystructure-basedassignmentoftheproteinstructuralclasses.AminoAcids35,551–564.Kurgan,L.A.,Cios,K.,Chen,K.,2008b.SCPRED:accuratepredictionofproteinstructuralclassforsequencesoftwilight-zonesimilaritywithpredictingsequences.BMCBioinformat.9,226.Mizianty,M.J.,Kurgan,L.A.,2009.Modularpredictionofproteinstructuralclassesfromsequencesoftwilight-zoneidentitywithpredictingsequences.BMCBioinformat.10,414.Murzin,A.G.,Brenner,S.E.,Hubbard,T.,Chothia,C.,1995.SCOP:astructuralclassificationofproteinsdatabasefortheinvestigationofsequencesandstructures.J.Mol.Biol.247,536–540.Yuan,Z.,Bailey,T.L.,Teasdale,R.D.,2005.PredictionofproteinB-factorprofiles.Proteins58,905–912.
因篇幅问题不能全部显示,请点此查看更多更全内容