LearningSemanticDefinitionsofOnlineInformationSources
MarkJamesCarmanCraigA.Knoblock
UniversityofSouthernCaliforniaInformationSciencesInstitute4676AdmiraltyWay
MarinadelRey,CA90292
mark@bradipo.netknoblock@isi.edu
Abstract
TheInternetcontainsaverylargenumberofinformationsourcesprovidingmanytypesofdatafromweatherforecaststotraveldealsandfinancialinformation.ThesesourcescanbeaccessedviaWeb-forms,WebServices,RSSfeedsandsoon.Inordertomakeautomateduseofthesesources,weneedtomodelthemsemantically,butwritingsemanticdescriptionsforWebServicesisbothtediousanderrorprone.Inthispaperweinvestigatetheproblemofautomaticallygeneratingsuchmodels.WeintroduceaframeworkforlearningDatalogdefinitionsofWebsources.Inordertolearnthesedefinitions,oursystemactivelyinvokesthesourcesandcomparesthedatatheyproducewiththatofknownsourcesofinformation.Itthenperformsaninductivelogicsearchthroughthespaceofplausiblesourcedefinitionsinordertolearnthebestpossiblesemanticmodelforeachnewsource.Inthispaperweperformanempiricalevaluationofthesystemusingreal-worldWebsources.Theevaluationdemonstratestheeffectivenessoftheapproach,showingthatwecanautomaticallylearncomplexmodelsforrealsourcesinreasonabletime.Wealsocompareoursystemwithacomplexschemamatchingsystem,showingthatourapproachcanhandlethekindsofproblemstackledbythelatter.
1.Introduction
Recentyearshaveseenanexplosioninthequantityandvarietyofinformationavailableonline.Onecanfindshoppingdata(pricesandavailabilityofgoods),geospatialdata(weatherforecasts,housinginformation),traveldata(flightpricingandstatus),financialdata(exchangeratesandstockquotes),andthatjustscratchesthesurfaceofwhatisavailable.Theaimofthisworkistomakeuseofthatvaststoreofinformation.
Astheamountofinformationhasincreased,sotoohasitsreuseacrossWebportalsandapplications.Developershaverealisedtheimportanceofmanagingcontentseparatelyfrompresentation,leadingtothedevelopmentofXMLasaself-describingdataformat.ContentinXMLisfareasiertomanipulatethanHTML,simplifyingintegrationacrossdifferentsources.Standardshavealsoemergedforprovidingprogrammaticaccesstodata(likeSOAPandREST)sothatdeveloperscaneasilybuildprograms(calledMash-Ups)thatcombinecontentfromdifferentsitesinreal-time.Manyportalsnowprovidesuchaccesstotheirdataandsomeevenprovidesyntacticdefinitions(inWSDL)oftheinputandoutputthesedatasourcesexpect.Missing,however,aresemanticdescriptionsofwhateachsourcedoes,whichisrequiredinordertosupportautomateddataintegration.
c2007AIAccessFoundation.Allrightsreserved.
Carman&Knoblock
1.1StructuredQuerying
Givenallthestructuredsourcesavailable,wewouldliketocombinethedatadynamicallytoanswerspecificuserrequests(asopposedtostaticallyinthecaseofMash-Ups).Dynamicdatarequestscanbeexpressedasqueriessuchasthoseshownbelow.Suchqueriesmayrequireaccesstomultiple(publiclyavailable)datasourcesandcombineinformationinwaysthatwerenotenvisagedbyitsproducers.
1.Tourism:Getpricesandavailabilityforallthreestarhotelswithin100kilometersofTrento,Italythatliewithin1kilometerofaskiresortthathasover1meterofsnow.2.Transportation:DeterminethetimethatIneedtoleaveworkinordertocatchabustotheairporttopickupmybrotherwhoisarrivingonQantasflight205.
3.DisasterPrevention:Findphonenumbersforallpeoplelivingwithin1mileofthecoastandbelow200feetofelevation.Itisclearevenfromthissmallsetofexamplesjusthowpowerfultheabilitytocombinedatafromdisparatesourcescanbe.Inordertogivethesequeriestoanautomatedsystem,wemustfirstexpressthemformallyinaquerylanguagesuchasSQLorDatalog(Ullman,1989).InDatalogthefirstquerymightlookasfollows:
q(hotel,price):-accommodation(hotel,3*,address),available(hotel,today,price),distance(address,Trento,Italy,dist1),dist1<100km,skiResort(resort,loc1),distance(address,loc1,dist2),
dist2<1km,snowCondiditions(resort,today,height),height>1m.
Theexpressionstatesthathotelandpricepairsaregeneratedbylookingupthreestarhotelsinarelationaltablecalledaccommodation,thencheckingthepricefortomorrownightinatablecalledavailable.TheaddressofeachhotelisusedtocalculatethedistancetoTrento,whichmustbelessthan100kilometers.ThequeryalsochecksthatthereisaskiResortwithin1kilometerofthehotel,andthatthesnowConditionsfortodayshowmorethan1meterofsnow.1.2Mediators
AsystemcapableofgeneratingaplantoanswersuchaqueryiscalledanInformationMediator(Wiederhold,1992).Inordertogenerateaplan,mediatorslookforsourcesthatarerelevanttothequery.Inthiscase,relevantsourcesmightbe:
1.TheItalianTourismWebsite:tofindallhotelsnear‘Trento,Italy’.2.ASkiSearchEngine:tofindskiresortsneareachhotel.
3.AWeatherProvider:tofindouthowmuchsnowhasfallenateachskiresort.Foramediatortoknowwhichsourcesarerelevant,itneedstoknowwhatinformationeachsourceprovides.WhileXMLdefinesthesyntax(formatting)usedbyasource,thesemantics(intendedmeaning)oftheinformationthesourceprovidesmustbedefinedseparately.ThiscanbedoneusingLocal-as-View(LAV)sourcedefinitionsinDatalog(Levy,2000).Essentially,sourcedefinitionsdescribequeriesthatifgiventoamediator,willreturnthesamedataasthesourceprovides.Exampledefinitionsareshownbelow.Thefirststates
2
LearningSemanticDefinitionsofInformationSourcesontheInternet
thatthesourcehotelSearchtakesfourvaluesasinput(inputsareprefixedbythe$-symbol),andreturnsalistofhotelswhichliewithinthegivendistanceoftheinputlocation.Foreachhotelitalsoreturnstheaddressaswellasthepriceforaroomonthegivendate.(NotethatthesourceonlyprovidesinformationforhotelsinItaly.)
hotelSearch($location,$distance,$rating,$date,hotel,address,price):-country(location,Italy),accommodation(hotel,rating,address),available(hotel,date,price),distance(address,location,dist1),dist1 Inordertogenerateaplanforansweringthequery,amediatorperformsaprocesscalledqueryreformulation(Levy,2000),wherebyittransformsthequeryintoanewqueryover(intermsof)therelevantinformationsources.1(Asourceisrelevantifitreferstothesamerelationsasthequery.)Theresultingplaninthiscaseisshownbelow. q(hotel,price):-hotelSearch(Trento,Italy,100km,3*,today,hotel,address,price),findSkiResorts(address,1km,resort,location), getSkiConditions(resort,today,height),height>1m. Inthiswork,thequestionsofinterestare:wheredoallthedefinitionsfortheseinformationsourcescomefromandmoreprecisely,whathappenswhenwewanttoaddnewsourcestothesystem?Isitpossibletogeneratethesesourcedefinitionsautomatically?1.3DiscoveringNewSources Intheexampleabove,themediatorknowsofasetofrelevantsourcestousetosuccessfullyanswerthequery.Ifinstead,oneofthosesourcesismissingordoesn’thavethedesiredscope(e.g.getSkiConditionsdoesn’tprovidedataforTrento),thenthemediatorfirstneedstodiscoverasourceprovidingthatinformation.Asthenumberandvarietyofinformationsourcesincrease,wewillundoubtedlyrelyonautomatedmethodsfordiscoveringthemandannotatingthemwithsemanticdescriptions.Inordertodiscoverrelevantsources,asystemmightinspectaserviceregistry2(suchasthosedefinedinUDDI),orperformkeyword-basedsearchoveraWebindex(suchasGoogleordel.icio.us).Theresearchcommunityhaslookedattheproblemofdiscoveringrelevantservices,developingtechniquesforclassifyingservicesintodifferentdomains(suchasweatherandflights)usingservicemetadata(Heß&Kushmerick,2003)andclusteringsimilarservicestogethertoimprovekeyword-basedsearch(Dong,Halevy,Madhavan,Nemes,&Zhang,2004).Thesetechniques,althoughuseful,arenotsufficientforautomatingserviceintegration. 1.Thecomplexityofqueryreformulationisknowntobeexponential,althoughefficientalgorithmsforperformingitdoexist(Pottinger&Halevy,2001). 2.Notethattechnically,aserviceisdifferentfromasource.Aserviceisaninterfaceprovidingaccesstomultipleoperations,eachofwhichmayprovideinformation.Ifanoperationdoesnotaffectthe“stateoftheworld”(e.g.bychargingsomebody’screditcard),thenwecallitaninformationsource.Forthispaper,however,weusethetermservicetoreferonlytoinformationsources. 3 Carman&Knoblock 1.4LabelingServiceInputsandOutputs Oncearelevantserviceisdiscovered,theproblemshiftstomodelingitsemantically.Model-ingsourcesbyhandcanbelaborious,soautomatingtheprocessmakessense.Sincedifferentservicesoftenprovidesimilaroroverlappingdata,itshouldbepossibletouseknowledgeofpreviouslymodeledservicestolearndescriptionsfornewones. Thefirststepinmodelingasourceistodeterminewhattypeofdataitrequiresasinputandproducesasoutput.Thisisdonebyassigningsemantictypes(likezipcode,telephonenumber,temperature,andsoon)totheattributesofaservice.Semantictypesrestrictthepossiblevaluesofanattributetoasubsetofthecorrespondingprimitivetype.Theresearchcommunityhasinvestigatedautomatingtheassignmentprocessbyviewingitasaclassificationproblem(Heß&Kushmerick,2003).Intheirsystem,HeßandKushmericktrainedaSupportVectorMachine(SVM)onmetadatadescribingdifferentsources.Thesystem,givenasourcesuchasthefollowing: getWeather($zip,temp) usesthelabelsgetWeather,zipandtemp(andanyotheravailablemetadata)toassigntypestotheinputandoutputattributes,e.g.:zip→zipcode,temp→temperature.Notethatadditionalmetadataisoftenusefulfordistinguishingbetweenpossibleassignments.(If,forexample,thenameoftheoperationhadbeenlistEmployees,thentempmayhavereferredtoatemporaryemployeeratherthanatemperature.) Insubsequentwork,researchersdevelopedamorecomprehensivesystemthatusedbothmetadataandoutputdatatoclassifyserviceattributes(Lerman,Plangprasopchok,&Knoblock,2006).Inthatsystem,aLogisticRegressionbasedclassifierfirstassignsse-mantictypestoinputparameters.Examplesofthoseinputtypesarethenusedtoinvoketheservice,andtheoutputisgiventoapattern-languagebasedclassifier,thatassignstypestotheoutputparameters.Theauthorsarguethatclassificationbasedonbothdataandmetadataisfarmoreaccuratethanthatbasedonmetadataalone.Usinganexample,itiseasytoseewhy.ConsiderthefollowingtuplesproducedbyourgetWeathersource: 90292,25◦C,10274,15◦C,60610,18◦C,... Giventhedata,theclassifiercanbecertainthattempreallydoesreferstoatemperature,andindeedcanevenassignittoamorespecifictype,temperatureC(inCelsius). Whiletheproblemofdeterminingthesemantictypesofaservice’sattributesisveryinteresting,andthereisroomforimprovementoncurrenttechniques,weassumeforthepurposesofthisworkthatithasalreadybeensolved.1.5GeneratingaDefinition Onceweknowtheparametertypes,wecaninvoketheservice,butwearestillunabletomakeuseofthedataitreturns.Todothat,weneedalsotoknowhowtheoutputattributesrelatetotheinput(i.e.adefinitionforthesource).Forexample,forthegetWeatherserviceweneedtoknowwhetherthetemperaturebeingreturnedisthecurrenttemperature,thepredictedhightemperaturefortomorrowortheaveragetemperatureforthistimeofyear.Suchrelationshipscanbedescribedbythefollowingdefinitions: getWeather($zip,temp):-currentTemp(zip,temp). getWeather($zip,temp):-forecast(zip,tomorrow,temp).getWeather($zip,temp):-averageTemp(zip,today,temp). 4 LearningSemanticDefinitionsofInformationSourcesontheInternet Therelationsusedinthesedefinitionswouldbedefinedinadomainontology(orschema).Inthispaperwedescribeasystemcapableoflearningwhichifanyofthesedefinitionsisthecorrect.Thesystemleverageswhatitknowsaboutthedomain,i.e.thedomainontologyandasetofknowninformationsources,tolearnwhatitdoesnotknow,namelytherelationshipbetweentheattributesofanewlydiscoveredsource.1.6Outline Thispaperpresentsacomprehensivetreatmentofmethodsforlearningsemanticdescrip-tionsofWebinformationsources.Itextendsourpreviousworkonthesubject(Carman&Knoblock,2007)bypresentingdetaileddescriptionsofthemethodsforbothenumeratingthesearchspaceandevaluatingtheindividualcandidatedefinitions.Weprovideadditionaldetailsregardingtheevaluationmethodologyandtheresultsgenerated. Thepaperisstructuredasfollows.Westartwithanexampletomotivatethesourceinductionproblemandthenformulatetheproblemconcisely.Wediscussourapproachtolearningdefinitionsforsourcesintermsofotherknownsourcesofinformation(section3).Wegivedetailsofthesearchprocedureforgeneratingcandidatedefinitions(section4)andoftheevaluationprocedureforscoringcandidatesduringsearch(section5).Wethendescribeextensionstothebasicalgorithm(section6)beforediscussingtheevaluationsetupandtheexperiments(section7),whichdemonstratethecapabilitiesofoursystem.Finally,wecontrastourapproachwithpriorwork. 2.Problem Wenowdescribeindetailtheproblemoflearningdefinitionsfornewlydiscoveredservices.Westartwithaconcreteexampleofwhatismeantbylearningasourcedefinition.Intheexampletherearefourtypesofdata(semantictypes),namely:zipcodes,distances,latitudesandlongitudes.Therearealsothreeknownsourcesofinformation.EachofthesourceshasadefinitioninDatalogasshownbelow.Thefirstservice,aptlynamedsource1,takesinazipcodeandreturnsthelatitudeandlongitudecoordinatesofitscentroid.Thesecondservicecalculatesthegreatcircledistance(theshortestdistanceovertheearth’ssurface)betweentwopairsofcoordinates,andthethirdconvertsadistancefromkilometresintomilesbymultiplyingtheinputbytheconstant1.609. source1($zip,lat,long):-centroid(zip,lat,long).source2($lat1,$long1,$lat2,$long2,dist):-greatCircleDist(lat1,long1,lat2,long2,dist). source3($dist1,dist2):-multiply(dist1,1.609,dist2). Thegoalinthisexampleistolearnadefinitionforanewlydiscoveredservice,calledsource4.Thisservicetakestwozipcodesasinputandreturnsadistancevalueasoutput: source4($zip1,$zip2,dist) Thesystemwewilldescribeusesthistypesignature(inputandoutputtypeinformation)tosearchforanappropriatedefinitionforthesource.Thedefinitiondiscoveredinthiscasemightbethefollowingconjunctionofcallstotheindividualsources: source4($zip1,$zip2,dist):-source1($zip1,lat1,long1),source1($zip2,lat2,long2), source2($lat1,$long1,$lat2,$long2,dist2),source3($dist2,dist). 5 Carman&Knoblock Thedefinitionstatesthatsource’soutputdistancecanbecalculatedfromtheinputzipcodes,bygivingthosezipcodestosource1,takingtheresultingcoordinatesandcalculatingthedistancebetweenthemusingsource2,andthenconvertingthatdistanceintomilesusingsource3.Totestwhetherthisdefinitioniscorrect,thesystemmustinvokeboththenewsourceandthedefinitiontoseeifthevaluesgeneratedagreewitheachother.Thefollowingtableshowssuchatest: $zip1802106060110005 $zip2902661520135555 dist(actual)842.37410.31899.50 dist(predicted)843.65410.83899.21 Inthetable,theinputzipcodeshavebeenselectedrandomlyfromasetofexamples,andtheoutputfromthesourceandthedefinitionareshownsidebyside.Sincetheoutputvaluesarequitesimilar,oncethesystemhasseenasufficientnumberofexamples,itcanbeconfidentthatithasfoundthecorrectsemanticdefinitionforthesource. Thedefinitionaboveisgivenintermsofthesourcerelations,butcouldalsohavebeenwrittenintermsofthedomainrelations(therelationsusedtodefinesources1to3).Toconvertthedefinitionintothatform,onesimplyneedstoreplaceeachsourcerelationbyitsdefinitionasfollows: source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2), greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.609,dist2).Writteninthisway,thenewsemanticdefinitionmakessenseatanintuitivelevel:thesourceissimplycalculatingthedistanceinmilesbetweenthecentroidsofthetwozipcodes.2.1ProblemFormulation HavinggivenanexampleoftheSourceDefinitionInductionProblem,wenowdescribetheproblemmoreformally,butbeforedoingso,weintroducesomeconceptsandnotation.(Wenotethatourfocusinthispaperisonlearningdefinitionsforinformation-providing,asopposedto“world-altering”services.) •Thedomainofasemanticdata-typet,denotedD[t],isthe(possiblyinfinite)setofconstantvalues{c1,c2,...},whichconstitutethesetofvaluesforvariablesofthattype.ForexampleD[zipcode]={90210,90292,...} •Anattributeisapairlabel,semanticdata-type,e.g.zip1,zipcode.Thetypeofattributeaisdenotedtype(a)andthecorrespondingdomainD[type(a)]isabbreviatedtoD[a]. •Aschemeisanordered(finite)setofattributesa1,...,anwithuniquelabels,wherenisreferredtoasthearityofthescheme.Anexampleschememightbezip1:zipcode,zip2:zipcode,dist:distance.ThedomainofaschemeA,denotedD[A],istheCartesianproductofthedomainsoftheattributesinthescheme{D[a1]×...×D[an]},ai∈A. •AtupleoveraschemeAisanelementfromthesetD[A].Atuplecanberepresentedbyasetofname-valuepairs,suchas{zip1=90210,zip2=90292,dist=8.15} 6 LearningSemanticDefinitionsofInformationSourcesontheInternet •Arelationisanamedscheme,suchasairDistance(zip1,zip2,dist).Multiplere-lationsmaysharethesamescheme. •Anextensionofarelationr,denotedE[r],isasubsetofthetuplesinD[r].Forexam-ple,E[airDistance]mightbeatablecontainingthedistancebetweenallzipcodesinCalifornia.(Notethattheextensionofarelationmayonlycontaindistincttuples.)•AdatabaseinstanceoverasetofrelationsR,denotedI[R],isasetofextensions{E[r1],...,E[rn]},oneforeachrelationr∈R. •AquerylanguageLisaformallanguageforconstructingqueriesoverasetofrelations.WedenotethesetofallqueriesthatcanbewrittenusingthelanguageLoverthesetofrelationsRreturningtuplesconformingtoaschemeAasLR,A. •Theresultsetproducedbytheexecutionofaqueryq∈LR,AonadatabaseinstanceI[R]isdenotedEI[q]. •Asourceisarelations,withabindingpatternβs⊆s,whichdistinguishesinputattributesfromoutputattributes.(Theoutputattributesofasourcearedenotedby c≡s\\β.)thecomplementofthebindingpattern3,βss •AviewdefinitionforasourcesisaqueryvswritteninsomequerylanguageLR,s.TheSourceDefinitionInductionProblemisdefinedasatuple: T,R,L,S,V,s∗ whereTisasetofsemanticdata-types,Risasetofrelations,Lisaquerylanguage,Sisasetofknownsources,Visasetofviewdefinitions(oneforeachknownsource),ands∗isthenewsource(alsoreferredtoasthetarget). Eachsemantictypet∈TmustbeprovidedwithasetofexamplesvaluesEt⊆D[t].(WedonotrequiretheentiresetD[t],becausethedomainofmanytypesmaybepartiallyunknownortoolargetobeenumerated.)Inaddition,apredicateeqt(t,t)isavailableforcheckingequalitybetweenvaluesofeachsemantictypetohandlethecasewheremultipleserialisationsofavariablerepresentthesamevalue. Eachrelationr∈Risreferredtoasaglobalrelationordomainpredicateanditsextensionisvirtual,meaningthattheextensioncanonlybegeneratedbyinspectingeveryrelevantdatasource.ThesetofrelationsRmayincludesomeinterpretedpredicates,suchas≤,whoseextensionisdefinedandnotvirtual. ThelanguageLusedforconstructingqueriescouldbeanyquerylanguageincludingSQLandXQuery(theXMLQueryLanguage).InthispaperwewilluseaformofDatalog. Eachsources∈ShasanextensionE[s]whichisthecompletesetoftuplesthatcanbeproducedbythesource(atagivenmomentintime).Werequirethatthecorrespondingviewdefinitionvs∈V(writteninLR,s)isconsistentwiththesource,suchthat:E[s]⊆EI[vs],(whereI[R]isthecurrentvirtualdatabaseinstanceovertheglobalrelations).Notethatwedonotrequireequivalence,becausesomesourcesmayprovideincompletedata. Theviewdefinitionforthesourcetobemodeleds∗isunknown.ThesolutiontotheSourceDefinitionInductionProblemisaviewdefinitionv∗∈LR,s∗forthesources∗suchthatE[s∗]⊆EI[v∗],andthereisnootherviewdefinitionv∈LR,s∗thatbetterdescribes(providesatighterdefinitionfor)thesources∗,i.e.: ¬∃v∈LR,s∗s.t.E[s∗]⊆EI[v]∧|EI[v]|<|EI[v∗]| 3.The‘\\’-symboldenotessetdifference. 7 Carman&Knoblock Givenlimitedavailablecomputationandbandwidth,wenotethatitmaynotbepossibletoguaranteethatthisoptimalityconditionholdsforaparticularsolution;thusinthispaperwewillsimplystrivetofindthebestsolutionpossible.2.2ImplicitAssumptions Anumberofassumptionsareimplicitintheproblemformulation.Thefirstisthatthereexistsasystemcapableofdiscoveringnewsourcesandmoreimportantlyclassifying(togoodaccuracy)thesemantictypesoftheirinputandoutput.Systemscapableofdoingthiswerediscussedinsection1.4. Thesecondassumptionhastodowiththerepresentationofeachsourceasarelationalviewdefinition.MostsourcesontheInternetprovidetreestructuredXMLdata.Itmaynotalwaysbeobvioushowbesttoflattenthatdataintoasetofrelationaltuples,whilepreservingtheintendedmeaningofthedata.Consideratravelbookingsitewhichreturnsasetofflightoptionseachwithaticketnumber,apriceandalistofflightsegmentsthatconstitutetheitinerary.Onepossibilityforconvertingthisdataintoasetoftupleswouldbetobreakeachticketupintoindividualflightsegmenttuples(therebyobscuringtherelationshipbetweenpriceandthenumberofflightsegments).Anotherwouldbetocreateoneverylongtupleforeachticketwithroomforanumberofflightsegments(therebycreatingtupleswithmanynullvalues).Inthiscase,itisnotobviouswhich,ifeither,oftheseoptionsistobepreferred.Mostonlinedatasourcescan,however,bemodeledquitenaturallyasrelationalsources;andbyfirsttacklingtherelationalproblem,wecandeveloptechniquesthatcanlaterbeappliedtothemoredifficultsemi-structuredcase. Athirdassumptionisthatthesetofdomainrelationssufficesfordescribingthesourcetobemodeled.Forinstance,considerthecasewherethedomainmodelonlycontainsrelationsdescribingfinancialdata,andthenewsourceprovidesweatherforecasts.Obviously,thesystemwouldbeunabletofindanadequatedescriptionofthebehaviorofthesourceandwouldnotlearnamodelofit.Fromapracticalperspective,thislimitationisnotabigproblem,sinceausercanonlyrequestdata(writequeriestoamediator)usingtherelationsinthedomainmodelanyway.(Thusanysourcewhichcannotbedescribedusingthoserelationswouldnotbeneededtoansweruserrequests.)Inotherwords,theonusisonthedomainmodelertomodelsufficientrelationssoastobeabletodescribethetypesofqueriesausershouldbeabletoposetothesystemandconsequently,thetypesofsourcesthatshouldbeavailable.Thatsaid,aninterestingavenueforfutureresearchwouldbetoinvestigatetheproblemofautomating(atleastinpart)theprocessofexpandingthescopeofthedomainmodel(byaddingattributestorelations,orinventingnewones),basedonthetypesofsourcesdiscovered.2.3ProblemDiscussion Anumberofquestionsarisefromtheproblemformulation,thefirstbeingwherethedomainmodelcomesfrom.Inprinciple,thesetofsemantictypesandrelationscouldcomefrommanyplaces.Itcouldbetakenfromstandarddatamodelsforthedifferentdomains,oritmightjustbethesimplestmodelpossiblethataptlydescribesthesetofknownsources.Thedomainmodelmaythenevolveovertimeassourcesarediscoveredforwhichnoappropriatemodelcanbefound.Somewhatrelatedisthequestionofhowspecificthesemantictypes 8 LearningSemanticDefinitionsofInformationSourcesontheInternet oughttobe.Forexample,isitsufficienttohaveonesemantictypedistanceorshouldonedistinguishbetweendistanceinmetersanddistanceinfeet?Generallyspeaking,asemantictypeshouldbecreatedforeachattributethatissyntacticallydissimilartoallotherattributes.Forexample,aphonenumberandazipcodehaveverydifferentsyntax,thusoperationswhichacceptoneofthetypesasinputareunlikelytoaccepttheother.Inpractice,onemightcreateanewsemantictypewheneveratrainedclassifiercanrecognisethetypebasedonitssyntaxalone.Ingeneral,themoresemantictypesthereare,theharderthejobofthesystemclassifyingtheattributes,andtheeasierthejobofthesystemtaskedwithlearningadefinitionforthesource. Anotherquestiontobeconsiderediswherethedefinitionsfortheknownsourcescomefrom.Initiallysuchdefinitionswouldneedtobewrittenbyhand.Asthesystemlearnsdefinitionsfornewsources,theytoowouldbeaddedtothesetofknownsources,makingitpossibletolearnevermorecomplicateddefinitions. Inorderforthesystemtolearnadefinitionforanewsource,itmustbeabletoinvokethatsourceandthusneedsexamplesoftheinputtypes.Themorerepresentativethesetofexamplesavailable,themoreefficientandaccuratethelearningprocesswillbe.Aninitialsetofexampleswillneedtobeprovidedbythedomainmodeler.Then,asthesystemlearnsovertime,itwillgeneratealargenumberofexamplesofdifferentsemantictypes(asoutputfromvarioussources),whichcanberetainedforfutureuse. InformationIntegrationresearchhasreachedapointwheremediatortechnology4isbecomingmatureandpractical.Theneedtoinvolveahumaninthewritingofsourcedefinitionsis,however,theAchilles’Heelofsuchsystems.Thegainsinflexibilitythatcomewiththeabilitytodynamicallyreformulateuserqueriesareoftenpartiallyoffsetbythetimeandskillrequiredtowritedefinitionswhenincorporatingnewsources.Thusasystemcapableoflearningdefinitionsautomaticallycouldgreatlyenhancetheviabilityofmediatortechnology.Thismotivationaloneseemssufficientforpursuingtheproblem. 3.Approach TheapproachwetaketolearningsemanticmodelsforinformationsourcesontheWebistwofold.Firstly,wechoosetomodelsourcesusingthepowerfullanguageofconjunctivequeries.Secondly,weleveragethesetofknownsourcesinordertolearnadefinitionforthenewone.Inthissectionwediscusstheseaspectsinmoredetail.3.1ModelingLanguage ThesourcedefinitionlanguageListhehypothesislanguageinwhichnewdefinitionswillneedtobelearnt.Asisoftenthecaseinmachinelearning,wearefacedwithatrade-offwithrespecttotheexpressivenessofthislanguage.Ifthehypothesislanguageistoosimple,thenwemaynotbeabletomodelrealservicesusingit.Ontheotherhand,ifthelanguageisoverlycomplex,thenthespaceofpossiblehypotheseswillbesolargethatlearningwillnotbefeasible.ThelanguagewechooseisthatofconjunctivequeriesinDatalog,whichis 4.InfluentialInformationIntegrationsystemsincludeTSIMMIS(Garcia-Molina,Hammer,Ireland,Pa-pakonstantinou,Ullman,&Widom,1995),SIMS(Arens,Knoblock,&Shen,1996),InfoMaster(Duschka,1997),andAriadne(Knoblock,Minton,Ambite,Ashish,Muslea,Philpot,&Tejada,2001). 9 Carman&Knoblock ahighlyexpressiverelationalquerylanguage.Inthissectionwearguewhyalessexpressivelanguageisnotsufficientforourpurposes. ResearchersinterestedintheproblemofassigningsemanticstoWebServices(Heß&Kushmerick,2003)haveinvestigatedtheproblemofusingMachineLearningtechniquestoclassifyservices(basedonmetadatacharacteristics)intodifferentsemanticdomains,suchasweatherandflights,andtheoperationstheyprovideintodifferentclassesofoperation,suchasweatherForecastandflightStatus.Fromarelationalperspective,wecanconsiderthedifferentclassesofoperationsasrelations.Forinstance,considerthedefinitionbelow: source($zip,temp):-weatherForecast(zip,tomorrow,temp). ThesourceprovidesweatherdatabyselectingtuplesfromarelationcalledweatherForecast,whichhasthedesiredzipcodeanddateequaltotomorrow.Thisqueryisreferredtoasaselect-projectquerybecauseitsevaluationcanbeperformedusingtherelationaloperatorsselectionandprojection.Sofarsogood,wehavebeenabletouseasimpleclassifiertolearnasimpledefinitionforasource.Thelimitationimposedbythisrestricted(select-project)modelinglanguagebecomesobvious,however,whenweconsiderslightlymorecomplicatedsources.ConsiderasourcethatprovidesthetemperatureinFahrenheitaswellasCelsius.Inordertomodelsuchasourceusingaselect-projectquery,wewouldrequirethattheweatherForecastrelationbeextendedwithanewattributeasfollows: source($zip,tempC,tempF):-weatherForecast(zip,tomorrow,tempC,tempF). Themoreattributesthatcouldconceivablybereturnedbyaweatherforecastoperation(suchasdewpoint,humidity,temperatureinKelvin,latitude,etc.),thelongertherelationwillneedtobetocoverthemall.Better,inthiscase,wouldbetointroduceasecondrelationconvertCtoFthatmakesexplicittherelationshipbetweenthetemperaturevalues.If,inaddition,thesourcelimitsitsoutputtozipcodesinCalifornia,areasonabledefinitionforthesourcemightbe: source($zip,tempC,tempF):-weatherForecast(zip,tomorrow,tempC),convertCtoF(tempC,tempF),state(zip,California). Thisdefinitionisnolongerexpressedinthelanguageofselect-projectqueries,becauseitnowinvolvesmultiplerelationsandjoinsacrossthem.Thusfromthissimpleexample,weseethatmodelingservicesusingsimpleselect-projectqueriesisnotsufficientforourpurposes.Whatweneedareselect-project-joinqueries,alsoreferredtoasconjunctivequeries.5Thereaderhasalreadybeenintroducedtoexamplesofconjunctivequeriesthroughouttheprevioussections.ConjunctivequeriesformasubsetofthelogicalquerylanguageDatalogandcanbedescribedmoreformallyasfollows: AconjunctivequeryoverasetofrelationsRisanexpressionoftheform:q(X0):-r1(X1),r2(X2),...,rl(Xl). whereeachri∈RisarelationandXiisanorderedsetofvariablenamesofsizearity(ri).6Eachconjunctri(Xi)isreferredtoasaliteral.Thesetofvariables inthequery,denotedvars(q)=li=0Xi,consistsofdistinguishedvariablesX0(fromtheheadofthequery),andexistentialvariablesvars(q)\\X0,(which 5.Evaluatingaselect-project-joinqueryrequiresadditionalrelationaloperators:naturaljoinandrename.6.Notethataconjunctivequerycanalsobeexpressedinfirstorderlogicasfollows: l ∀X0∃Ys.t.r1(X1)∧r2(X2)∧...∧rl(Xl)→q(X0)whereX0∪Y=i=1Xi 10 LearningSemanticDefinitionsofInformationSourcesontheInternet onlyappearinthebody).Aconjunctivequeryissaidtobesafeifallthe distinguishedvariablesappearinthebody,i.e.X0⊆li=1Xi.3.2MoreExpressiveLanguages ModelingsourcesusingconjunctivequeriesimpliesthataggregateoperatorslikeMINandORDERcannotbeusedinsourcedefinitions.Thefunctionalityofmostsourcescanbedescribedwithoutsuchoperators.Somesourcescanonlybedescribedpoorly,however.Considerahotelsearchservicethatreturnsthe20closesthotelstoagivenlocation: hotelSearch($loc,hotel,dist):-accommodation(hotel,loc1),distance(loc,loc1,dist). Accordingtothedefinition,thesourceshouldreturnallhotelsregardlessofdistance.Onecannotexpressthefactthatonlytheclosesthotelswillbereturned.Thereasonfornotincludingaggregateoperatorsinthehypothesislanguageisthatthesearchspaceassociatedwithlearningdefinitionsisprohibitivelylarge.(Thusweleaveaggregateoperatorstofutureworkasdiscussedinsection9.2.) Similarly,sourcedefinitionscannotcontaindisjunction,whichrulesoutunionandre-cursivequeries.Again,thissimplifyingassumptionholdsformostinformationsourcesandgreatlyreducesthesearchspace.Itmeanshowever,thataweatherserviceprovidingforecastsonlyforcitiesintheUSandCanadawouldbemodeledas: s($city,temp):-forecast(city,country,tomorrow,temp). Sincethedefinitiondoesnotrestrictthedomainofthecountryattribute,whenconfrontedwitharequestfortheforecastinAustralia,amediatorwouldproceedtocalltheservice,oblivioustoanyrestrictiononthatattribute. Wealsodonotallownegationinthequeriesbecausesourcedefinitionsveryrarelyrequireit,soincludingitwouldneedlesslycomplicatethesearch.Inthoserarecaseswherethenegationofaparticularpredicateisusefulfordescribingcertaintypesofsources,thenegatedpredicatecanbeincluded(asadistinctpredicate)inthesearch.Forinstance,wemightuse“≥”todescribeasource,eventhoughstrictlyspeakingitisthenegationof“<”.3.3LeveragingKnownSources Ourapproachtotheproblemofdiscoveringsemanticdefinitionsfornewservicesistoleveragethesetofknownsourceswhenlearninganewdefinition.Broadlyspeaking,wedothisbyinvokingtheknownsources(inamethodicalmanner)toseeifanycombinationoftheinformationtheyprovidematchestheinformationprovidedbythenewsource.Fromapracticalperspective,thismeansinordertomodelanewlydiscoveredsourcesemantically,werequiresomeoverlapinthedatabeingproducedbythenewsourceandthesetofknownsources.Onewaytounderstandthisistoconsideranewsourceproducingweatherdata.Ifnoneoftheknownsourcesproduceanyweatherinformation,thenthereisnowayforthesystemtolearnwhetherthenewsourceisproducinghistoricalweatherdata,weatherforecasts-oreventhatitisdescribingweatheratall.(Inprinciple,onecouldtrytoguesswhattheserviceisdoingbasedonthetypesignaturealone,buttherewouldbenoguaranteethatthedefinitionwascorrect,makingitoflittleusetoamediator.)Giventhisoverlappingdatarequirement,onemightclaimthatthereislittlebenefitinincorporatingnewsources.Wedetailsomeofthereasonswhythisisnotthecasebelow. 11 Carman&Knoblock Themostobviousbenefitoflearningdefinitionsfornewsourcesisredundancy.Ifthesystemisabletolearnthatonesourceprovidesexactlythesameinformationasacurrentlyavailablesource,thenifthelattersuddenlybecomesunavailable,theformercanbeusedinitsplace.Forexample,ifamediatorknowsofoneweathersourceprovidingcurrentconditionsandlearnsthatasecondsourceprovidesthesameorsimilardata,thenifthefirstgoesdownforwhateverreason(perhapsbecauseanaccessquotahasbeenreached),weatherdatacanstillbeaccessedfromthesecond. Thesecondandperhapsmoreinterestingreasonforwantingtolearnadefinitionforanewsourceisthatthenewsourcemayprovidedatawhichliesoutsidethescopeof(orsimplywasnotpresentin)thedataprovidedbytheothersources.Forexample,consideraweatherservicewhichprovidestemperaturevaluesforzipcodesintheUnitedStates.Thenconsiderasecondsourcethatprovidesweatherforecastsforcitiesworldwide.Ifthesystemcanusethefirstsourcetolearnadefinitionforthesecond,theamountofinformationavailableforqueryingincreasesgreatly. Bindingconstraintsonaservicecanmakeaccessingcertaintypesofinformationdifficultorinefficient.Inthiscase,discoveringanewsourceprovidingthesameorsimilardatabutwithadifferentbindingpatternmayimproveperformance.Forexample,considerahotelsearchservicethatacceptsazipcodeandreturnsasetofhotelsalongwiththeirstarrating: hotelSearch($zip,hotel,rating,street,city,state):-accommodation(hotel,rating,street,city,state,zip). NowconsiderasimplequeryforthenamesandaddressesofallfivestarhotelsinCalifornia: q(hotel,street,city,zip):-accommodation(hotel,5*,street,city,California,zip).Answeringthisquerywouldrequirethousandsofcallstotheknownsource,oneforeveryzipcodeinCalifornia,andamediatorcouldonlyanswerthequeryiftherewasanothersourceprovidingthosezipcodes.Incontrast,ifthesystemhadlearntadefinitionforanewsourcewhichprovidesexactlythesamedatabutwithadifferentbindingpattern(suchastheonebelow),thenansweringthequerywouldrequireonlyonecalltoasource: hotelsByState($state,$rating,hotel,street,city,zip):-accommodation(hotel,rating,street,city,state,zip). Oftenthefunctionalityofacomplexsourcecanbedescribedintermsofacompositionofthefunctionalityprovidedbyothersimplerservices.Forinstance,considerthemotivatingexamplefromsection2,inwhichthefunctionalityprovidedbythenewsourcewastocalcu-latethedistanceinmilesbetweentwozipcodes.Thesamefunctionalitycouldbeachievedbyperformingfourdifferentcallstotheavailablesources.Inthatcase,thedefinitionlearntbythesystemmeantthatanyqueryregardingthedistancebetweenzipcodescouldbehan-dledmoreefficiently.Ingeneral,bylearningdefinitionsformorecomplicatedsourcesintermsofsimplerones,thesystemcanbenefitfromcomputation,optimisationandcachingabilitiesofservicesprovidingcomplexfunctionality. Finally,thenewlydiscoveredservicemaybefastertoaccessthantheknownsourcesprovidingsimilardata.Forinstance,considerageocodingservicethattakesinanaddressandreturnsthelatitudeandlongitudecoordinatesofthelocation.Becauseofthevarietyinalgorithmsusedtocalculatethecoordinates,it’snotunreasonableforsomegeocodingservicestotakeaverylongtime(upwardsofonesecond)toreturnaresult.Ifthesystemwereabletodiscoveranewsourceprovidingthesamegeocodingfunctionality,butusing 12 LearningSemanticDefinitionsofInformationSourcesontheInternet afasteralgorithm,thenitcouldlocateanddisplaymanymoreaddressesonamapinthesameamountoftime. 4.InducingDefinitions Inthissectionwedescribeanalgorithmforgeneratingcandidatedefinitionsforanewlydiscoveredsource.Thealgorithmformsthefirstphaseinagenerateandtestmethodologyforlearningsourcedefinitions.Wedeferdiscussionofthetestingphasetolaterinthepaper.Westartbybrieflydiscussingworkonrelationalrulelearningandthendescribehowouralgorithmbuildsupontheseideas.4.1InductiveLogicProgramming Thelanguageofconjunctivequeriesisarestrictedformoffirst-orderlogic.IntheMachineLearningcommunity,systemscapableoflearningmodelsusingfirst-orderrepresentationsarereferredtoasInductiveLogicProgramming(ILP)systemsorrelationalrulelearners.Becauseoftheexpressivenessofthemodelinglanguage,thecomplexityoflearningismuchhigherthanforpropositionalrulelearners(alsocalledattribute-valuelearners),whichformthebulkofMachineLearningalgorithms.Givenourrelationalmodelingofservices,manyofthetechniquesdevelopedinILPshouldalsoapplytoourproblem. TheFirstOrderInductiveLearner(foil)isawellknownILPsearchalgorithm(Cameron-Jones&Quinlan,1994).Itiscapableoflearningfirst-orderrulestodescribeatargetpred-icate,whichisrepresentedbyasetofpositiveexamples(tuplesoverthetargetrelation,denotedE+)andoptionallyalsoasetofnegativeexamples(E−).Thesearchforaviabledefinitioninfoilstartswithanemptyclause7andprogressivelyaddsliteralstothebody(antecedent)oftherule,therebymakingtherulemorespecific.Thisprocesscontinuesuntilthedefinition(denotedh)coversonlypositiveexamplesandnonegativeexamples: E+∩EI[h]=∅ andE−∩EI[h]=∅ Usuallyasetofrulesislearntinthismannerbyremovingthepositiveexamplescoveredbythefirstruleandrepeatingtheprocess.(Thesetofrulesistheninterpretedasaunionquery.)Searchinfoilisperformedinagreedybest-firstmanner,guidedbyaninformationgain-basedheuristic.Manyextensionstothebasicalgorithmexist,mostnotablythosethatcombinedeclarativebackgroundknowledgeinthesearchprocesssuchasfocl(Pazzani&Kibler,1992).Suchsystemsarecategorisedasperformingatopdownsearchbecausetheystartfromanemptyclause(themostgeneralrulepossible)andprogressivelyspecializetheclause.Bottomupapproaches,ontheotherhand,suchasgolem(Muggleton&Feng,1990),performaspecifictogeneralsearchstartingfromthepositiveexamplesofthetarget.4.2Search Wenowdescribetheactualsearchprocedureweusetogeneratecandidatedefinitionsforanewsource.Theprocedureisbasedonthetop-downsearchstrategyusedinfoil.Thealgorithmtakesasinputatypesignatureforthenewsourceandusesittoseedthesearchfor 7.WeusethetermsclauseandqueryinterchangeablytorefertoaconjunctivequeryinDatalog.Anemptyclauseisaquerywithoutanyliteralsinthebody(rightside)oftheclause. 13 Carman&Knoblock input:Apredicatesignatures∗output:Thebestscoringviewdefinitionvbest invoketargetwithsetofrandominputs;2vbest←emptyclauses∗;3addvbesttoemptyqueue; 4whilequeue=∅∧time() 17returnvbest; Algorithm1:Best-firstsearchthroughthespaceofcandidatesourcedefinitions.1 candidatedefinitions.(Wewillrefertothenewsourcerelationasthetargetpredicateandthesetofknownsourcerelationsassourcepredicates.)Thespaceofcandidatedefinitionsisenumeratedinabest-firstmanner,witheachcandidatetestedtoseeifthedataitreturnsissimilartothetarget.Pseudo-codedescribingtheprocedureisgiveninAlgorithm1.8 Thefirststepinthealgorithmistoinvokethenewsourcewitharepresentativesetofinputtuplestogenerateexamplesofoutputtuplesthatcharacterisethefunctionalityofthesource.Thissetofinvocationsmustincludepositiveexamples(invocationsforwhichoutputtupleswereproduced)andifpossible,alsonegativetuples(inputsforwhichnooutputwasreturned).Thealgorithm’sabilitytoinducethecorrectdefinitionforasourcedependsgreatlyonthenumberofpositiveexamplesavailable.Thusaminimumnumberofpositiveinvocationsofthesourceisimposed,meaningthatthealgorithmmayhavetoinvokethesourcerepeatedlyusingdifferentinputsuntilsufficientpositiveinvocationscanberecorded.Selectingappropriateinputvaluessoastosuccessfullyinvokeaserviceiseasiersaidthandone.Wedeferdiscussionoftheissuesanddifficultiesinvolvedinsuccessfullyinvokingthenewsourcetosection6.1,andassumeforthemomentthattheinductionsystemisabletogenerateatableofvaluesthatrepresentitsfunctionality. Thenextstepinthealgorithmistoinitialisethesearchbyaddinganemptyclausetothequeueofdefinitionstoexpand.Therestofthealgorithmissimplyabest-firstsearchprocedure.Ateachiterationthehighestscoringbutnotyetexpandeddefinition(denotedv0)isremovedfromthequeueandexpandedbyaddinganewpredicatetotheendofthe 8.Theimplementationofthisalgorithmusedintheexperimentsofsection7.3isavailableat:http://www.isi.edu/publications/licensed-sw/eidos/index.html 14 LearningSemanticDefinitionsofInformationSourcesontheInternet clause(seethenextsectionforanexample).Eachcandidategenerated(denotedv1)isaddedtothequeue.Thealgorithmthenprogressivelyconstrainsthecandidatebybindingvariablesofthenewlyaddedpredicate,(seesection4.4).Theevalfunction(seesection5.3)evaluatesthequalityofeachcandidateproduced.Theprocedurestopsconstrainingthecandidatewhenthechangeintheevaluationfunction(∆eval)dropstozero.Itthencomparesv1tothepreviousbestcandidatevbestandupdatesthelatteraccordingly. Inprinciplethealgorithmshouldterminatewhenthe“perfect”candidatedefinitionisdiscovered-onethatproducesexactlythesamedataasthetarget.Inpracticethatneveroccursbecausethesourcesareincomplete(don’tperfectlyoverlapwitheachother)andarenoisy.Insteadthealgorithmterminateswheneitherthequeuebecomesempty,atimelimitisreachedoramaximumnumberofiterationshasbeenperformed.4.3AnExample Wenowrunthroughanexampleoftheprocessofgeneratingcandidatedefinitions.Consideranewlydiscoveredsource,whichtakesinazipcodeandadistance,andreturnsallthezipcodesthatliewithinthegivenradius(alongwiththeirrespectivedistances).Thetargetpredicaterepresentingthissourceis: source5($zip1,$dist1,zip2,dist2) Nowassumethattherearetwoknownsources.Thefirstisthesourceforwhichthedefinitionwaslearntintheexamplefromsection2,namely: source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2), greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.609,dist2).Thesecondsourceisn’tactuallyasourcebutaninterpretedpredicate: ≤(dist1,dist2). Thesearchforadefinitionforthenewsourcemightthenproceedasfollows.Thefirstdefinitiontobegeneratedistheemptyclause: source5($,$,,). Thenullcharacter()representsthefactthatnoneoftheinputsoroutputshaveanyrestrictionsplacedontheirvalues.Priortoaddingthefirstliteral(sourcepredicate),thesystemwillcheckwhetheranyoutputattributesechotheinputvalues.Inthiscase,giventhesemantictypes,twopossibilitiesneedtobechecked: source5($zip1,$,zip1,).source5($,$dist1,,dist1). Assumingneitherofthesepossibilitiesistrue(i.e.improvesthescore),thenliteralswillbeaddedoneatatimetorefinethedefinition.Aliteralisasourcepredicatewithanassignmentofvariablenamestoitsattributes.Anewdefinitionmustbecreatedforeverypossibleliteralthatincludesatleastonevariablealreadypresentintheclause.(Forthemomentweignoretheissueofbindingconstraintsonthesourcesbeingadded.)Thusmanycandidatedefinitionswouldbegenerated,includingthefollowing: source5($zip1,$dist1,,):-source4($zip1,$,dist1).source5($zip1,$,zip2,):-source4($zip1,$zip2,).source5($,$dist1,,dist2):-≤(dist1,dist2). 15 Carman&Knoblock Notethatthesemantictypesinthetypesignatureofthetargetpredicatelimitgreatlythenumberofcandidatedefinitionsthatcanbeproduced.Thesystemthenevaluateseachofthesecandidatesinturn,selectingthebestoneforfurtherexpansion.Assumingthefirstofthethreehasthebestscore,itwouldbeexpandedbyaddinganotherliteral,formingmorecomplicatedcandidatessuchasthefollowing: source5($zip1,$dist1,,dist2):-source4($zip1,$,dist1),≤(dist1,dist2).Thisprocesscontinuesuntilthesystemdiscoversadefinitionwhichperfectlydescribesthesource,orisforcedtobacktrackbecausenoliteralimprovesthescore.4.4IterativeExpansion Thesourcesusedinthepreviousexampleallhadrelativelylowarity.OntheInternetthisisrarelythecase,withmanysourcesproducingalargenumberofattributesofthesametype.Thisisaproblembecauseitcausesanexponentialnumberofdefinitionstobepossibleateachexpansionstep.Considerforinstanceastockpriceservice,whichprovidesthecurrent,high,low,market-openingandmarket-closingpricesforagiventickersymbol.Thetypesignatureforthatservicewouldbe: stockprice($ticker,price,price,price,price,price) Ifthedefinitiontowhichthispredicateistobeaddedalreadycontainskdistinctpricevariables,thenthenumberofwaysinwhichthepriceattributesofthenewrelationcanbe 55i assignedvariablenamesisi=0ik,whichisprohibitivelylargeevenformoderatek.9Tolimitthesearchspaceinthecaseofsuchhigh-aritypredicates,wefirstgeneratecandidateswithaminimalnumberofboundvariablesinthenewliteralandprogressivelyconstrainthebestperformingofthesedefinitionswithineachexpansion.(Higharitypredicatesarehandledinasimilarfashioninfoil,QuinlanandCameron-Jones,1993.)Forexample,considerusingthesourceabovetolearnadefinitionforanewsourcewithsignature: source6($ticker,price,price) Westartbyaddingliteralstoanemptydefinitionasbefore.Thistimethough,insteadofgeneratingaliteralforeverypossibleassignmentofvariablenamestotheattributesofeachrelation,wegenerateonlythesimplestassignmentssuchthatallofthebindingconstraintsaremet.(ThisistheexpandprocedurereferredtoinAlgorithm1.)Inthisexample,thetickersymbolinputofthestockpricesourcewouldneedtobebound,generatingasingledefinition: source6($tic,,):-stockprice($tic,,,,,). Thisdefinitionwouldbeevaluated,andthenmoreconstraineddefinitionsaregeneratedbyequatingavariablefromthesameliteraltoothervariablesfromtheclause.(ThisistheconstrainprocedurefromAlgorithm1.)Twosuchdefinitionsareshownbelow: source6($tic,pri1,):-stockprice($tic,pri1,,,,).source6($tic,pri1,):-stockprice($tic,,pri1,,,). Thebestofthesedefinitionswouldthenbeselectedandconstrainedfurther,generatingdefinitionssuchas: i 9.Intuitively,onecanassignvariable5namestoiattributesusingklabelsinkdifferentways.Onecanchooseiofthe5attributesiniways,andonecandothatfori=0,1,..,5.SeeWeber,Tausend,andStahl(1995)foradetaileddiscussionofthesizeofthehypothesisspaceinILP. 16 LearningSemanticDefinitionsofInformationSourcesontheInternet source6($tic,pri1,pri2):-stockprice($tic,,pri1,pri2,,).source6($tic,pri1,pri2):-stockprice($tic,,pri1,,pri2,). Inthisway,thebestscoringliteralcanbefoundwithouttheneedtoiterateoverallofthepossibleassignmentsofvariablestoattributes.4.5DomainPredicatesvs.SourcePredicates Intheexamplesofsections4.3and4.4,thedecisiontoperformsearchoverthesourcepredicatesratherthanthedomainpredicateswasmadeinanarbitraryfashion.10Inthissectionwejustifythatdecision.Ifoneweretoperformthesearchoverthedomainpredicatesratherthanthesourcepredicates,thentestingeachdefinitionwouldrequireanadditionalqueryreformulationstep.Forexample,considerthefollowingcandidatedefinitionforsource5containingthedomainpredicatecentroid: source5($zip1,$,,):-centroid(zip1,,). Inordertoevaluatethiscandidate,thesystemwouldneedtofirsttreatthedefinitionasaqueryandreformulateitintoasetofrewritings(thattogetherformaunionquery)overthevarioussourcesasfollows: source5($zip1,$,,):-source4($zip1,$,).source5($zip1,$,,):-source4($,$zip1,). Thisunionquerycanthenbeexecutedagainsttheavailablesources(inthiscasejustsource4)toseewhattuplesthecandidatedefinitionreturns.Inpracticehowever,ifthedefinitionsfortheknownsourcescontainmultipleliterals(astheynormallydo)andthedomainrelationsareofhigh-arity(astheyoftenare),thenthesearchoverthespaceofconjunctionsofdomainpredicatesisoftenmuchlargerthanthecorrespondingsearchoverthespaceofconjunctionsofsourcepredicates.Thisisbecausemultipleconjunctionsofdomainpredicates(candidatedefinitions)endupreformulatingtothesameconjunctionofsourcepredicates(unionqueries).Forexample,considerthefollowingcandidatedefinitionswrittenintermsofthedomainpredicates: source5($zip1,$,,):-centroid(zip1,lat,),greatCircleDist(lat,,,,).source5($zip1,$,,):-centroid(zip1,,lon),greatCircleDist(,lon,,,). source5($zip1,$,,):-centroid(zip1,lat,lon),greatCircleDist(lat,lon,,,).Allthreecandidateswouldreformulatetothesamequeryoverthesources(shownbelow),andthusareindistinguishablegiventhesourcesavailable. source5($zip1,$,,):-source4($zip1,$,). Ingeneral,thenumberofcandidatedefinitionsthatmaptothesamereformulationcanbeexponentialinthenumberofhiddenvariablespresentinthedefinitionsoftheknownsources.Forthisreason,wesimplifytheproblemandsearchthespaceofconjunctionsofsourcepredicates.Insomesense,performingthesearchoverthesourcepredicatescanbeseenasintroducinga“similarityheuristic”whichfocusesthesearchtowarddefinitionswithsimilarstructuretothedefinitionsoftheavailablesources.Wenotethatthedefinitionsproducedcan(andwill)laterbeconvertedtoqueriesovertheglobalpredicatesbyunfoldingand 10.Anoteonthedifferencebetweendomain,sourceandinterpretedpredicates.Domainpredicatesare inventedbyadomainexpertforuseinmodelingaparticularinformationdomain.Theydefineacommonschemathatcanbeusedfordescribinginformationfromdifferentsources.Sourcepredicatesrepresentthesourcesavailableinthesystem.Interpretedpredicates,(suchas“≤”),areaspecialtypeofdomainpredicate,thatcanbetreatedassourcepredicates,sincetheirmeaningisinterpreted(understood). 17 Carman&Knoblock possiblytighteningthemtoremoveredundancies.Wewilldiscusstheprocessoftighteningtheunfoldingsinsection6.6.4.6LimitingtheSearch Thesearchspacegeneratedbythistop-downsearchalgorithmmaybeverylargeevenforasmallnumberofsources.Theuseofsemantictypeslimitsgreatlythewaysinwhichvariableswithineachdefinitioncanbeequated(akathejoinpaths)andthusgoesalongwaytoreducethesizeofthesearchspace.Despitethisreduction,asthenumberofsourcesavailableincreases,thesearchspacebecomessolargethattechniquesforlimitingitmustbeused.Weemploysomestandard(andothernotsostandard)ILPtechniquesforlimitingthisspace.Suchlimitationsareoftenreferredtoasinductivesearchbiasorlanguagebias(N´edellec,Rouveirol,Ad´e,Bergadano,&Tausend,1996). Anobviouswaytolimitthesearchistorestrictthenumberofsourcepredicatesthatcanoccurinadefinition.Wheneveradefinitionreachesthemaximumlength,backtrackingcanbeperformed,allowingthesearchtoescapefromlocalminimathatmayresultfromgreedyenumeration.Theassumptionhereisthatshorterdefinitionsaremoreprobablethanlongerones,whichmakessensesinceserviceprovidersarelikelytoprovidedatainthesimplestformpossible.Moreover,thesimplerthedefinitionlearnt,themoreusefulitwillbetoamediator,sowedecidetotradecompleteness(theabilitytoexpresslongerdefinitions)forimprovedaccuracyovershorterdefinitions. Thesecondrestrictionplacedoncandidatedefinitionsistolimitthenumberoftimesthesamesourcepredicateappearsinagivencandidate.Thismakessensebecausethedefinitionsofrealservicestendnottocontainmanyrepeatedpredicates.Intuitively,thisisbecausemostservicesproviderawdatawithoutperformingmanycalculationsuponit.Repeateduseofthesamepredicateinadefinitionismoreusefulfordescribingsomeformofcalculationthanrawdataitself.(Exceptionstothisruleexist,forexamplepredicatesrepresentingunitconversionfunctionalitysuchasFahrenheittoCelsius,maynecessarilyoccurmultipletimesinthedefinitionofasource.) Thethirdrestrictionlimitsthecomplexityofthedefinitionsgeneratedbyreducingthenumberofliteralsthatdonotcontainvariablesfromtheheadoftheclause.Specifically,itlimitsthelevelofexistentialquantification(sometimesalsoreferredtoasthedepth,MuggletonandFeng,1990)ofeachvariableinaclause.Thislevelisdefinedtobezeroforalldistinguishedvariables(thoseappearingintheheadoftheclause).Forexistentialvariablesitisdefinedrecursivelyasoneplusthelowestlevelofanyvariableappearinginthesameliteral.Forexample,thecandidatedefinitionshownbelowhasamaximumexistentialquantificationlevelofthreebecausetheshortestpathfromthelastliteraltotheheadliteral(viajoinvariables)passesthroughtwootherliterals: source5($zip1,$,,):-source4($zip1,$,d1),source3($d1,d2),source3($d2,).Theeffectofthisbiasistoconcentratethesearcharoundsimplerbuthighlyconnecteddefinitions,whereeachliteraliscloselylinkedtotheinputandoutputofthesource. Thefourthrestrictionplacedonsourcedefinitionsisthattheyareexecutable.Morespecifically,itshouldbepossibletoexecutethemfromlefttoright,meaningthattheinputsofeachsourceappeareitherinthetargetpredicate(headoftheclause)orinoneoftheliteralstotheleftofthatliteral.Forexample,ofthetwocandidatedefinitionsshownbelow, 18 LearningSemanticDefinitionsofInformationSourcesontheInternet onlythefirstisexecutable.Theseconddefinitionisnot,becausezip2isusedasinputforsource4inthefirstliteral,withoutfirstbeingboundtoavalueintheheadoftheclause: source5($zip1,$,zip2,):-source4($zip1,$zip2,). source5($zip1,$,,):-source4($zip1,$zip2,dist1),source4($zip2,$zip1,dist1).Thisrestrictionservestwopurposes.Firstly,liketheotherbiases,itlimitsthesizeofthesearchspace.Secondly,itmakesiteasiertoevaluatethedefinitionsproduced.Intheory,onecouldstillevaluatetheseconddefinitionabovebygeneratinglotsofinputvaluesforzip2,butthatwouldrequirealotofinvocationsforminimalgain. Thelastrestrictionreducesthesearchspacebylimitingthenumberoftimesthesamevariablecanappearinanygivenliteralinthebodyoftheclause.Definitionsinwhichthesamevariableappearsmultipletimesinagivenliteral,suchasinthefollowingexamplewhichreturnsthedistancebetweenazipcodeanditself,arenotverycommoninpractice: source5($zip1,$,,dist2):-source4($zip1,$zip1,dist2). Explicitlypreventingsuchdefinitionsfrombeinggeneratedmakessensebecausesourcesrequiringthemaresorare,thatitisbettertoreducethesearchspaceexponentiallybyignoringthem,thantoexplicitlycheckforthemeachtime. 5.ScoringDefinitions Wenowproceedtotheproblemofevaluatingthecandidatedefinitionsgeneratedduringsearch.Thebasicideaistocomparetheoutputproducedbythesourcewiththeoutputproducedbythedefinitiononthesameinput.Themoresimilarthesetoftuplesproduced,thehigherthescoreforthecandidate.Thescoreisthenaveragedoverasetofdifferentinputtuplestoseehowwellthecandidatedefinitiondescribesthenewsource. Inthemotivatingexampleofsection2,thesourceforwhichadefinitionwasbeinglearnt(thedefinitionisrepeatedbelow)onlyproducedoneoutputtupledistforeveryinputtuplezip1,zip2: source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2),greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.6093,dist2). Thisfactmadeitsimpletocomparetheoutputoftheservicewiththeoutputoftheinduceddefinition.Ingeneralhowever,thesourcetobemodeled(andthecandidatedefinitionsmodelingit)mayproducemultipleoutputtuplesforeachinputtuple.Takeforexamplesource5fromsection4.3,whichproducesthesetofoutputtupleszip2,dist2containingallthezipcodeswhichliewithinagivenradiusoftheinputzipcodezip1,dist1.Insuchcases,thesystemneedstocompareasetofoutputtupleswiththesetproducedbythedefinitiontoseeifanyofthetuplesarethesame.Sinceboththenewsourceandtheexistingsourcesmaynotbecomplete,thetwosetsmaysimplyoverlap,evenifthecandidatedefinitioncorrectlydescribesthenewsource.Assumingthatwecancountthenumberoftuplesthatarethesame,weneedameasurethattellsushowwellacandidatehypothesisdescribesthedatareturnedbyasource.Onesuchmeasureisthefollowing: 1|Os(i)∩Ov(i)| score(s,v,I)= |I|i∈I|Os(i)∪Ov(i)| 19 Carman&Knoblock wheresisthenewsource,visacandidatesourcedefinition,andI⊆D[βs]isthesetofinputtuplesusedtotestthesource(βsisthesetofinputattributesofsources).Os(i)denotesthesetoftuplesreturnedbythenewsourcewheninvokedwithinputtuplei.Ov(i)isthecorrespondingsetreturnedbythecandidatedefinition.Usingrelationalprojection(π)andselection(σ)operatorsandthenotationintroducedinsection2.1,thesesetscan crepresentstheoutputattributesofs.)bewrittenasfollows.(Notethatβs c(σβ=i(E[s]))Os(i)≡πβss and c(σβ=i(EI[v]))Ov(i)≡πβss Ifweviewthishypothesistestingasaninformationretrievaltask,wecanconsiderrecall tobethenumberofcommontuples,dividedbythenumberoftuplesproducedbythesource,andprecisiontobethenumberofcommontuplesdividedbythenumberoftuplesproducedbythedefinition.TheabovemeasuretakesbothprecisionandrecallintoaccountbycalculatingtheaverageJaccardsimilaritybetweenthesets.Thetablebelowgivesanexampleofhowthisscoreiscalculatedforeachinputtuple. inputtuplei∈Ia,bc,de,fg,hi,j actualoutputtuplesOs(i){x,y,x,z}{x,w,x,z}{x,w,x,y} ∅∅ predictedoutputtuplesOv(i){x,y}{x,w,x,y}{x,w,x,y}{x,y}∅ Jaccardsimilarity fortuplei1/21/310#undef! Thefirsttworowsofthetableshowinputsforwhichthepredictedandactualoutputtuplesoverlap.Inthethirdrow,thedefinitionproducesexactlythesamesetoftuplesasthesourcebeingmodeledandthusgetsthemaximumscore.Inthefourthrow,thedefinitionproducedatuple,whilethesourcedidn’t,sothedefinitionwaspenalised.Inthelastrow,thedefinitioncorrectlypredictedthatnotupleswouldbeoutputbythesource.Ourscorefunctionisundefinedatthispoint.Fromacertainperspectivethedefinitionshouldscorewellherebecauseithascorrectlypredictedthatnotuplesbereturnedforthatinput,butgivingahighscoretoadefinitionwhenitproducesnotuplescanbedangerous.Doingsomaycauseoverlyconstraineddefinitionsthatcangenerateveryfewoutputtuplestoscorewell.Atthesametime,lessconstraineddefinitionsthatarebetteratpredictingtheoutputtuplesonaveragemayscorepoorly.Forexample,considerasourcewhichreturnsweatherforecastsforzipcodesinLosAngeles: source($zip,temp):-forecast(zip,tomorrow,temp),UScity(zip,LosAngeles). Nowconsidertwocandidatedefinitionsforthesource.Thefirstreturnsthetemperatureforazipcode,whilethesecondreturnsthetemperatureonlyifitisbelow0◦C: v1($zip,temp):-forecast(zip,tomorrow,temp). v2($zip,temp):-forecast(zip,tomorrow,temp),temp<0◦C. Assumethatthesourceandcandidatesareinvokedusing20differentrandomlyselectedzipcodes.Formostzipcodes,thesourcewillnotreturnanyoutput,becausethezipcodewilllieoutsideofLosAngeles.Thefirstcandidatewilllikelyreturnoutputforallzipcodes,whilethesecondcandidatewould,likethesource,onlyrarelyproduceanyoutput.Thisisbecausethetemperatureinmostzipcodeswillbegreaterthanzero,andhasnothingtodo 20 LearningSemanticDefinitionsofInformationSourcesontheInternet withwhetherornotthezipcodeisinLosAngeles.Ifwescoredefinitionshighlywhentheycorrectlyproducenooutput,thesystemwoulderroneouslypreferthesecondcandidateoverthefirst(becausethelatteroftenproducesnooutput).Topreventthatfromhappening,wesimplyignoreinputsforwhichthedefinitioncorrectlypredictszerotuples.Thisisthesameassettingthescoretobetheaverageoftheothervalues. Returningourattentiontothetable,afterignoringthelastrow,theoverallscoreforthisdefinitionwouldbecalculatedas0.46.5.1PartialDefinitions Asthesearchproceedstowardthecorrectdefinitionfortheservice,manysemi-complete(unsafe)definitionswillbegenerated.Thesedefinitionswillnotproducevaluesforallattributesofthetargettuplebutonlyasubsetofthem.Forexample,thecandidate: source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist1). producesonlyoneofthetwooutputattributesproducedbythesource.Thispresentsaproblem,becauseourscoreisonlydefinedoversetsoftuplescontainingalloftheoutputattributesofthenewsource.Onesolutionmightbetowaituntilthedefinitionsbecomesufficientlylongastoproducealloutputs,beforecomparingthemtoseewhichonebestdescribesthenewsource.Thereare,however,tworeasonswhythiswouldnotmakesense: •Thespaceofsafedefinitionsistoolargetoenumerate,andthusweneedtocomparepartialdefinitionstoguidethesearchtowardthecorrectdefinition. •Thebestdefinitionthatthesystemcangeneratemaywellbeapartialone,asthesetofknownsourcesmaynotbesufficienttocompletelymodelthesource. Thesimplestwaytocomputeascoreforapartialdefinitionistocomputethesamefunctionasbefore,butinsteadofusingtherawsourcetuples,projectingthemoverthesubsetofattributesthatareproducedbythedefinition.Thisrevisedscoreisshownbelow.(Notethattheprojectionisoverv\\βs,whichdenotesthesubsetofoutputattributesofswhichareproducedbytheviewdefinitionv.Notealsothattheprojectionisnotdistinct,i.e.multipleinstancesofthesametuplemaybeproduced.) 1|πv\\βs(Os(i))∩Ov(i)| score2(s,v,I)= |I|i∈I|πv\\βs(Os(i))∪Ov(i)| Thisrevisedscoreisnotveryusefulhowever,asitgivesanunfairadvantagetodefinitionsthatdonotproducealloftheoutputattributesofthesource.Thisisbecauseitisfareasiertocorrectlyproduceasubsetoftheoutputattributesthantoproduceallofthem.Considerforexamplethetwosourcedefinitionsshownbelow.Thetwodefinitionsareidenticalexceptthatthesecondreturnstheoutputdistancevaluedist2,whilethefirstdoesnot: source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist2),≤(dist2,dist1).source5($zip1,$dist1,zip2,dist2):-source4($zip1,$zip2,dist2),≤(dist2,dist1).Sincethetwoareidentical,theprojectionoverthesubsetwillinthiscasereturnthesamenumberoftuples.Thismeansthatbothdefinitionswouldgetthesamescorealthoughtheseconddefinitionisclearlybetterthanthefirstsinceitproducesalloftherequiredoutputs. Weneedtobeabletopenalisepartialdefinitionsinsomewayfortheattributestheydon’tproduce.Onewaytodothisistofirstcalculatethesizeofthedomain|D[a]|ofeach 21 Carman&Knoblock ofthemissingattributes.Intheexampleabove,themissingattributeisthedistancevalue.Sincedistanceisacontinuousvalue,calculatingthesizeofitsdomainisnotobvious.Wecanapproximatethesizeofitsdomainby: |D[distance]|≈ max−minaccuracy whereaccuracyistheerror-boundondistancevalues.(Wewilldiscusserror-boundsfurtherinsection5.4.)Notethatthecardinalitycalculationmaybespecifictoeachsemantictype.Armedwiththedomainsize,wecanpenalisethescoreforthedefinitionbydividingitbytheproductofthesizeofthedomainsofalloutputattributesnotgeneratedbythedefinition.Inessence,wearesayingthatallpossiblevaluesfortheseextraattributeshavebeen“allowed”bythisdefinition.Thistechniqueissimilartotechniquesusedforlearningwithoutexplicitnegativeexamples(Zelle,Thompson,Califf,&Mooney,1995). c\\v,thusthepenaltyThesetofmissingoutputattributesisgivenbytheexpressionβs formissingattributesisjustthesizeofthedomainoftuplesofthatscheme,i.e.: c penalty=|D[βs\\v]| Usingthispenaltyvaluewecancalculateanewscore,whichtakesintoaccountthemissing attributes.Simplydividingtheprojectedscorebythepenaltywouldnotadheretotheintendedmeaningofcompensatingforthemissingattributevalues,andthusmayskewtheresults.Instead,wederiveanewscorebyintroducingtheconceptoftypeddompredicatesasfollows: Adompredicateforasemanticdata-typet,denoteddomt,isasinglearityrelationwhoseextensionissettobethedomainofthedatatype,i.e.E[domt]=D[t].Similarly,adompredicateforaschemeA,denoteddomA,isarelationoverAwhoseextensionisE[domA]=D[A]. DompredicateswereintroducedbyDuschkatohandletheproblemofqueryreformulationinthepresenceofsourceswithbindingconstraints(Duschka,1997).(Inthatworkthepredicateswerenottyped,althoughtypingwouldhaveresultedinamoreefficientalgo-rithm.)Hereweusethemtoconvertapartialdefinitionvintoasafe(complete)definitionv.Wecandothissimplybyaddingadompredicatetotheendoftheviewdefinitionthatgeneratesvaluesforthemissingattributes.Fortheexampleabove,vwouldbe: source5($zip1,$dist1,zip2,x):-source4($zip1,$zip2,dist2),≤(dist2,dist1),domdistance(x). wherexisanewvariableoftypedistance.Thenewviewdefinitionvissafe,becauseallthevariablesintheheadoftheclausealsoappearinthebody.Ingeneral,wecanturnanyunsafeviewdefinitionvintoasafedefinitionvbyappendingadompredicate c\\v(x1,...,xn),whereeachxiisadistinguishedvariable(fromtheheadoftheclause)domβs correspondingtoanoutputattributeofvthatwasn’tboundinv.Nowwecanusethiscompletedefinitiontocalculatethescoreasbefore: 1|Os(i)∩Ov(i)| score3(s,v,I)=score(s,v,I)= |I|i∈I|Os(i)∪Ov(i)| 22 LearningSemanticDefinitionsofInformationSourcesontheInternet whichcanberewritten(byexpandingthedenominator)asfollows: score3(s,v,I)= |Os(i)∩Ov(i)|1 |I|i∈I|Os(i)|+|Ov(i)|−|Os(i)∩Ov(i)| Wecanthenremovethereferencestovfromthisequationbyconsidering: c c\\v]=Ov(i)×D[βs\\v]Ov(i)=Ov(i)×E[domβs c\\v]|andthesizeoftheintersectionThusthesizeofthesetisgivenby|Ov(i)|=|Ov(i)||D[βs canbecalculatedbytakingtheprojectionovertheoutputattributesproducedbyv: c |Os(i)∩Ov(i)|=|πv\\βs(Os(i)∩Ov(i)×D[βs\\v])|=|πv\\βs(Os(i))∩Ov(i)| Substitutingthesecardinalitiesintothescorefunctiongivenabove,wearriveatthefollowing equationforthepenalisedscore: |πv\\βs(Os(i))∩Ov(i)|1 score3(s,v,I)=c\\v]|−|π|I|i∈I|Os(i)|+|Ov(i)||D[βsv\\βs(Os(i))∩Ov(i)| 5.2BindingConstraints Someofthecandidatedefinitionsgeneratedduringthesearchmayhavedifferentbinding constraintsfromthetargetpredicate.Forinstanceinthepartialdefinitionshownbelow,thevariablezip2isanoutputofthetargetsource,butaninputtosource4: source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist1). Fromalogicalperspective,inordertotestthisdefinitioncorrectly,weneedtoinvokesource4witheverypossiblevaluefromthedomainofzipcodes.Doingthisisnotpracticalfortworeasons:firstly,thesystemmaynothaveacompletelistofzipcodesatitsdisposal.Secondlyandfarmoreimportantly,invokingsource4withthousandsofdifferentzipcodeswouldtakeaverylongtimeandwouldprobablyresultinthesystembeingblockedfromfurtheruseoftheservice.Soinsteadofinvokingthesamesourcethousandsoftimes,weapproximatethescoreforthisdefinitionbysamplingfromthedomainofzipcodesandinvokingthesourceusingthesampledvalues.Wethencompensateforthissamplingbyscaling(certaincomponentsof)thescorebytheratioofthesampledzipcodestotheentiredomain.Consideringtheexampleabove,ifwerandomlychooseasample(denotedδ[zipcode])ofsay20valuesfromthedomainofzipcodes,thenthesetoftuplesreturnedbythedefinitionwillneedtobescaledbyafactorof|D[zipcode]|/20. Amoregeneralequationforcomputingthescalingfactor(denotedSF)isshownbelow.Notethatthesamplingmayneedtobeperformedoverasetofattributes.(Hereβv\\βsdenotestheinputattributesofvwhichareoutputsofs.) SF= |D[βv\\βs]||δ[βv\\βs]| Wenowcalculatetheeffectofthisscalingfactorontheoverallscoreasfollows.Wedenote ˜v(i).ThisvaluethesetoftuplesreturnedbythedefinitiongiventhesampledinputasO 23 Carman&Knoblock whenscaledwillapproximatethesetoftuplesthatwouldhavebeenreturnedhadthedefinitionbeeninvokedwithallthepossiblevaluesfortheadditionalinputattributes: ˜v(i)|∗SF|Ov(i)|≈|O Assumingthesamplingisperformedrandomlyoverthedomainofpossiblevalues,the intersectionbetweenthetuplesproducedbythesourceandthedefinitionshouldscaleinthesameway.Thustheonlyfactornotaffectedbythescalinginthescoredefinedpreviouslyis|Os(i)|.Ifwedividethroughoutbythescalingfactorwehaveanewscoringfunction: ˜v(i)||πv\\βs(Os(i))∩O1 score4(s,v,I)= ˜v(i)||D[βc\\v]|−|πv\\β(Os(i))∩O˜v(i)||I|i∈I|Os(i)|/SF+|Oss Theproblemwiththisapproachisthatoftenthesampledsetofvaluesistoosmallandasa resultitdoesnotintersectwiththesetofvaluesreturnedbythesource,eventhoughalargersamplewouldhaveintersectedinsomeway.Thusoursamplingintroducesunfairdistortionsintothescoreforcertaindefinitions,causingthemtoperformpoorly.Forexample,consideragainsource5andassumethatforscalabilitypurposes,theserviceplacesalimitonthemaximumvaluefortheinputradiusdist1.(Thismakessense,asotherwisetheusercouldsettheinputradiustocovertheentireUS,andatupleforeverypossiblezipcodewouldneedtobereturned.)Nowconsiderthesamplingperformedabove.Ifwerandomlychooseonly20zipcodesfromthesetofallpossiblezipcodes,thechanceofthesamplecontainingazipcodewhichlieswithina300mileradiusofaparticularzipcode(inthemiddleofthedesert)isverylow.Moreover,evenifonepairofzipcodes(outof20)resultsinasuccessfulinvocation,thiswillnotbesufficientforlearningagooddefinitionfortheservice. Sotogetaroundthisproblemwebiasthesamplesuchthat,wheneverpossible,halfofthevaluesaretakenfrompositiveexamplesofthetarget(thosetuplesreturnedbythenewsource)andhalfaretakenfromnegativeexamples(thosetuplesnotreturnedbythesource).Bysamplingfrombothpositiveandnegativetuples,weguaranteethattheapproximationgeneratedwillbeasaccurateaspossiblegiventhelimitedsamplesize.Wedenotethesetofpositiveandnegativesamplesasδ+[βv\\βs]andδ−[βv\\βs],andusethesevaluestodefinepositiveandtotalscalingfactorsasshownbelow.(Thenumeratorforthepositivevaluesisdifferentfrombefore,asthesevalueshavebeentakenfromtheoutputofthenewsource.) SF+= |πβv\\βs(πv\\βs(Os(i)))| |δ+[βv\\βs]| Thetotalscalingfactoristhesamevalueasbefore,butcalculatedslightlydifferently: SF= |D[βv\\βs]| |δ+[βv\\βs]|+|δ−[βv\\βs]| Thescorecanthenbeapproximatedaccordinglybytakingintoaccountthesenewscalingfactors.Theintersectionnowneedstobescaledusingthepositivescalingfactor: ˜v(i)|∗SF+|πv\\βs(Os(i))∩Ov(i)|≈|πv\\βs(Os(i))∩O Thisnewscalingresultsinanewfunctionforevaluatingthequalityofaviewdefinition:˜v(i)|∗SF+|πv\\βs(Os(i))∩O1 score5(s,v,I)= c\\v]|∗SF−|π+˜v(i)||D[βs˜|I|i∈I|Os(i)|+|Ov\\βs(Os(i))∩Ov(i)|∗SF 24 LearningSemanticDefinitionsofInformationSourcesontheInternet 5.3FavouringShorterDefinitions Nowthatwehavederivedascoreforcomparingthedatathatasourceandcandidateproduce,wecandefinetheevaluationfunctionevalusedinAlgorithm1.Asmentionedinsection4.6,shorterdefinitionsforthetargetsourceshouldbepreferredoverlongerandpossiblylessaccurateones.Inaccordancewiththisprinciple,wescalethescorebythelengthofthedefinition,soastofavourshorterdefinitionsasfollows: eval(v)=ωlength(v)∗score5(s,v,I) Herelength(v)isthelengthoftheclauseandω<1isaweightingfactor.Settingtheweightingfactortobealittlelessthan1(suchas0.95)helpstoremovelogicallyredundantdefinitions,whichcansometimesbehardtodetect,butoftenreturnalmostexactlythesamescoreastheirshorterequivalent.Wewilldiscusstheproblemofgeneratingnon-redundantclausesinsection6.3. 5.4ApproximatingEquality Untilnow,wehaveignoredtheproblemofdecidingwhethertwotuplesproducedbythetargetsourceandthedefinitionarethesame.Sincedifferentsourcesmayserializedataindifferentwaysandatdifferentlevelsofaccuracy,wemustallowforsomeflexibilityinthevaluesthatthetuplescontain.Forinstance,intheexamplefromsection2,thedistancevaluesreturnedbythesourceanddefinitiondidnotmatchexactly,butwere“sufficientlysimilar”tobeacceptedasthesamevalue. Fornumerictypesliketemperatureordistanceitmakessensetouseanerrorbound(like±0.5◦C)orapercentageerror(suchas±1%)todecideiftwovaluescanbeconsideredthesame.Thisisbecausethesensingequipment(inthecaseoftemperature)orthealgorithm(inthecaseofdistance)willhavesomeerrorboundassociatedwiththevaluesitproduces.Werequirethatanerrorboundforeachnumerictypebeprovidedintheproblemspecification.(Ideally,theseboundswouldbelearntautomaticallyfromexamples.) Forcertainnominaltypeslikecompanynames,wherevalueslikeIBMCorporationandInternationalBusinessMachinesCorp.representthesamevalue,simplisticequalitycheckingusingexactorsubstringmatchesisnotsufficientfordecidingwhethertwovaluescorrespondtothesameentity.Inthiscase,stringeditdistancessuchastheJaroWinklerscoredobetteratdistinguishingstringsrepresentingthesameentityfromthoserepresentingdifferentones(Bilenko,Mooney,Cohen,Ravikumar,&Fienberg,2003).Amachinelearningclassifiercouldbetrainedonasetofsuchexamplestolearnwhichoftheavailablestringeditdistancesbestdistinguishesvaluesofthattypeandwhatthresholdtosetforacceptingapairasamatch.Werequirethatthispairofsimilaritymetricandthreshold(oranycombinationsofmetrics)beprovidedintheproblemspecification. Inothercases,enumeratedtypeslikemonthsoftheyearmightbeassociatedwithasimpleequalitycheckingprocedure,sothatvalueslikeJanuary,Janand1canbefoundequal.Theactualequalityprocedureusedwilldependonthesemantictypeandweassumeinthisworkthatsuchaprocedureisgivenintheproblemdefinition.Wenotethattheprocedureneednotbe100%accurate,butonlyprovideasufficientlevelofaccuracytoguidethesystemtowardthecorrectsourcedescription.Indeed,theequalityrulescouldalsobegeneratedofflinebytrainingaclassifier. 25 Carman&Knoblock Complextypessuchasdatepresentabiggerproblemwhenoneconsiderstherangeofpossibleserializations,includingvalueslike5/4/2006orThu,4May2006or2006-05-04.Insuchcasesspecializedfunctionsarenotonlyrequiredtocheckequalitybetweenvaluesbutalsotobreakthecomplextypesupintotheirconstituentparts(inthiscaseday,monthandyear).Thelatterwouldformpartofthedomainmodel. Insomecases,decidingwhethertwovaluesofthesametypecanbeconsideredequaldependsnotonlyonthetype,butalsoontherelationstheyareusedin.Considerthetworelationsshownbelow.Thefirstprovidesthelatitudeandlongitudecoordinatesofthecentroidforazipcode,whilethesecondreturnsthecoordinatesforaparticularaddress: centroid(zipcode,latitude,longitude) geocode(number,street,zipcode,latitude,longitude) Giventhedifferentwaysofcalculatingthecentroidofazipcode(includingusingthecenterofmassorthecenterofpopulationdensity)anerrorboundof500metersmightmakesenseforequatinglatitudeandlongitudecoordinates.Forageocodingservice,ontheotherhand,anerrorboundof50metersmaybemorereasonable.Ingeneral,sucherrorboundsshouldbeassociatedwiththesetofglobalrelations,insteadofjustthesemantictypes,andcouldbelearntaccordingly.Whentherelationscontainmultipleattributes,thentheproblemofdecidingwhethertwotuplesrefertothesameentityiscalledrecordlinkage(Winkler,1999).Anentirefieldofresearchisdevotedtotacklingthisproblem.Duetothecomplexityoftheproblemandthevarietyoftechniquesthathavebeendevelopedtohandleit,wedonotinvestigateitfurtherhere. 6.Extensions Inthissectionwediscussextensionstothebasicalgorithmneededforhandlingrealdatasources,aswellaswaystoreducethesizeofthehypothesisspaceandimprovethequalityofthedefinitionsproduced.6.1GeneratingInputs Thefirststepinthesourceinductionalgorithmistogenerateasetoftupleswhichwillrepresentthetargetrelationduringtheinductionprocess.Inotherwords,thesystemmusttrytoinvokethenewsourcetogathersomeexampledata.Doingthiswithoutbiasingtheinductionprocessiseasiersaidthandone.Thesimplestapproachtogeneratinginputvaluesistoselectconstantsatrandomfromthesetofexamplesgivenintheproblemspecification.Theproblemwiththisapproachisthatinsomecasesthenewsourcewillnotproduceanyoutputfortheselectedinputs.Insteadthesystemmayneedtoselectvaluesaccordingtosomedistributionoverthedomainofvaluesinorderforthesourcetoinvokecorrectly.Forexample,considerasourceprovidingpostsofusedcarsforsaleinacertainarea.Thesourcetakesthemakeofthecarasinput,andreturnscardetails: usedCars($make,model,year,price,phone) Althoughthereareoverahundreddifferentcarmanufacturersintheworld,onlyafewofthemproducethebulkofthecars.ThusinvokingthesourcewithvalueslikeFerrari,LotusandAstonMartinwillbelesslikelytoreturnanytuples,whencomparedwithmorecommonbrandssuchasFordandToyota(unlessthesourceisonlyprovidingdataforsportscarsofcourse).Ifadistributionoverpossiblevaluesisavailable,thesystemcanfirsttry 26 LearningSemanticDefinitionsofInformationSourcesontheInternet themorecommonvalues,ormoregenerally,itcanchoosevaluesfromthatsetaccordingtothedistribution.Inthisparticularexample,itmightnotbetoodifficulttoquerythesourcewithacompletesetofcarmanufacturersuntiloneoftheinvocationsreturnssomedata.Ingeneral,thesetofexamplesmaybeverylarge(suchasthe40,000+zipcodesintheUS)andthenumberof“interesting”valuesinthatset(theoneslikelytoreturnresults)maybeverysmall,inwhichcasetakingadvantageofpriorknowledgeaboutthedistributionofpossiblevaluesmakessense.Itshouldbenotedalsothatduringexecutionthesystemwillreceivealotofoutputdatafromthedifferentsourcesitaccesses.Thisdatacanberecordedtogeneratedistributionsoverpossiblevaluesforthedifferenttypes. Theproblemofgeneratingviableinputdataforanewsourcebecomesyetmoredifficultiftheinputrequiredisnotasinglevaluebutatupleofvalues.Inthiscasethesystemcanfirsttrytoinvokethesourcewithrandomcombinationsofattributevaluesfromtheexamplesofeachtype.Invokingsomesources(suchassource5)iseasybecausethereisnoexplicitrestrictiononthecombinationofinputvalues: source5($zip,$distance,zip,distance) Inothercases,suchasageocodingservicethecombinationofpossibleinputvaluesishighlyrestricted: USGeocoder($number,$street,$zipcode,latitude,longitude) Randomlyselectinginputvaluesindependentlyofoneanotherisunlikelytoresultinanysuccessfulinvocations.(Inorderfortheinvocationtosucceed,therandomlygeneratedtuplemustcorrespondtoanaddressthatactuallyexists.)Insuchcases,afterfailingtoinvokethesourceanumberoftimes,thesystemcantrytoinvokeothersources(suchasthehotellookupservicebelow),whichproducetuplescontainingtherequiredattributetypes: HotelSearch($city,hotel,number,street,zipcode) Ingeneral,thisprocessofinvokingsourcestogenerateinputforothersourcescanbechaineduntilasetofviableinputsisgenerated. Wenotethattheproblemofsynthesizingviableinputdataisitselfadifficultandinterestingresearchproblem.Ourcombinedapproachofutilizingvaluedistributionsandinvokingalternativeservicesperformswellinourexperiments(seesection7.3),butanareaoffutureworkistodevelopamoregeneralsolution.6.2DealingwithSources Inordertominimisesourceaccesses,whichcanbeveryexpensiveintermsofbothtimeandbandwidth,allrequeststotheindividualsourcesarecachedinalocalrelationaldatabase.Thisimplementationmeansthatthereisanimplicitassumptioninthisworkthattheoutputproducedbytheservicesisconstantforthedurationoftheinductionprocess.Thiscouldbeproblematiciftheservicebeingmodeledprovides(near)real-timedatawithanupdatefrequencyoflessthanthetimeittakestoinduceadefinition.Foraweatherpredictionservice,updatedhourly,thismaynotpresentmuchofaproblem,sincethedifferencebetweenpredictedtemperaturesmayvaryonlyslightlyfromoneupdatetothenext.Forareal-timeflightstatusserviceprovidingthecoordinatesofagivenaircrafteveryfiveminutes,thecachingmaybeproblematicasthelocationoftheplanewillvarygreatlyifittakes,forexample,onehourtoinduceadefinition.Intheoryonecouldtestforsuchvariationsystematicallybyperiodicallyinvokingthesamesourcewithapreviously 27 Carman&Knoblock successfulinputtupletoseeiftheoutputhaschanged,andupdatethecachingpolicyaccordingly. 6.3LogicalOptimisations Evaluatingdefinitionscanbeexpensivebothintermsoftime(waitingforsourcestore-turndata)andcomputation(calculatingjoinsoverlargetables).Thusitmakessensetocheckeachcandidateforredundancybeforeevaluatingit.Todecidewhichdefinitionsareredundant,weusetheconceptofquerycontainment: Aqueryq1∈LR,Aissaidtobecontainedinanotherqueryq2∈LR,AifforanydatabaseinstanceI,thesetoftuplesreturnedbythefirstqueryisasubsetofthosereturnedbythesecond,i.e.∀IEI[q1]⊆EI[q2].Wedenotecontainmentbyq1q2.Twoqueriesareconsideredlogicallyequivalentifq1q2∧q2q1.Fortheconjunctivequerieslearntinthispaper,testingforquerycontainmentreducestotheproblemoffindingacontainmentmapping(Chandra&Merlin,1977).11Wecanusethistesttodiscoverlogicallyequivalentdefinitionssuchasthefollowing,(whichcontainareorderingofthesameliterals): source($zip,temp):-getCentroid($zip,lat,lon),getConditions($lat,$lon,temp).source($zip,temp):-getConditions($lat,$lon,temp),getCentroid($zip,lat,lon).Suchequivalencecheckingcanbeperformedefficientlyifacanonicalorderingofpredicateandvariablenamesischosenapriori.Wheneverlogicallyequivalentdefinitionsaredis-covered,thesearchcanbacktrack,therebyavoidingentiresub-treesofequivalentclauses.Similarly,wecantestforandskiplogicallyredundantclausessuchasthefollowing(whichisequivalenttoashorterdefinitionwithoutthesecondliteral): source($zip,):-getCentroid($zip,lat,long),getCentroid($zip,lat,). Again,suchredundancycheckingcanbeperformedefficiently(Levy,Mendelzon,Sagiv,&Srivastava,1995)resultinginlittlecomputationaloverheadduringsearch.6.4FunctionalSources Moreinformationmaybeknownaboutthefunctionalityofcertainsourcesthanisexpressedbytheirsourcedefinitions.Forexample,sourceslikeMultiplyandConcatenate,whichareimplementedlocally,willbeknowntobecomplete.(Asourceisconsideredcomplete,ifitreturnsallofthetuplesimpliedbyitsdefinition,i.e.E[s]=EI[v].)Wheneversuchinformationisavailable,theinductionsystemcantakeadvantageofittoimprovesearchefficiency.Toexplainhow,wedefineaclassofsourcesthatwecallfunctionalsources,whicharecompleteandforanyinputtuplereturnexactlyoneoutputtuple.ThisisslightlymorerestrictivethanthestandardILPconceptofdeterminateliterals(Cameron-Jones&Quinlan,1994),whichforeveryinputtuplereturnatmostoneoutputtuple.MultiplyandConcatenatearebothexamplesoffunctionalsources.Thesystemtakesadvantageofthefactthatfunctionalsourcesplacenorestrictionsontheirinput.Wheneverafunctionalsourceisaddedtoacandidatedefinition,thescoreforthatdefinitiondoesn’tchangeprovidingallthesource’sinputsandnoneofitsoutputsarebound.(Thesetoftuplesreturnedbythenew 11.Ifthequeriescontaininterpretedpredicates,thencontainmenttestingisalittlemoreinvolved(Afrati, Li,&Mitra,2004). 28 LearningSemanticDefinitionsofInformationSourcesontheInternet definitionisthesameasbefore,butwithafewnewattributescorrespondingtotheoutputsofthesource.)Thusthenewdefinitiondoesnotneedtobeevaluated,butcanbeaddedtothequeue(ofdefinitionstoexpand)asis,whichbecomesparticularlyadvantageouswhenasource’sinputarityishigh.6.5Constants Constantsareoftenusedinsourcedescriptionstodefinethescopeofaservice.ConsideraweatherservicethatprovidesreportsforzipcodesonlyinCalifornia: calWeather($zip,$date,temp):-forecast(zip,date,temp),USstate(zip,California).IfamediatorreceivesaqueryaskingfortheforecastforChicago,itwillknowthatthissourceisnotrelevanttothequerysinceChicagoisnotinCalifornia.Althoughconstantsinsourcedescriptionscanbeveryuseful,simplyintroducingthemintothehypothesislanguagecouldcausethesearchspacetogrowprohibitively.(Forstates,thebranchingfactorwouldbe50,whileforzipcodesitwouldbeinexcessof40,000.)Obviouslyagenerateandtestmethodologydoesnotmakesensewhenthedomainofthesemantictypeislarge.Alternatively,onecanexplicitlycheckforrepeatedvaluesinthetuplesreturnedbythenewsource(i.e.forconstantsintheheadoftheclause)andinthejoinofthesourceanddefinitionrelations(i.e.forconstantsinthebodyoftheclause).Forexample,inthedefinitionbelowthejoinofthesourcerelationzip,date,tempwiththedefinitionrelationzip,date,temp,statewouldproduceonlytupleswithstateequaltoCalifornia.Sothatconstantcouldbeaddedtothedefinition. source($zip,$date,temp):-forecast(zip,date,temp),USstate(zip,state).Morecomplicateddetectionprocedureswouldberequiredfordiscoveringconstantsinin-terpretedpredicates(i.e.rangerestrictionsovernumericattributes).6.6Post-Processing Afteradefinitionhasbeenlearntforanewsource,itmaybepossibletotightenthatdefinitionbyremovinglogicalredundanciesfromitsunfolding.Considerthefollowingdefinitioninvolvingcallstotwohotelsources,onetocheckavailabilityandtheothertocheckitsrating: source($hotel,address,rating):-HotelAvailability($hotel,address,price),HotelRating($hotel,rating,address).Theunfoldingofthatdefinition(intermsofthedefinitionsofthehotelsources)containstworeferencestoanaccommodationrelation: source($hotel,address,rating):-accommodation(hotel,,address),available(hotel,today,price),accommodation(hotel,rating,address). Thefirstliteralisredundantandcanberemovedfromtheunfolding.Ingeneral,thesamerulesusedtodiscoverredundancyincandidatedefinitionscanbeusedtoremoveredundantliteralsfromtheunfolding.Moreover,sincethispost-processingstepneedstobeperformedonlyonce,timecanbespentsearchingformorecomplicatedformsofredundancy. 29 Carman&Knoblock 7.Evaluation Inthissectionwedescribeourevaluationofthesourceinductionalgorithm.Wefirstdescribetheexperimentalsetupusedandthentheexperimentsperformed.Finally,wecomparetheinductionalgorithmwithaparticularcomplexschemamatchingsystem.7.1ExperimentalSetup Thesourceinductionalgorithmdefinedinthispaperwasimplementedinasystemcalledeidos,whichstandsforEfficientlyInducingDefinitionsforOnlineSources.eidosim-plementsthetechniquesandoptimisationsdiscussedinsections4through6.(Certainextensionsfromsection6wereonlypartiallyimplemented:theimplementationcurrentlychecksforconstantsonlyintheheadoftheclauseanddoesnotperformanytighteningofthedefinitions.)AllcodewaswritteninJavaandaMySQLdatabasewasusedforcachingtheresultsofsourceinvocations. eidoswastestedon25differentproblemsinvolvingrealservicesfromseveraldomainsincludinghotels,financialdata,weatherandcars.Thedomainmodelusedwasthesameforeachproblemandincludedover70differentsemantictypes,rangingfromcommononeslikezipcodetomorespecifictypessuchasstocktickersymbols.Thedatamodelalsocontained36relations(excludinginterpretedpredicates),whichwereusedtomodel33differentservices.Allofthemodeledservicesarepubliclyavailableinformationsources.Wenoteherethatthedecisiontousethesamesetofknownsourcesforeachproblem(regardlessofthedomain)wasimportantinordertomakesurethatthetestswererealistic.Thisdecisionmadetheproblemmoredifficultthanthestandardschemamatching/mappingscenarioinwhichthesourceschemaischosen,becauseitprovidesdatathatisknownaprioritoberelevanttotheoutputschema. Inordertogiveabettersenseoftheproblemsettingandthecomplexityoftheknownsourcesavailable,welisttenbelow(orderedbyarity).Duetospacelimitationswedon’tshowthecompletelistnortheirdefinitions,onlytheinput/outputtypesforeachsource.Notethatallthesourcessharethesemantictypeslatitudeandlongitude,whichmeansthatthesearchspaceassociatedwiththesesourcesaloneisverylarge. 1 WeatherConditions($city,state,country,latitude,longitude,time,time,timeoffset,datetime,temperatureF,sky,pressureIn,direction,speedMph,humidity,temperatureF)2WeatherForecast($city,state,country,latitude,longitude,timeoffset,day,date, temperatureF,temperatureF,time,time,sky,direction,speedMph,humidity)3GetDistance($latitude,$longitude,$latitude,$longitude,distanceKm)4USGeocoder($street,$zipcode,city,state,latitude,longitude)5ConvertDMS($latitudeDMS,$longitudeDMS,latitude,longitude)6USGSEarthquakes(decimal,timestamp,latitude,longitude)7GetAirportCoords($iata,airport,latitude,longitude)8CountryCode($latitude,$longitude,countryAbbr)9GetCentroid($zipcode,latitude,longitude)10Altitude($latitude,$longitude,distanceM) Inordertoinducedefinitionsforeachproblem,thesource(andeachcandidatedefini-tion)wasinvokedatleast20timesusingrandominputs.Wheneverpossible,thesystemattemptedtogenerate10positiveexamplesofthesource(invocationsforwhichthesourcereturnedsometuples)and10negativeexamples(inputswhichproducednooutput).To 30 LearningSemanticDefinitionsofInformationSourcesontheInternet ensurethatthesearchterminated,thenumberofiterationsofthealgorithmincludingback-trackingstepswaslimitedto30.Asearchtimelimitof20minuteswasalsoimposed.Theinductivesearchbiasusedduringtheexperimentsisshownbelow,andaweightingfactor(definedinsection5.3)of0.9wasusedtodirectthesearchtowardshorterdefinitions. SearchBiasMaximumclauselength=7Maximumpredicaterepetition=2Maximumvariablelevel=5Executablecandidatesonly Novariablerepetitionwithinaliteral Intheexperiments,differentprocedureswereusedtodecideequalitybetweenvaluesofthesametypeasdiscussedinsection5.4.Someoftheequalityproceduresusedfordifferenttypesarelistedbelow.Theaccuracyboundsandthresholdsusedwerechosentomaximizeoverallperformanceofthelearningalgorithm.(Inpractice,ameta-learningalgorithmcouldbeusedtodeterminethebestaccuracyboundsfordifferentattributetypes.)Forallsemantictypesnotlistedbelow,substringmatching(checkingifonestringcontainedtheother)wasusedtotestequalitybetweenvalues. Typeslatitudes,longitudesdistances,speeds,temperatures,priceshumidity,pressure,degreesdecimals companies,hotels,airportsdates EqualityProcedureaccuracyboundof±0.002accuracyboundof±1%accuracyboundof±1.0accuracyboundof±0.1JaroWinklerscore≥0.85specialisedequalityprocedure Theexperimentswererunonadualcore3.2GHzPentium4with4GBofRAM(althoughmemorywasnotalimitingfactorinanyofthetests).ThesystemwasrunningWindows2003Server,JavaRuntimeEnvironment1.5andMySQL5.0.7.2EvaluationCriteria Inordertoevaluatetheinductionsystem,onewouldliketocompareforeachproblemthedefinitiongeneratedbythesystemwiththeidealdefinitionforthatsource(denotedvbestandv∗respectively).Inotherwords,wewouldlikeanevaluationfunction,whichratesthequalityofeachdefinitionproducedwithrespecttoahand-writtendefinitionforthesource(i.e.quality:vbest×v∗→[0,1]).Theproblemwiththisistwofold.Firstly,itisnotobvioushowtodefinesuchasimilarityfunctionoverconjunctivequeriesandmanydifferentpossibilitiesexist(seeMarkovandMarinchev,2000,foraparticularexample).Secondly,workingoutthebestdefinitionbyhand,whiletakingintoaccountthelimitationsofthedomainmodelandthefactthattheavailablesourcesarenoisy,incomplete,possiblylessaccurate,andevenserialisedataindifferentways,maybeextremelydifficult,ifevenpossible.Soinordertoevaluateeachofthediscovereddefinitions,weinsteadcountthenumberofcorrectlygeneratedattributesineachdefinition.Anattributeissaidtobecorrectlygenerated,if: 31 Carman&Knoblock •itisaninput,andthedefinitioncorrectlyrestrictsthedomainofpossiblevaluesforthatattribute,or •itisanoutput,andthedefinitioncorrectlypredictsitsvalueforgiveninputtuples.Considerthefollowingdefinitionthattakestwoinputvaluesandreturnsthedifferenceanditssquareroot(providingthedifferenceispositive): source($A,$B,C,D):-sum(B,C,A),product(D,D,C),A≥B. andimaginethattheinductionsystemmanagedtolearnonlythatthesourcereturnsthedifferencebetweentheinputandtheoutput,i.e.: source($A,$B,C,):-sum(B,C,A). WesaythatthefirstinputattributeAisnotcorrectlygeneratedasitisaninputandisnotconstrainedwithrespecttotheinputattributeB(theinequalityismissing).TheinputBisdeemedcorrectlygeneratedasitispresentinthesumrelation(onlyoneinputispenalisedforthemissinginequality).TheoutputCisdeemedcorrectlygeneratedwithrespecttotheinputs,andthemissingattributeDisn’tgeneratedatall.(Notethatiftheorderingofvariablesinthesumrelationhadbeendifferent,saysum(A,B,C),thenCwouldhavebeengenerated,butnotcorrectlygenerated.) Givenadefinitionforcorrectlygeneratedattributes,onecandefineexpressionsforprecisionandrecallovertheattributescontainedinasourcedefinition.Wedefineprecisiontobetheratioofcorrectlygeneratedattributestothetotalnumberofattributesgeneratedbyadefinition,i.e.: precision= #ofcorrectlygeneratedattributestotal#ofgeneratedattributes Wedefinerecalltobetheratioofgeneratedattributestothetotalnumberofattributesthatwouldhavebeengeneratedbytheidealdefinition,giventhesourcesavailable.(Insomecasesnosourcesareavailabletogeneratevaluesforanattributeinwhichcase,thatattributeisnotincludedinthecount.) recall= #ofcorrectlygeneratedattributes total#ofattributesthatshouldhavebeengenerated Notethatwedefinedprecisionandrecallattheschemalevelintermsoftheattributesinvolvedinasourcedefinition.Theycouldalsobedefinedatthedatalevelintermsofthetuplesbeingreturnedbythesourceandthedefinition.Indeed,theJaccardsimilarityusedtoscorecandidatedefinitionsisacombinationofdata-levelprecisionandrecallvalues.Thereasonforchoosingschemalevelmetricsinourevaluationisthattheybetterreflectthesemanticcorrectnessofthelearntdefinition,insofarastheyareindependentofthecompleteness(amountofoverlap)betweentheknownandtargetsources. Returningtoourexampleabove,theprecisionforthesimpledefinitionlearntwouldbe2/3andtherecallwouldbe2/4.Notethat,iftheproductrelationhadnotbeenavailableinourdomainmodel(inwhichcaseattributeDcouldneverhavebeengenerated),recallwouldhavebeenhigherat2/3. 32 LearningSemanticDefinitionsofInformationSourcesontheInternet 7.3Experiments Thedefinitionslearntbythesystemaredescribedbelow.Overallthesystemperformedverywellandwasabletolearntheintendeddefinition(ignoringmissingjoinvariablesandsuperfluousliterals)in19outofthe25problems.7.3.1GeospatialSources Thefirstsetofproblemsinvolvedninegeospatialdatasourcesprovidingavarietyoflocationbasedinformation.Thedefinitionslearntbythesystemarelistedbelow.Theyarereportedintermsofthesourcepredicatesratherthanthedomainrelations(i.e.theunfoldings)becausethecorrespondingdefinitionsaremuchshorter.Thismakesiteasiertounderstandhowwellthesearchalgorithmisperforming. 123 GetInfoByZip($zip0,cit1,sta2,_,tim4):-GetTimezone($sta2,tim4,_,_),GetCityState($zip0,cit1,sta2).GetInfoByState($sta0,cit1,zip2,_,tim4):-GetTimezone($sta0,tim4,_,_),GetCityState($zip2,cit1,sta0).GetDistanceBetweenZipCodes($zip0,$zip1,dis2):-GetCentroid($zip0,lat1,lon2),GetCentroid($zip1,lat4,lon5), GetDistance($lat1,$lon2,$lat4,$lon5,dis10),ConvertKm2Mi($dis10,dis2).GetZipCodesWithin($_,$dis1,_,dis3):-<(dis3,dis1). YahooGeocoder($str0,$zip1,cit2,sta3,_,lat5,lon6):-USGeocoder($str0,$zip1,cit2,sta3,lat5,lon6).GetCenter($zip0,lat1,lon2,cit3,sta4):-WeatherConditions($cit3,sta4,_,lat1,lon2,_,_,_,_,_,_,_,_,_,_,_),GetZipcode($cit3,$sta4,zip0). Earthquakes($_,$_,$_,$_,lat4,lon5,_,dec7,_):-USGSEarthquakes(dec7,_,lat4,lon5).USGSElevation($lat0,$lon1,dis2):-ConvertFt2M($dis2,dis1),Altitude($lat0,$lon1,dis1).CountryInfo($cou0,cou1,cit2,_,_,cur5,_,_,_,_):-GetCountryName($cou0,cou1),GoCurrency(cur5,cou0,_), WeatherConditions($cit2,_,cou1,_,_,_,_,_,_,_,_,_,_,_,_,_). 456 789 Thefirsttwosourcesprovideinformationaboutzipcodes,suchasthenameofthecity,thestateandthetimezone.Theydifferintheirbindingconstraints,withthefirsttakingazipcodeasinput,andthesecondtakingastate.Thesecondsourcereturnsmanyoutputtuplesperinputvalue,makingithardertolearnthedefinition,eventhoughthetwosourcesprovidelogicallythesameinformation.Theinduceddefinitionsarethebestpossiblegiventheknownsourcesavailable.(Noneofthemprovidedthemissingoutputattribute,atele-phonearea-code.)Thethirdsourcecalculatesthedistanceinmilesbetweentwozipcodes,(itisthesameassource4fromsection2).Thecorrectdefinitionwaslearntforthissource,butforthenextsource,whichreturnedzipcodeswithinagivenradius,areasonabledefini-tioncouldnotbelearntwithinthetimelimit.12Ignoringbindingconstraints,theintended 12.Therecallforthisproblemis1/4becausetheinputattributedis1isdeterminedtobetheonlycorrectly generatedattribute(itisconstrainedwithrespecttotheoutputattributedis3),whileallfourattributesshouldhavebeengenerated(theoutputattributedis3isnotgeneratedbythe Carman&Knoblock definitionwasthesameasthethird,butwitharestrictionthattheoutputdistancebelessthantheinputdistance.Thusitwouldhavebeenfareasierforeidostolearnadefinitionforthefourthsourceintermsofthethird.Indeed,whenthenewdefinitionforthethirdsourcewasaddedtothesetofknownsources,thesystemwasabletolearnthefollowing: 4’GetZipCodesWithin($zip0,$dis1,zip2,dis3):-GetDistanceBetweenZipCodes($zip0,$zip2,dis3),<(dis3,dis1). Theabilityofthesystemtoimproveitslearningabilityovertimeasthesetofknownsourcesincreasesisakeybenefitoftheapproach. SourcefiveisageocodingserviceprovidedbyYahoo.(Geocodingservicesmapad-dressestolatitudeandlongitudecoordinates.)eidoslearntthatthesamefunctionalitywasprovidedbyaservicecalledUSGeocoder.Sourcesixisasimpleserviceprovidingthelatitude/longitudecoordinatesandthecityandstateforagivenzipcode.Interestingly,thesystemlearntthatthesource’scoordinateswerebetterpredictedbyaweathercondi-tionsservice(discussedinsection7.3.3),thanbytheGetCentroidsourcefromthethirddefinition.Notethatwhenthenewsourcedefinitionisunfoldeditwillcontainextrane-ouspredicatesrelatedtoweatherinformation.13Theadditionalpredicatesdonotinterferewiththeusefulnessofthedefinition,however,asaqueryreformulationalgorithmwillstillusethesourcetoanswerthesamequeriesregardless.(Thusprecisionandrecallscoresarenotaffected.)Apost-processingsteptoremoveextraneouspredicatesispossible,butwouldrequireadditionalinformationinthedomainmodel.14Theseventhsourceprovidedearthquakedatawithinaboundingboxwhichittookasinput.Inthiscase,thesystemdiscoveredthatthesourcewasindeedprovidingearthquakedata,(lat4andlon5aretheco-ordinatesoftheearthquakeanddec7isitsmagnitude).Itdidn’tmanage,however,toworkouthowtheinputcoordinatesrelatedtotheoutput.Thenextsourceprovidedelevationdatainfeet,whichwasfoundtobesufficientlysimilartoknownaltitudedatainmetres.Finally,thesystemlearntadefinitionforasourceprovidinginformationaboutcountriessuchasthecurrencyused,andthenameofthecapitalcity.Sinceknownsourceswerenotavailabletoprovidethisinformation,thesystemendeduplearningthatweatherreportswereavailableforthecapitalofeachcountry.Problem123456789 #Candidates2524888350401115176 #Invocations50689804111361117613148151621487717728784 Time(s)85914449253242831872559 log10(Score)-1.36-1.08-0.750.25-0.45-7.61-6.87-8.58-5.77 Precision4/44/43/31/16/65/53/33/34/4 Recall4/44/43/31/46/65/53/93/34/4 13.Theunfoldingisshownbelow.Theconditionspredicatecouldberemovedwithoutaffectingitsmeaning: GetCenter($zip0,lat1,lon2,cit3,sta4):-municipality(cit3,sta4,zip2,tim3),country(,cou5,), northAmerica(cou5),centroid(zip2,lat1,lon2),conditions(lat1,lon2,,,,,,,,,,,),timezone(tim3,,),municipality(cit3,sta4,zip0,). 14.Inparticular,universallyquantifiedknowledgewouldbeneededinthedomainmodel,e.g.: ∀lat,lon∃x1,...,x11s.t.conditions(lat,long,x1,...,x11) 34 LearningSemanticDefinitionsofInformationSourcesontheInternet Thetableshowssomedetailsregardingthesearchperformedtolearneachofthedefi-nitionslistedabove.Foreachproblem,itshowsthenumberofcandidatesgeneratedpriortothewinningdefinition,alongwiththetimeandnumberofsourceinvocationsrequiredtolearnthedefinition.(Thelasttwovaluesshouldbeinterpretedwithcautionastheyarehighlydependentonthedelayinaccessingsources,andonthecachingofdatainthesystem.)Thescoresshowninthefifthcolumnareanormalisedversionofthescoringfunctionusedtocomparethedefinitionsduringsearch.(Normalisationinvolvedremovingthepenaltyappliedformissingoutputs.)Thescorescanbeverysmall,sothelogarithmofthevaluesisshown(hencethenegativevalues).15Thesescorescanbeinterpretedastheconfidencethesystemhasinthedefinitionsproduced.Thecloserthevalueistozero,thebetterthedefinition’sabilitytoproducethesametuplesasthesource.Weseethatthesystemwasfarmoreconfidentaboutthedefinitionsonethroughfive,thanthelatterones.16Thelasttwocolumnsgivetheprecisionandrecallvalueforeachproblem.Theaverageprecisionfortheseproblemswas100%.(Notethatahighprecisionvalueistobeexpected,giventhattheinductionalgorithmreliesonfindingmatchingtuplesbetweenthesourceanddefinition.)Theaveragerecallforthegeospatialproblemswasalsoveryhighat84%. 7.3.2FinancialSources Twosourcesweretestedthatprovidedfinancialdata.Thedefinitionsgeneratedbyeidosforthesesourcesareshownbelow. 10GetQuote($tic0,pri1,dat2,tim3,pri4,pri5,pri6,pri7,cou8,_,pri10,_,_,pri13,_,com15):-YahooFinance($tic0,pri1,dat2,tim3,pri4,pri5,pri6,pri7,cou8), GetCompanyName($tic0,com15,_,_),Add($pri5,$pri13,pri10),Add($pri4,$pri10,pri1).11YahooExchangeRate($_,$cur1,pri2,dat3,_,pri5,pri6):-GetCurrentTime(_,_,dat3,_),GoCurrency(cur1,cou5,pri2), GoCurrency(_,cou5,pri5),Add($pri2,$pri5,pri12),Add($pri2,$pri6,pri12). Thefirstfinancialserviceprovidedstockquoteinformation,andthesystemlearntthatthesourcereturnedexactlythesameinformationasastockmarketserviceprovidedbyYahoo.Itwasalsoabletoworkoutthatthepreviousday’scloseplustoday’schangewasequaltothecurrentprice.Thesecondsourceprovidedtherateofexchangebetweenthecurrenciesgivenasinput.Inthiscase,thesystemdidnotfarewell.Itwasunabletolearntheintendedresult,whichinvolvedcalculatingtheexchangeratebytakingtheratioofthevaluesforthefirstandsecondcurrency.Problem1011 #Candidates2844367 #Invocations1667116749 Time(s)387282 log10(Score)-8.13-9.84 Precision12/131/5 Recall12/121/4 Detailsregardingthesearchspacesforthetwoproblemsareshownabove.Theaverageprecisionandrecallfortheseproblemsweremuchlowerat56%and63%respectively,becausethesystemwasunabletolearntheintendeddefinitioninthesecondproblem. 15.Thepositivevalueforproblem4resultsfromanapproximationerror. 16.Lowscoresandperfectprecisionandrecall(problems6,8and9)indicateverylittleoverlapbetween thetargetandtheknownsources.Thefactthatthesystemlearnsthecorrectdefinitioninsuchcasesistestimonytotherobustnessoftheapproach. 35 Carman&Knoblock 7.3.3WeatherSources OntheInternet,therearetwotypesofweatherinformationservices,thosethatprovideforecastsforcomingdays,andthosethatprovidedetailsofcurrentweatherconditions.Intheexperiments,apairofsuchservicesprovidedbyWeather.comwereusedtolearndefini-tionsforanumberofotherweathersources.Thefirstsetofdefinitions,whichcorrespondtosourcesthatprovidecurrentweatherconditions,arelistedbelow: 12NOAAWeather($ica0,air1,_,_,sky4,tem5,hum6,dir7,spe8,_,pre10,tem11,_,_):-GetAirportInfo($ica0,_,air1,cit3,_,_), WeatherForecast($cit3,_,_,_,_,_,_,_,_,_,_,_,sky4,dir7,_,_), WeatherConditions($cit3,_,_,_,_,_,_,_,_,tem5,sky4,pre33,_,spe8,hum6,tem11),ConvertIn2mb($pre33,pre10). 13WunderGround($sta0,$cit1,tem2,_,_,pre5,pre6,sky7,dir8,spe9,spe10):-WeatherConditions($cit1,sta0,_,_,_,_,_,_,dat8,tem9,sky7,pre5,dir8,spe13,_,tem2),WeatherForecast($cit1,sta0,_,_,_,_,_,_,tem24,_,_,_,_,_,spe10,_), ConvertIn2mb($pre5,pre6),<(tem9,tem24),ConvertTime($dat8,_,_,_,_),<(spe9,spe13).14WeatherBugLive($_,cit1,sta2,zip3,tem4,_,_,dir7,_,_):-WeatherConditions($cit1,sta2,_,_,_,_,_,_,_,tem4,_,_,dir7,_,_,_),GetZipcode($cit1,$sta2,zip3). 15WeatherFeed($cit0,$_,tem2,_,sky4,tem5,_,_,pre8,lat9,_,_):-WeatherConditions($cit0,_,_,lat9,_,_,_,_,_,_,sky4,pre8,dir12,_,_,tem5),WeatherForecast($cit0,_,_,_,_,_,_,_,_,tem2,_,_,_,dir12,_,_). 16WeatherByICAO($ica0,air1,cou2,lat3,lon4,_,dis6,_,sky8,_,_,_,_):-Altitude($lat3,$lon4,dis6),GetAirportInfo($ica0,_,air1,cit6,_,cou8),WeatherForecast($cit6,_,cou8,_,_,_,_,_,_,_,_,_,sky8,_,_,_),GetCountryName($cou2,cou8). 17WeatherByLatLon($_,$_,_,_,_,lat5,lon6,_,dis8,_,_,_,_,_,_):-Altitude($lat5,$lon6,dis8). Inthefirstproblem,thesystemlearntthatsource12providedcurrentconditionsatairports,bycheckingtheweatherreportforthecitiesinwhicheachairportwaslocated.Thisparticularproblemdemonstratessomeoftheadvantagesoflearningdefinitionsfornewsourcesdescribedinsection3.3.Oncethedefinitionhasbeenlearnt,ifamediatorreceivesarequestforthecurrentconditionsatanairport,itcangenerateananswerforthatquerybyexecutingasinglecalltothenewlymodeledsource,(withoutneedingtofindanearbycity).Thesystemperformedwellonthenextthreesources(13to15)learningdefinitionswhichcovermostoftheattributesofeach.Onthelasttwoproblems,thesystemdidnotperformaswell.Inthecaseofsource16,thesystemspentmostofitstimelearningwhichattributesoftheairportwerebeingreturned(suchasitscountry,coordinates,elevation,etc.).Inthelastcase,thesystemwasonlyabletolearnthatthesourcewasreturningsomecoordinatesalongwiththeirelevation.Wenoteherethatdifferentsourcesmayprovidedataatdifferentlevelsofaccuracy.Thusthefactthatthesystemisunabletolearnadefinitionforaparticularsourcecouldsimplymeanthatthedatabeingreturnedbythatsourcewasn’tsufficientlyaccurateforthesystemtolabelitamatch. Inadditiontocurrentweatherfeeds,thesystemwasrunontwoproblemsinvolvingweatherforecastfeeds.Itdidverywellonthefirstproblem,matchingallbaroneoftheattributes(thecountry)andfindingthattheorderofthehighandlowtemperatureswas 36 LearningSemanticDefinitionsofInformationSourcesontheInternet inverted.Itdidwellalsoforthesecondproblem,learningadefinitionforthesourcethatproducedmostoftheoutputattributes. 18YahooWeather($zip0,cit1,sta2,_,lat4,lon5,day6,dat7,tem8,tem9,sky10):-WeatherForecast($cit1,sta2,_,lat4,lon5,_,day6,dat7,tem9,tem8,_,_,sky10,_,_,_),GetCityState($zip0,cit1,sta2). 19WeatherBugForecast($_,cit1,sta2,_,day4,sky5,tem6,_):-WeatherForecast($cit1,sta2,_,_,_,tim5,day4,_,tem6,_,tim10,_,sky5,_,_,_),WeatherConditions($cit1,_,_,_,_,tim10,_,tim5,_,_,_,_,_,_,_,_). Detailsregardingthenumberofcandidatesgeneratedinordertolearndefinitionsforthedifferentweathersourcesareshownbelow.Theaverageprecisionofthedefinitionsproducedwas91%,whiletheaveragerecallwas62%.Problem1213141516171819 #Candidates27719899819910245119116 #Invocations579426249975494676691387614857 Time(s)23360593029248410267591217 log10(Score)-2.92-6.35-13.37-6.48-29.69-26.71-5.74-12.56 Precision8/96/95/55/66/73/310/105/5 Recall8/116/105/85/106/93/1310/115/7 7.3.4HotelSources DefinitionswerealsolearntforsourcesprovidinghotelinformationfromYahoo,GoogleandtheUSFireAdministration.Thesedefinitionsareshownbelow. 20USFireHotelsByCity($cit0,_,_,sta3,zip4,cou5,_):-HotelsByZip($zip4,_,_,cit0,sta3,cou5). 21USFireHotelsByZip($zip0,_,_,cit3,sta4,cou5,_):-HotelsByZip($zip0,_,_,cit3,sta4,cou5). 22YahooHotel($zip0,$_,hot2,str3,cit4,sta5,_,_,_,_,_):-HotelsByZip($zip0,hot2,str3,cit4,sta5,_). 23GoogleBaseHotels($zip0,_,cit2,sta3,_,_,lat6,lon7,_):-WeatherConditions($cit2,sta3,_,lat6,lon7,_,_,_,_,_,_,_,_,_,_,_),GetZipcode($cit2,$sta3,zip0). Thesystemperformedwellonthreeoutofthefourproblems.Itwasunableinthetimeallocatedtodiscoveradefinitionforthehotelattributes(name,street,latitudeandlongi-tude)returnedbytheGoogleWebService.Theaverageprecisionfortheseproblemswas90%whiletheaveragerecallwas60%.Problem20212223 #Candidates16164395 #Invocations448189431374931 Time(s)4852821161 37 log10(Score)-4.00-2.56-2.81-7.50Precision4/44/45/53/5Recall4/64/65/93/6 Carman&Knoblock 7.3.5CarsandTrafficSources ThelastproblemsonwhichthesystemwastestedwereapairoftrafficrelatedWebServices.Thefirstservice,providedbyYahoo,reportedlivetrafficdata(suchasaccidentsandconstructionwork)withinagivenradiusoftheinputzipcode.Noknownsourceswereavailablewhichprovidedsuchinformation,sonotsurprisingly,thesystemwasunabletolearnadefinitionforthetrafficrelatedattributesofthatsource.(Instead,thesystemdiscoveredarelationshipbetweentheinputzipcodeandtheoutputlongitudethatwasn’tcorrect,soprecisionforthisproblemwaszero.) 24YahooTraffic($zip0,$_,_,lat3,lon4,_,_,_):-GetCentroid($zip0,_,lon4),CountryCode($lat3,$lon4,_).25YahooAutos($zip0,$mak1,dat2,yea3,mod4,_,_,pri7,_):-GoogleBaseCars($zip0,$mak1,_,mod4,pri7,_,_,yea3), ConvertTime($dat2,_,dat10,_,_),GetCurrentTime(_,_,dat10,_). Thesecondprobleminvolvedaclassifiedused-carlistingfromYahoothattookazipcodeandcarmanufacturerasinput.eidoswasabletolearnagooddefinitionforthatsource,takingadvantageofthefactthatsomeofthesamecars(definedbytheirmake,model,yearandprice)werealsolistedforsaleonGoogle’sclassifiedcarlisting.Problem2425 #Candidates 8155 #Invocations 29974405 Time(s)1065815 log10(Score)-11.21-5.29 Precision0/36/6 Recall0/46/6 Sincethesystemfailedonthefirstproblem(itfoundsomeincorrect/non-generalrelation-shipsbetweendifferentattributes),butsucceededonthesecondproblemtofindthebestpossibledefinition,theaverageprecisionandrecallfortheseproblemswereboth50%.7.3.6OverallResults Acrossthe25problems,eidosmanagedtogeneratedefinitionswithhighaccuracy(averageprecisionwas88%)andalargenumberofattributes(averagerecallwas69%).Theseresultsarepromising,especiallyconsideringthatallproblemsinvolvedrealdatasourceswithinsomecasesverysmalloverlapbetweenthedataproducedbythetargetandthatprovidedbytheknownsources(asevidencedbylowlogarithmicscores).Inadditiontominimaloverlap,manysourcesprovidedincompletetuples(i.e.tuplescontainingmultiple“NULL”or“N/A”values)aswellaserroneousorinaccuratedata,makingtheproblemallthemoredifficult.ThehighaverageprecisionandrecallleadustobelievethattheJaccardmeasureisdoingagoodjobofdistinguishingcorrectfromincorrectdefinitionsinthepresenceofdatasourcesthatarebothnoisy(inconsistent)andincomplete(missingtuplesandvalues). Comparingthedifferentdomains,onecanseethatthesystemperformedbetteronproblemswithfewerinputandoutputattributes(suchasthegeospatialproblems),whichwastobeexpectedgiventhattheresultingsearchspaceismuchsmaller. 38 LearningSemanticDefinitionsofInformationSourcesontheInternet 7.4EmpiricalComparison Havingdemonstratedtheeffectivenessofeidosinlearningdefinitionsforrealinformationservices,wenowshowthatthesystemiscapableofhandlingthesameproblemsasawell-knowncomplexschemamatchingsystem. TheiMAPsystem(Dhamanka,Lee,Doan,Halevy,&Domingos,2004)(discussedinsection8.4)isaschemamatcherthatcanlearncomplex(many-to-one)mappingsbetweentheconceptsofasourceandatargetschema.Itusesasetofspecialpurposesearcherstolearndifferenttypesofmappings.Theeidossystem,ontheotherhand,usesagenericsearchalgorithmtosolveacomparableproblem.Sincethetwosystemscanbemadetoperformasimilartask,weshowthateidosiscapableofrunningononeoftheproblemdomainsusedintheevaluationofiMAP.Wechosetheparticulardomainofonlinecricketdatabasesbecauseitistheonlyoneusedintheevaluationthatinvolvedaligningdatafromtwoindependentdatasources.(Allotherproblemsinvolvedgeneratingsyntheticdatabysplittingasingledatabaseintoasourceandtargetschema,whichwouldnothavebeenasinterestingforeidos.) Playerstatisticsfromtwoonlinecricketdatabases(cricketbase.comandcricinfo.com)wereusedintheexperiments.Sinceneitherofthesourcesprovidedprogrammaticaccesstotheirdata,thestatisticsdatawasextractedfromHTMLpagesandinsertedintoarelationaldatabase.Theextractionprocessinvolvedflatteningthedataintoarelationalmodelandasmallamountofdatacleaning.(TheresultingtablesaresimilarbutnotnecessarilyexactlythesameasthoseusedintheiMAPexperiments.)Thedatafromthetwowebsiteswasusedtocreatethreedatasourcesrepresentingeachwebsite.Thethreesourcesrepresentingcricinfo.comwerethenusedtolearndefinitionsforthethreesourcesrepresentingcricketbase.com.Otherknownsourceswereavailabletothesystem,includingfunctionalityforsplittingapartcomma-separatedlists,addingandmultiplyingnumbers,andsoon.Thedefinitionslearnttodescribethecricketbaseservicesareshownbelow: 1CricbasePlayers($cou0,nam1,_,dat3,_,unk5,unk6):-CricinfoPlayer($nam1,dat3,_,_,_,lis5,nam6,unk5,unk6),contains($lis5,cou0),CricinfoTest($nam6,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_). 2CricbaseTest($_,nam1,cou2,_,_,_,cou6,cou7,dec8,cou9,_,_,_,cou13,cou14,dec15,_,dec17,cou18,_,_):-CricinfoTest($nam1,_,_,cou2,cou6,cou18,cou7,_,dec8,dec15,_,cou9,_,_,_,cou14,cou13,_,_,_,_,dec17). 3CricbaseODI($_,nam1,cou2,_,_,_,cou6,cou7,dec8,cou9,cou10,_,cou12,cou13,_,dec15,dec16,dec17,cou18,cou19,_):-CricinfoODI($nam1,_,_,cou2,cou6,_,cou10,_,dec8,_,cou7,cou18,cou19,cou9,_,_,cou13,cou12,dec15,_,dec16,dec17). Thefirstsourceprovidedplayerprofilesbycountry.Thesecondandthirdsourcesprovideddetailedplayerstatisticsfortwodifferenttypesofcricket(TestandOne-Day-Internationalrespectively).Thesystemeasilyfoundthebestdefinitionforthefirstsource.Thedefinitioninvolvedlookingfortheplayer’scountryinalistofteamsthatheplayedfor.eidosdidnotperformquiteaswellonthesecondandthirdproblems.Thereweretworeasonsforthis.Firstly,thearityofthesesourceswasmuchhigherwithmanyinstancesofthesamesemantictype(countanddecimal),makingthespaceofpossiblealignmentsmuchlarger. 39 Carman&Knoblock (Becauseofthelargesearchspace,alongertimeoutof40minuteswasused.)Secondly,ahighfrequencyofnullvalues(theconstant“N/A”)inthedataforsomeofthefieldsconfusedthealgorithm,andmadeitharderforittodiscoveroverlappingtupleswithallofthedesiredattributes.Problem123 #Candidates19911623114 #Invocations376215174299 Time(s)43213192127 log10(Score)-3.95-4.70-6.28 Precision5/58/118/14 Recall5/58/168/16 Detailsofthesearchperformedtolearnthedefinitionsareshownabove.Theaverageprecisionfortheseproblemswas77%whiletheaveragerecallwaslowerat66%.ThesevaluesarecomparabletothequalityofthematchingsreportedforiMAP.17Theseresultsareverygood,consideringthateidossearchesinthespaceofmany-to-manycorrespondences,(tryingtodefinethesetoftargetattributescontemporaneously),whileiMAPsearchesthespacesofone-to-oneandmany-to-onecorrespondences.Moreover,eidosfirstinvokesthetargetsourcetogeneraterepresentativedata(atasknotperformedbyiMAP)andthenperformsagenericsearchforreasonabledefinitionswithoutrelyingonspecialisedsearchalgorithmsfordifferenttypesofattributes(asisdoneiniMAP). 8.RelatedWork InthissectionwedescribehowtheworkinthispaperrelatestoresearchperformedbytheMachineLearning,DatabaseandtheSemanticWebcommunities.Beforedoingthat,wedescribesomeearlyworkperformedbytheArtificialIntelligencecommunity.WealsodiscusshowouralgorithmdiffersfromstandardILPtechniques,andinparticularwhyadirectapplicationofsuchtechniqueswasnotpossibleforourproblem.8.1AnEarlyApproach ThefirstworkconcernedwithlearningmodelsfordescribingoperationsavailableontheInternetwasperformed(inthepre-XMLera)onaproblemcalledcategorytranslation(Perkowitz&Etzioni,1995;Perkowitz,Doorenbos,Etzioni,&Weld,1997).Thisproblemconsistedofanincompleteinternalworldmodelandanexternalinformationsourcewiththegoalbeingtocharacterizetheinformationsourceintermsoftheworldmodel.TheworldmodelconsistedofasetofobjectsO,whereeachobjecto∈Obelongedtoacertaincategory(e.g.people)andwasassociatedwithasetofattributesa1(o),...,an(o),madeupofstringsandotherobjects.Asimplerelationalinterpretationofthisworldmodelwouldconsidereachcategorytobearelation,andeachobjecttobeatuple.Theinformationsource,meanwhile,wasanoperationthattookinasinglevalueasinputandreturnedasingletupleasoutput.Thecategorytranslationproblemcanbeviewedasasimplificationofthesourcedefinitioninductionproblem,whereby: 17.Theactualvaluesforprecisionandrecallinthecricketdomainarenotquoted,butanaccuracyrange of68-92%forsimplematches(one-to-onecorrespondencesbetweensourceandtargetfields)and50-86%forcomplexmatches(many-to-onecorrespondences)acrosssyntheticandrealproblemswasgiven. 40 LearningSemanticDefinitionsofInformationSourcesontheInternet •Theextensionsoftheglobalrelationsareexplicit.(Thereisonesourceperglobalrelation,anditdoesn’thavebindingconstraints,i.e.R=S.) •Theinformationprovidedbythesourcesdoesnotchangeovertime. •Thenewsourcetakesasinglevalueasinputandreturnsasingletupleasoutput.Inordertofindsolutionstoinstancesofthecategorytranslationproblem,theauthorsem-ployedavariantofrelationalpath-finding(Richards&Mooney,1992),whichisanextensiononthefoilalgorithm,tolearnmodelsoftheexternalsource.Thetechniquedescribedinthispaperforsolvinginstancesofthesourceinductionproblemissimilarinthatittooisbasedonafoil-likeinductivesearchalgorithm.8.2DirectApplicationofILPtechniques ResearchersbecameinterestedinthefieldofInductiveLogicProgrammingintheearlynineties,resultinginanumberofdifferentILPsystemsbeingdevelopedincludingfoil(Cameron-Jones&Quinlan,1994),progol(Muggleton,1995)andaleph18.Ideally,onewouldliketoapplysuch“off-the-shelf”ILPsystemstothesourcedefinitioninductionproblem.Anumberofissues,however,limitthedirectapplicabilityofthesesystems.Theissuescanbesummarisedasfollows: •••• Extensionsoftheglobalrelationsarevirtual. Sourcesmaybeincompletewithrespecttotheirdefinitions.Explicitnegativeexamplesofthetargetarenotavailable.Sourcesmayserialiseconstantsindifferentways. ThefirstissuehastodowiththefactthatallILPsystemsassumethatthereisanextensionaldefinitionofthetargetpredicateandextensional(orinsomecasesintentional)definitionsofthe(source)predicatesthatwillbeusedinthedefinitionforthetarget.Inotherwords,theyassumethattablesalreadyexistinsomerelationaldatabasetorepresentboththenewsourceandtheknownsources.Inourcase,weneedtogeneratesuchtablesbyfirstinvokingtheserviceswithrelevantinputs.Onecouldenvisageinvokingeachofthesourceswitheverypossibleinputandusingtheresultingtablestoperforminduction.Suchadirectapproachwouldnotbefeasiblefortworeasons.Firstly,acompletesetofpossibleinputvaluesmaynotbeknowntothesystem.Secondly,evenifitispossibletogenerateacompletesetofviableinputstoaservice,itmaynotbepracticaltoquerythesourcewithsuchalargesetoftuples.Considersource4fromsection2,whichcalculatesthedistanceinmilesbetweentwozipcodes.Giventhatthereareover40,000zipcodesintheUS,generatinganextensionalrepresentationofthissourcewouldrequireperformingmorethanabillioninvocations!Performingsuchalargenumberofinvocationsdoesnotmakesensewhenasmallnumberofexampleinvocationswouldsufficeforcharacterisingthefunctionalityofthesource.Inthispaperwehavedevelopedanefficientalgorithmthatonlyqueriesthesourcesasneededinordertoevaluateindividualcandidatedefinitions. Thesecondissueregardingtheincompletenessofthesourcescausesaproblemwhenacandidateistobeevaluated.Sincethesetoftuplesreturnedbyeachknownsourcemayonly 18.SeetheAlephManualbyAshwinSrinivasan,whichisavailableat: http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.html 41 Carman&Knoblock beasubsetofthoseimpliedbyitsowndefinition,sotoowillbethesetoftuplesreturnedbythecandidatehypothesiswhenexecutedagainstthosesources.Thismeansthatwhenthesystemtriestoevaluateahypothesisbycomparingthosetupleswiththeoutputofthenewsource,itcannotbesurethatatuplewhichisproducedbythenewsourcebutnotbythehypothesisisinfactnotlogicallyimpliedbyit.Thisfactistakenintoaccountinourevaluationfunctionforscoringcandidatedefinitions,discussedinsection5. Thethirdissueregardingthelackofexplicitnegativeexamplesforthetargetpredicatealsoaffectstheevaluationofcandidatehypotheses.Theclassicapproachtodealingwiththisproblemistoassumeaclosedworld,inwhichalltuples(overtheheadrelation)whicharenotexplicitlydeclaredtobepositivemustbenegative.Sincethenewsourcemayinfactbeincompletewithrespecttothebestpossibledefinitionforit,thisassumptiondoesnotnecessarilyhold.Inotherwords,justbecauseaparticulartupleisproducedwhenthecandidatedefinitionisexecutedandthatsametupleisnotreturnedbythenewsourcedoesnotnecessarilymeanthatthecandidatedefinitionisincorrect. Thefourthissuehastodowiththefactthatthedataprovidedbydifferentsourcesmayneedtobereconciled,inthesensethatdifferentserialisationsof(stringsrepresenting)thesamevalue(suchas“Monday”and“Mon”forinstance)mustberecognized.SinceILPsystemshavebeendesignedtooperateoverasingledatabasecontainingmultipletables,theissueofheterogeneityinthedataisnothandledbycurrentsystems.Insection5.4wediscussedhowthisheterogeneityisresolvedinoursystem.8.3MachineLearningApproaches SincetheadventofservicesontheInternet,researchershavebeeninvestigatingwaystomodelthemautomatically.Primarily,interesthascenteredonusingmachinelearningtechniquestoclassifytheinputandoutputtypesofaservice,soastofacilitateservicediscovery.Heß&KushmerickproposedusingaSupportVectorMachinetoclassifytheinputandoutputattributesintodifferentsemantictypesbasedonmetadataininterfacedescriptions(Heß&Kushmerick,2003,2004).Theirnotionofsemantictypes(suchaszipcode)asopposedtosyntactictypes(likeinteger)wentsomewaytowarddefiningthefunctionalitythatasourceprovides.Recently,otherresearchers(Lermanetal.,2006)proposedtheuseoflogisticregressionforassigningsemantictypestoinputparametersbasedonmetadata,andapatternlanguageforassigningsemantictypestotheoutputparametersbasedonthedatathesourceproduces.Thisworkonclassifyinginputandoutputattributesofaservicetosemantictypesformsaprerequisitefortheworkinthisarticle.Forthepurposesofthispaper,wehaveassumedthatthisproblemhasbeensolved. Inadditiontoclassifyingtheinput/outputattributesofservices,Heß&Kushmerickinvestigatedtheideaofclassifyingtheservicesthemselvesintodifferentservicetypes.Moreprecisely,theyusedthesameclassificationtechniquestoassignserviceinterfacestodiffer-entsemanticdomains(suchasweatherandflights)andtheoperationsthateachinterfaceprovidestodifferentclassesofoperation(suchasweatherForecastandflightStatus).Theresultingsourcedescription(hypothesis)languageislimitedtoselect-projectqueries,whicharenotsufficientlyexpressivetodescribemanyofthesourcesavailableontheInternet.Accordingtothatapproach,sinceeveryoperationmustbecharacterizedbyaparticu-laroperationclass,operationsthatprovideoverlapping(non-identical)functionalitywould 42 LearningSemanticDefinitionsofInformationSourcesontheInternet needtobeassigneddifferentclassesaswouldoperationswhichprovidecomposedfunction-ality(suchas,forexample,anoperationthatprovidesbothweatherandflightdata).Theneedforanexhaustivesetofoperationclasses(andaccompanyingtrainingdata)isamajorlimitationofthatapproach,notsharedbytheworkdescribedinthispaper,whichreliesonamoreexpressivelanguagefordescribingserviceoperations. Onewaytoeliminatetheneedforapredefinedsetofoperationtypesistouseunsuper-visedclusteringtechniquestogeneratethe(operation)classesautomaticallyfromexamples(WSDLdocuments).ThisideawasimplementedinasystemcalledWoogle(Dongetal.,2004).Thesystemclusteredserviceinterfacestogetherusingasimilarityscorebasedontheco-occurrenceofmetadatalabels.Itthentookadvantageoftheclustersproducedtoim-provekeyword-basedsearchforWebServices.Anadvantageofthisunsupervisedapproachisthatnolabeledtrainingdataisrequired,whichcanbetime-consumingtogenerate.Suchclusteringapproaches,however,whileusefulforservicediscovery,sufferthesamelimitationsasthepreviousapproachwhenitcomestoexpressiveness.8.4DatabaseApproaches Thedatabasecommunityhaslongbeeninterestedintheproblemofintegratingdatafromdisparatesources.Specifically,intheareasofdatawarehousing(Widom,1995)andin-formationintegration(Wiederhold,1996),researchersareinterestedinresolvingsemanticheterogeneitywhichexistsbetweendifferentdatabasessothatthedatacanbecombinedoraccessedviaasingleinterface.Theschemamappingproblemistheproblemofdeterminingamappingbetweentherelationscontainedinasourceschemaandaparticularrelationinatargetschema.Amappingdefinesatransformationwhichcanbeusedtopopulatethetargetrelationwithdatafromthesourceschema.Mappingsmaybearbitrarilycomplexprocedures,butingeneraltheywillbedeclarativequeriesinSQLorDatalog.Thecomplex-ityofthesequeriesmakestheschemamappingproblemfarmoredifficultthanthehighlyinvestigatedschemamatchingproblem(Rahm&Bernstein,2001),whichinvolvesfinding1-to-1correspondencesbetweenfieldsofasourceandtargetschema. Thesourcedefinitioninductionproblemcanbeviewedasatypeofschemamappingproblem,inwhichtheknownsourcesdefinethesourceschemaandtheunknownsourcespecifiesthetargetrelation.Inordertosolveaschemamappingproblem,onetypicallytakesadvantageofallavailableauxiliaryinformation(includingsourceandtargetdatainstances,labelsfromtherespectiveschemas,andsoon).Suchproblemsaregenerallysimpler,however,becausethedata(theextensionsoftherelations)inthesourceandtargetschemaareusuallyexplicitlyavailable.Insourceinduction,thatdataishiddenbehindaserviceinterface,whichhasbindingconstraints,andthedataitselfcanbeextremelylargeoreven(inthecaseofsourcesprovidingmathematicalfunctions)infinite.Thusmakingtheproblemconsiderablymoredifficult. TheschemaintegrationsystemCLIO(Yan,Miller,Haas,&Fagin,2001)helpsusersbuildSQLqueriesthatmapdatafromasourcetoatargetschema.InCLIO,foreignkeysandinstancedataareusedtogenerateintegrationrulessemi-automatically.SinceCLIOreliesheavilyonuserinvolvement,itdoesnotmakesensetocompareitdirectlywiththeautomatedsystemdevelopedinthispaper. 43 Carman&Knoblock Anothercloselyrelatedproblemisthatofcomplexschemamatching,thegoalofwhichistodiscovercomplex(many-to-one)mappingsbetweentworelationaltablesorXMLschemas.Thisproblemisfarmorecomplicatedthanbasic(one-to-one)schemamatchingbecause: •ThespaceofpossiblecorrespondencesbetweentherelationsisnolongertheCartesianproductofthesourceandtargetrelations,butthepowersetofthesourcerelationtimesthetargetrelation. •Many-to-onemappingsrequireamappingfunction,whichcanbesimplelikeconcate-nate(x,y,z),oranarbitrarilycomplexformulasuchasz=x2+y. TheiMAPsystem(Dhamankaetal.,2004)triestolearnsuchmany-to-onemappingsbe-tweentheconceptsofasetofsourcerelationsandatargetrelation.Itusesasetofspecialpurposesearcherstolearndifferenttypesofmappings(suchasmathematicalexpressions,unitconversionsandtime/datemanipulations).Itthenusesameta-heuristictocontrolthesearchbeingperformedbythedifferentspecialpurposesearchers.Ifoneviewsboththesourceschemaandthefunctionsavailableforuseinthemappings(suchasconcate-nate(x,y,z),add(x,y,z),etc.)asthesetofknownsourcesinthesourcedefinitioninductionproblem,thenthecomplexschemamatchingandsourceinductionproblemsaresomewhatsimilar.Themaindifferencesbetweentheproblemsare: •Thedataassociatedwiththesourceschemaisexplicit(andstatic)incomplexschemamatching,whileitishidden(anddynamic)insourceinduction. •Ingeneral,thesetofknownsourcesinasourceinductionproblemwillbemuchlarger(andthedatatheyprovidemaybelessconsistent),thanthesetofmappingfunctionsandsourcerelationsinacomplexschemamatchingproblem. Inthispaperwedevelopageneralframeworkforhandlingthesourceinductionproblem.SinceiMAPprovidesfunctionalitywhichissimilartothatofoursystem,weperformasimpleempiricalcomparisoninsection7.4.8.5SemanticWebApproach ThestatedgoaloftheSemanticWeb(Berners-Lee,Hendler,&Lassila,2001)istoenablemachineunderstandingofWebresources.Thisisdonebyannotatingthoseresourceswithsemanticallymeaningfulmetadata.ThustheworkdescribedinthispaperisverymuchinlinewiththeSemanticWeb,insofarasweareattemptingtodiscoversemanticallymeaningfuldefinitionsforonlineinformationsources.Defactostandardsforannotatingserviceswithsemanticmarkuphavebeenaroundforanumberofyears.Thesestandardsprovideserviceownerswithametadatalanguageforaddingdeclarativestatementstoserviceinterfacedescriptionsinanattempttodescribethesemanticsofeachserviceintermsofthefunctionality(e.g.abookpurchaseoperation)ordata(e.g.aweatherforecast)thatitprovides.Workontheselanguagesisrelatedtothisarticlefromtwoperspectives: •Itcanbeviewedasanalternativeapproachtogainingknowledgeastothesemanticsofanewlydiscoveredsource(providingithassemanticmetadataassociatedwithit).•SemanticWebServiceannotationlanguagescanbeseenasatargetlanguageforthesemanticdescriptionslearntinthispaper. 44 LearningSemanticDefinitionsofInformationSourcesontheInternet IfaWebServiceisalreadysemanticallyannotated,heterogeneitymaystillexistbetweentheontologyusedbytheserviceproviderandthatusedbytheconsumer,inwhichcasethelearningcapabilitiesdescribedinthispapermayberequiredtoreconcilethosedifferences.Moreimportantly,weareinterestedinthevastnumberofsourcesforwhichsemanticmarkupiscurrentlyunavailable.TheworkinthisarticlecomplementsthatoftheSemanticWebcommunitybyprovidingawayofautomaticallyannotatingsourceswithsemanticinformation;therebyrelievingserviceprovidersoftheburdenofmanuallyannotatingtheirservices.Oncelearnt,DatalogsourcedefinitionscanbeconvertedtoDescriptionLogic-basedrepresentationssuchasisusedinOWL-S(Martin,Paolucci,McIlraith,Burstein,McDermott,McGuinness,Parsia,Payne,Sabou,Solanki,Srinivasan,&Sycara,2004)andWSMO(Roman,Keller,Lausen,deBruijn,Lara,Stollberg,Polleres,Feier,Bussler,&Fensel,2005).ThereasonweuseDataloginthispaper(ratherthanDescriptionLogics)isthatmostmediator-basedintegrationsystemsrelyonitasarepresentationlanguage. 9.Discussion Inthispaperwehavepresentedacompletelyautomaticapproachtolearningdefinitionsforonlineservices.Ourapproachexploitsthedefinitionofsourcesthathaveeitherbeengiventothesystemorlearnedpreviously.Theresultingframeworkisasignificantadvanceoverpriorapproachesthathavefocusedonlearningonlytheinputsandoutputsortheclassofaservice.Wehavedemonstratedempiricallytheviabilityoftheapproach. Thekeycontributionofthisarticleisaprocedureforlearningsemanticdefinitionsforonlineinformationservicesthatis: •Fullyautomated:Definitionsarelearntinacompletelyautomatedmannerwithouttheneedforanyuserintervention. •Moreexpressive:Thequerylanguagefordefiningsourcesisthatofconjunctivequeries,whichisfarmoreexpressivethanpreviousattribute-valueapproaches. •Sufficientlyrobust:Theprocedureisabletolearndefinitionsinthepresenceofnoisyandincompletedata,andthusissufficientlyrobusttohandlerealdatasources. •Data-accessefficient:Theproceduresamplesdatafromlivesources,invokingthemsparinglyandonlyasrequired,makingithighlyefficientintermsofsourceaccesses.•Evolving:Theprocedure’sabilitytolearndefinitionsimprovesovertimeaseachnewdefinitionislearntandaddedtothesetofknownsources.9.1ApplicationScenarios Thereareanumberofdifferentapplicationscenariosforasystemthatiscapableoflearningdefinitionsforonlinesources.Theygenerallyinvolveprovidingsemanticdefinitionstodataintegrationsystems,whichthenexploitandintegratetheavailablesources. Themostobviousapplicationforourworkwouldbeasystem(depictedontheleftsideofFigure1)thatcrawlstheWeb,searchingforinformationsources.Uponfindingasource,thesystemwoulduseaclassifiertoassignsemantictypestoit,followedbytheinductivelearnertogenerateadefinitionforit.ThedefinitioncouldthenbeusedtoannotatethesourcefortheSemanticWeb,orbyamediatorforansweringqueries.Importantly,thisentireprocesscouldrunwithminimaluserinvolvement. 45 Carman&Knoblock Figure1:Architecturediagramsforthreedifferentapplicationscenarios. Amorechallengingapplicationscenario(showninthecenterofFigure1)wouldinvolvereal-timeservicediscovery.Considerthecasewhereamediatorisunabletoansweraparticularquerybecausethedesiredinformationliesoutofscopeofthesourcesavailable.Asearchisthenperformedbasedonthe“missingconjuncts”(relationnamesandconstants)fromthequeryusingaspecialisedWebServicesearchengine,suchasWoogle(Dongetal.,2004).Theservicesreturnedwouldbeannotatedwithsemantictypesand,ifpossible,semanticdefinitions.Afterthedefinitionsareprovidedtothemediator,itwouldcompletethequeryprocessingandreturnananswertotheuser.Thisscenariomayseemalittlefar-fetcheduntiloneconsidersaspecificexample:imagineauserinteractingwithageospatialbrowser(anonlineatlas).Iftheuserturnsonaparticularinformationlayer,suchasskiresorts,butnosourceisavailableforthecurrentfieldofview(of,forinstance,Italy),thennoresultswouldbedisplayed.Inthebackgroundasearchcouldbeperformedandanewsourcediscovered,whichprovidesskiresortsalloverEurope.Therelevantdatacouldthenbedisplayed,withtheuserunawarethatasearchhasbeenperformed. Perhapsthemostlikelyapplicationscenario(totherightofFigure1)forasourceinductionsystemwouldbeamixedinitiativeone.Inthiscaseahumanwouldannotatethedifferentoperationsofaserviceinterfacewithsemanticdefinitions.Atthesametime,thesystemwouldattempttoinducedefinitionsfortheremainingoperations,andprompttheuserwithsuggestionsforthem.Inthisscenariotheclassifiermaynotbeneeded,sinceattributesofthesamenameinthedifferentoperationswouldlikelyhavethesamesemantictype.Moreover,sincethedefinitionslearntbythesystemmayinsomecasescontainerroneousorsuperfluouspredicates,theusercouldalsobeinvolvedinaprocessofcheckingandimprovingthedefinitionsdiscovered. 46 LearningSemanticDefinitionsofInformationSourcesontheInternet 9.2OpportunitiesforFurtherResearch Anumberoffuturedirectionsforthisworkwillallowthesetechniquestobeappliedmorebroadly.Wenowdiscusstwosuchdirections,improvingthesearchalgorithmandextendingthequerylanguage. Asthenumberofknownsourcesgrows,sotoowillthesearchspace,anditwillbenec-essarytodevelopadditionalheuristicstobetterdirectthesearchtowardthebestdefinition.ManyheuristictechniqueshavebeendevelopedintheILPcommunityandsomemaybeapplicabletothesourceinductionproblem.Morepressingperhapsistheneedtodeveloparobustterminationconditionforhaltingthesearchoncea“sufficientlygood”definitionhasbeendiscovered.Asthenumberofavailablesourcesincreases,thesimpletimeoutusedintheexperimentswillbeineffectiveascertain(morecomplicated)definitionswillnecessarilytakelongertolearnthanothers. Anotherwaytoincreasetheapplicabilityofthisworkistoextendthequerylanguagesothatitbetterdescribesthesourcesavailable.Oftenonlinesourcesdonotreturnacompletesetofresultsbutrathercutoffthelistatsomemaximumcardinality.ForexampletheYahooHotelsourcedescribedinsection7.3.4returnsamaximumof20hotelsnearagivenlocation,andordersthemaccordingtodistance.Inthiscase,recognisingthespecificorderingonthetuplesproducedwouldbeveryusefultoamediator.Asecondusefulextensiontothequerylanguagewouldbetheabilitytodescribesourcesusingtheproceduralconstructif-then-else.Thisconstructisneededtodescribethebehaviourofsomesourcesoncertaininputs.Forexample,considertheYahooGeocoderfromsection7.3.1,whichtakesasinputatuplecontainingastreetname,number,andzipcode.Ifthegeocoderisunabletolocatethecorrespondingaddressinitsdatabase(becauseitdoesn’texist),insteadofreturningnotuples,itreturnsthecentroidofthezipcode.Describingsuchbehaviorisonlypossibleusingproceduralconstructs. Acknowledgments ThisresearchisbaseduponworksupportedinpartbytheDefenseAdvancedResearchProjectsAgency(DARPA),throughtheDepartmentoftheInterior,NBC,AcquisitionSer-vicesDivision,underContractNo.NBCHD030010.TheU.S.GovernmentisauthorizedtoreproduceanddistributereportsforGovernmentalpurposesnotwithstandinganycopy-rightannotationthereon.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofficialpoliciesorendorsements,eitherexpressedorimplied,ofanyoftheaboveorganizationsoranypersonconnectedwiththem. References Afrati,F.N.,Li,C.,&Mitra,P.(2004).Oncontainmentofconjunctivequeriesusingarith-meticcomparisions.In9thInternationalConferenceonExtendingDatabaseTechnol-ogy(EDBT2004)Heraklion-Crete,Greece. 47 Carman&Knoblock Arens,Y.,Knoblock,C.A.,&Shen,W.-M.(1996).Queryreformulationfordynamic informationintegration.JournalofIntelligentInformationSystems-SpecialIssueonIntelligentInformationIntegration,6(2/3),99–130.Berners-Lee,T.,Hendler,J.,&Lassila,O.(2001).Thesemanticweb.ScientificAmerican, 284(5),34–43.Bilenko,M.,Mooney,R.J.,Cohen,W.W.,Ravikumar,P.,&Fienberg,S.E.(2003). Adaptivenamematchingininformationintegration..IEEEIntelligentSystems,18(5),16–23.Cameron-Jones,R.M.,&Quinlan,J.R.(1994).Efficienttop-downinductionoflogic programs.SIGARTBulletin,5(1),33–42.Carman,M.J.,&Knoblock,C.A.(2007).Learningsemanticdescriptionsofwebinfor-mationsources.InProceedingsoftheTwentiethInternationalJointConferenceonArtificialIntelligence(IJCAI-07)Hyderabad,India.Chandra,A.K.,&Merlin,P.M.(1977).Optimalimplementationofconjunctivequeries inrelationaldatabases.InProceedingsofthe9thACMSymposiumonTheoryofComputing(STOC),pp.77–90Boulder,Colorado.Dhamanka,R.,Lee,Y.,Doan,A.,Halevy,A.,&Domingos,P.(2004).imap:Discovering complexsemanticmatchesbetweendatabaseschemas.InSIGMOD’04:Proceedingsofthe2004ACMSIGMODInternationalConferenceonManagementofData.Dong,X.,Halevy,A.Y.,Madhavan,J.,Nemes,E.,&Zhang,J.(2004).Simlaritysearch forwebservices.InProceedingsofVLDB.Duschka,O.M.(1997).QueryPlanningandOptimizationinInformationIntegration.Ph.D. thesis,DepartmentofComputerScience,StanfordUniversity.Garcia-Molina,H.,Hammer,J.,Ireland,K.,Papakonstantinou,Y.,Ullman,J.,&Widom, J.(1995).Integratingandaccessingheterogeneousinformationsourcesintsimmis.InProceedingsoftheAAAISymposiumonInformationGathering,pp.61-64.Heß,A.,&Kushmerick,N.(2003).Learningtoattachsemanticmetadatatowebservices. In2ndInternationalSemanticWebConference(ISWC).Heß,A.,&Kushmerick,N.(2004).Iterativeensembleclassificationforrelationaldata: Acasestudyofsemanticwebservices.In15thEuropeanConferenceonMachineLearning(ECML2004)Pisa,Italy.Springer.Knoblock,C.A.,Minton,S.,Ambite,J.L.,Ashish,N.,Muslea,I.,Philpot,A.,&Tejada, S.(2001).Theariadneapproachtoweb-basedinformationintegration.InternationalJournalofCooperativeInformationSystems,10(1-2),145–169.Lerman,K.,Plangprasopchok,A.,&Knoblock,C.A.(2006).Automaticallylabelingdata usedbywebservices.InProceedingsofthe21stNationalConferenceonArtificialIntelligence(AAAI). 48 LearningSemanticDefinitionsofInformationSourcesontheInternet Levy,A.Y.(2000).Logic-basedtechniquesindataintegration.InMinker,J.(Ed.),Logic-BasedArtificialIntelligence.KluwerPublishers.Levy,A.Y.,Mendelzon,A.O.,Sagiv,Y.,&Srivastava,D.(1995).Answeringqueriesusing views.InProceedingsofthe14thACMSIGACT-SIGMOD-SIGARTSymposiumonPrinciplesofDatabaseSystems,pp.95–104SanJose,Calif.Markov,Z.,&Marinchev,I.(2000).Metric-basedinductivelearningusingsemanticheight functions.InProceedingsofthe11thEuropeanConferenceonMachineLearning(ECML2000).Springer.Martin,D.,Paolucci,M.,McIlraith,S.,Burstein,M.,McDermott,D.,McGuinness,D., Parsia,B.,Payne,T.,Sabou,M.,Solanki,M.,Srinivasan,N.,&Sycara,K.(2004).Bringingsemanticstowebservices:Theowl-sapproach.InProceedingsoftheFirstInternationalWorkshoponSemanticWebServicesandWebProcessComposition(SWSWPC2004).Muggleton,S.,&Feng,C.(1990).Efficientinductionoflogicprograms.InProceedingsof the1stConferenceonAlgorithmicLearningTheory.Muggleton,S.(1995).InverseentailmentandProgol.NewGenerationComputing,Special issueonInductiveLogicProgramming,13(3-4),245–286.N´edellec,C.,Rouveirol,C.,Ad´e,H.,Bergadano,F.,&Tausend,B.(1996).Declarative biasinILP.InDeRaedt,L.(Ed.),AdvancesinInductiveLogicProgramming,pp.82–103.IOSPress.Pazzani,M.J.,&Kibler,D.F.(1992).Theutilityofknowledgeininductivelearning. MachineLearning,9,57–94.Perkowitz,M.,&Etzioni,O.(1995).Categorytranslation:Learningtounderstandinforma-tionontheinternet.InProceedingsoftheFourteenthInternationalJointConferenceonArtificialIntelligence(IJCAI-95).Perkowitz,M.,Doorenbos,R.B.,Etzioni,O.,&Weld,D.S.(1997).Learningtounder-standinformationontheinternet:Anexample-basedapproach.JournalofIntelligentInformationSystems,8(2),133–153.Pottinger,R.,&Halevy,A.Y.(2001).Minicon:Ascalablealgorithmforansweringqueries usingviews.VLDBJournal,10(2-3).Quinlan,J.R.,&Cameron-Jones,R.M.(1993).FOIL:Amidtermreport.InMachine Learning:ECML-93,EuropeanConferenceonMachineLearning,Proceedings,Vol.667,pp.3–20.Springer-Verlag.Rahm,E.,&Bernstein,P.(2001).Asurveyofapproachestoautomaticschemamatching. VLDBJournal,10(4).Richards,B.L.,&Mooney,R.J.(1992).Learningrelationsbypathfinding.InNational ConferenceonArtificialIntelligence,pp.50–55. 49 Carman&Knoblock Roman,D.,Keller,U.,Lausen,H.,deBruijn,J.,Lara,R.,Stollberg,M.,Polleres,A., Feier,C.,Bussler,C.,&Fensel,D.(2005).Webservicemodelingontology.AppliedOntology,1(1),77–106.Ullman,J.D.(1989).PrinciplesofDatabaseandKnowledge-BaseSystems,Vol.2.Com-puterSciencePress,Rockville,Maryland.Weber,I.,Tausend,B.,&Stahl,I.(1995).Languageseriesrevisited:Thecomplexityof hypothesisspacesinILP.InProceedingsofthe8thEuropeanConferenceonMachineLearning,Vol.912,pp.360–363.Springer-Verlag.Widom,J.(1995).Researchproblemsindatawarehousing.InCIKM’95:Proceedingsof thefourthInternationalConferenceonInformationandKnowledgeManagement,pp.25–30.ACMPress.Wiederhold,G.(1992).Mediatorsinthearchitectureoffutureinformationsystems.Com-puter,25(3),38–49.Wiederhold,G.(Ed.).(1996).IntelligentIntegrationofInformation.KluwerAcademic Publishers,BostonMA.Winkler,W.(1999).Thestateofrecordlinkageandcurrentresearchproblems.Tech.rep., StatisticalResearchDivision,U.S.BureauoftheCensus,Washington,DC.Yan,L.L.,Miller,R.J.,Haas,L.M.,&Fagin,R.(2001).Data-drivenunderstanding andrefinementofschemamappings.InSIGMOD’01:Proceedingsofthe2001ACMSIGMODInternationalConferenceonManagementofdata.Zelle,J.M.,Thompson,C.A.,Califf,M.E.,&Mooney,R.J.(1995).Inducinglogic programswithoutexplicitnegativeexamples.InProceedingsoftheFifthInternationalWorkshoponInductiveLogicProgramming. 50 因篇幅问题不能全部显示,请点此查看更多更全内容