Learning Semantic Definitions of Online Information Sources

2023-10-27 来源：步旅网

JournalofArtiﬁcialIntelligenceResearch30(2007)1-50Submitted11/06;published9/07

LearningSemanticDeﬁnitionsofOnlineInformationSources

MarkJamesCarmanCraigA.Knoblock

UniversityofSouthernCaliforniaInformationSciencesInstitute4676AdmiraltyWay

MarinadelRey,CA90292

mark@bradipo.netknoblock@isi.edu

Abstract

TheInternetcontainsaverylargenumberofinformationsourcesprovidingmanytypesofdatafromweatherforecaststotraveldealsandﬁnancialinformation.ThesesourcescanbeaccessedviaWeb-forms,WebServices,RSSfeedsandsoon.Inordertomakeautomateduseofthesesources,weneedtomodelthemsemantically,butwritingsemanticdescriptionsforWebServicesisbothtediousanderrorprone.Inthispaperweinvestigatetheproblemofautomaticallygeneratingsuchmodels.WeintroduceaframeworkforlearningDatalogdeﬁnitionsofWebsources.Inordertolearnthesedeﬁnitions,oursystemactivelyinvokesthesourcesandcomparesthedatatheyproducewiththatofknownsourcesofinformation.Itthenperformsaninductivelogicsearchthroughthespaceofplausiblesourcedeﬁnitionsinordertolearnthebestpossiblesemanticmodelforeachnewsource.Inthispaperweperformanempiricalevaluationofthesystemusingreal-worldWebsources.Theevaluationdemonstratestheeﬀectivenessoftheapproach,showingthatwecanautomaticallylearncomplexmodelsforrealsourcesinreasonabletime.Wealsocompareoursystemwithacomplexschemamatchingsystem,showingthatourapproachcanhandlethekindsofproblemstackledbythelatter.

1.Introduction

Recentyearshaveseenanexplosioninthequantityandvarietyofinformationavailableonline.Onecanﬁndshoppingdata(pricesandavailabilityofgoods),geospatialdata(weatherforecasts,housinginformation),traveldata(ﬂightpricingandstatus),ﬁnancialdata(exchangeratesandstockquotes),andthatjustscratchesthesurfaceofwhatisavailable.Theaimofthisworkistomakeuseofthatvaststoreofinformation.

Astheamountofinformationhasincreased,sotoohasitsreuseacrossWebportalsandapplications.Developershaverealisedtheimportanceofmanagingcontentseparatelyfrompresentation,leadingtothedevelopmentofXMLasaself-describingdataformat.ContentinXMLisfareasiertomanipulatethanHTML,simplifyingintegrationacrossdiﬀerentsources.Standardshavealsoemergedforprovidingprogrammaticaccesstodata(likeSOAPandREST)sothatdeveloperscaneasilybuildprograms(calledMash-Ups)thatcombinecontentfromdiﬀerentsitesinreal-time.Manyportalsnowprovidesuchaccesstotheirdataandsomeevenprovidesyntacticdeﬁnitions(inWSDL)oftheinputandoutputthesedatasourcesexpect.Missing,however,aresemanticdescriptionsofwhateachsourcedoes,whichisrequiredinordertosupportautomateddataintegration.

c2007AIAccessFoundation.Allrightsreserved.󰀂

Carman&Knoblock

1.1StructuredQuerying

Givenallthestructuredsourcesavailable,wewouldliketocombinethedatadynamicallytoanswerspeciﬁcuserrequests(asopposedtostaticallyinthecaseofMash-Ups).Dynamicdatarequestscanbeexpressedasqueriessuchasthoseshownbelow.Suchqueriesmayrequireaccesstomultiple(publiclyavailable)datasourcesandcombineinformationinwaysthatwerenotenvisagedbyitsproducers.

1.Tourism:Getpricesandavailabilityforallthreestarhotelswithin100kilometersofTrento,Italythatliewithin1kilometerofaskiresortthathasover1meterofsnow.2.Transportation:DeterminethetimethatIneedtoleaveworkinordertocatchabustotheairporttopickupmybrotherwhoisarrivingonQantasﬂight205.

3.DisasterPrevention:Findphonenumbersforallpeoplelivingwithin1mileofthecoastandbelow200feetofelevation.Itisclearevenfromthissmallsetofexamplesjusthowpowerfultheabilitytocombinedatafromdisparatesourcescanbe.Inordertogivethesequeriestoanautomatedsystem,wemustﬁrstexpressthemformallyinaquerylanguagesuchasSQLorDatalog(Ullman,1989).InDatalogtheﬁrstquerymightlookasfollows:

q(hotel,price):-accommodation(hotel,3*,address),available(hotel,today,price),distance(address,󰀛Trento,Italy󰀜,dist1),dist1<100km,skiResort(resort,loc1),distance(address,loc1,dist2),

dist2<1km,snowCondiditions(resort,today,height),height>1m.

Theexpressionstatesthathotelandpricepairsaregeneratedbylookingupthreestarhotelsinarelationaltablecalledaccommodation,thencheckingthepricefortomorrownightinatablecalledavailable.TheaddressofeachhotelisusedtocalculatethedistancetoTrento,whichmustbelessthan100kilometers.ThequeryalsochecksthatthereisaskiResortwithin1kilometerofthehotel,andthatthesnowConditionsfortodayshowmorethan1meterofsnow.1.2Mediators

AsystemcapableofgeneratingaplantoanswersuchaqueryiscalledanInformationMediator(Wiederhold,1992).Inordertogenerateaplan,mediatorslookforsourcesthatarerelevanttothequery.Inthiscase,relevantsourcesmightbe:

1.TheItalianTourismWebsite:toﬁndallhotelsnear‘Trento,Italy’.2.ASkiSearchEngine:toﬁndskiresortsneareachhotel.

3.AWeatherProvider:toﬁndouthowmuchsnowhasfallenateachskiresort.Foramediatortoknowwhichsourcesarerelevant,itneedstoknowwhatinformationeachsourceprovides.WhileXMLdeﬁnesthesyntax(formatting)usedbyasource,thesemantics(intendedmeaning)oftheinformationthesourceprovidesmustbedeﬁnedseparately.ThiscanbedoneusingLocal-as-View(LAV)sourcedeﬁnitionsinDatalog(Levy,2000).Essentially,sourcedeﬁnitionsdescribequeriesthatifgiventoamediator,willreturnthesamedataasthesourceprovides.Exampledeﬁnitionsareshownbelow.Theﬁrststates

LearningSemanticDefinitionsofInformationSourcesontheInternet

thatthesourcehotelSearchtakesfourvaluesasinput(inputsarepreﬁxedbythe$-symbol),andreturnsalistofhotelswhichliewithinthegivendistanceoftheinputlocation.Foreachhotelitalsoreturnstheaddressaswellasthepriceforaroomonthegivendate.(NotethatthesourceonlyprovidesinformationforhotelsinItaly.)

hotelSearch($location,$distance,$rating,$date,hotel,address,price):-country(location,Italy),accommodation(hotel,rating,address),available(hotel,date,price),distance(address,location,dist1),dist1findSkiResorts($address,$distance,resort,location):-skiResort(resort,location),distance(address,location,dist1),dist1getSkiConditions($resort,$date,height):-snowCondiditions(resort,date,height).

Inordertogenerateaplanforansweringthequery,amediatorperformsaprocesscalledqueryreformulation(Levy,2000),wherebyittransformsthequeryintoanewqueryover(intermsof)therelevantinformationsources.1(Asourceisrelevantifitreferstothesamerelationsasthequery.)Theresultingplaninthiscaseisshownbelow.

q(hotel,price):-hotelSearch(󰀛Trento,Italy󰀜,100km,3*,today,hotel,address,price),findSkiResorts(address,1km,resort,location),

getSkiConditions(resort,today,height),height>1m.

Inthiswork,thequestionsofinterestare:wheredoallthedeﬁnitionsfortheseinformationsourcescomefromandmoreprecisely,whathappenswhenwewanttoaddnewsourcestothesystem?Isitpossibletogeneratethesesourcedeﬁnitionsautomatically?1.3DiscoveringNewSources

Intheexampleabove,themediatorknowsofasetofrelevantsourcestousetosuccessfullyanswerthequery.Ifinstead,oneofthosesourcesismissingordoesn’thavethedesiredscope(e.g.getSkiConditionsdoesn’tprovidedataforTrento),thenthemediatorﬁrstneedstodiscoverasourceprovidingthatinformation.Asthenumberandvarietyofinformationsourcesincrease,wewillundoubtedlyrelyonautomatedmethodsfordiscoveringthemandannotatingthemwithsemanticdescriptions.Inordertodiscoverrelevantsources,asystemmightinspectaserviceregistry2(suchasthosedeﬁnedinUDDI),orperformkeyword-basedsearchoveraWebindex(suchasGoogleordel.icio.us).Theresearchcommunityhaslookedattheproblemofdiscoveringrelevantservices,developingtechniquesforclassifyingservicesintodiﬀerentdomains(suchasweatherandﬂights)usingservicemetadata(Heß&Kushmerick,2003)andclusteringsimilarservicestogethertoimprovekeyword-basedsearch(Dong,Halevy,Madhavan,Nemes,&Zhang,2004).Thesetechniques,althoughuseful,arenotsuﬃcientforautomatingserviceintegration.

1.Thecomplexityofqueryreformulationisknowntobeexponential,althougheﬃcientalgorithmsforperformingitdoexist(Pottinger&Halevy,2001).

2.Notethattechnically,aserviceisdiﬀerentfromasource.Aserviceisaninterfaceprovidingaccesstomultipleoperations,eachofwhichmayprovideinformation.Ifanoperationdoesnotaﬀectthe“stateoftheworld”(e.g.bychargingsomebody’screditcard),thenwecallitaninformationsource.Forthispaper,however,weusethetermservicetoreferonlytoinformationsources.

Carman&Knoblock

1.4LabelingServiceInputsandOutputs

Oncearelevantserviceisdiscovered,theproblemshiftstomodelingitsemantically.Model-ingsourcesbyhandcanbelaborious,soautomatingtheprocessmakessense.Sincediﬀerentservicesoftenprovidesimilaroroverlappingdata,itshouldbepossibletouseknowledgeofpreviouslymodeledservicestolearndescriptionsfornewones.

Theﬁrststepinmodelingasourceistodeterminewhattypeofdataitrequiresasinputandproducesasoutput.Thisisdonebyassigningsemantictypes(likezipcode,telephonenumber,temperature,andsoon)totheattributesofaservice.Semantictypesrestrictthepossiblevaluesofanattributetoasubsetofthecorrespondingprimitivetype.Theresearchcommunityhasinvestigatedautomatingtheassignmentprocessbyviewingitasaclassiﬁcationproblem(Heß&Kushmerick,2003).Intheirsystem,HeßandKushmericktrainedaSupportVectorMachine(SVM)onmetadatadescribingdiﬀerentsources.Thesystem,givenasourcesuchasthefollowing:

getWeather($zip,temp)

usesthelabelsgetWeather,zipandtemp(andanyotheravailablemetadata)toassigntypestotheinputandoutputattributes,e.g.:zip→zipcode,temp→temperature.Notethatadditionalmetadataisoftenusefulfordistinguishingbetweenpossibleassignments.(If,forexample,thenameoftheoperationhadbeenlistEmployees,thentempmayhavereferredtoatemporaryemployeeratherthanatemperature.)

Insubsequentwork,researchersdevelopedamorecomprehensivesystemthatusedbothmetadataandoutputdatatoclassifyserviceattributes(Lerman,Plangprasopchok,&Knoblock,2006).Inthatsystem,aLogisticRegressionbasedclassiﬁerﬁrstassignsse-mantictypestoinputparameters.Examplesofthoseinputtypesarethenusedtoinvoketheservice,andtheoutputisgiventoapattern-languagebasedclassiﬁer,thatassignstypestotheoutputparameters.Theauthorsarguethatclassiﬁcationbasedonbothdataandmetadataisfarmoreaccuratethanthatbasedonmetadataalone.Usinganexample,itiseasytoseewhy.ConsiderthefollowingtuplesproducedbyourgetWeathersource:

󰀛90292,25◦C󰀜,󰀛10274,15◦C󰀜,󰀛60610,18◦C󰀜,...

Giventhedata,theclassiﬁercanbecertainthattempreallydoesreferstoatemperature,andindeedcanevenassignittoamorespeciﬁctype,temperatureC(inCelsius).

Whiletheproblemofdeterminingthesemantictypesofaservice’sattributesisveryinteresting,andthereisroomforimprovementoncurrenttechniques,weassumeforthepurposesofthisworkthatithasalreadybeensolved.1.5GeneratingaDeﬁnition

Onceweknowtheparametertypes,wecaninvoketheservice,butwearestillunabletomakeuseofthedataitreturns.Todothat,weneedalsotoknowhowtheoutputattributesrelatetotheinput(i.e.adeﬁnitionforthesource).Forexample,forthegetWeatherserviceweneedtoknowwhetherthetemperaturebeingreturnedisthecurrenttemperature,thepredictedhightemperaturefortomorrowortheaveragetemperatureforthistimeofyear.Suchrelationshipscanbedescribedbythefollowingdeﬁnitions:

getWeather($zip,temp):-currentTemp(zip,temp).

getWeather($zip,temp):-forecast(zip,tomorrow,temp).getWeather($zip,temp):-averageTemp(zip,today,temp).

LearningSemanticDefinitionsofInformationSourcesontheInternet

Therelationsusedinthesedeﬁnitionswouldbedeﬁnedinadomainontology(orschema).Inthispaperwedescribeasystemcapableoflearningwhichifanyofthesedeﬁnitionsisthecorrect.Thesystemleverageswhatitknowsaboutthedomain,i.e.thedomainontologyandasetofknowninformationsources,tolearnwhatitdoesnotknow,namelytherelationshipbetweentheattributesofanewlydiscoveredsource.1.6Outline

Thispaperpresentsacomprehensivetreatmentofmethodsforlearningsemanticdescrip-tionsofWebinformationsources.Itextendsourpreviousworkonthesubject(Carman&Knoblock,2007)bypresentingdetaileddescriptionsofthemethodsforbothenumeratingthesearchspaceandevaluatingtheindividualcandidatedeﬁnitions.Weprovideadditionaldetailsregardingtheevaluationmethodologyandtheresultsgenerated.

Thepaperisstructuredasfollows.Westartwithanexampletomotivatethesourceinductionproblemandthenformulatetheproblemconcisely.Wediscussourapproachtolearningdeﬁnitionsforsourcesintermsofotherknownsourcesofinformation(section3).Wegivedetailsofthesearchprocedureforgeneratingcandidatedeﬁnitions(section4)andoftheevaluationprocedureforscoringcandidatesduringsearch(section5).Wethendescribeextensionstothebasicalgorithm(section6)beforediscussingtheevaluationsetupandtheexperiments(section7),whichdemonstratethecapabilitiesofoursystem.Finally,wecontrastourapproachwithpriorwork.

2.Problem

Wenowdescribeindetailtheproblemoflearningdeﬁnitionsfornewlydiscoveredservices.Westartwithaconcreteexampleofwhatismeantbylearningasourcedeﬁnition.Intheexampletherearefourtypesofdata(semantictypes),namely:zipcodes,distances,latitudesandlongitudes.Therearealsothreeknownsourcesofinformation.EachofthesourceshasadeﬁnitioninDatalogasshownbelow.Theﬁrstservice,aptlynamedsource1,takesinazipcodeandreturnsthelatitudeandlongitudecoordinatesofitscentroid.Thesecondservicecalculatesthegreatcircledistance(theshortestdistanceovertheearth’ssurface)betweentwopairsofcoordinates,andthethirdconvertsadistancefromkilometresintomilesbymultiplyingtheinputbytheconstant1.609.

source1($zip,lat,long):-centroid(zip,lat,long).source2($lat1,$long1,$lat2,$long2,dist):-greatCircleDist(lat1,long1,lat2,long2,dist).

source3($dist1,dist2):-multiply(dist1,1.609,dist2).

Thegoalinthisexampleistolearnadeﬁnitionforanewlydiscoveredservice,calledsource4.Thisservicetakestwozipcodesasinputandreturnsadistancevalueasoutput:

source4($zip1,$zip2,dist)

Thesystemwewilldescribeusesthistypesignature(inputandoutputtypeinformation)tosearchforanappropriatedeﬁnitionforthesource.Thedeﬁnitiondiscoveredinthiscasemightbethefollowingconjunctionofcallstotheindividualsources:

source4($zip1,$zip2,dist):-source1($zip1,lat1,long1),source1($zip2,lat2,long2),

source2($lat1,$long1,$lat2,$long2,dist2),source3($dist2,dist).

Carman&Knoblock

Thedeﬁnitionstatesthatsource’soutputdistancecanbecalculatedfromtheinputzipcodes,bygivingthosezipcodestosource1,takingtheresultingcoordinatesandcalculatingthedistancebetweenthemusingsource2,andthenconvertingthatdistanceintomilesusingsource3.Totestwhetherthisdeﬁnitioniscorrect,thesystemmustinvokeboththenewsourceandthedeﬁnitiontoseeifthevaluesgeneratedagreewitheachother.Thefollowingtableshowssuchatest:

$zip1802106060110005

$zip2902661520135555

dist(actual)842.37410.31899.50

dist(predicted)843.65410.83899.21

Inthetable,theinputzipcodeshavebeenselectedrandomlyfromasetofexamples,andtheoutputfromthesourceandthedeﬁnitionareshownsidebyside.Sincetheoutputvaluesarequitesimilar,oncethesystemhasseenasuﬃcientnumberofexamples,itcanbeconﬁdentthatithasfoundthecorrectsemanticdeﬁnitionforthesource.

Thedeﬁnitionaboveisgivenintermsofthesourcerelations,butcouldalsohavebeenwrittenintermsofthedomainrelations(therelationsusedtodeﬁnesources1to3).Toconvertthedeﬁnitionintothatform,onesimplyneedstoreplaceeachsourcerelationbyitsdeﬁnitionasfollows:

source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2),

greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.609,dist2).Writteninthisway,thenewsemanticdeﬁnitionmakessenseatanintuitivelevel:thesourceissimplycalculatingthedistanceinmilesbetweenthecentroidsofthetwozipcodes.2.1ProblemFormulation

HavinggivenanexampleoftheSourceDeﬁnitionInductionProblem,wenowdescribetheproblemmoreformally,butbeforedoingso,weintroducesomeconceptsandnotation.(Wenotethatourfocusinthispaperisonlearningdeﬁnitionsforinformation-providing,asopposedto“world-altering”services.)

•Thedomainofasemanticdata-typet,denotedD[t],isthe(possiblyinﬁnite)setofconstantvalues{c1,c2,...},whichconstitutethesetofvaluesforvariablesofthattype.ForexampleD[zipcode]={90210,90292,...}

•Anattributeisapair󰀛label,semanticdata-type󰀜,e.g.󰀛zip1,zipcode󰀜.Thetypeofattributeaisdenotedtype(a)andthecorrespondingdomainD[type(a)]isabbreviatedtoD[a].

•Aschemeisanordered(ﬁnite)setofattributes󰀛a1,...,an󰀜withuniquelabels,wherenisreferredtoasthearityofthescheme.Anexampleschememightbe󰀛zip1:zipcode,zip2:zipcode,dist:distance󰀜.ThedomainofaschemeA,denotedD[A],istheCartesianproductofthedomainsoftheattributesinthescheme{D[a1]×...×D[an]},ai∈A.

•AtupleoveraschemeAisanelementfromthesetD[A].Atuplecanberepresentedbyasetofname-valuepairs,suchas{zip1=90210,zip2=90292,dist=8.15}

LearningSemanticDefinitionsofInformationSourcesontheInternet

•Arelationisanamedscheme,suchasairDistance(zip1,zip2,dist).Multiplere-lationsmaysharethesamescheme.

•Anextensionofarelationr,denotedE[r],isasubsetofthetuplesinD[r].Forexam-ple,E[airDistance]mightbeatablecontainingthedistancebetweenallzipcodesinCalifornia.(Notethattheextensionofarelationmayonlycontaindistincttuples.)•AdatabaseinstanceoverasetofrelationsR,denotedI[R],isasetofextensions{E[r1],...,E[rn]},oneforeachrelationr∈R.

•AquerylanguageLisaformallanguageforconstructingqueriesoverasetofrelations.WedenotethesetofallqueriesthatcanbewrittenusingthelanguageLoverthesetofrelationsRreturningtuplesconformingtoaschemeAasLR,A.

•Theresultsetproducedbytheexecutionofaqueryq∈LR,AonadatabaseinstanceI[R]isdenotedEI[q].

•Asourceisarelations,withabindingpatternβs⊆s,whichdistinguishesinputattributesfromoutputattributes.(Theoutputattributesofasourcearedenotedby

c≡s\\β.)thecomplementofthebindingpattern3,βss

•AviewdeﬁnitionforasourcesisaqueryvswritteninsomequerylanguageLR,s.TheSourceDeﬁnitionInductionProblemisdeﬁnedasatuple:

󰀛T,R,L,S,V,s∗󰀜

whereTisasetofsemanticdata-types,Risasetofrelations,Lisaquerylanguage,Sisasetofknownsources,Visasetofviewdeﬁnitions(oneforeachknownsource),ands∗isthenewsource(alsoreferredtoasthetarget).

Eachsemantictypet∈TmustbeprovidedwithasetofexamplesvaluesEt⊆D[t].(WedonotrequiretheentiresetD[t],becausethedomainofmanytypesmaybepartiallyunknownortoolargetobeenumerated.)Inaddition,apredicateeqt(t,t)isavailableforcheckingequalitybetweenvaluesofeachsemantictypetohandlethecasewheremultipleserialisationsofavariablerepresentthesamevalue.

Eachrelationr∈Risreferredtoasaglobalrelationordomainpredicateanditsextensionisvirtual,meaningthattheextensioncanonlybegeneratedbyinspectingeveryrelevantdatasource.ThesetofrelationsRmayincludesomeinterpretedpredicates,suchas≤,whoseextensionisdeﬁnedandnotvirtual.

ThelanguageLusedforconstructingqueriescouldbeanyquerylanguageincludingSQLandXQuery(theXMLQueryLanguage).InthispaperwewilluseaformofDatalog.

Eachsources∈ShasanextensionE[s]whichisthecompletesetoftuplesthatcanbeproducedbythesource(atagivenmomentintime).Werequirethatthecorrespondingviewdeﬁnitionvs∈V(writteninLR,s)isconsistentwiththesource,suchthat:E[s]⊆EI[vs],(whereI[R]isthecurrentvirtualdatabaseinstanceovertheglobalrelations).Notethatwedonotrequireequivalence,becausesomesourcesmayprovideincompletedata.

Theviewdeﬁnitionforthesourcetobemodeleds∗isunknown.ThesolutiontotheSourceDeﬁnitionInductionProblemisaviewdeﬁnitionv∗∈LR,s∗forthesources∗suchthatE[s∗]⊆EI[v∗],andthereisnootherviewdeﬁnitionv󰀄∈LR,s∗thatbetterdescribes(providesatighterdeﬁnitionfor)thesources∗,i.e.:

¬∃v󰀄∈LR,s∗s.t.E[s∗]⊆EI[v󰀄]∧|EI[v󰀄]|<|EI[v∗]|

3.The‘\\’-symboldenotessetdiﬀerence.

Carman&Knoblock

Givenlimitedavailablecomputationandbandwidth,wenotethatitmaynotbepossibletoguaranteethatthisoptimalityconditionholdsforaparticularsolution;thusinthispaperwewillsimplystrivetoﬁndthebestsolutionpossible.2.2ImplicitAssumptions

Anumberofassumptionsareimplicitintheproblemformulation.Theﬁrstisthatthereexistsasystemcapableofdiscoveringnewsourcesandmoreimportantlyclassifying(togoodaccuracy)thesemantictypesoftheirinputandoutput.Systemscapableofdoingthiswerediscussedinsection1.4.

Thesecondassumptionhastodowiththerepresentationofeachsourceasarelationalviewdeﬁnition.MostsourcesontheInternetprovidetreestructuredXMLdata.Itmaynotalwaysbeobvioushowbesttoﬂattenthatdataintoasetofrelationaltuples,whilepreservingtheintendedmeaningofthedata.Consideratravelbookingsitewhichreturnsasetofﬂightoptionseachwithaticketnumber,apriceandalistofﬂightsegmentsthatconstitutetheitinerary.Onepossibilityforconvertingthisdataintoasetoftupleswouldbetobreakeachticketupintoindividualﬂightsegmenttuples(therebyobscuringtherelationshipbetweenpriceandthenumberofﬂightsegments).Anotherwouldbetocreateoneverylongtupleforeachticketwithroomforanumberofﬂightsegments(therebycreatingtupleswithmanynullvalues).Inthiscase,itisnotobviouswhich,ifeither,oftheseoptionsistobepreferred.Mostonlinedatasourcescan,however,bemodeledquitenaturallyasrelationalsources;andbyﬁrsttacklingtherelationalproblem,wecandeveloptechniquesthatcanlaterbeappliedtothemorediﬃcultsemi-structuredcase.

Athirdassumptionisthatthesetofdomainrelationssuﬃcesfordescribingthesourcetobemodeled.Forinstance,considerthecasewherethedomainmodelonlycontainsrelationsdescribingﬁnancialdata,andthenewsourceprovidesweatherforecasts.Obviously,thesystemwouldbeunabletoﬁndanadequatedescriptionofthebehaviorofthesourceandwouldnotlearnamodelofit.Fromapracticalperspective,thislimitationisnotabigproblem,sinceausercanonlyrequestdata(writequeriestoamediator)usingtherelationsinthedomainmodelanyway.(Thusanysourcewhichcannotbedescribedusingthoserelationswouldnotbeneededtoansweruserrequests.)Inotherwords,theonusisonthedomainmodelertomodelsuﬃcientrelationssoastobeabletodescribethetypesofqueriesausershouldbeabletoposetothesystemandconsequently,thetypesofsourcesthatshouldbeavailable.Thatsaid,aninterestingavenueforfutureresearchwouldbetoinvestigatetheproblemofautomating(atleastinpart)theprocessofexpandingthescopeofthedomainmodel(byaddingattributestorelations,orinventingnewones),basedonthetypesofsourcesdiscovered.2.3ProblemDiscussion

Anumberofquestionsarisefromtheproblemformulation,theﬁrstbeingwherethedomainmodelcomesfrom.Inprinciple,thesetofsemantictypesandrelationscouldcomefrommanyplaces.Itcouldbetakenfromstandarddatamodelsforthediﬀerentdomains,oritmightjustbethesimplestmodelpossiblethataptlydescribesthesetofknownsources.Thedomainmodelmaythenevolveovertimeassourcesarediscoveredforwhichnoappropriatemodelcanbefound.Somewhatrelatedisthequestionofhowspeciﬁcthesemantictypes

LearningSemanticDefinitionsofInformationSourcesontheInternet

oughttobe.Forexample,isitsuﬃcienttohaveonesemantictypedistanceorshouldonedistinguishbetweendistanceinmetersanddistanceinfeet?Generallyspeaking,asemantictypeshouldbecreatedforeachattributethatissyntacticallydissimilartoallotherattributes.Forexample,aphonenumberandazipcodehaveverydiﬀerentsyntax,thusoperationswhichacceptoneofthetypesasinputareunlikelytoaccepttheother.Inpractice,onemightcreateanewsemantictypewheneveratrainedclassiﬁercanrecognisethetypebasedonitssyntaxalone.Ingeneral,themoresemantictypesthereare,theharderthejobofthesystemclassifyingtheattributes,andtheeasierthejobofthesystemtaskedwithlearningadeﬁnitionforthesource.

Anotherquestiontobeconsiderediswherethedeﬁnitionsfortheknownsourcescomefrom.Initiallysuchdeﬁnitionswouldneedtobewrittenbyhand.Asthesystemlearnsdeﬁnitionsfornewsources,theytoowouldbeaddedtothesetofknownsources,makingitpossibletolearnevermorecomplicateddeﬁnitions.

Inorderforthesystemtolearnadeﬁnitionforanewsource,itmustbeabletoinvokethatsourceandthusneedsexamplesoftheinputtypes.Themorerepresentativethesetofexamplesavailable,themoreeﬃcientandaccuratethelearningprocesswillbe.Aninitialsetofexampleswillneedtobeprovidedbythedomainmodeler.Then,asthesystemlearnsovertime,itwillgeneratealargenumberofexamplesofdiﬀerentsemantictypes(asoutputfromvarioussources),whichcanberetainedforfutureuse.

InformationIntegrationresearchhasreachedapointwheremediatortechnology4isbecomingmatureandpractical.Theneedtoinvolveahumaninthewritingofsourcedeﬁnitionsis,however,theAchilles’Heelofsuchsystems.Thegainsinﬂexibilitythatcomewiththeabilitytodynamicallyreformulateuserqueriesareoftenpartiallyoﬀsetbythetimeandskillrequiredtowritedeﬁnitionswhenincorporatingnewsources.Thusasystemcapableoflearningdeﬁnitionsautomaticallycouldgreatlyenhancetheviabilityofmediatortechnology.Thismotivationaloneseemssuﬃcientforpursuingtheproblem.

3.Approach

TheapproachwetaketolearningsemanticmodelsforinformationsourcesontheWebistwofold.Firstly,wechoosetomodelsourcesusingthepowerfullanguageofconjunctivequeries.Secondly,weleveragethesetofknownsourcesinordertolearnadeﬁnitionforthenewone.Inthissectionwediscusstheseaspectsinmoredetail.3.1ModelingLanguage

ThesourcedeﬁnitionlanguageListhehypothesislanguageinwhichnewdeﬁnitionswillneedtobelearnt.Asisoftenthecaseinmachinelearning,wearefacedwithatrade-oﬀwithrespecttotheexpressivenessofthislanguage.Ifthehypothesislanguageistoosimple,thenwemaynotbeabletomodelrealservicesusingit.Ontheotherhand,ifthelanguageisoverlycomplex,thenthespaceofpossiblehypotheseswillbesolargethatlearningwillnotbefeasible.ThelanguagewechooseisthatofconjunctivequeriesinDatalog,whichis

4.InﬂuentialInformationIntegrationsystemsincludeTSIMMIS(Garcia-Molina,Hammer,Ireland,Pa-pakonstantinou,Ullman,&Widom,1995),SIMS(Arens,Knoblock,&Shen,1996),InfoMaster(Duschka,1997),andAriadne(Knoblock,Minton,Ambite,Ashish,Muslea,Philpot,&Tejada,2001).

Carman&Knoblock

ahighlyexpressiverelationalquerylanguage.Inthissectionwearguewhyalessexpressivelanguageisnotsuﬃcientforourpurposes.

ResearchersinterestedintheproblemofassigningsemanticstoWebServices(Heß&Kushmerick,2003)haveinvestigatedtheproblemofusingMachineLearningtechniquestoclassifyservices(basedonmetadatacharacteristics)intodiﬀerentsemanticdomains,suchasweatherandﬂights,andtheoperationstheyprovideintodiﬀerentclassesofoperation,suchasweatherForecastandﬂightStatus.Fromarelationalperspective,wecanconsiderthediﬀerentclassesofoperationsasrelations.Forinstance,considerthedeﬁnitionbelow:

source($zip,temp):-weatherForecast(zip,tomorrow,temp).

ThesourceprovidesweatherdatabyselectingtuplesfromarelationcalledweatherForecast,whichhasthedesiredzipcodeanddateequaltotomorrow.Thisqueryisreferredtoasaselect-projectquerybecauseitsevaluationcanbeperformedusingtherelationaloperatorsselectionandprojection.Sofarsogood,wehavebeenabletouseasimpleclassiﬁertolearnasimpledeﬁnitionforasource.Thelimitationimposedbythisrestricted(select-project)modelinglanguagebecomesobvious,however,whenweconsiderslightlymorecomplicatedsources.ConsiderasourcethatprovidesthetemperatureinFahrenheitaswellasCelsius.Inordertomodelsuchasourceusingaselect-projectquery,wewouldrequirethattheweatherForecastrelationbeextendedwithanewattributeasfollows:

source($zip,tempC,tempF):-weatherForecast(zip,tomorrow,tempC,tempF).

Themoreattributesthatcouldconceivablybereturnedbyaweatherforecastoperation(suchasdewpoint,humidity,temperatureinKelvin,latitude,etc.),thelongertherelationwillneedtobetocoverthemall.Better,inthiscase,wouldbetointroduceasecondrelationconvertCtoFthatmakesexplicittherelationshipbetweenthetemperaturevalues.If,inaddition,thesourcelimitsitsoutputtozipcodesinCalifornia,areasonabledeﬁnitionforthesourcemightbe:

source($zip,tempC,tempF):-weatherForecast(zip,tomorrow,tempC),convertCtoF(tempC,tempF),state(zip,California).

Thisdeﬁnitionisnolongerexpressedinthelanguageofselect-projectqueries,becauseitnowinvolvesmultiplerelationsandjoinsacrossthem.Thusfromthissimpleexample,weseethatmodelingservicesusingsimpleselect-projectqueriesisnotsuﬃcientforourpurposes.Whatweneedareselect-project-joinqueries,alsoreferredtoasconjunctivequeries.5Thereaderhasalreadybeenintroducedtoexamplesofconjunctivequeriesthroughouttheprevioussections.ConjunctivequeriesformasubsetofthelogicalquerylanguageDatalogandcanbedescribedmoreformallyasfollows:

AconjunctivequeryoverasetofrelationsRisanexpressionoftheform:q(X0):-r1(X1),r2(X2),...,rl(Xl).

whereeachri∈RisarelationandXiisanorderedsetofvariablenamesofsizearity(ri).6Eachconjunctri(Xi)isreferredtoasaliteral.Thesetofvariables

󰀃

inthequery,denotedvars(q)=li=0Xi,consistsofdistinguishedvariablesX0(fromtheheadofthequery),andexistentialvariablesvars(q)\\X0,(which

5.Evaluatingaselect-project-joinqueryrequiresadditionalrelationaloperators:naturaljoinandrename.6.Notethataconjunctivequerycanalsobeexpressedinﬁrstorderlogic󰀃asfollows:

∀X0∃Ys.t.r1(X1)∧r2(X2)∧...∧rl(Xl)→q(X0)whereX0∪Y=i=1Xi

LearningSemanticDefinitionsofInformationSourcesontheInternet

onlyappearinthebody).Aconjunctivequeryissaidtobesafeifallthe

󰀃

distinguishedvariablesappearinthebody,i.e.X0⊆li=1Xi.3.2MoreExpressiveLanguages

ModelingsourcesusingconjunctivequeriesimpliesthataggregateoperatorslikeMINandORDERcannotbeusedinsourcedeﬁnitions.Thefunctionalityofmostsourcescanbedescribedwithoutsuchoperators.Somesourcescanonlybedescribedpoorly,however.Considerahotelsearchservicethatreturnsthe20closesthotelstoagivenlocation:

hotelSearch($loc,hotel,dist):-accommodation(hotel,loc1),distance(loc,loc1,dist).

Accordingtothedeﬁnition,thesourceshouldreturnallhotelsregardlessofdistance.Onecannotexpressthefactthatonlytheclosesthotelswillbereturned.Thereasonfornotincludingaggregateoperatorsinthehypothesislanguageisthatthesearchspaceassociatedwithlearningdeﬁnitionsisprohibitivelylarge.(Thusweleaveaggregateoperatorstofutureworkasdiscussedinsection9.2.)

Similarly,sourcedeﬁnitionscannotcontaindisjunction,whichrulesoutunionandre-cursivequeries.Again,thissimplifyingassumptionholdsformostinformationsourcesandgreatlyreducesthesearchspace.Itmeanshowever,thataweatherserviceprovidingforecastsonlyforcitiesintheUSandCanadawouldbemodeledas:

s($city,temp):-forecast(city,country,tomorrow,temp).

Sincethedeﬁnitiondoesnotrestrictthedomainofthecountryattribute,whenconfrontedwitharequestfortheforecastinAustralia,amediatorwouldproceedtocalltheservice,oblivioustoanyrestrictiononthatattribute.

Wealsodonotallownegationinthequeriesbecausesourcedeﬁnitionsveryrarelyrequireit,soincludingitwouldneedlesslycomplicatethesearch.Inthoserarecaseswherethenegationofaparticularpredicateisusefulfordescribingcertaintypesofsources,thenegatedpredicatecanbeincluded(asadistinctpredicate)inthesearch.Forinstance,wemightuse“≥”todescribeasource,eventhoughstrictlyspeakingitisthenegationof“<”.3.3LeveragingKnownSources

Ourapproachtotheproblemofdiscoveringsemanticdeﬁnitionsfornewservicesistoleveragethesetofknownsourceswhenlearninganewdeﬁnition.Broadlyspeaking,wedothisbyinvokingtheknownsources(inamethodicalmanner)toseeifanycombinationoftheinformationtheyprovidematchestheinformationprovidedbythenewsource.Fromapracticalperspective,thismeansinordertomodelanewlydiscoveredsourcesemantically,werequiresomeoverlapinthedatabeingproducedbythenewsourceandthesetofknownsources.Onewaytounderstandthisistoconsideranewsourceproducingweatherdata.Ifnoneoftheknownsourcesproduceanyweatherinformation,thenthereisnowayforthesystemtolearnwhetherthenewsourceisproducinghistoricalweatherdata,weatherforecasts-oreventhatitisdescribingweatheratall.(Inprinciple,onecouldtrytoguesswhattheserviceisdoingbasedonthetypesignaturealone,buttherewouldbenoguaranteethatthedeﬁnitionwascorrect,makingitoflittleusetoamediator.)Giventhisoverlappingdatarequirement,onemightclaimthatthereislittlebeneﬁtinincorporatingnewsources.Wedetailsomeofthereasonswhythisisnotthecasebelow.

Carman&Knoblock

Themostobviousbeneﬁtoflearningdeﬁnitionsfornewsourcesisredundancy.Ifthesystemisabletolearnthatonesourceprovidesexactlythesameinformationasacurrentlyavailablesource,thenifthelattersuddenlybecomesunavailable,theformercanbeusedinitsplace.Forexample,ifamediatorknowsofoneweathersourceprovidingcurrentconditionsandlearnsthatasecondsourceprovidesthesameorsimilardata,theniftheﬁrstgoesdownforwhateverreason(perhapsbecauseanaccessquotahasbeenreached),weatherdatacanstillbeaccessedfromthesecond.

Thesecondandperhapsmoreinterestingreasonforwantingtolearnadeﬁnitionforanewsourceisthatthenewsourcemayprovidedatawhichliesoutsidethescopeof(orsimplywasnotpresentin)thedataprovidedbytheothersources.Forexample,consideraweatherservicewhichprovidestemperaturevaluesforzipcodesintheUnitedStates.Thenconsiderasecondsourcethatprovidesweatherforecastsforcitiesworldwide.Ifthesystemcanusetheﬁrstsourcetolearnadeﬁnitionforthesecond,theamountofinformationavailableforqueryingincreasesgreatly.

Bindingconstraintsonaservicecanmakeaccessingcertaintypesofinformationdiﬃcultorineﬃcient.Inthiscase,discoveringanewsourceprovidingthesameorsimilardatabutwithadiﬀerentbindingpatternmayimproveperformance.Forexample,considerahotelsearchservicethatacceptsazipcodeandreturnsasetofhotelsalongwiththeirstarrating:

hotelSearch($zip,hotel,rating,street,city,state):-accommodation(hotel,rating,street,city,state,zip).

NowconsiderasimplequeryforthenamesandaddressesofallﬁvestarhotelsinCalifornia:

q(hotel,street,city,zip):-accommodation(hotel,5*,street,city,California,zip).Answeringthisquerywouldrequirethousandsofcallstotheknownsource,oneforeveryzipcodeinCalifornia,andamediatorcouldonlyanswerthequeryiftherewasanothersourceprovidingthosezipcodes.Incontrast,ifthesystemhadlearntadeﬁnitionforanewsourcewhichprovidesexactlythesamedatabutwithadiﬀerentbindingpattern(suchastheonebelow),thenansweringthequerywouldrequireonlyonecalltoasource:

hotelsByState($state,$rating,hotel,street,city,zip):-accommodation(hotel,rating,street,city,state,zip).

Oftenthefunctionalityofacomplexsourcecanbedescribedintermsofacompositionofthefunctionalityprovidedbyothersimplerservices.Forinstance,considerthemotivatingexamplefromsection2,inwhichthefunctionalityprovidedbythenewsourcewastocalcu-latethedistanceinmilesbetweentwozipcodes.Thesamefunctionalitycouldbeachievedbyperformingfourdiﬀerentcallstotheavailablesources.Inthatcase,thedeﬁnitionlearntbythesystemmeantthatanyqueryregardingthedistancebetweenzipcodescouldbehan-dledmoreeﬃciently.Ingeneral,bylearningdeﬁnitionsformorecomplicatedsourcesintermsofsimplerones,thesystemcanbeneﬁtfromcomputation,optimisationandcachingabilitiesofservicesprovidingcomplexfunctionality.

Finally,thenewlydiscoveredservicemaybefastertoaccessthantheknownsourcesprovidingsimilardata.Forinstance,considerageocodingservicethattakesinanaddressandreturnsthelatitudeandlongitudecoordinatesofthelocation.Becauseofthevarietyinalgorithmsusedtocalculatethecoordinates,it’snotunreasonableforsomegeocodingservicestotakeaverylongtime(upwardsofonesecond)toreturnaresult.Ifthesystemwereabletodiscoveranewsourceprovidingthesamegeocodingfunctionality,butusing

LearningSemanticDefinitionsofInformationSourcesontheInternet

afasteralgorithm,thenitcouldlocateanddisplaymanymoreaddressesonamapinthesameamountoftime.

4.InducingDeﬁnitions

Inthissectionwedescribeanalgorithmforgeneratingcandidatedeﬁnitionsforanewlydiscoveredsource.Thealgorithmformstheﬁrstphaseinagenerateandtestmethodologyforlearningsourcedeﬁnitions.Wedeferdiscussionofthetestingphasetolaterinthepaper.Westartbybrieﬂydiscussingworkonrelationalrulelearningandthendescribehowouralgorithmbuildsupontheseideas.4.1InductiveLogicProgramming

Thelanguageofconjunctivequeriesisarestrictedformofﬁrst-orderlogic.IntheMachineLearningcommunity,systemscapableoflearningmodelsusingﬁrst-orderrepresentationsarereferredtoasInductiveLogicProgramming(ILP)systemsorrelationalrulelearners.Becauseoftheexpressivenessofthemodelinglanguage,thecomplexityoflearningismuchhigherthanforpropositionalrulelearners(alsocalledattribute-valuelearners),whichformthebulkofMachineLearningalgorithms.Givenourrelationalmodelingofservices,manyofthetechniquesdevelopedinILPshouldalsoapplytoourproblem.

TheFirstOrderInductiveLearner(foil)isawellknownILPsearchalgorithm(Cameron-Jones&Quinlan,1994).Itiscapableoflearningﬁrst-orderrulestodescribeatargetpred-icate,whichisrepresentedbyasetofpositiveexamples(tuplesoverthetargetrelation,denotedE+)andoptionallyalsoasetofnegativeexamples(E−).Thesearchforaviabledeﬁnitioninfoilstartswithanemptyclause7andprogressivelyaddsliteralstothebody(antecedent)oftherule,therebymakingtherulemorespeciﬁc.Thisprocesscontinuesuntilthedeﬁnition(denotedh)coversonlypositiveexamplesandnonegativeexamples:

E+∩EI[h]=∅

andE−∩EI[h]=∅

Usuallyasetofrulesislearntinthismannerbyremovingthepositiveexamplescoveredbytheﬁrstruleandrepeatingtheprocess.(Thesetofrulesistheninterpretedasaunionquery.)Searchinfoilisperformedinagreedybest-ﬁrstmanner,guidedbyaninformationgain-basedheuristic.Manyextensionstothebasicalgorithmexist,mostnotablythosethatcombinedeclarativebackgroundknowledgeinthesearchprocesssuchasfocl(Pazzani&Kibler,1992).Suchsystemsarecategorisedasperformingatopdownsearchbecausetheystartfromanemptyclause(themostgeneralrulepossible)andprogressivelyspecializetheclause.Bottomupapproaches,ontheotherhand,suchasgolem(Muggleton&Feng,1990),performaspeciﬁctogeneralsearchstartingfromthepositiveexamplesofthetarget.4.2Search

Wenowdescribetheactualsearchprocedureweusetogeneratecandidatedeﬁnitionsforanewsource.Theprocedureisbasedonthetop-downsearchstrategyusedinfoil.Thealgorithmtakesasinputatypesignatureforthenewsourceandusesittoseedthesearchfor

7.WeusethetermsclauseandqueryinterchangeablytorefertoaconjunctivequeryinDatalog.Anemptyclauseisaquerywithoutanyliteralsinthebody(rightside)oftheclause.

Carman&Knoblock

input:Apredicatesignatures∗output:Thebestscoringviewdeﬁnitionvbest

invoketargetwithsetofrandominputs;2vbest←emptyclauses∗;3addvbesttoemptyqueue;

4whilequeue=∅∧time()0do9forallv2∈constrain(v1)do10insertv2intoqueue;11ifeval(v2)≥eval(v1)thenv1←v2;12end13end14ifeval(v1)≥eval(vbest)thenvbest←v1;15end16end

17returnvbest;

Algorithm1:Best-ﬁrstsearchthroughthespaceofcandidatesourcedeﬁnitions.1

candidatedeﬁnitions.(Wewillrefertothenewsourcerelationasthetargetpredicateandthesetofknownsourcerelationsassourcepredicates.)Thespaceofcandidatedeﬁnitionsisenumeratedinabest-ﬁrstmanner,witheachcandidatetestedtoseeifthedataitreturnsissimilartothetarget.Pseudo-codedescribingtheprocedureisgiveninAlgorithm1.8

Theﬁrststepinthealgorithmistoinvokethenewsourcewitharepresentativesetofinputtuplestogenerateexamplesofoutputtuplesthatcharacterisethefunctionalityofthesource.Thissetofinvocationsmustincludepositiveexamples(invocationsforwhichoutputtupleswereproduced)andifpossible,alsonegativetuples(inputsforwhichnooutputwasreturned).Thealgorithm’sabilitytoinducethecorrectdeﬁnitionforasourcedependsgreatlyonthenumberofpositiveexamplesavailable.Thusaminimumnumberofpositiveinvocationsofthesourceisimposed,meaningthatthealgorithmmayhavetoinvokethesourcerepeatedlyusingdiﬀerentinputsuntilsuﬃcientpositiveinvocationscanberecorded.Selectingappropriateinputvaluessoastosuccessfullyinvokeaserviceiseasiersaidthandone.Wedeferdiscussionoftheissuesanddiﬃcultiesinvolvedinsuccessfullyinvokingthenewsourcetosection6.1,andassumeforthemomentthattheinductionsystemisabletogenerateatableofvaluesthatrepresentitsfunctionality.

Thenextstepinthealgorithmistoinitialisethesearchbyaddinganemptyclausetothequeueofdeﬁnitionstoexpand.Therestofthealgorithmissimplyabest-ﬁrstsearchprocedure.Ateachiterationthehighestscoringbutnotyetexpandeddeﬁnition(denotedv0)isremovedfromthequeueandexpandedbyaddinganewpredicatetotheendofthe

8.Theimplementationofthisalgorithmusedintheexperimentsofsection7.3isavailableat:http://www.isi.edu/publications/licensed-sw/eidos/index.html

LearningSemanticDefinitionsofInformationSourcesontheInternet

clause(seethenextsectionforanexample).Eachcandidategenerated(denotedv1)isaddedtothequeue.Thealgorithmthenprogressivelyconstrainsthecandidatebybindingvariablesofthenewlyaddedpredicate,(seesection4.4).Theevalfunction(seesection5.3)evaluatesthequalityofeachcandidateproduced.Theprocedurestopsconstrainingthecandidatewhenthechangeintheevaluationfunction(∆eval)dropstozero.Itthencomparesv1tothepreviousbestcandidatevbestandupdatesthelatteraccordingly.

Inprinciplethealgorithmshouldterminatewhenthe“perfect”candidatedeﬁnitionisdiscovered-onethatproducesexactlythesamedataasthetarget.Inpracticethatneveroccursbecausethesourcesareincomplete(don’tperfectlyoverlapwitheachother)andarenoisy.Insteadthealgorithmterminateswheneitherthequeuebecomesempty,atimelimitisreachedoramaximumnumberofiterationshasbeenperformed.4.3AnExample

Wenowrunthroughanexampleoftheprocessofgeneratingcandidatedeﬁnitions.Consideranewlydiscoveredsource,whichtakesinazipcodeandadistance,andreturnsallthezipcodesthatliewithinthegivenradius(alongwiththeirrespectivedistances).Thetargetpredicaterepresentingthissourceis:

source5($zip1,$dist1,zip2,dist2)

Nowassumethattherearetwoknownsources.Theﬁrstisthesourceforwhichthedeﬁnitionwaslearntintheexamplefromsection2,namely:

source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2),

greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.609,dist2).Thesecondsourceisn’tactuallyasourcebutaninterpretedpredicate:

≤(dist1,dist2).

Thesearchforadeﬁnitionforthenewsourcemightthenproceedasfollows.Theﬁrstdeﬁnitiontobegeneratedistheemptyclause:

source5($,$,,).

Thenullcharacter()representsthefactthatnoneoftheinputsoroutputshaveanyrestrictionsplacedontheirvalues.Priortoaddingtheﬁrstliteral(sourcepredicate),thesystemwillcheckwhetheranyoutputattributesechotheinputvalues.Inthiscase,giventhesemantictypes,twopossibilitiesneedtobechecked:

source5($zip1,$,zip1,).source5($,$dist1,,dist1).

Assumingneitherofthesepossibilitiesistrue(i.e.improvesthescore),thenliteralswillbeaddedoneatatimetoreﬁnethedeﬁnition.Aliteralisasourcepredicatewithanassignmentofvariablenamestoitsattributes.Anewdeﬁnitionmustbecreatedforeverypossibleliteralthatincludesatleastonevariablealreadypresentintheclause.(Forthemomentweignoretheissueofbindingconstraintsonthesourcesbeingadded.)Thusmanycandidatedeﬁnitionswouldbegenerated,includingthefollowing:

source5($zip1,$dist1,,):-source4($zip1,$,dist1).source5($zip1,$,zip2,):-source4($zip1,$zip2,).source5($,$dist1,,dist2):-≤(dist1,dist2).

Carman&Knoblock

Notethatthesemantictypesinthetypesignatureofthetargetpredicatelimitgreatlythenumberofcandidatedeﬁnitionsthatcanbeproduced.Thesystemthenevaluateseachofthesecandidatesinturn,selectingthebestoneforfurtherexpansion.Assumingtheﬁrstofthethreehasthebestscore,itwouldbeexpandedbyaddinganotherliteral,formingmorecomplicatedcandidatessuchasthefollowing:

source5($zip1,$dist1,,dist2):-source4($zip1,$,dist1),≤(dist1,dist2).Thisprocesscontinuesuntilthesystemdiscoversadeﬁnitionwhichperfectlydescribesthesource,orisforcedtobacktrackbecausenoliteralimprovesthescore.4.4IterativeExpansion

Thesourcesusedinthepreviousexampleallhadrelativelylowarity.OntheInternetthisisrarelythecase,withmanysourcesproducingalargenumberofattributesofthesametype.Thisisaproblembecauseitcausesanexponentialnumberofdeﬁnitionstobepossibleateachexpansionstep.Considerforinstanceastockpriceservice,whichprovidesthecurrent,high,low,market-openingandmarket-closingpricesforagiventickersymbol.Thetypesignatureforthatservicewouldbe:

stockprice($ticker,price,price,price,price,price)

Ifthedeﬁnitiontowhichthispredicateistobeaddedalreadycontainskdistinctpricevariables,thenthenumberofwaysinwhichthepriceattributesofthenewrelationcanbe

󰀂5󰀅5󰀁i

assignedvariablenamesisi=0ik,whichisprohibitivelylargeevenformoderatek.9Tolimitthesearchspaceinthecaseofsuchhigh-aritypredicates,weﬁrstgeneratecandidateswithaminimalnumberofboundvariablesinthenewliteralandprogressivelyconstrainthebestperformingofthesedeﬁnitionswithineachexpansion.(Higharitypredicatesarehandledinasimilarfashioninfoil,QuinlanandCameron-Jones,1993.)Forexample,considerusingthesourceabovetolearnadeﬁnitionforanewsourcewithsignature:

source6($ticker,price,price)

Westartbyaddingliteralstoanemptydeﬁnitionasbefore.Thistimethough,insteadofgeneratingaliteralforeverypossibleassignmentofvariablenamestotheattributesofeachrelation,wegenerateonlythesimplestassignmentssuchthatallofthebindingconstraintsaremet.(ThisistheexpandprocedurereferredtoinAlgorithm1.)Inthisexample,thetickersymbolinputofthestockpricesourcewouldneedtobebound,generatingasingledeﬁnition:

source6($tic,,):-stockprice($tic,,,,,).

Thisdeﬁnitionwouldbeevaluated,andthenmoreconstraineddeﬁnitionsaregeneratedbyequatingavariablefromthesameliteraltoothervariablesfromtheclause.(ThisistheconstrainprocedurefromAlgorithm1.)Twosuchdeﬁnitionsareshownbelow:

source6($tic,pri1,):-stockprice($tic,pri1,,,,).source6($tic,pri1,):-stockprice($tic,,pri1,,,).

Thebestofthesedeﬁnitionswouldthenbeselectedandconstrainedfurther,generatingdeﬁnitionssuchas:

9.Intuitively,onecanassignvariable󰀅5󰀁namestoiattributesusingklabelsinkdiﬀerentways.Onecanchooseiofthe5attributesiniways,andonecandothatfori=0,1,..,5.SeeWeber,Tausend,andStahl(1995)foradetaileddiscussionofthesizeofthehypothesisspaceinILP.

LearningSemanticDefinitionsofInformationSourcesontheInternet

source6($tic,pri1,pri2):-stockprice($tic,,pri1,pri2,,).source6($tic,pri1,pri2):-stockprice($tic,,pri1,,pri2,).

Inthisway,thebestscoringliteralcanbefoundwithouttheneedtoiterateoverallofthepossibleassignmentsofvariablestoattributes.4.5DomainPredicatesvs.SourcePredicates

Intheexamplesofsections4.3and4.4,thedecisiontoperformsearchoverthesourcepredicatesratherthanthedomainpredicateswasmadeinanarbitraryfashion.10Inthissectionwejustifythatdecision.Ifoneweretoperformthesearchoverthedomainpredicatesratherthanthesourcepredicates,thentestingeachdeﬁnitionwouldrequireanadditionalqueryreformulationstep.Forexample,considerthefollowingcandidatedeﬁnitionforsource5containingthedomainpredicatecentroid:

source5($zip1,$,,):-centroid(zip1,,).

Inordertoevaluatethiscandidate,thesystemwouldneedtoﬁrsttreatthedeﬁnitionasaqueryandreformulateitintoasetofrewritings(thattogetherformaunionquery)overthevarioussourcesasfollows:

source5($zip1,$,,):-source4($zip1,$,).source5($zip1,$,,):-source4($,$zip1,).

Thisunionquerycanthenbeexecutedagainsttheavailablesources(inthiscasejustsource4)toseewhattuplesthecandidatedeﬁnitionreturns.Inpracticehowever,ifthedeﬁnitionsfortheknownsourcescontainmultipleliterals(astheynormallydo)andthedomainrelationsareofhigh-arity(astheyoftenare),thenthesearchoverthespaceofconjunctionsofdomainpredicatesisoftenmuchlargerthanthecorrespondingsearchoverthespaceofconjunctionsofsourcepredicates.Thisisbecausemultipleconjunctionsofdomainpredicates(candidatedeﬁnitions)endupreformulatingtothesameconjunctionofsourcepredicates(unionqueries).Forexample,considerthefollowingcandidatedeﬁnitionswrittenintermsofthedomainpredicates:

source5($zip1,$,,):-centroid(zip1,lat,),greatCircleDist(lat,,,,).source5($zip1,$,,):-centroid(zip1,,lon),greatCircleDist(,lon,,,).

source5($zip1,$,,):-centroid(zip1,lat,lon),greatCircleDist(lat,lon,,,).Allthreecandidateswouldreformulatetothesamequeryoverthesources(shownbelow),andthusareindistinguishablegiventhesourcesavailable.

source5($zip1,$,,):-source4($zip1,$,).

Ingeneral,thenumberofcandidatedeﬁnitionsthatmaptothesamereformulationcanbeexponentialinthenumberofhiddenvariablespresentinthedeﬁnitionsoftheknownsources.Forthisreason,wesimplifytheproblemandsearchthespaceofconjunctionsofsourcepredicates.Insomesense,performingthesearchoverthesourcepredicatescanbeseenasintroducinga“similarityheuristic”whichfocusesthesearchtowarddeﬁnitionswithsimilarstructuretothedeﬁnitionsoftheavailablesources.Wenotethatthedeﬁnitionsproducedcan(andwill)laterbeconvertedtoqueriesovertheglobalpredicatesbyunfoldingand

10.Anoteonthediﬀerencebetweendomain,sourceandinterpretedpredicates.Domainpredicatesare

inventedbyadomainexpertforuseinmodelingaparticularinformationdomain.Theydeﬁneacommonschemathatcanbeusedfordescribinginformationfromdiﬀerentsources.Sourcepredicatesrepresentthesourcesavailableinthesystem.Interpretedpredicates,(suchas“≤”),areaspecialtypeofdomainpredicate,thatcanbetreatedassourcepredicates,sincetheirmeaningisinterpreted(understood).

Carman&Knoblock

possiblytighteningthemtoremoveredundancies.Wewilldiscusstheprocessoftighteningtheunfoldingsinsection6.6.4.6LimitingtheSearch

Thesearchspacegeneratedbythistop-downsearchalgorithmmaybeverylargeevenforasmallnumberofsources.Theuseofsemantictypeslimitsgreatlythewaysinwhichvariableswithineachdeﬁnitioncanbeequated(akathejoinpaths)andthusgoesalongwaytoreducethesizeofthesearchspace.Despitethisreduction,asthenumberofsourcesavailableincreases,thesearchspacebecomessolargethattechniquesforlimitingitmustbeused.Weemploysomestandard(andothernotsostandard)ILPtechniquesforlimitingthisspace.Suchlimitationsareoftenreferredtoasinductivesearchbiasorlanguagebias(N´edellec,Rouveirol,Ad´e,Bergadano,&Tausend,1996).

Anobviouswaytolimitthesearchistorestrictthenumberofsourcepredicatesthatcanoccurinadeﬁnition.Wheneveradeﬁnitionreachesthemaximumlength,backtrackingcanbeperformed,allowingthesearchtoescapefromlocalminimathatmayresultfromgreedyenumeration.Theassumptionhereisthatshorterdeﬁnitionsaremoreprobablethanlongerones,whichmakessensesinceserviceprovidersarelikelytoprovidedatainthesimplestformpossible.Moreover,thesimplerthedeﬁnitionlearnt,themoreusefulitwillbetoamediator,sowedecidetotradecompleteness(theabilitytoexpresslongerdeﬁnitions)forimprovedaccuracyovershorterdeﬁnitions.

Thesecondrestrictionplacedoncandidatedeﬁnitionsistolimitthenumberoftimesthesamesourcepredicateappearsinagivencandidate.Thismakessensebecausethedeﬁnitionsofrealservicestendnottocontainmanyrepeatedpredicates.Intuitively,thisisbecausemostservicesproviderawdatawithoutperformingmanycalculationsuponit.Repeateduseofthesamepredicateinadeﬁnitionismoreusefulfordescribingsomeformofcalculationthanrawdataitself.(Exceptionstothisruleexist,forexamplepredicatesrepresentingunitconversionfunctionalitysuchasFahrenheittoCelsius,maynecessarilyoccurmultipletimesinthedeﬁnitionofasource.)

Thethirdrestrictionlimitsthecomplexityofthedeﬁnitionsgeneratedbyreducingthenumberofliteralsthatdonotcontainvariablesfromtheheadoftheclause.Speciﬁcally,itlimitsthelevelofexistentialquantiﬁcation(sometimesalsoreferredtoasthedepth,MuggletonandFeng,1990)ofeachvariableinaclause.Thislevelisdeﬁnedtobezeroforalldistinguishedvariables(thoseappearingintheheadoftheclause).Forexistentialvariablesitisdeﬁnedrecursivelyasoneplusthelowestlevelofanyvariableappearinginthesameliteral.Forexample,thecandidatedeﬁnitionshownbelowhasamaximumexistentialquantiﬁcationlevelofthreebecausetheshortestpathfromthelastliteraltotheheadliteral(viajoinvariables)passesthroughtwootherliterals:

source5($zip1,$,,):-source4($zip1,$,d1),source3($d1,d2),source3($d2,).Theeﬀectofthisbiasistoconcentratethesearcharoundsimplerbuthighlyconnecteddeﬁnitions,whereeachliteraliscloselylinkedtotheinputandoutputofthesource.

Thefourthrestrictionplacedonsourcedeﬁnitionsisthattheyareexecutable.Morespeciﬁcally,itshouldbepossibletoexecutethemfromlefttoright,meaningthattheinputsofeachsourceappeareitherinthetargetpredicate(headoftheclause)orinoneoftheliteralstotheleftofthatliteral.Forexample,ofthetwocandidatedeﬁnitionsshownbelow,

LearningSemanticDefinitionsofInformationSourcesontheInternet

onlytheﬁrstisexecutable.Theseconddeﬁnitionisnot,becausezip2isusedasinputforsource4intheﬁrstliteral,withoutﬁrstbeingboundtoavalueintheheadoftheclause:

source5($zip1,$,zip2,):-source4($zip1,$zip2,).

source5($zip1,$,,):-source4($zip1,$zip2,dist1),source4($zip2,$zip1,dist1).Thisrestrictionservestwopurposes.Firstly,liketheotherbiases,itlimitsthesizeofthesearchspace.Secondly,itmakesiteasiertoevaluatethedeﬁnitionsproduced.Intheory,onecouldstillevaluatetheseconddeﬁnitionabovebygeneratinglotsofinputvaluesforzip2,butthatwouldrequirealotofinvocationsforminimalgain.

Thelastrestrictionreducesthesearchspacebylimitingthenumberoftimesthesamevariablecanappearinanygivenliteralinthebodyoftheclause.Deﬁnitionsinwhichthesamevariableappearsmultipletimesinagivenliteral,suchasinthefollowingexamplewhichreturnsthedistancebetweenazipcodeanditself,arenotverycommoninpractice:

source5($zip1,$,,dist2):-source4($zip1,$zip1,dist2).

Explicitlypreventingsuchdeﬁnitionsfrombeinggeneratedmakessensebecausesourcesrequiringthemaresorare,thatitisbettertoreducethesearchspaceexponentiallybyignoringthem,thantoexplicitlycheckforthemeachtime.

5.ScoringDeﬁnitions

Wenowproceedtotheproblemofevaluatingthecandidatedeﬁnitionsgeneratedduringsearch.Thebasicideaistocomparetheoutputproducedbythesourcewiththeoutputproducedbythedeﬁnitiononthesameinput.Themoresimilarthesetoftuplesproduced,thehigherthescoreforthecandidate.Thescoreisthenaveragedoverasetofdiﬀerentinputtuplestoseehowwellthecandidatedeﬁnitiondescribesthenewsource.

Inthemotivatingexampleofsection2,thesourceforwhichadeﬁnitionwasbeinglearnt(thedeﬁnitionisrepeatedbelow)onlyproducedoneoutputtuple󰀛dist󰀜foreveryinputtuple󰀛zip1,zip2󰀜:

source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2),greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.6093,dist2).

Thisfactmadeitsimpletocomparetheoutputoftheservicewiththeoutputoftheinduceddeﬁnition.Ingeneralhowever,thesourcetobemodeled(andthecandidatedeﬁnitionsmodelingit)mayproducemultipleoutputtuplesforeachinputtuple.Takeforexamplesource5fromsection4.3,whichproducesthesetofoutputtuples󰀛zip2,dist2󰀜containingallthezipcodeswhichliewithinagivenradiusoftheinputzipcode󰀛zip1,dist1󰀜.Insuchcases,thesystemneedstocompareasetofoutputtupleswiththesetproducedbythedeﬁnitiontoseeifanyofthetuplesarethesame.Sinceboththenewsourceandtheexistingsourcesmaynotbecomplete,thetwosetsmaysimplyoverlap,evenifthecandidatedeﬁnitioncorrectlydescribesthenewsource.Assumingthatwecancountthenumberoftuplesthatarethesame,weneedameasurethattellsushowwellacandidatehypothesisdescribesthedatareturnedbyasource.Onesuchmeasureisthefollowing:

1󰀄|Os(i)∩Ov(i)|

score(s,v,I)=

|I|i∈I|Os(i)∪Ov(i)|

Carman&Knoblock

wheresisthenewsource,visacandidatesourcedeﬁnition,andI⊆D[βs]isthesetofinputtuplesusedtotestthesource(βsisthesetofinputattributesofsources).Os(i)denotesthesetoftuplesreturnedbythenewsourcewheninvokedwithinputtuplei.Ov(i)isthecorrespondingsetreturnedbythecandidatedeﬁnition.Usingrelationalprojection(π)andselection(σ)operatorsandthenotationintroducedinsection2.1,thesesetscan

crepresentstheoutputattributesofs.)bewrittenasfollows.(Notethatβs

c(σβ=i(E[s]))Os(i)≡πβss

and

c(σβ=i(EI[v]))Ov(i)≡πβss

Ifweviewthishypothesistestingasaninformationretrievaltask,wecanconsiderrecall

tobethenumberofcommontuples,dividedbythenumberoftuplesproducedbythesource,andprecisiontobethenumberofcommontuplesdividedbythenumberoftuplesproducedbythedeﬁnition.TheabovemeasuretakesbothprecisionandrecallintoaccountbycalculatingtheaverageJaccardsimilaritybetweenthesets.Thetablebelowgivesanexampleofhowthisscoreiscalculatedforeachinputtuple.

inputtuplei∈I󰀛a,b󰀜󰀛c,d󰀜󰀛e,f󰀜󰀛g,h󰀜󰀛i,j󰀜

actualoutputtuplesOs(i){󰀛x,y󰀜,󰀛x,z󰀜}{󰀛x,w󰀜,󰀛x,z󰀜}{󰀛x,w󰀜,󰀛x,y󰀜}

∅∅

predictedoutputtuplesOv(i){󰀛x,y󰀜}{󰀛x,w󰀜,󰀛x,y󰀜}{󰀛x,w󰀜,󰀛x,y󰀜}{󰀛x,y󰀜}∅

Jaccardsimilarity

fortuplei1/21/310#undef!

Theﬁrsttworowsofthetableshowinputsforwhichthepredictedandactualoutputtuplesoverlap.Inthethirdrow,thedeﬁnitionproducesexactlythesamesetoftuplesasthesourcebeingmodeledandthusgetsthemaximumscore.Inthefourthrow,thedeﬁnitionproducedatuple,whilethesourcedidn’t,sothedeﬁnitionwaspenalised.Inthelastrow,thedeﬁnitioncorrectlypredictedthatnotupleswouldbeoutputbythesource.Ourscorefunctionisundeﬁnedatthispoint.Fromacertainperspectivethedeﬁnitionshouldscorewellherebecauseithascorrectlypredictedthatnotuplesbereturnedforthatinput,butgivingahighscoretoadeﬁnitionwhenitproducesnotuplescanbedangerous.Doingsomaycauseoverlyconstraineddeﬁnitionsthatcangenerateveryfewoutputtuplestoscorewell.Atthesametime,lessconstraineddeﬁnitionsthatarebetteratpredictingtheoutputtuplesonaveragemayscorepoorly.Forexample,considerasourcewhichreturnsweatherforecastsforzipcodesinLosAngeles:

source($zip,temp):-forecast(zip,tomorrow,temp),UScity(zip,LosAngeles).

Nowconsidertwocandidatedeﬁnitionsforthesource.Theﬁrstreturnsthetemperatureforazipcode,whilethesecondreturnsthetemperatureonlyifitisbelow0◦C:

v1($zip,temp):-forecast(zip,tomorrow,temp).

v2($zip,temp):-forecast(zip,tomorrow,temp),temp<0◦C.

Assumethatthesourceandcandidatesareinvokedusing20diﬀerentrandomlyselectedzipcodes.Formostzipcodes,thesourcewillnotreturnanyoutput,becausethezipcodewilllieoutsideofLosAngeles.Theﬁrstcandidatewilllikelyreturnoutputforallzipcodes,whilethesecondcandidatewould,likethesource,onlyrarelyproduceanyoutput.Thisisbecausethetemperatureinmostzipcodeswillbegreaterthanzero,andhasnothingtodo

LearningSemanticDefinitionsofInformationSourcesontheInternet

withwhetherornotthezipcodeisinLosAngeles.Ifwescoredeﬁnitionshighlywhentheycorrectlyproducenooutput,thesystemwoulderroneouslypreferthesecondcandidateovertheﬁrst(becausethelatteroftenproducesnooutput).Topreventthatfromhappening,wesimplyignoreinputsforwhichthedeﬁnitioncorrectlypredictszerotuples.Thisisthesameassettingthescoretobetheaverageoftheothervalues.

Returningourattentiontothetable,afterignoringthelastrow,theoverallscoreforthisdeﬁnitionwouldbecalculatedas0.46.5.1PartialDeﬁnitions

Asthesearchproceedstowardthecorrectdeﬁnitionfortheservice,manysemi-complete(unsafe)deﬁnitionswillbegenerated.Thesedeﬁnitionswillnotproducevaluesforallattributesofthetargettuplebutonlyasubsetofthem.Forexample,thecandidate:

source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist1).

producesonlyoneofthetwooutputattributesproducedbythesource.Thispresentsaproblem,becauseourscoreisonlydeﬁnedoversetsoftuplescontainingalloftheoutputattributesofthenewsource.Onesolutionmightbetowaituntilthedeﬁnitionsbecomesuﬃcientlylongastoproducealloutputs,beforecomparingthemtoseewhichonebestdescribesthenewsource.Thereare,however,tworeasonswhythiswouldnotmakesense:

•Thespaceofsafedeﬁnitionsistoolargetoenumerate,andthusweneedtocomparepartialdeﬁnitionstoguidethesearchtowardthecorrectdeﬁnition.

•Thebestdeﬁnitionthatthesystemcangeneratemaywellbeapartialone,asthesetofknownsourcesmaynotbesuﬃcienttocompletelymodelthesource.

Thesimplestwaytocomputeascoreforapartialdeﬁnitionistocomputethesamefunctionasbefore,butinsteadofusingtherawsourcetuples,projectingthemoverthesubsetofattributesthatareproducedbythedeﬁnition.Thisrevisedscoreisshownbelow.(Notethattheprojectionisoverv\\βs,whichdenotesthesubsetofoutputattributesofswhichareproducedbytheviewdeﬁnitionv.Notealsothattheprojectionisnotdistinct,i.e.multipleinstancesofthesametuplemaybeproduced.)

1󰀄|πv\\βs(Os(i))∩Ov(i)|

score2(s,v,I)=

|I|i∈I|πv\\βs(Os(i))∪Ov(i)|

Thisrevisedscoreisnotveryusefulhowever,asitgivesanunfairadvantagetodeﬁnitionsthatdonotproducealloftheoutputattributesofthesource.Thisisbecauseitisfareasiertocorrectlyproduceasubsetoftheoutputattributesthantoproduceallofthem.Considerforexamplethetwosourcedeﬁnitionsshownbelow.Thetwodeﬁnitionsareidenticalexceptthatthesecondreturnstheoutputdistancevaluedist2,whiletheﬁrstdoesnot:

source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist2),≤(dist2,dist1).source5($zip1,$dist1,zip2,dist2):-source4($zip1,$zip2,dist2),≤(dist2,dist1).Sincethetwoareidentical,theprojectionoverthesubsetwillinthiscasereturnthesamenumberoftuples.Thismeansthatbothdeﬁnitionswouldgetthesamescorealthoughtheseconddeﬁnitionisclearlybetterthantheﬁrstsinceitproducesalloftherequiredoutputs.

Weneedtobeabletopenalisepartialdeﬁnitionsinsomewayfortheattributestheydon’tproduce.Onewaytodothisistoﬁrstcalculatethesizeofthedomain|D[a]|ofeach

Carman&Knoblock

ofthemissingattributes.Intheexampleabove,themissingattributeisthedistancevalue.Sincedistanceisacontinuousvalue,calculatingthesizeofitsdomainisnotobvious.Wecanapproximatethesizeofitsdomainby:

|D[distance]|≈

max−minaccuracy

whereaccuracyistheerror-boundondistancevalues.(Wewilldiscusserror-boundsfurtherinsection5.4.)Notethatthecardinalitycalculationmaybespeciﬁctoeachsemantictype.Armedwiththedomainsize,wecanpenalisethescoreforthedeﬁnitionbydividingitbytheproductofthesizeofthedomainsofalloutputattributesnotgeneratedbythedeﬁnition.Inessence,wearesayingthatallpossiblevaluesfortheseextraattributeshavebeen“allowed”bythisdeﬁnition.Thistechniqueissimilartotechniquesusedforlearningwithoutexplicitnegativeexamples(Zelle,Thompson,Caliﬀ,&Mooney,1995).

c\\v,thusthepenaltyThesetofmissingoutputattributesisgivenbytheexpressionβs

formissingattributesisjustthesizeofthedomainoftuplesofthatscheme,i.e.:

penalty=|D[βs\\v]|

Usingthispenaltyvaluewecancalculateanewscore,whichtakesintoaccountthemissing

attributes.Simplydividingtheprojectedscorebythepenaltywouldnotadheretotheintendedmeaningofcompensatingforthemissingattributevalues,andthusmayskewtheresults.Instead,wederiveanewscorebyintroducingtheconceptoftypeddompredicatesasfollows:

Adompredicateforasemanticdata-typet,denoteddomt,isasinglearityrelationwhoseextensionissettobethedomainofthedatatype,i.e.E[domt]=D[t].Similarly,adompredicateforaschemeA,denoteddomA,isarelationoverAwhoseextensionisE[domA]=D[A].

DompredicateswereintroducedbyDuschkatohandletheproblemofqueryreformulationinthepresenceofsourceswithbindingconstraints(Duschka,1997).(Inthatworkthepredicateswerenottyped,althoughtypingwouldhaveresultedinamoreeﬃcientalgo-rithm.)Hereweusethemtoconvertapartialdeﬁnitionvintoasafe(complete)deﬁnitionv󰀄.Wecandothissimplybyaddingadompredicatetotheendoftheviewdeﬁnitionthatgeneratesvaluesforthemissingattributes.Fortheexampleabove,v󰀄wouldbe:

source5($zip1,$dist1,zip2,x):-source4($zip1,$zip2,dist2),≤(dist2,dist1),domdistance(x).

wherexisanewvariableoftypedistance.Thenewviewdeﬁnitionv󰀄issafe,becauseallthevariablesintheheadoftheclausealsoappearinthebody.Ingeneral,wecanturnanyunsafeviewdeﬁnitionvintoasafedeﬁnitionv󰀄byappendingadompredicate

c\\v(x1,...,xn),whereeachxiisadistinguishedvariable(fromtheheadoftheclause)domβs

correspondingtoanoutputattributeofv󰀄thatwasn’tboundinv.Nowwecanusethiscompletedeﬁnitiontocalculatethescoreasbefore:

1󰀄|Os(i)∩Ov󰀁(i)|

score3(s,v,I)=score(s,v,I)=

|I|i∈I|Os(i)∪Ov󰀁(i)|

󰀄

LearningSemanticDefinitionsofInformationSourcesontheInternet

whichcanberewritten(byexpandingthedenominator)asfollows:

score3(s,v,I)=

|Os(i)∩Ov󰀁(i)|1󰀄

|I|i∈I|Os(i)|+|Ov󰀁(i)|−|Os(i)∩Ov󰀁(i)|

Wecanthenremovethereferencestov󰀄fromthisequationbyconsidering:

c\\v]=Ov(i)×D[βs\\v]Ov󰀁(i)=Ov(i)×E[domβs

c\\v]|andthesizeoftheintersectionThusthesizeofthesetisgivenby|Ov󰀁(i)|=|Ov(i)||D[βs

canbecalculatedbytakingtheprojectionovertheoutputattributesproducedbyv:

Substitutingthesecardinalitiesintothescorefunctiongivenabove,wearriveatthefollowing

equationforthepenalisedscore:

|πv\\βs(Os(i))∩Ov(i)|1󰀄

score3(s,v,I)=c\\v]|−|π|I|i∈I|Os(i)|+|Ov(i)||D[βsv\\βs(Os(i))∩Ov(i)|

5.2BindingConstraints

Someofthecandidatedeﬁnitionsgeneratedduringthesearchmayhavediﬀerentbinding

constraintsfromthetargetpredicate.Forinstanceinthepartialdeﬁnitionshownbelow,thevariablezip2isanoutputofthetargetsource,butaninputtosource4:

source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist1).

Fromalogicalperspective,inordertotestthisdeﬁnitioncorrectly,weneedtoinvokesource4witheverypossiblevaluefromthedomainofzipcodes.Doingthisisnotpracticalfortworeasons:ﬁrstly,thesystemmaynothaveacompletelistofzipcodesatitsdisposal.Secondlyandfarmoreimportantly,invokingsource4withthousandsofdiﬀerentzipcodeswouldtakeaverylongtimeandwouldprobablyresultinthesystembeingblockedfromfurtheruseoftheservice.Soinsteadofinvokingthesamesourcethousandsoftimes,weapproximatethescoreforthisdeﬁnitionbysamplingfromthedomainofzipcodesandinvokingthesourceusingthesampledvalues.Wethencompensateforthissamplingbyscaling(certaincomponentsof)thescorebytheratioofthesampledzipcodestotheentiredomain.Consideringtheexampleabove,ifwerandomlychooseasample(denotedδ[zipcode])ofsay20valuesfromthedomainofzipcodes,thenthesetoftuplesreturnedbythedeﬁnitionwillneedtobescaledbyafactorof|D[zipcode]|/20.

Amoregeneralequationforcomputingthescalingfactor(denotedSF)isshownbelow.Notethatthesamplingmayneedtobeperformedoverasetofattributes.(Hereβv\\βsdenotestheinputattributesofvwhichareoutputsofs.)

SF=

|D[βv\\βs]||δ[βv\\βs]|

Wenowcalculatetheeﬀectofthisscalingfactorontheoverallscoreasfollows.Wedenote

˜v(i).ThisvaluethesetoftuplesreturnedbythedeﬁnitiongiventhesampledinputasO

Carman&Knoblock

whenscaledwillapproximatethesetoftuplesthatwouldhavebeenreturnedhadthedeﬁnitionbeeninvokedwithallthepossiblevaluesfortheadditionalinputattributes:

˜v(i)|∗SF|Ov(i)|≈|O

Assumingthesamplingisperformedrandomlyoverthedomainofpossiblevalues,the

intersectionbetweenthetuplesproducedbythesourceandthedeﬁnitionshouldscaleinthesameway.Thustheonlyfactornotaﬀectedbythescalinginthescoredeﬁnedpreviouslyis|Os(i)|.Ifwedividethroughoutbythescalingfactorwehaveanewscoringfunction:

˜v(i)||πv\\βs(Os(i))∩O1󰀄

score4(s,v,I)=

˜v(i)||D[βc\\v]|−|πv\\β(Os(i))∩O˜v(i)||I|i∈I|Os(i)|/SF+|Oss

Theproblemwiththisapproachisthatoftenthesampledsetofvaluesistoosmallandasa

resultitdoesnotintersectwiththesetofvaluesreturnedbythesource,eventhoughalargersamplewouldhaveintersectedinsomeway.Thusoursamplingintroducesunfairdistortionsintothescoreforcertaindeﬁnitions,causingthemtoperformpoorly.Forexample,consideragainsource5andassumethatforscalabilitypurposes,theserviceplacesalimitonthemaximumvaluefortheinputradiusdist1.(Thismakessense,asotherwisetheusercouldsettheinputradiustocovertheentireUS,andatupleforeverypossiblezipcodewouldneedtobereturned.)Nowconsiderthesamplingperformedabove.Ifwerandomlychooseonly20zipcodesfromthesetofallpossiblezipcodes,thechanceofthesamplecontainingazipcodewhichlieswithina300mileradiusofaparticularzipcode(inthemiddleofthedesert)isverylow.Moreover,evenifonepairofzipcodes(outof20)resultsinasuccessfulinvocation,thiswillnotbesuﬃcientforlearningagooddeﬁnitionfortheservice.

Sotogetaroundthisproblemwebiasthesamplesuchthat,wheneverpossible,halfofthevaluesaretakenfrompositiveexamplesofthetarget(thosetuplesreturnedbythenewsource)andhalfaretakenfromnegativeexamples(thosetuplesnotreturnedbythesource).Bysamplingfrombothpositiveandnegativetuples,weguaranteethattheapproximationgeneratedwillbeasaccurateaspossiblegiventhelimitedsamplesize.Wedenotethesetofpositiveandnegativesamplesasδ+[βv\\βs]andδ−[βv\\βs],andusethesevaluestodeﬁnepositiveandtotalscalingfactorsasshownbelow.(Thenumeratorforthepositivevaluesisdiﬀerentfrombefore,asthesevalueshavebeentakenfromtheoutputofthenewsource.)

SF+=

|πβv\\βs(πv\\βs(Os(i)))|

|δ+[βv\\βs]|

Thetotalscalingfactoristhesamevalueasbefore,butcalculatedslightlydiﬀerently:

SF=

|D[βv\\βs]|

|δ+[βv\\βs]|+|δ−[βv\\βs]|

Thescorecanthenbeapproximatedaccordinglybytakingintoaccountthesenewscalingfactors.Theintersectionnowneedstobescaledusingthepositivescalingfactor:

˜v(i)|∗SF+|πv\\βs(Os(i))∩Ov(i)|≈|πv\\βs(Os(i))∩O

Thisnewscalingresultsinanewfunctionforevaluatingthequalityofaviewdeﬁnition:˜v(i)|∗SF+|πv\\βs(Os(i))∩O1󰀄

score5(s,v,I)=

c\\v]|∗SF−|π+˜v(i)||D[βs˜|I|i∈I|Os(i)|+|Ov\\βs(Os(i))∩Ov(i)|∗SF

LearningSemanticDefinitionsofInformationSourcesontheInternet

5.3FavouringShorterDeﬁnitions

Nowthatwehavederivedascoreforcomparingthedatathatasourceandcandidateproduce,wecandeﬁnetheevaluationfunctionevalusedinAlgorithm1.Asmentionedinsection4.6,shorterdeﬁnitionsforthetargetsourceshouldbepreferredoverlongerandpossiblylessaccurateones.Inaccordancewiththisprinciple,wescalethescorebythelengthofthedeﬁnition,soastofavourshorterdeﬁnitionsasfollows:

eval(v)=ωlength(v)∗score5(s,v,I)

Herelength(v)isthelengthoftheclauseandω<1isaweightingfactor.Settingtheweightingfactortobealittlelessthan1(suchas0.95)helpstoremovelogicallyredundantdeﬁnitions,whichcansometimesbehardtodetect,butoftenreturnalmostexactlythesamescoreastheirshorterequivalent.Wewilldiscusstheproblemofgeneratingnon-redundantclausesinsection6.3.

5.4ApproximatingEquality

Untilnow,wehaveignoredtheproblemofdecidingwhethertwotuplesproducedbythetargetsourceandthedeﬁnitionarethesame.Sincediﬀerentsourcesmayserializedataindiﬀerentwaysandatdiﬀerentlevelsofaccuracy,wemustallowforsomeﬂexibilityinthevaluesthatthetuplescontain.Forinstance,intheexamplefromsection2,thedistancevaluesreturnedbythesourceanddeﬁnitiondidnotmatchexactly,butwere“suﬃcientlysimilar”tobeacceptedasthesamevalue.

Fornumerictypesliketemperatureordistanceitmakessensetouseanerrorbound(like±0.5◦C)orapercentageerror(suchas±1%)todecideiftwovaluescanbeconsideredthesame.Thisisbecausethesensingequipment(inthecaseoftemperature)orthealgorithm(inthecaseofdistance)willhavesomeerrorboundassociatedwiththevaluesitproduces.Werequirethatanerrorboundforeachnumerictypebeprovidedintheproblemspeciﬁcation.(Ideally,theseboundswouldbelearntautomaticallyfromexamples.)

Forcertainnominaltypeslikecompanynames,wherevalueslike󰀛IBMCorporation󰀜and󰀛InternationalBusinessMachinesCorp.󰀜representthesamevalue,simplisticequalitycheckingusingexactorsubstringmatchesisnotsuﬃcientfordecidingwhethertwovaluescorrespondtothesameentity.Inthiscase,stringeditdistancessuchastheJaroWinklerscoredobetteratdistinguishingstringsrepresentingthesameentityfromthoserepresentingdiﬀerentones(Bilenko,Mooney,Cohen,Ravikumar,&Fienberg,2003).Amachinelearningclassiﬁercouldbetrainedonasetofsuchexamplestolearnwhichoftheavailablestringeditdistancesbestdistinguishesvaluesofthattypeandwhatthresholdtosetforacceptingapairasamatch.Werequirethatthispairofsimilaritymetricandthreshold(oranycombinationsofmetrics)beprovidedintheproblemspeciﬁcation.

Inothercases,enumeratedtypeslikemonthsoftheyearmightbeassociatedwithasimpleequalitycheckingprocedure,sothatvalueslike󰀛January󰀜,󰀛Jan󰀜and󰀛1󰀜canbefoundequal.Theactualequalityprocedureusedwilldependonthesemantictypeandweassumeinthisworkthatsuchaprocedureisgivenintheproblemdeﬁnition.Wenotethattheprocedureneednotbe100%accurate,butonlyprovideasuﬃcientlevelofaccuracytoguidethesystemtowardthecorrectsourcedescription.Indeed,theequalityrulescouldalsobegeneratedoﬄinebytrainingaclassiﬁer.

Carman&Knoblock

Complextypessuchasdatepresentabiggerproblemwhenoneconsiderstherangeofpossibleserializations,includingvalueslike󰀛5/4/2006󰀜or󰀛Thu,4May2006󰀜or󰀛2006-05-04󰀜.Insuchcasesspecializedfunctionsarenotonlyrequiredtocheckequalitybetweenvaluesbutalsotobreakthecomplextypesupintotheirconstituentparts(inthiscaseday,monthandyear).Thelatterwouldformpartofthedomainmodel.

Insomecases,decidingwhethertwovaluesofthesametypecanbeconsideredequaldependsnotonlyonthetype,butalsoontherelationstheyareusedin.Considerthetworelationsshownbelow.Theﬁrstprovidesthelatitudeandlongitudecoordinatesofthecentroidforazipcode,whilethesecondreturnsthecoordinatesforaparticularaddress:

centroid(zipcode,latitude,longitude)

geocode(number,street,zipcode,latitude,longitude)

Giventhediﬀerentwaysofcalculatingthecentroidofazipcode(includingusingthecenterofmassorthecenterofpopulationdensity)anerrorboundof500metersmightmakesenseforequatinglatitudeandlongitudecoordinates.Forageocodingservice,ontheotherhand,anerrorboundof50metersmaybemorereasonable.Ingeneral,sucherrorboundsshouldbeassociatedwiththesetofglobalrelations,insteadofjustthesemantictypes,andcouldbelearntaccordingly.Whentherelationscontainmultipleattributes,thentheproblemofdecidingwhethertwotuplesrefertothesameentityiscalledrecordlinkage(Winkler,1999).Anentireﬁeldofresearchisdevotedtotacklingthisproblem.Duetothecomplexityoftheproblemandthevarietyoftechniquesthathavebeendevelopedtohandleit,wedonotinvestigateitfurtherhere.

6.Extensions

Inthissectionwediscussextensionstothebasicalgorithmneededforhandlingrealdatasources,aswellaswaystoreducethesizeofthehypothesisspaceandimprovethequalityofthedeﬁnitionsproduced.6.1GeneratingInputs

Theﬁrststepinthesourceinductionalgorithmistogenerateasetoftupleswhichwillrepresentthetargetrelationduringtheinductionprocess.Inotherwords,thesystemmusttrytoinvokethenewsourcetogathersomeexampledata.Doingthiswithoutbiasingtheinductionprocessiseasiersaidthandone.Thesimplestapproachtogeneratinginputvaluesistoselectconstantsatrandomfromthesetofexamplesgivenintheproblemspeciﬁcation.Theproblemwiththisapproachisthatinsomecasesthenewsourcewillnotproduceanyoutputfortheselectedinputs.Insteadthesystemmayneedtoselectvaluesaccordingtosomedistributionoverthedomainofvaluesinorderforthesourcetoinvokecorrectly.Forexample,considerasourceprovidingpostsofusedcarsforsaleinacertainarea.Thesourcetakesthemakeofthecarasinput,andreturnscardetails:

usedCars($make,model,year,price,phone)

Althoughthereareoverahundreddiﬀerentcarmanufacturersintheworld,onlyafewofthemproducethebulkofthecars.ThusinvokingthesourcewithvalueslikeFerrari,LotusandAstonMartinwillbelesslikelytoreturnanytuples,whencomparedwithmorecommonbrandssuchasFordandToyota(unlessthesourceisonlyprovidingdataforsportscarsofcourse).Ifadistributionoverpossiblevaluesisavailable,thesystemcanﬁrsttry

LearningSemanticDefinitionsofInformationSourcesontheInternet

themorecommonvalues,ormoregenerally,itcanchoosevaluesfromthatsetaccordingtothedistribution.Inthisparticularexample,itmightnotbetoodiﬃculttoquerythesourcewithacompletesetofcarmanufacturersuntiloneoftheinvocationsreturnssomedata.Ingeneral,thesetofexamplesmaybeverylarge(suchasthe40,000+zipcodesintheUS)andthenumberof“interesting”valuesinthatset(theoneslikelytoreturnresults)maybeverysmall,inwhichcasetakingadvantageofpriorknowledgeaboutthedistributionofpossiblevaluesmakessense.Itshouldbenotedalsothatduringexecutionthesystemwillreceivealotofoutputdatafromthediﬀerentsourcesitaccesses.Thisdatacanberecordedtogeneratedistributionsoverpossiblevaluesforthediﬀerenttypes.

Theproblemofgeneratingviableinputdataforanewsourcebecomesyetmorediﬃcultiftheinputrequiredisnotasinglevaluebutatupleofvalues.Inthiscasethesystemcanﬁrsttrytoinvokethesourcewithrandomcombinationsofattributevaluesfromtheexamplesofeachtype.Invokingsomesources(suchassource5)iseasybecausethereisnoexplicitrestrictiononthecombinationofinputvalues:

source5($zip,$distance,zip,distance)

Inothercases,suchasageocodingservicethecombinationofpossibleinputvaluesishighlyrestricted:

USGeocoder($number,$street,$zipcode,latitude,longitude)

Randomlyselectinginputvaluesindependentlyofoneanotherisunlikelytoresultinanysuccessfulinvocations.(Inorderfortheinvocationtosucceed,therandomlygeneratedtuplemustcorrespondtoanaddressthatactuallyexists.)Insuchcases,afterfailingtoinvokethesourceanumberoftimes,thesystemcantrytoinvokeothersources(suchasthehotellookupservicebelow),whichproducetuplescontainingtherequiredattributetypes:

HotelSearch($city,hotel,number,street,zipcode)

Ingeneral,thisprocessofinvokingsourcestogenerateinputforothersourcescanbechaineduntilasetofviableinputsisgenerated.

Wenotethattheproblemofsynthesizingviableinputdataisitselfadiﬃcultandinterestingresearchproblem.Ourcombinedapproachofutilizingvaluedistributionsandinvokingalternativeservicesperformswellinourexperiments(seesection7.3),butanareaoffutureworkistodevelopamoregeneralsolution.6.2DealingwithSources

Inordertominimisesourceaccesses,whichcanbeveryexpensiveintermsofbothtimeandbandwidth,allrequeststotheindividualsourcesarecachedinalocalrelationaldatabase.Thisimplementationmeansthatthereisanimplicitassumptioninthisworkthattheoutputproducedbytheservicesisconstantforthedurationoftheinductionprocess.Thiscouldbeproblematiciftheservicebeingmodeledprovides(near)real-timedatawithanupdatefrequencyoflessthanthetimeittakestoinduceadeﬁnition.Foraweatherpredictionservice,updatedhourly,thismaynotpresentmuchofaproblem,sincethediﬀerencebetweenpredictedtemperaturesmayvaryonlyslightlyfromoneupdatetothenext.Forareal-timeﬂightstatusserviceprovidingthecoordinatesofagivenaircrafteveryﬁveminutes,thecachingmaybeproblematicasthelocationoftheplanewillvarygreatlyifittakes,forexample,onehourtoinduceadeﬁnition.Intheoryonecouldtestforsuchvariationsystematicallybyperiodicallyinvokingthesamesourcewithapreviously

Carman&Knoblock

successfulinputtupletoseeiftheoutputhaschanged,andupdatethecachingpolicyaccordingly.

6.3LogicalOptimisations

Evaluatingdeﬁnitionscanbeexpensivebothintermsoftime(waitingforsourcestore-turndata)andcomputation(calculatingjoinsoverlargetables).Thusitmakessensetocheckeachcandidateforredundancybeforeevaluatingit.Todecidewhichdeﬁnitionsareredundant,weusetheconceptofquerycontainment:

Aqueryq1∈LR,Aissaidtobecontainedinanotherqueryq2∈LR,AifforanydatabaseinstanceI,thesetoftuplesreturnedbytheﬁrstqueryisasubsetofthosereturnedbythesecond,i.e.∀IEI[q1]⊆EI[q2].Wedenotecontainmentbyq1󰀟q2.Twoqueriesareconsideredlogicallyequivalentifq1󰀟q2∧q2󰀟q1.Fortheconjunctivequerieslearntinthispaper,testingforquerycontainmentreducestotheproblemofﬁndingacontainmentmapping(Chandra&Merlin,1977).11Wecanusethistesttodiscoverlogicallyequivalentdeﬁnitionssuchasthefollowing,(whichcontainareorderingofthesameliterals):

source($zip,temp):-getCentroid($zip,lat,lon),getConditions($lat,$lon,temp).source($zip,temp):-getConditions($lat,$lon,temp),getCentroid($zip,lat,lon).Suchequivalencecheckingcanbeperformedeﬃcientlyifacanonicalorderingofpredicateandvariablenamesischosenapriori.Wheneverlogicallyequivalentdeﬁnitionsaredis-covered,thesearchcanbacktrack,therebyavoidingentiresub-treesofequivalentclauses.Similarly,wecantestforandskiplogicallyredundantclausessuchasthefollowing(whichisequivalenttoashorterdeﬁnitionwithoutthesecondliteral):

source($zip,):-getCentroid($zip,lat,long),getCentroid($zip,lat,).

Again,suchredundancycheckingcanbeperformedeﬃciently(Levy,Mendelzon,Sagiv,&Srivastava,1995)resultinginlittlecomputationaloverheadduringsearch.6.4FunctionalSources

Moreinformationmaybeknownaboutthefunctionalityofcertainsourcesthanisexpressedbytheirsourcedeﬁnitions.Forexample,sourceslikeMultiplyandConcatenate,whichareimplementedlocally,willbeknowntobecomplete.(Asourceisconsideredcomplete,ifitreturnsallofthetuplesimpliedbyitsdeﬁnition,i.e.E[s]=EI[v].)Wheneversuchinformationisavailable,theinductionsystemcantakeadvantageofittoimprovesearcheﬃciency.Toexplainhow,wedeﬁneaclassofsourcesthatwecallfunctionalsources,whicharecompleteandforanyinputtuplereturnexactlyoneoutputtuple.ThisisslightlymorerestrictivethanthestandardILPconceptofdeterminateliterals(Cameron-Jones&Quinlan,1994),whichforeveryinputtuplereturnatmostoneoutputtuple.MultiplyandConcatenatearebothexamplesoffunctionalsources.Thesystemtakesadvantageofthefactthatfunctionalsourcesplacenorestrictionsontheirinput.Wheneverafunctionalsourceisaddedtoacandidatedeﬁnition,thescoreforthatdeﬁnitiondoesn’tchangeprovidingallthesource’sinputsandnoneofitsoutputsarebound.(Thesetoftuplesreturnedbythenew

11.Ifthequeriescontaininterpretedpredicates,thencontainmenttestingisalittlemoreinvolved(Afrati,

Li,&Mitra,2004).

LearningSemanticDefinitionsofInformationSourcesontheInternet

deﬁnitionisthesameasbefore,butwithafewnewattributescorrespondingtotheoutputsofthesource.)Thusthenewdeﬁnitiondoesnotneedtobeevaluated,butcanbeaddedtothequeue(ofdeﬁnitionstoexpand)asis,whichbecomesparticularlyadvantageouswhenasource’sinputarityishigh.6.5Constants

Constantsareoftenusedinsourcedescriptionstodeﬁnethescopeofaservice.ConsideraweatherservicethatprovidesreportsforzipcodesonlyinCalifornia:

calWeather($zip,$date,temp):-forecast(zip,date,temp),USstate(zip,California).IfamediatorreceivesaqueryaskingfortheforecastforChicago,itwillknowthatthissourceisnotrelevanttothequerysinceChicagoisnotinCalifornia.Althoughconstantsinsourcedescriptionscanbeveryuseful,simplyintroducingthemintothehypothesislanguagecouldcausethesearchspacetogrowprohibitively.(Forstates,thebranchingfactorwouldbe50,whileforzipcodesitwouldbeinexcessof40,000.)Obviouslyagenerateandtestmethodologydoesnotmakesensewhenthedomainofthesemantictypeislarge.Alternatively,onecanexplicitlycheckforrepeatedvaluesinthetuplesreturnedbythenewsource(i.e.forconstantsintheheadoftheclause)andinthejoinofthesourceanddeﬁnitionrelations(i.e.forconstantsinthebodyoftheclause).Forexample,inthedeﬁnitionbelowthejoinofthesourcerelation󰀛zip,date,temp󰀜withthedeﬁnitionrelation󰀛zip,date,temp,state󰀜wouldproduceonlytupleswithstateequaltoCalifornia.Sothatconstantcouldbeaddedtothedeﬁnition.

source($zip,$date,temp):-forecast(zip,date,temp),USstate(zip,state).Morecomplicateddetectionprocedureswouldberequiredfordiscoveringconstantsinin-terpretedpredicates(i.e.rangerestrictionsovernumericattributes).6.6Post-Processing

Afteradeﬁnitionhasbeenlearntforanewsource,itmaybepossibletotightenthatdeﬁnitionbyremovinglogicalredundanciesfromitsunfolding.Considerthefollowingdeﬁnitioninvolvingcallstotwohotelsources,onetocheckavailabilityandtheothertocheckitsrating:

source($hotel,address,rating):-HotelAvailability($hotel,address,price),HotelRating($hotel,rating,address).Theunfoldingofthatdeﬁnition(intermsofthedeﬁnitionsofthehotelsources)containstworeferencestoanaccommodationrelation:

source($hotel,address,rating):-accommodation(hotel,,address),available(hotel,today,price),accommodation(hotel,rating,address).

Theﬁrstliteralisredundantandcanberemovedfromtheunfolding.Ingeneral,thesamerulesusedtodiscoverredundancyincandidatedeﬁnitionscanbeusedtoremoveredundantliteralsfromtheunfolding.Moreover,sincethispost-processingstepneedstobeperformedonlyonce,timecanbespentsearchingformorecomplicatedformsofredundancy.

Carman&Knoblock

7.Evaluation

Inthissectionwedescribeourevaluationofthesourceinductionalgorithm.Weﬁrstdescribetheexperimentalsetupusedandthentheexperimentsperformed.Finally,wecomparetheinductionalgorithmwithaparticularcomplexschemamatchingsystem.7.1ExperimentalSetup

Thesourceinductionalgorithmdeﬁnedinthispaperwasimplementedinasystemcalledeidos,whichstandsforEﬃcientlyInducingDeﬁnitionsforOnlineSources.eidosim-plementsthetechniquesandoptimisationsdiscussedinsections4through6.(Certainextensionsfromsection6wereonlypartiallyimplemented:theimplementationcurrentlychecksforconstantsonlyintheheadoftheclauseanddoesnotperformanytighteningofthedeﬁnitions.)AllcodewaswritteninJavaandaMySQLdatabasewasusedforcachingtheresultsofsourceinvocations.

eidoswastestedon25diﬀerentproblemsinvolvingrealservicesfromseveraldomainsincludinghotels,ﬁnancialdata,weatherandcars.Thedomainmodelusedwasthesameforeachproblemandincludedover70diﬀerentsemantictypes,rangingfromcommononeslikezipcodetomorespeciﬁctypessuchasstocktickersymbols.Thedatamodelalsocontained36relations(excludinginterpretedpredicates),whichwereusedtomodel33diﬀerentservices.Allofthemodeledservicesarepubliclyavailableinformationsources.Wenoteherethatthedecisiontousethesamesetofknownsourcesforeachproblem(regardlessofthedomain)wasimportantinordertomakesurethatthetestswererealistic.Thisdecisionmadetheproblemmorediﬃcultthanthestandardschemamatching/mappingscenarioinwhichthesourceschemaischosen,becauseitprovidesdatathatisknownaprioritoberelevanttotheoutputschema.

Inordertogiveabettersenseoftheproblemsettingandthecomplexityoftheknownsourcesavailable,welisttenbelow(orderedbyarity).Duetospacelimitationswedon’tshowthecompletelistnortheirdeﬁnitions,onlytheinput/outputtypesforeachsource.Notethatallthesourcessharethesemantictypeslatitudeandlongitude,whichmeansthatthesearchspaceassociatedwiththesesourcesaloneisverylarge.

WeatherConditions($city,state,country,latitude,longitude,time,time,timeoffset,datetime,temperatureF,sky,pressureIn,direction,speedMph,humidity,temperatureF)2WeatherForecast($city,state,country,latitude,longitude,timeoffset,day,date,

temperatureF,temperatureF,time,time,sky,direction,speedMph,humidity)3GetDistance($latitude,$longitude,$latitude,$longitude,distanceKm)4USGeocoder($street,$zipcode,city,state,latitude,longitude)5ConvertDMS($latitudeDMS,$longitudeDMS,latitude,longitude)6USGSEarthquakes(decimal,timestamp,latitude,longitude)7GetAirportCoords($iata,airport,latitude,longitude)8CountryCode($latitude,$longitude,countryAbbr)9GetCentroid($zipcode,latitude,longitude)10Altitude($latitude,$longitude,distanceM)

Inordertoinducedeﬁnitionsforeachproblem,thesource(andeachcandidatedeﬁni-tion)wasinvokedatleast20timesusingrandominputs.Wheneverpossible,thesystemattemptedtogenerate10positiveexamplesofthesource(invocationsforwhichthesourcereturnedsometuples)and10negativeexamples(inputswhichproducednooutput).To

LearningSemanticDefinitionsofInformationSourcesontheInternet

ensurethatthesearchterminated,thenumberofiterationsofthealgorithmincludingback-trackingstepswaslimitedto30.Asearchtimelimitof20minuteswasalsoimposed.Theinductivesearchbiasusedduringtheexperimentsisshownbelow,andaweightingfactor(deﬁnedinsection5.3)of0.9wasusedtodirectthesearchtowardshorterdeﬁnitions.

SearchBiasMaximumclauselength=7Maximumpredicaterepetition=2Maximumvariablelevel=5Executablecandidatesonly

Novariablerepetitionwithinaliteral

Intheexperiments,diﬀerentprocedureswereusedtodecideequalitybetweenvaluesofthesametypeasdiscussedinsection5.4.Someoftheequalityproceduresusedfordiﬀerenttypesarelistedbelow.Theaccuracyboundsandthresholdsusedwerechosentomaximizeoverallperformanceofthelearningalgorithm.(Inpractice,ameta-learningalgorithmcouldbeusedtodeterminethebestaccuracyboundsfordiﬀerentattributetypes.)Forallsemantictypesnotlistedbelow,substringmatching(checkingifonestringcontainedtheother)wasusedtotestequalitybetweenvalues.

Typeslatitudes,longitudesdistances,speeds,temperatures,priceshumidity,pressure,degreesdecimals

companies,hotels,airportsdates

EqualityProcedureaccuracyboundof±0.002accuracyboundof±1%accuracyboundof±1.0accuracyboundof±0.1JaroWinklerscore≥0.85specialisedequalityprocedure

Theexperimentswererunonadualcore3.2GHzPentium4with4GBofRAM(althoughmemorywasnotalimitingfactorinanyofthetests).ThesystemwasrunningWindows2003Server,JavaRuntimeEnvironment1.5andMySQL5.0.7.2EvaluationCriteria

Inordertoevaluatetheinductionsystem,onewouldliketocompareforeachproblemthedeﬁnitiongeneratedbythesystemwiththeidealdeﬁnitionforthatsource(denotedvbestandv∗respectively).Inotherwords,wewouldlikeanevaluationfunction,whichratesthequalityofeachdeﬁnitionproducedwithrespecttoahand-writtendeﬁnitionforthesource(i.e.quality:vbest×v∗→[0,1]).Theproblemwiththisistwofold.Firstly,itisnotobvioushowtodeﬁnesuchasimilarityfunctionoverconjunctivequeriesandmanydiﬀerentpossibilitiesexist(seeMarkovandMarinchev,2000,foraparticularexample).Secondly,workingoutthebestdeﬁnitionbyhand,whiletakingintoaccountthelimitationsofthedomainmodelandthefactthattheavailablesourcesarenoisy,incomplete,possiblylessaccurate,andevenserialisedataindiﬀerentways,maybeextremelydiﬃcult,ifevenpossible.Soinordertoevaluateeachofthediscovereddeﬁnitions,weinsteadcountthenumberofcorrectlygeneratedattributesineachdeﬁnition.Anattributeissaidtobecorrectlygenerated,if:

Carman&Knoblock

•itisaninput,andthedeﬁnitioncorrectlyrestrictsthedomainofpossiblevaluesforthatattribute,or

•itisanoutput,andthedeﬁnitioncorrectlypredictsitsvalueforgiveninputtuples.Considerthefollowingdeﬁnitionthattakestwoinputvaluesandreturnsthediﬀerenceanditssquareroot(providingthediﬀerenceispositive):

source($A,$B,C,D):-sum(B,C,A),product(D,D,C),A≥B.

andimaginethattheinductionsystemmanagedtolearnonlythatthesourcereturnsthediﬀerencebetweentheinputandtheoutput,i.e.:

source($A,$B,C,):-sum(B,C,A).

WesaythattheﬁrstinputattributeAisnotcorrectlygeneratedasitisaninputandisnotconstrainedwithrespecttotheinputattributeB(theinequalityismissing).TheinputBisdeemedcorrectlygeneratedasitispresentinthesumrelation(onlyoneinputispenalisedforthemissinginequality).TheoutputCisdeemedcorrectlygeneratedwithrespecttotheinputs,andthemissingattributeDisn’tgeneratedatall.(Notethatiftheorderingofvariablesinthesumrelationhadbeendiﬀerent,saysum(A,B,C),thenCwouldhavebeengenerated,butnotcorrectlygenerated.)

Givenadeﬁnitionforcorrectlygeneratedattributes,onecandeﬁneexpressionsforprecisionandrecallovertheattributescontainedinasourcedeﬁnition.Wedeﬁneprecisiontobetheratioofcorrectlygeneratedattributestothetotalnumberofattributesgeneratedbyadeﬁnition,i.e.:

precision=

#ofcorrectlygeneratedattributestotal#ofgeneratedattributes

Wedeﬁnerecalltobetheratioofgeneratedattributestothetotalnumberofattributesthatwouldhavebeengeneratedbytheidealdeﬁnition,giventhesourcesavailable.(Insomecasesnosourcesareavailabletogeneratevaluesforanattributeinwhichcase,thatattributeisnotincludedinthecount.)

recall=

#ofcorrectlygeneratedattributes

total#ofattributesthatshouldhavebeengenerated

Notethatwedeﬁnedprecisionandrecallattheschemalevelintermsoftheattributesinvolvedinasourcedeﬁnition.Theycouldalsobedeﬁnedatthedatalevelintermsofthetuplesbeingreturnedbythesourceandthedeﬁnition.Indeed,theJaccardsimilarityusedtoscorecandidatedeﬁnitionsisacombinationofdata-levelprecisionandrecallvalues.Thereasonforchoosingschemalevelmetricsinourevaluationisthattheybetterreﬂectthesemanticcorrectnessofthelearntdeﬁnition,insofarastheyareindependentofthecompleteness(amountofoverlap)betweentheknownandtargetsources.

Returningtoourexampleabove,theprecisionforthesimpledeﬁnitionlearntwouldbe2/3andtherecallwouldbe2/4.Notethat,iftheproductrelationhadnotbeenavailableinourdomainmodel(inwhichcaseattributeDcouldneverhavebeengenerated),recallwouldhavebeenhigherat2/3.

LearningSemanticDefinitionsofInformationSourcesontheInternet

7.3Experiments

Thedeﬁnitionslearntbythesystemaredescribedbelow.Overallthesystemperformedverywellandwasabletolearntheintendeddeﬁnition(ignoringmissingjoinvariablesandsuperﬂuousliterals)in19outofthe25problems.7.3.1GeospatialSources

Theﬁrstsetofproblemsinvolvedninegeospatialdatasourcesprovidingavarietyoflocationbasedinformation.Thedeﬁnitionslearntbythesystemarelistedbelow.Theyarereportedintermsofthesourcepredicatesratherthanthedomainrelations(i.e.theunfoldings)becausethecorrespondingdeﬁnitionsaremuchshorter.Thismakesiteasiertounderstandhowwellthesearchalgorithmisperforming.

123

GetInfoByZip($zip0,cit1,sta2,_,tim4):-GetTimezone($sta2,tim4,_,_),GetCityState($zip0,cit1,sta2).GetInfoByState($sta0,cit1,zip2,_,tim4):-GetTimezone($sta0,tim4,_,_),GetCityState($zip2,cit1,sta0).GetDistanceBetweenZipCodes($zip0,$zip1,dis2):-GetCentroid($zip0,lat1,lon2),GetCentroid($zip1,lat4,lon5),

GetDistance($lat1,$lon2,$lat4,$lon5,dis10),ConvertKm2Mi($dis10,dis2).GetZipCodesWithin($_,$dis1,_,dis3):-<(dis3,dis1).

YahooGeocoder($str0,$zip1,cit2,sta3,_,lat5,lon6):-USGeocoder($str0,$zip1,cit2,sta3,lat5,lon6).GetCenter($zip0,lat1,lon2,cit3,sta4):-WeatherConditions($cit3,sta4,_,lat1,lon2,_,_,_,_,_,_,_,_,_,_,_),GetZipcode($cit3,$sta4,zip0).

Earthquakes($_,$_,$_,$_,lat4,lon5,_,dec7,_):-USGSEarthquakes(dec7,_,lat4,lon5).USGSElevation($lat0,$lon1,dis2):-ConvertFt2M($dis2,dis1),Altitude($lat0,$lon1,dis1).CountryInfo($cou0,cou1,cit2,_,_,cur5,_,_,_,_):-GetCountryName($cou0,cou1),GoCurrency(cur5,cou0,_),

WeatherConditions($cit2,_,cou1,_,_,_,_,_,_,_,_,_,_,_,_,_).

456

789

Theﬁrsttwosourcesprovideinformationaboutzipcodes,suchasthenameofthecity,thestateandthetimezone.Theydiﬀerintheirbindingconstraints,withtheﬁrsttakingazipcodeasinput,andthesecondtakingastate.Thesecondsourcereturnsmanyoutputtuplesperinputvalue,makingithardertolearnthedeﬁnition,eventhoughthetwosourcesprovidelogicallythesameinformation.Theinduceddeﬁnitionsarethebestpossiblegiventheknownsourcesavailable.(Noneofthemprovidedthemissingoutputattribute,atele-phonearea-code.)Thethirdsourcecalculatesthedistanceinmilesbetweentwozipcodes,(itisthesameassource4fromsection2).Thecorrectdeﬁnitionwaslearntforthissource,butforthenextsource,whichreturnedzipcodeswithinagivenradius,areasonabledeﬁni-tioncouldnotbelearntwithinthetimelimit.12Ignoringbindingconstraints,theintended

12.Therecallforthisproblemis1/4becausetheinputattributedis1isdeterminedtobetheonlycorrectly

generatedattribute(itisconstrainedwithrespecttotheoutputattributedis3),whileallfourattributesshouldhavebeengenerated(theoutputattributedis3isnotgeneratedbythe33

Carman&Knoblock

deﬁnitionwasthesameasthethird,butwitharestrictionthattheoutputdistancebelessthantheinputdistance.Thusitwouldhavebeenfareasierforeidostolearnadeﬁnitionforthefourthsourceintermsofthethird.Indeed,whenthenewdeﬁnitionforthethirdsourcewasaddedtothesetofknownsources,thesystemwasabletolearnthefollowing:

4’GetZipCodesWithin($zip0,$dis1,zip2,dis3):-GetDistanceBetweenZipCodes($zip0,$zip2,dis3),<(dis3,dis1).

Theabilityofthesystemtoimproveitslearningabilityovertimeasthesetofknownsourcesincreasesisakeybeneﬁtoftheapproach.

SourceﬁveisageocodingserviceprovidedbyYahoo.(Geocodingservicesmapad-dressestolatitudeandlongitudecoordinates.)eidoslearntthatthesamefunctionalitywasprovidedbyaservicecalledUSGeocoder.Sourcesixisasimpleserviceprovidingthelatitude/longitudecoordinatesandthecityandstateforagivenzipcode.Interestingly,thesystemlearntthatthesource’scoordinateswerebetterpredictedbyaweathercondi-tionsservice(discussedinsection7.3.3),thanbytheGetCentroidsourcefromthethirddeﬁnition.Notethatwhenthenewsourcedeﬁnitionisunfoldeditwillcontainextrane-ouspredicatesrelatedtoweatherinformation.13Theadditionalpredicatesdonotinterferewiththeusefulnessofthedeﬁnition,however,asaqueryreformulationalgorithmwillstillusethesourcetoanswerthesamequeriesregardless.(Thusprecisionandrecallscoresarenotaﬀected.)Apost-processingsteptoremoveextraneouspredicatesispossible,butwouldrequireadditionalinformationinthedomainmodel.14Theseventhsourceprovidedearthquakedatawithinaboundingboxwhichittookasinput.Inthiscase,thesystemdiscoveredthatthesourcewasindeedprovidingearthquakedata,(lat4andlon5aretheco-ordinatesoftheearthquakeanddec7isitsmagnitude).Itdidn’tmanage,however,toworkouthowtheinputcoordinatesrelatedtotheoutput.Thenextsourceprovidedelevationdatainfeet,whichwasfoundtobesuﬃcientlysimilartoknownaltitudedatainmetres.Finally,thesystemlearntadeﬁnitionforasourceprovidinginformationaboutcountriessuchasthecurrencyused,andthenameofthecapitalcity.Sinceknownsourceswerenotavailabletoprovidethisinformation,thesystemendeduplearningthatweatherreportswereavailableforthecapitalofeachcountry.Problem123456789

#Candidates2524888350401115176

#Invocations50689804111361117613148151621487717728784

Time(s)85914449253242831872559

log10(Score)-1.36-1.08-0.750.25-0.45-7.61-6.87-8.58-5.77

Precision4/44/43/31/16/65/53/33/34/4

Recall4/44/43/31/46/65/53/93/34/4

13.Theunfoldingisshownbelow.Theconditionspredicatecouldberemovedwithoutaﬀectingitsmeaning:

GetCenter($zip0,lat1,lon2,cit3,sta4):-municipality(cit3,sta4,zip2,tim3),country(,cou5,),

northAmerica(cou5),centroid(zip2,lat1,lon2),conditions(lat1,lon2,,,,,,,,,,,),timezone(tim3,,),municipality(cit3,sta4,zip0,).

14.Inparticular,universallyquantiﬁedknowledgewouldbeneededinthedomainmodel,e.g.:

∀lat,lon∃x1,...,x11s.t.conditions(lat,long,x1,...,x11)

LearningSemanticDefinitionsofInformationSourcesontheInternet

Thetableshowssomedetailsregardingthesearchperformedtolearneachofthedeﬁ-nitionslistedabove.Foreachproblem,itshowsthenumberofcandidatesgeneratedpriortothewinningdeﬁnition,alongwiththetimeandnumberofsourceinvocationsrequiredtolearnthedeﬁnition.(Thelasttwovaluesshouldbeinterpretedwithcautionastheyarehighlydependentonthedelayinaccessingsources,andonthecachingofdatainthesystem.)Thescoresshownintheﬁfthcolumnareanormalisedversionofthescoringfunctionusedtocomparethedeﬁnitionsduringsearch.(Normalisationinvolvedremovingthepenaltyappliedformissingoutputs.)Thescorescanbeverysmall,sothelogarithmofthevaluesisshown(hencethenegativevalues).15Thesescorescanbeinterpretedastheconﬁdencethesystemhasinthedeﬁnitionsproduced.Thecloserthevalueistozero,thebetterthedeﬁnition’sabilitytoproducethesametuplesasthesource.Weseethatthesystemwasfarmoreconﬁdentaboutthedeﬁnitionsonethroughﬁve,thanthelatterones.16Thelasttwocolumnsgivetheprecisionandrecallvalueforeachproblem.Theaverageprecisionfortheseproblemswas100%.(Notethatahighprecisionvalueistobeexpected,giventhattheinductionalgorithmreliesonﬁndingmatchingtuplesbetweenthesourceanddeﬁnition.)Theaveragerecallforthegeospatialproblemswasalsoveryhighat84%.

7.3.2FinancialSources

Twosourcesweretestedthatprovidedﬁnancialdata.Thedeﬁnitionsgeneratedbyeidosforthesesourcesareshownbelow.

10GetQuote($tic0,pri1,dat2,tim3,pri4,pri5,pri6,pri7,cou8,_,pri10,_,_,pri13,_,com15):-YahooFinance($tic0,pri1,dat2,tim3,pri4,pri5,pri6,pri7,cou8),

GetCompanyName($tic0,com15,_,_),Add($pri5,$pri13,pri10),Add($pri4,$pri10,pri1).11YahooExchangeRate($_,$cur1,pri2,dat3,_,pri5,pri6):-GetCurrentTime(_,_,dat3,_),GoCurrency(cur1,cou5,pri2),

GoCurrency(_,cou5,pri5),Add($pri2,$pri5,pri12),Add($pri2,$pri6,pri12).

Theﬁrstﬁnancialserviceprovidedstockquoteinformation,andthesystemlearntthatthesourcereturnedexactlythesameinformationasastockmarketserviceprovidedbyYahoo.Itwasalsoabletoworkoutthatthepreviousday’scloseplustoday’schangewasequaltothecurrentprice.Thesecondsourceprovidedtherateofexchangebetweenthecurrenciesgivenasinput.Inthiscase,thesystemdidnotfarewell.Itwasunabletolearntheintendedresult,whichinvolvedcalculatingtheexchangeratebytakingtheratioofthevaluesfortheﬁrstandsecondcurrency.Problem1011

#Candidates2844367

#Invocations1667116749

Time(s)387282

log10(Score)-8.13-9.84

Precision12/131/5

Recall12/121/4

Detailsregardingthesearchspacesforthetwoproblemsareshownabove.Theaverageprecisionandrecallfortheseproblemsweremuchlowerat56%and63%respectively,becausethesystemwasunabletolearntheintendeddeﬁnitioninthesecondproblem.

15.Thepositivevalueforproblem4resultsfromanapproximationerror.

16.Lowscoresandperfectprecisionandrecall(problems6,8and9)indicateverylittleoverlapbetween

thetargetandtheknownsources.Thefactthatthesystemlearnsthecorrectdeﬁnitioninsuchcasesistestimonytotherobustnessoftheapproach.

Carman&Knoblock

7.3.3WeatherSources

OntheInternet,therearetwotypesofweatherinformationservices,thosethatprovideforecastsforcomingdays,andthosethatprovidedetailsofcurrentweatherconditions.Intheexperiments,apairofsuchservicesprovidedbyWeather.comwereusedtolearndeﬁni-tionsforanumberofotherweathersources.Theﬁrstsetofdeﬁnitions,whichcorrespondtosourcesthatprovidecurrentweatherconditions,arelistedbelow:

12NOAAWeather($ica0,air1,_,_,sky4,tem5,hum6,dir7,spe8,_,pre10,tem11,_,_):-GetAirportInfo($ica0,_,air1,cit3,_,_),

WeatherForecast($cit3,_,_,_,_,_,_,_,_,_,_,_,sky4,dir7,_,_),

WeatherConditions($cit3,_,_,_,_,_,_,_,_,tem5,sky4,pre33,_,spe8,hum6,tem11),ConvertIn2mb($pre33,pre10).

13WunderGround($sta0,$cit1,tem2,_,_,pre5,pre6,sky7,dir8,spe9,spe10):-WeatherConditions($cit1,sta0,_,_,_,_,_,_,dat8,tem9,sky7,pre5,dir8,spe13,_,tem2),WeatherForecast($cit1,sta0,_,_,_,_,_,_,tem24,_,_,_,_,_,spe10,_),

ConvertIn2mb($pre5,pre6),<(tem9,tem24),ConvertTime($dat8,_,_,_,_),<(spe9,spe13).14WeatherBugLive($_,cit1,sta2,zip3,tem4,_,_,dir7,_,_):-WeatherConditions($cit1,sta2,_,_,_,_,_,_,_,tem4,_,_,dir7,_,_,_),GetZipcode($cit1,$sta2,zip3).

15WeatherFeed($cit0,$_,tem2,_,sky4,tem5,_,_,pre8,lat9,_,_):-WeatherConditions($cit0,_,_,lat9,_,_,_,_,_,_,sky4,pre8,dir12,_,_,tem5),WeatherForecast($cit0,_,_,_,_,_,_,_,_,tem2,_,_,_,dir12,_,_).

16WeatherByICAO($ica0,air1,cou2,lat3,lon4,_,dis6,_,sky8,_,_,_,_):-Altitude($lat3,$lon4,dis6),GetAirportInfo($ica0,_,air1,cit6,_,cou8),WeatherForecast($cit6,_,cou8,_,_,_,_,_,_,_,_,_,sky8,_,_,_),GetCountryName($cou2,cou8).

17WeatherByLatLon($_,$_,_,_,_,lat5,lon6,_,dis8,_,_,_,_,_,_):-Altitude($lat5,$lon6,dis8).

Intheﬁrstproblem,thesystemlearntthatsource12providedcurrentconditionsatairports,bycheckingtheweatherreportforthecitiesinwhicheachairportwaslocated.Thisparticularproblemdemonstratessomeoftheadvantagesoflearningdeﬁnitionsfornewsourcesdescribedinsection3.3.Oncethedeﬁnitionhasbeenlearnt,ifamediatorreceivesarequestforthecurrentconditionsatanairport,itcangenerateananswerforthatquerybyexecutingasinglecalltothenewlymodeledsource,(withoutneedingtoﬁndanearbycity).Thesystemperformedwellonthenextthreesources(13to15)learningdeﬁnitionswhichcovermostoftheattributesofeach.Onthelasttwoproblems,thesystemdidnotperformaswell.Inthecaseofsource16,thesystemspentmostofitstimelearningwhichattributesoftheairportwerebeingreturned(suchasitscountry,coordinates,elevation,etc.).Inthelastcase,thesystemwasonlyabletolearnthatthesourcewasreturningsomecoordinatesalongwiththeirelevation.Wenoteherethatdiﬀerentsourcesmayprovidedataatdiﬀerentlevelsofaccuracy.Thusthefactthatthesystemisunabletolearnadeﬁnitionforaparticularsourcecouldsimplymeanthatthedatabeingreturnedbythatsourcewasn’tsuﬃcientlyaccurateforthesystemtolabelitamatch.

Inadditiontocurrentweatherfeeds,thesystemwasrunontwoproblemsinvolvingweatherforecastfeeds.Itdidverywellontheﬁrstproblem,matchingallbaroneoftheattributes(thecountry)andﬁndingthattheorderofthehighandlowtemperatureswas

LearningSemanticDefinitionsofInformationSourcesontheInternet

inverted.Itdidwellalsoforthesecondproblem,learningadeﬁnitionforthesourcethatproducedmostoftheoutputattributes.

18YahooWeather($zip0,cit1,sta2,_,lat4,lon5,day6,dat7,tem8,tem9,sky10):-WeatherForecast($cit1,sta2,_,lat4,lon5,_,day6,dat7,tem9,tem8,_,_,sky10,_,_,_),GetCityState($zip0,cit1,sta2).

19WeatherBugForecast($_,cit1,sta2,_,day4,sky5,tem6,_):-WeatherForecast($cit1,sta2,_,_,_,tim5,day4,_,tem6,_,tim10,_,sky5,_,_,_),WeatherConditions($cit1,_,_,_,_,tim10,_,tim5,_,_,_,_,_,_,_,_).

Detailsregardingthenumberofcandidatesgeneratedinordertolearndeﬁnitionsforthediﬀerentweathersourcesareshownbelow.Theaverageprecisionofthedeﬁnitionsproducedwas91%,whiletheaveragerecallwas62%.Problem1213141516171819

#Candidates27719899819910245119116

#Invocations579426249975494676691387614857

Time(s)23360593029248410267591217

log10(Score)-2.92-6.35-13.37-6.48-29.69-26.71-5.74-12.56

Precision8/96/95/55/66/73/310/105/5

Recall8/116/105/85/106/93/1310/115/7

7.3.4HotelSources

DeﬁnitionswerealsolearntforsourcesprovidinghotelinformationfromYahoo,GoogleandtheUSFireAdministration.Thesedeﬁnitionsareshownbelow.

20USFireHotelsByCity($cit0,_,_,sta3,zip4,cou5,_):-HotelsByZip($zip4,_,_,cit0,sta3,cou5).

21USFireHotelsByZip($zip0,_,_,cit3,sta4,cou5,_):-HotelsByZip($zip0,_,_,cit3,sta4,cou5).

22YahooHotel($zip0,$_,hot2,str3,cit4,sta5,_,_,_,_,_):-HotelsByZip($zip0,hot2,str3,cit4,sta5,_).

23GoogleBaseHotels($zip0,_,cit2,sta3,_,_,lat6,lon7,_):-WeatherConditions($cit2,sta3,_,lat6,lon7,_,_,_,_,_,_,_,_,_,_,_),GetZipcode($cit2,$sta3,zip0).

Thesystemperformedwellonthreeoutofthefourproblems.Itwasunableinthetimeallocatedtodiscoveradeﬁnitionforthehotelattributes(name,street,latitudeandlongi-tude)returnedbytheGoogleWebService.Theaverageprecisionfortheseproblemswas90%whiletheaveragerecallwas60%.Problem20212223

#Candidates16164395

#Invocations448189431374931

Time(s)4852821161

log10(Score)-4.00-2.56-2.81-7.50Precision4/44/45/53/5Recall4/64/65/93/6

Carman&Knoblock

7.3.5CarsandTrafficSources

ThelastproblemsonwhichthesystemwastestedwereapairoftraﬃcrelatedWebServices.Theﬁrstservice,providedbyYahoo,reportedlivetraﬃcdata(suchasaccidentsandconstructionwork)withinagivenradiusoftheinputzipcode.Noknownsourceswereavailablewhichprovidedsuchinformation,sonotsurprisingly,thesystemwasunabletolearnadeﬁnitionforthetraﬃcrelatedattributesofthatsource.(Instead,thesystemdiscoveredarelationshipbetweentheinputzipcodeandtheoutputlongitudethatwasn’tcorrect,soprecisionforthisproblemwaszero.)

24YahooTraffic($zip0,$_,_,lat3,lon4,_,_,_):-GetCentroid($zip0,_,lon4),CountryCode($lat3,$lon4,_).25YahooAutos($zip0,$mak1,dat2,yea3,mod4,_,_,pri7,_):-GoogleBaseCars($zip0,$mak1,_,mod4,pri7,_,_,yea3),

ConvertTime($dat2,_,dat10,_,_),GetCurrentTime(_,_,dat10,_).

Thesecondprobleminvolvedaclassiﬁedused-carlistingfromYahoothattookazipcodeandcarmanufacturerasinput.eidoswasabletolearnagooddeﬁnitionforthatsource,takingadvantageofthefactthatsomeofthesamecars(deﬁnedbytheirmake,model,yearandprice)werealsolistedforsaleonGoogle’sclassiﬁedcarlisting.Problem2425

#Candidates

8155

#Invocations

29974405

Time(s)1065815

log10(Score)-11.21-5.29

Precision0/36/6

Recall0/46/6

Sincethesystemfailedontheﬁrstproblem(itfoundsomeincorrect/non-generalrelation-shipsbetweendiﬀerentattributes),butsucceededonthesecondproblemtoﬁndthebestpossibledeﬁnition,theaverageprecisionandrecallfortheseproblemswereboth50%.7.3.6OverallResults

Acrossthe25problems,eidosmanagedtogeneratedeﬁnitionswithhighaccuracy(averageprecisionwas88%)andalargenumberofattributes(averagerecallwas69%).Theseresultsarepromising,especiallyconsideringthatallproblemsinvolvedrealdatasourceswithinsomecasesverysmalloverlapbetweenthedataproducedbythetargetandthatprovidedbytheknownsources(asevidencedbylowlogarithmicscores).Inadditiontominimaloverlap,manysourcesprovidedincompletetuples(i.e.tuplescontainingmultiple“NULL”or“N/A”values)aswellaserroneousorinaccuratedata,makingtheproblemallthemorediﬃcult.ThehighaverageprecisionandrecallleadustobelievethattheJaccardmeasureisdoingagoodjobofdistinguishingcorrectfromincorrectdeﬁnitionsinthepresenceofdatasourcesthatarebothnoisy(inconsistent)andincomplete(missingtuplesandvalues).

Comparingthediﬀerentdomains,onecanseethatthesystemperformedbetteronproblemswithfewerinputandoutputattributes(suchasthegeospatialproblems),whichwastobeexpectedgiventhattheresultingsearchspaceismuchsmaller.

LearningSemanticDefinitionsofInformationSourcesontheInternet

7.4EmpiricalComparison

Havingdemonstratedtheeﬀectivenessofeidosinlearningdeﬁnitionsforrealinformationservices,wenowshowthatthesystemiscapableofhandlingthesameproblemsasawell-knowncomplexschemamatchingsystem.

TheiMAPsystem(Dhamanka,Lee,Doan,Halevy,&Domingos,2004)(discussedinsection8.4)isaschemamatcherthatcanlearncomplex(many-to-one)mappingsbetweentheconceptsofasourceandatargetschema.Itusesasetofspecialpurposesearcherstolearndiﬀerenttypesofmappings.Theeidossystem,ontheotherhand,usesagenericsearchalgorithmtosolveacomparableproblem.Sincethetwosystemscanbemadetoperformasimilartask,weshowthateidosiscapableofrunningononeoftheproblemdomainsusedintheevaluationofiMAP.Wechosetheparticulardomainofonlinecricketdatabasesbecauseitistheonlyoneusedintheevaluationthatinvolvedaligningdatafromtwoindependentdatasources.(Allotherproblemsinvolvedgeneratingsyntheticdatabysplittingasingledatabaseintoasourceandtargetschema,whichwouldnothavebeenasinterestingforeidos.)

Playerstatisticsfromtwoonlinecricketdatabases(cricketbase.comandcricinfo.com)wereusedintheexperiments.Sinceneitherofthesourcesprovidedprogrammaticaccesstotheirdata,thestatisticsdatawasextractedfromHTMLpagesandinsertedintoarelationaldatabase.Theextractionprocessinvolvedﬂatteningthedataintoarelationalmodelandasmallamountofdatacleaning.(TheresultingtablesaresimilarbutnotnecessarilyexactlythesameasthoseusedintheiMAPexperiments.)Thedatafromthetwowebsiteswasusedtocreatethreedatasourcesrepresentingeachwebsite.Thethreesourcesrepresentingcricinfo.comwerethenusedtolearndeﬁnitionsforthethreesourcesrepresentingcricketbase.com.Otherknownsourceswereavailabletothesystem,includingfunctionalityforsplittingapartcomma-separatedlists,addingandmultiplyingnumbers,andsoon.Thedeﬁnitionslearnttodescribethecricketbaseservicesareshownbelow:

1CricbasePlayers($cou0,nam1,_,dat3,_,unk5,unk6):-CricinfoPlayer($nam1,dat3,_,_,_,lis5,nam6,unk5,unk6),contains($lis5,cou0),CricinfoTest($nam6,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_).

2CricbaseTest($_,nam1,cou2,_,_,_,cou6,cou7,dec8,cou9,_,_,_,cou13,cou14,dec15,_,dec17,cou18,_,_):-CricinfoTest($nam1,_,_,cou2,cou6,cou18,cou7,_,dec8,dec15,_,cou9,_,_,_,cou14,cou13,_,_,_,_,dec17).

3CricbaseODI($_,nam1,cou2,_,_,_,cou6,cou7,dec8,cou9,cou10,_,cou12,cou13,_,dec15,dec16,dec17,cou18,cou19,_):-CricinfoODI($nam1,_,_,cou2,cou6,_,cou10,_,dec8,_,cou7,cou18,cou19,cou9,_,_,cou13,cou12,dec15,_,dec16,dec17).

Theﬁrstsourceprovidedplayerproﬁlesbycountry.Thesecondandthirdsourcesprovideddetailedplayerstatisticsfortwodiﬀerenttypesofcricket(TestandOne-Day-Internationalrespectively).Thesystemeasilyfoundthebestdeﬁnitionfortheﬁrstsource.Thedeﬁnitioninvolvedlookingfortheplayer’scountryinalistofteamsthatheplayedfor.eidosdidnotperformquiteaswellonthesecondandthirdproblems.Thereweretworeasonsforthis.Firstly,thearityofthesesourceswasmuchhigherwithmanyinstancesofthesamesemantictype(countanddecimal),makingthespaceofpossiblealignmentsmuchlarger.

Carman&Knoblock

(Becauseofthelargesearchspace,alongertimeoutof40minuteswasused.)Secondly,ahighfrequencyofnullvalues(theconstant“N/A”)inthedataforsomeoftheﬁeldsconfusedthealgorithm,andmadeitharderforittodiscoveroverlappingtupleswithallofthedesiredattributes.Problem123

#Candidates19911623114

#Invocations376215174299

Time(s)43213192127

log10(Score)-3.95-4.70-6.28

Precision5/58/118/14

Recall5/58/168/16

Detailsofthesearchperformedtolearnthedeﬁnitionsareshownabove.Theaverageprecisionfortheseproblemswas77%whiletheaveragerecallwaslowerat66%.ThesevaluesarecomparabletothequalityofthematchingsreportedforiMAP.17Theseresultsareverygood,consideringthateidossearchesinthespaceofmany-to-manycorrespondences,(tryingtodeﬁnethesetoftargetattributescontemporaneously),whileiMAPsearchesthespacesofone-to-oneandmany-to-onecorrespondences.Moreover,eidosﬁrstinvokesthetargetsourcetogeneraterepresentativedata(atasknotperformedbyiMAP)andthenperformsagenericsearchforreasonabledeﬁnitionswithoutrelyingonspecialisedsearchalgorithmsfordiﬀerenttypesofattributes(asisdoneiniMAP).

8.RelatedWork

InthissectionwedescribehowtheworkinthispaperrelatestoresearchperformedbytheMachineLearning,DatabaseandtheSemanticWebcommunities.Beforedoingthat,wedescribesomeearlyworkperformedbytheArtiﬁcialIntelligencecommunity.WealsodiscusshowouralgorithmdiﬀersfromstandardILPtechniques,andinparticularwhyadirectapplicationofsuchtechniqueswasnotpossibleforourproblem.8.1AnEarlyApproach

TheﬁrstworkconcernedwithlearningmodelsfordescribingoperationsavailableontheInternetwasperformed(inthepre-XMLera)onaproblemcalledcategorytranslation(Perkowitz&Etzioni,1995;Perkowitz,Doorenbos,Etzioni,&Weld,1997).Thisproblemconsistedofanincompleteinternalworldmodelandanexternalinformationsourcewiththegoalbeingtocharacterizetheinformationsourceintermsoftheworldmodel.TheworldmodelconsistedofasetofobjectsO,whereeachobjecto∈Obelongedtoacertaincategory(e.g.people)andwasassociatedwithasetofattributes󰀛a1(o),...,an(o)󰀜,madeupofstringsandotherobjects.Asimplerelationalinterpretationofthisworldmodelwouldconsidereachcategorytobearelation,andeachobjecttobeatuple.Theinformationsource,meanwhile,wasanoperationthattookinasinglevalueasinputandreturnedasingletupleasoutput.Thecategorytranslationproblemcanbeviewedasasimpliﬁcationofthesourcedeﬁnitioninductionproblem,whereby:

17.Theactualvaluesforprecisionandrecallinthecricketdomainarenotquoted,butanaccuracyrange

of68-92%forsimplematches(one-to-onecorrespondencesbetweensourceandtargetﬁelds)and50-86%forcomplexmatches(many-to-onecorrespondences)acrosssyntheticandrealproblemswasgiven.

LearningSemanticDefinitionsofInformationSourcesontheInternet

•Theextensionsoftheglobalrelationsareexplicit.(Thereisonesourceperglobalrelation,anditdoesn’thavebindingconstraints,i.e.R=S.)

•Theinformationprovidedbythesourcesdoesnotchangeovertime.

•Thenewsourcetakesasinglevalueasinputandreturnsasingletupleasoutput.Inordertoﬁndsolutionstoinstancesofthecategorytranslationproblem,theauthorsem-ployedavariantofrelationalpath-ﬁnding(Richards&Mooney,1992),whichisanextensiononthefoilalgorithm,tolearnmodelsoftheexternalsource.Thetechniquedescribedinthispaperforsolvinginstancesofthesourceinductionproblemissimilarinthatittooisbasedonafoil-likeinductivesearchalgorithm.8.2DirectApplicationofILPtechniques

ResearchersbecameinterestedintheﬁeldofInductiveLogicProgrammingintheearlynineties,resultinginanumberofdiﬀerentILPsystemsbeingdevelopedincludingfoil(Cameron-Jones&Quinlan,1994),progol(Muggleton,1995)andaleph18.Ideally,onewouldliketoapplysuch“oﬀ-the-shelf”ILPsystemstothesourcedeﬁnitioninductionproblem.Anumberofissues,however,limitthedirectapplicabilityofthesesystems.Theissuescanbesummarisedasfollows:

••••

Extensionsoftheglobalrelationsarevirtual.

Sourcesmaybeincompletewithrespecttotheirdeﬁnitions.Explicitnegativeexamplesofthetargetarenotavailable.Sourcesmayserialiseconstantsindiﬀerentways.

TheﬁrstissuehastodowiththefactthatallILPsystemsassumethatthereisanextensionaldeﬁnitionofthetargetpredicateandextensional(orinsomecasesintentional)deﬁnitionsofthe(source)predicatesthatwillbeusedinthedeﬁnitionforthetarget.Inotherwords,theyassumethattablesalreadyexistinsomerelationaldatabasetorepresentboththenewsourceandtheknownsources.Inourcase,weneedtogeneratesuchtablesbyﬁrstinvokingtheserviceswithrelevantinputs.Onecouldenvisageinvokingeachofthesourceswitheverypossibleinputandusingtheresultingtablestoperforminduction.Suchadirectapproachwouldnotbefeasiblefortworeasons.Firstly,acompletesetofpossibleinputvaluesmaynotbeknowntothesystem.Secondly,evenifitispossibletogenerateacompletesetofviableinputstoaservice,itmaynotbepracticaltoquerythesourcewithsuchalargesetoftuples.Considersource4fromsection2,whichcalculatesthedistanceinmilesbetweentwozipcodes.Giventhatthereareover40,000zipcodesintheUS,generatinganextensionalrepresentationofthissourcewouldrequireperformingmorethanabillioninvocations!Performingsuchalargenumberofinvocationsdoesnotmakesensewhenasmallnumberofexampleinvocationswouldsuﬃceforcharacterisingthefunctionalityofthesource.Inthispaperwehavedevelopedaneﬃcientalgorithmthatonlyqueriesthesourcesasneededinordertoevaluateindividualcandidatedeﬁnitions.

Thesecondissueregardingtheincompletenessofthesourcescausesaproblemwhenacandidateistobeevaluated.Sincethesetoftuplesreturnedbyeachknownsourcemayonly

18.SeetheAlephManualbyAshwinSrinivasan,whichisavailableat:

http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.html

Carman&Knoblock

beasubsetofthoseimpliedbyitsowndeﬁnition,sotoowillbethesetoftuplesreturnedbythecandidatehypothesiswhenexecutedagainstthosesources.Thismeansthatwhenthesystemtriestoevaluateahypothesisbycomparingthosetupleswiththeoutputofthenewsource,itcannotbesurethatatuplewhichisproducedbythenewsourcebutnotbythehypothesisisinfactnotlogicallyimpliedbyit.Thisfactistakenintoaccountinourevaluationfunctionforscoringcandidatedeﬁnitions,discussedinsection5.

Thethirdissueregardingthelackofexplicitnegativeexamplesforthetargetpredicatealsoaﬀectstheevaluationofcandidatehypotheses.Theclassicapproachtodealingwiththisproblemistoassumeaclosedworld,inwhichalltuples(overtheheadrelation)whicharenotexplicitlydeclaredtobepositivemustbenegative.Sincethenewsourcemayinfactbeincompletewithrespecttothebestpossibledeﬁnitionforit,thisassumptiondoesnotnecessarilyhold.Inotherwords,justbecauseaparticulartupleisproducedwhenthecandidatedeﬁnitionisexecutedandthatsametupleisnotreturnedbythenewsourcedoesnotnecessarilymeanthatthecandidatedeﬁnitionisincorrect.

Thefourthissuehastodowiththefactthatthedataprovidedbydiﬀerentsourcesmayneedtobereconciled,inthesensethatdiﬀerentserialisationsof(stringsrepresenting)thesamevalue(suchas“Monday”and“Mon”forinstance)mustberecognized.SinceILPsystemshavebeendesignedtooperateoverasingledatabasecontainingmultipletables,theissueofheterogeneityinthedataisnothandledbycurrentsystems.Insection5.4wediscussedhowthisheterogeneityisresolvedinoursystem.8.3MachineLearningApproaches

SincetheadventofservicesontheInternet,researchershavebeeninvestigatingwaystomodelthemautomatically.Primarily,interesthascenteredonusingmachinelearningtechniquestoclassifytheinputandoutputtypesofaservice,soastofacilitateservicediscovery.Heß&KushmerickproposedusingaSupportVectorMachinetoclassifytheinputandoutputattributesintodiﬀerentsemantictypesbasedonmetadataininterfacedescriptions(Heß&Kushmerick,2003,2004).Theirnotionofsemantictypes(suchaszipcode)asopposedtosyntactictypes(likeinteger)wentsomewaytowarddeﬁningthefunctionalitythatasourceprovides.Recently,otherresearchers(Lermanetal.,2006)proposedtheuseoflogisticregressionforassigningsemantictypestoinputparametersbasedonmetadata,andapatternlanguageforassigningsemantictypestotheoutputparametersbasedonthedatathesourceproduces.Thisworkonclassifyinginputandoutputattributesofaservicetosemantictypesformsaprerequisitefortheworkinthisarticle.Forthepurposesofthispaper,wehaveassumedthatthisproblemhasbeensolved.

Inadditiontoclassifyingtheinput/outputattributesofservices,Heß&Kushmerickinvestigatedtheideaofclassifyingtheservicesthemselvesintodiﬀerentservicetypes.Moreprecisely,theyusedthesameclassiﬁcationtechniquestoassignserviceinterfacestodiﬀer-entsemanticdomains(suchasweatherandﬂights)andtheoperationsthateachinterfaceprovidestodiﬀerentclassesofoperation(suchasweatherForecastandﬂightStatus).Theresultingsourcedescription(hypothesis)languageislimitedtoselect-projectqueries,whicharenotsuﬃcientlyexpressivetodescribemanyofthesourcesavailableontheInternet.Accordingtothatapproach,sinceeveryoperationmustbecharacterizedbyaparticu-laroperationclass,operationsthatprovideoverlapping(non-identical)functionalitywould

LearningSemanticDefinitionsofInformationSourcesontheInternet

needtobeassigneddiﬀerentclassesaswouldoperationswhichprovidecomposedfunction-ality(suchas,forexample,anoperationthatprovidesbothweatherandﬂightdata).Theneedforanexhaustivesetofoperationclasses(andaccompanyingtrainingdata)isamajorlimitationofthatapproach,notsharedbytheworkdescribedinthispaper,whichreliesonamoreexpressivelanguagefordescribingserviceoperations.

Onewaytoeliminatetheneedforapredeﬁnedsetofoperationtypesistouseunsuper-visedclusteringtechniquestogeneratethe(operation)classesautomaticallyfromexamples(WSDLdocuments).ThisideawasimplementedinasystemcalledWoogle(Dongetal.,2004).Thesystemclusteredserviceinterfacestogetherusingasimilarityscorebasedontheco-occurrenceofmetadatalabels.Itthentookadvantageoftheclustersproducedtoim-provekeyword-basedsearchforWebServices.Anadvantageofthisunsupervisedapproachisthatnolabeledtrainingdataisrequired,whichcanbetime-consumingtogenerate.Suchclusteringapproaches,however,whileusefulforservicediscovery,suﬀerthesamelimitationsasthepreviousapproachwhenitcomestoexpressiveness.8.4DatabaseApproaches

Thedatabasecommunityhaslongbeeninterestedintheproblemofintegratingdatafromdisparatesources.Speciﬁcally,intheareasofdatawarehousing(Widom,1995)andin-formationintegration(Wiederhold,1996),researchersareinterestedinresolvingsemanticheterogeneitywhichexistsbetweendiﬀerentdatabasessothatthedatacanbecombinedoraccessedviaasingleinterface.Theschemamappingproblemistheproblemofdeterminingamappingbetweentherelationscontainedinasourceschemaandaparticularrelationinatargetschema.Amappingdeﬁnesatransformationwhichcanbeusedtopopulatethetargetrelationwithdatafromthesourceschema.Mappingsmaybearbitrarilycomplexprocedures,butingeneraltheywillbedeclarativequeriesinSQLorDatalog.Thecomplex-ityofthesequeriesmakestheschemamappingproblemfarmorediﬃcultthanthehighlyinvestigatedschemamatchingproblem(Rahm&Bernstein,2001),whichinvolvesﬁnding1-to-1correspondencesbetweenﬁeldsofasourceandtargetschema.

Thesourcedeﬁnitioninductionproblemcanbeviewedasatypeofschemamappingproblem,inwhichtheknownsourcesdeﬁnethesourceschemaandtheunknownsourcespeciﬁesthetargetrelation.Inordertosolveaschemamappingproblem,onetypicallytakesadvantageofallavailableauxiliaryinformation(includingsourceandtargetdatainstances,labelsfromtherespectiveschemas,andsoon).Suchproblemsaregenerallysimpler,however,becausethedata(theextensionsoftherelations)inthesourceandtargetschemaareusuallyexplicitlyavailable.Insourceinduction,thatdataishiddenbehindaserviceinterface,whichhasbindingconstraints,andthedataitselfcanbeextremelylargeoreven(inthecaseofsourcesprovidingmathematicalfunctions)inﬁnite.Thusmakingtheproblemconsiderablymorediﬃcult.

TheschemaintegrationsystemCLIO(Yan,Miller,Haas,&Fagin,2001)helpsusersbuildSQLqueriesthatmapdatafromasourcetoatargetschema.InCLIO,foreignkeysandinstancedataareusedtogenerateintegrationrulessemi-automatically.SinceCLIOreliesheavilyonuserinvolvement,itdoesnotmakesensetocompareitdirectlywiththeautomatedsystemdevelopedinthispaper.

Carman&Knoblock

Anothercloselyrelatedproblemisthatofcomplexschemamatching,thegoalofwhichistodiscovercomplex(many-to-one)mappingsbetweentworelationaltablesorXMLschemas.Thisproblemisfarmorecomplicatedthanbasic(one-to-one)schemamatchingbecause:

•ThespaceofpossiblecorrespondencesbetweentherelationsisnolongertheCartesianproductofthesourceandtargetrelations,butthepowersetofthesourcerelationtimesthetargetrelation.

•Many-to-onemappingsrequireamappingfunction,whichcanbesimplelikeconcate-nate(x,y,z),oranarbitrarilycomplexformulasuchasz=x2+y.

TheiMAPsystem(Dhamankaetal.,2004)triestolearnsuchmany-to-onemappingsbe-tweentheconceptsofasetofsourcerelationsandatargetrelation.Itusesasetofspecialpurposesearcherstolearndiﬀerenttypesofmappings(suchasmathematicalexpressions,unitconversionsandtime/datemanipulations).Itthenusesameta-heuristictocontrolthesearchbeingperformedbythediﬀerentspecialpurposesearchers.Ifoneviewsboththesourceschemaandthefunctionsavailableforuseinthemappings(suchasconcate-nate(x,y,z),add(x,y,z),etc.)asthesetofknownsourcesinthesourcedeﬁnitioninductionproblem,thenthecomplexschemamatchingandsourceinductionproblemsaresomewhatsimilar.Themaindiﬀerencesbetweentheproblemsare:

•Thedataassociatedwiththesourceschemaisexplicit(andstatic)incomplexschemamatching,whileitishidden(anddynamic)insourceinduction.

•Ingeneral,thesetofknownsourcesinasourceinductionproblemwillbemuchlarger(andthedatatheyprovidemaybelessconsistent),thanthesetofmappingfunctionsandsourcerelationsinacomplexschemamatchingproblem.

Inthispaperwedevelopageneralframeworkforhandlingthesourceinductionproblem.SinceiMAPprovidesfunctionalitywhichissimilartothatofoursystem,weperformasimpleempiricalcomparisoninsection7.4.8.5SemanticWebApproach

ThestatedgoaloftheSemanticWeb(Berners-Lee,Hendler,&Lassila,2001)istoenablemachineunderstandingofWebresources.Thisisdonebyannotatingthoseresourceswithsemanticallymeaningfulmetadata.ThustheworkdescribedinthispaperisverymuchinlinewiththeSemanticWeb,insofarasweareattemptingtodiscoversemanticallymeaningfuldeﬁnitionsforonlineinformationsources.Defactostandardsforannotatingserviceswithsemanticmarkuphavebeenaroundforanumberofyears.Thesestandardsprovideserviceownerswithametadatalanguageforaddingdeclarativestatementstoserviceinterfacedescriptionsinanattempttodescribethesemanticsofeachserviceintermsofthefunctionality(e.g.abookpurchaseoperation)ordata(e.g.aweatherforecast)thatitprovides.Workontheselanguagesisrelatedtothisarticlefromtwoperspectives:

•Itcanbeviewedasanalternativeapproachtogainingknowledgeastothesemanticsofanewlydiscoveredsource(providingithassemanticmetadataassociatedwithit).•SemanticWebServiceannotationlanguagescanbeseenasatargetlanguageforthesemanticdescriptionslearntinthispaper.

LearningSemanticDefinitionsofInformationSourcesontheInternet

IfaWebServiceisalreadysemanticallyannotated,heterogeneitymaystillexistbetweentheontologyusedbytheserviceproviderandthatusedbytheconsumer,inwhichcasethelearningcapabilitiesdescribedinthispapermayberequiredtoreconcilethosediﬀerences.Moreimportantly,weareinterestedinthevastnumberofsourcesforwhichsemanticmarkupiscurrentlyunavailable.TheworkinthisarticlecomplementsthatoftheSemanticWebcommunitybyprovidingawayofautomaticallyannotatingsourceswithsemanticinformation;therebyrelievingserviceprovidersoftheburdenofmanuallyannotatingtheirservices.Oncelearnt,DatalogsourcedeﬁnitionscanbeconvertedtoDescriptionLogic-basedrepresentationssuchasisusedinOWL-S(Martin,Paolucci,McIlraith,Burstein,McDermott,McGuinness,Parsia,Payne,Sabou,Solanki,Srinivasan,&Sycara,2004)andWSMO(Roman,Keller,Lausen,deBruijn,Lara,Stollberg,Polleres,Feier,Bussler,&Fensel,2005).ThereasonweuseDataloginthispaper(ratherthanDescriptionLogics)isthatmostmediator-basedintegrationsystemsrelyonitasarepresentationlanguage.

9.Discussion

Inthispaperwehavepresentedacompletelyautomaticapproachtolearningdeﬁnitionsforonlineservices.Ourapproachexploitsthedeﬁnitionofsourcesthathaveeitherbeengiventothesystemorlearnedpreviously.Theresultingframeworkisasigniﬁcantadvanceoverpriorapproachesthathavefocusedonlearningonlytheinputsandoutputsortheclassofaservice.Wehavedemonstratedempiricallytheviabilityoftheapproach.

Thekeycontributionofthisarticleisaprocedureforlearningsemanticdeﬁnitionsforonlineinformationservicesthatis:

•Fullyautomated:Deﬁnitionsarelearntinacompletelyautomatedmannerwithouttheneedforanyuserintervention.

•Moreexpressive:Thequerylanguagefordeﬁningsourcesisthatofconjunctivequeries,whichisfarmoreexpressivethanpreviousattribute-valueapproaches.

•Suﬃcientlyrobust:Theprocedureisabletolearndeﬁnitionsinthepresenceofnoisyandincompletedata,andthusissuﬃcientlyrobusttohandlerealdatasources.

•Data-accesseﬃcient:Theproceduresamplesdatafromlivesources,invokingthemsparinglyandonlyasrequired,makingithighlyeﬃcientintermsofsourceaccesses.•Evolving:Theprocedure’sabilitytolearndeﬁnitionsimprovesovertimeaseachnewdeﬁnitionislearntandaddedtothesetofknownsources.9.1ApplicationScenarios

Thereareanumberofdiﬀerentapplicationscenariosforasystemthatiscapableoflearningdeﬁnitionsforonlinesources.Theygenerallyinvolveprovidingsemanticdeﬁnitionstodataintegrationsystems,whichthenexploitandintegratetheavailablesources.

Themostobviousapplicationforourworkwouldbeasystem(depictedontheleftsideofFigure1)thatcrawlstheWeb,searchingforinformationsources.Uponﬁndingasource,thesystemwoulduseaclassiﬁertoassignsemantictypestoit,followedbytheinductivelearnertogenerateadeﬁnitionforit.ThedeﬁnitioncouldthenbeusedtoannotatethesourcefortheSemanticWeb,orbyamediatorforansweringqueries.Importantly,thisentireprocesscouldrunwithminimaluserinvolvement.

Carman&Knoblock

Figure1:Architecturediagramsforthreediﬀerentapplicationscenarios.

Amorechallengingapplicationscenario(showninthecenterofFigure1)wouldinvolvereal-timeservicediscovery.Considerthecasewhereamediatorisunabletoansweraparticularquerybecausethedesiredinformationliesoutofscopeofthesourcesavailable.Asearchisthenperformedbasedonthe“missingconjuncts”(relationnamesandconstants)fromthequeryusingaspecialisedWebServicesearchengine,suchasWoogle(Dongetal.,2004).Theservicesreturnedwouldbeannotatedwithsemantictypesand,ifpossible,semanticdeﬁnitions.Afterthedeﬁnitionsareprovidedtothemediator,itwouldcompletethequeryprocessingandreturnananswertotheuser.Thisscenariomayseemalittlefar-fetcheduntiloneconsidersaspeciﬁcexample:imagineauserinteractingwithageospatialbrowser(anonlineatlas).Iftheuserturnsonaparticularinformationlayer,suchasskiresorts,butnosourceisavailableforthecurrentﬁeldofview(of,forinstance,Italy),thennoresultswouldbedisplayed.Inthebackgroundasearchcouldbeperformedandanewsourcediscovered,whichprovidesskiresortsalloverEurope.Therelevantdatacouldthenbedisplayed,withtheuserunawarethatasearchhasbeenperformed.

Perhapsthemostlikelyapplicationscenario(totherightofFigure1)forasourceinductionsystemwouldbeamixedinitiativeone.Inthiscaseahumanwouldannotatethediﬀerentoperationsofaserviceinterfacewithsemanticdeﬁnitions.Atthesametime,thesystemwouldattempttoinducedeﬁnitionsfortheremainingoperations,andprompttheuserwithsuggestionsforthem.Inthisscenariotheclassiﬁermaynotbeneeded,sinceattributesofthesamenameinthediﬀerentoperationswouldlikelyhavethesamesemantictype.Moreover,sincethedeﬁnitionslearntbythesystemmayinsomecasescontainerroneousorsuperﬂuouspredicates,theusercouldalsobeinvolvedinaprocessofcheckingandimprovingthedeﬁnitionsdiscovered.

LearningSemanticDefinitionsofInformationSourcesontheInternet

9.2OpportunitiesforFurtherResearch

Anumberoffuturedirectionsforthisworkwillallowthesetechniquestobeappliedmorebroadly.Wenowdiscusstwosuchdirections,improvingthesearchalgorithmandextendingthequerylanguage.

Asthenumberofknownsourcesgrows,sotoowillthesearchspace,anditwillbenec-essarytodevelopadditionalheuristicstobetterdirectthesearchtowardthebestdeﬁnition.ManyheuristictechniqueshavebeendevelopedintheILPcommunityandsomemaybeapplicabletothesourceinductionproblem.Morepressingperhapsistheneedtodeveloparobustterminationconditionforhaltingthesearchoncea“suﬃcientlygood”deﬁnitionhasbeendiscovered.Asthenumberofavailablesourcesincreases,thesimpletimeoutusedintheexperimentswillbeineﬀectiveascertain(morecomplicated)deﬁnitionswillnecessarilytakelongertolearnthanothers.

Anotherwaytoincreasetheapplicabilityofthisworkistoextendthequerylanguagesothatitbetterdescribesthesourcesavailable.Oftenonlinesourcesdonotreturnacompletesetofresultsbutrathercutoﬀthelistatsomemaximumcardinality.ForexampletheYahooHotelsourcedescribedinsection7.3.4returnsamaximumof20hotelsnearagivenlocation,andordersthemaccordingtodistance.Inthiscase,recognisingthespeciﬁcorderingonthetuplesproducedwouldbeveryusefultoamediator.Asecondusefulextensiontothequerylanguagewouldbetheabilitytodescribesourcesusingtheproceduralconstructif-then-else.Thisconstructisneededtodescribethebehaviourofsomesourcesoncertaininputs.Forexample,considertheYahooGeocoderfromsection7.3.1,whichtakesasinputatuplecontainingastreetname,number,andzipcode.Ifthegeocoderisunabletolocatethecorrespondingaddressinitsdatabase(becauseitdoesn’texist),insteadofreturningnotuples,itreturnsthecentroidofthezipcode.Describingsuchbehaviorisonlypossibleusingproceduralconstructs.

Acknowledgments

ThisresearchisbaseduponworksupportedinpartbytheDefenseAdvancedResearchProjectsAgency(DARPA),throughtheDepartmentoftheInterior,NBC,AcquisitionSer-vicesDivision,underContractNo.NBCHD030010.TheU.S.GovernmentisauthorizedtoreproduceanddistributereportsforGovernmentalpurposesnotwithstandinganycopy-rightannotationthereon.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnecessarilyrepresentingtheoﬃcialpoliciesorendorsements,eitherexpressedorimplied,ofanyoftheaboveorganizationsoranypersonconnectedwiththem.

References

Afrati,F.N.,Li,C.,&Mitra,P.(2004).Oncontainmentofconjunctivequeriesusingarith-meticcomparisions.In9thInternationalConferenceonExtendingDatabaseTechnol-ogy(EDBT2004)Heraklion-Crete,Greece.

Carman&Knoblock

Arens,Y.,Knoblock,C.A.,&Shen,W.-M.(1996).Queryreformulationfordynamic

informationintegration.JournalofIntelligentInformationSystems-SpecialIssueonIntelligentInformationIntegration,6(2/3),99–130.Berners-Lee,T.,Hendler,J.,&Lassila,O.(2001).Thesemanticweb.ScientiﬁcAmerican,

284(5),34–43.Bilenko,M.,Mooney,R.J.,Cohen,W.W.,Ravikumar,P.,&Fienberg,S.E.(2003).

Adaptivenamematchingininformationintegration..IEEEIntelligentSystems,18(5),16–23.Cameron-Jones,R.M.,&Quinlan,J.R.(1994).Eﬃcienttop-downinductionoflogic

programs.SIGARTBulletin,5(1),33–42.Carman,M.J.,&Knoblock,C.A.(2007).Learningsemanticdescriptionsofwebinfor-mationsources.InProceedingsoftheTwentiethInternationalJointConferenceonArtiﬁcialIntelligence(IJCAI-07)Hyderabad,India.Chandra,A.K.,&Merlin,P.M.(1977).Optimalimplementationofconjunctivequeries

inrelationaldatabases.InProceedingsofthe9thACMSymposiumonTheoryofComputing(STOC),pp.77–90Boulder,Colorado.Dhamanka,R.,Lee,Y.,Doan,A.,Halevy,A.,&Domingos,P.(2004).imap:Discovering

complexsemanticmatchesbetweendatabaseschemas.InSIGMOD’04:Proceedingsofthe2004ACMSIGMODInternationalConferenceonManagementofData.Dong,X.,Halevy,A.Y.,Madhavan,J.,Nemes,E.,&Zhang,J.(2004).Simlaritysearch

forwebservices.InProceedingsofVLDB.Duschka,O.M.(1997).QueryPlanningandOptimizationinInformationIntegration.Ph.D.

thesis,DepartmentofComputerScience,StanfordUniversity.Garcia-Molina,H.,Hammer,J.,Ireland,K.,Papakonstantinou,Y.,Ullman,J.,&Widom,

J.(1995).Integratingandaccessingheterogeneousinformationsourcesintsimmis.InProceedingsoftheAAAISymposiumonInformationGathering,pp.61-64.Heß,A.,&Kushmerick,N.(2003).Learningtoattachsemanticmetadatatowebservices.

In2ndInternationalSemanticWebConference(ISWC).Heß,A.,&Kushmerick,N.(2004).Iterativeensembleclassiﬁcationforrelationaldata:

Acasestudyofsemanticwebservices.In15thEuropeanConferenceonMachineLearning(ECML2004)Pisa,Italy.Springer.Knoblock,C.A.,Minton,S.,Ambite,J.L.,Ashish,N.,Muslea,I.,Philpot,A.,&Tejada,

S.(2001).Theariadneapproachtoweb-basedinformationintegration.InternationalJournalofCooperativeInformationSystems,10(1-2),145–169.Lerman,K.,Plangprasopchok,A.,&Knoblock,C.A.(2006).Automaticallylabelingdata

usedbywebservices.InProceedingsofthe21stNationalConferenceonArtiﬁcialIntelligence(AAAI).

LearningSemanticDefinitionsofInformationSourcesontheInternet

Levy,A.Y.(2000).Logic-basedtechniquesindataintegration.InMinker,J.(Ed.),Logic-BasedArtiﬁcialIntelligence.KluwerPublishers.Levy,A.Y.,Mendelzon,A.O.,Sagiv,Y.,&Srivastava,D.(1995).Answeringqueriesusing

views.InProceedingsofthe14thACMSIGACT-SIGMOD-SIGARTSymposiumonPrinciplesofDatabaseSystems,pp.95–104SanJose,Calif.Markov,Z.,&Marinchev,I.(2000).Metric-basedinductivelearningusingsemanticheight

functions.InProceedingsofthe11thEuropeanConferenceonMachineLearning(ECML2000).Springer.Martin,D.,Paolucci,M.,McIlraith,S.,Burstein,M.,McDermott,D.,McGuinness,D.,

Parsia,B.,Payne,T.,Sabou,M.,Solanki,M.,Srinivasan,N.,&Sycara,K.(2004).Bringingsemanticstowebservices:Theowl-sapproach.InProceedingsoftheFirstInternationalWorkshoponSemanticWebServicesandWebProcessComposition(SWSWPC2004).Muggleton,S.,&Feng,C.(1990).Eﬃcientinductionoflogicprograms.InProceedingsof

the1stConferenceonAlgorithmicLearningTheory.Muggleton,S.(1995).InverseentailmentandProgol.NewGenerationComputing,Special

issueonInductiveLogicProgramming,13(3-4),245–286.N´edellec,C.,Rouveirol,C.,Ad´e,H.,Bergadano,F.,&Tausend,B.(1996).Declarative

biasinILP.InDeRaedt,L.(Ed.),AdvancesinInductiveLogicProgramming,pp.82–103.IOSPress.Pazzani,M.J.,&Kibler,D.F.(1992).Theutilityofknowledgeininductivelearning.

MachineLearning,9,57–94.Perkowitz,M.,&Etzioni,O.(1995).Categorytranslation:Learningtounderstandinforma-tionontheinternet.InProceedingsoftheFourteenthInternationalJointConferenceonArtiﬁcialIntelligence(IJCAI-95).Perkowitz,M.,Doorenbos,R.B.,Etzioni,O.,&Weld,D.S.(1997).Learningtounder-standinformationontheinternet:Anexample-basedapproach.JournalofIntelligentInformationSystems,8(2),133–153.Pottinger,R.,&Halevy,A.Y.(2001).Minicon:Ascalablealgorithmforansweringqueries

usingviews.VLDBJournal,10(2-3).Quinlan,J.R.,&Cameron-Jones,R.M.(1993).FOIL:Amidtermreport.InMachine

Learning:ECML-93,EuropeanConferenceonMachineLearning,Proceedings,Vol.667,pp.3–20.Springer-Verlag.Rahm,E.,&Bernstein,P.(2001).Asurveyofapproachestoautomaticschemamatching.

VLDBJournal,10(4).Richards,B.L.,&Mooney,R.J.(1992).Learningrelationsbypathﬁnding.InNational

ConferenceonArtiﬁcialIntelligence,pp.50–55.

Carman&Knoblock

Roman,D.,Keller,U.,Lausen,H.,deBruijn,J.,Lara,R.,Stollberg,M.,Polleres,A.,

Feier,C.,Bussler,C.,&Fensel,D.(2005).Webservicemodelingontology.AppliedOntology,1(1),77–106.Ullman,J.D.(1989).PrinciplesofDatabaseandKnowledge-BaseSystems,Vol.2.Com-puterSciencePress,Rockville,Maryland.Weber,I.,Tausend,B.,&Stahl,I.(1995).Languageseriesrevisited:Thecomplexityof

hypothesisspacesinILP.InProceedingsofthe8thEuropeanConferenceonMachineLearning,Vol.912,pp.360–363.Springer-Verlag.Widom,J.(1995).Researchproblemsindatawarehousing.InCIKM’95:Proceedingsof

thefourthInternationalConferenceonInformationandKnowledgeManagement,pp.25–30.ACMPress.Wiederhold,G.(1992).Mediatorsinthearchitectureoffutureinformationsystems.Com-puter,25(3),38–49.Wiederhold,G.(Ed.).(1996).IntelligentIntegrationofInformation.KluwerAcademic

Publishers,BostonMA.Winkler,W.(1999).Thestateofrecordlinkageandcurrentresearchproblems.Tech.rep.,

StatisticalResearchDivision,U.S.BureauoftheCensus,Washington,DC.Yan,L.L.,Miller,R.J.,Haas,L.M.,&Fagin,R.(2001).Data-drivenunderstanding

andreﬁnementofschemamappings.InSIGMOD’01:Proceedingsofthe2001ACMSIGMODInternationalConferenceonManagementofdata.Zelle,J.M.,Thompson,C.A.,Caliﬀ,M.E.,&Mooney,R.J.(1995).Inducinglogic

programswithoutexplicitnegativeexamples.InProceedingsoftheFifthInternationalWorkshoponInductiveLogicProgramming.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文

全部栏目

Learning Semantic Definitions of Online Information Sources