您的当前位置:首页正文

Learning Semantic Definitions of Online Information Sources

2023-10-27 来源:步旅网
JournalofArtificialIntelligenceResearch30(2007)1-50Submitted11/06;published9/07

LearningSemanticDefinitionsofOnlineInformationSources

MarkJamesCarmanCraigA.Knoblock

UniversityofSouthernCaliforniaInformationSciencesInstitute4676AdmiraltyWay

MarinadelRey,CA90292

mark@bradipo.netknoblock@isi.edu

Abstract

TheInternetcontainsaverylargenumberofinformationsourcesprovidingmanytypesofdatafromweatherforecaststotraveldealsandfinancialinformation.ThesesourcescanbeaccessedviaWeb-forms,WebServices,RSSfeedsandsoon.Inordertomakeautomateduseofthesesources,weneedtomodelthemsemantically,butwritingsemanticdescriptionsforWebServicesisbothtediousanderrorprone.Inthispaperweinvestigatetheproblemofautomaticallygeneratingsuchmodels.WeintroduceaframeworkforlearningDatalogdefinitionsofWebsources.Inordertolearnthesedefinitions,oursystemactivelyinvokesthesourcesandcomparesthedatatheyproducewiththatofknownsourcesofinformation.Itthenperformsaninductivelogicsearchthroughthespaceofplausiblesourcedefinitionsinordertolearnthebestpossiblesemanticmodelforeachnewsource.Inthispaperweperformanempiricalevaluationofthesystemusingreal-worldWebsources.Theevaluationdemonstratestheeffectivenessoftheapproach,showingthatwecanautomaticallylearncomplexmodelsforrealsourcesinreasonabletime.Wealsocompareoursystemwithacomplexschemamatchingsystem,showingthatourapproachcanhandlethekindsofproblemstackledbythelatter.

1.Introduction

Recentyearshaveseenanexplosioninthequantityandvarietyofinformationavailableonline.Onecanfindshoppingdata(pricesandavailabilityofgoods),geospatialdata(weatherforecasts,housinginformation),traveldata(flightpricingandstatus),financialdata(exchangeratesandstockquotes),andthatjustscratchesthesurfaceofwhatisavailable.Theaimofthisworkistomakeuseofthatvaststoreofinformation.

Astheamountofinformationhasincreased,sotoohasitsreuseacrossWebportalsandapplications.Developershaverealisedtheimportanceofmanagingcontentseparatelyfrompresentation,leadingtothedevelopmentofXMLasaself-describingdataformat.ContentinXMLisfareasiertomanipulatethanHTML,simplifyingintegrationacrossdifferentsources.Standardshavealsoemergedforprovidingprogrammaticaccesstodata(likeSOAPandREST)sothatdeveloperscaneasilybuildprograms(calledMash-Ups)thatcombinecontentfromdifferentsitesinreal-time.Manyportalsnowprovidesuchaccesstotheirdataandsomeevenprovidesyntacticdefinitions(inWSDL)oftheinputandoutputthesedatasourcesexpect.Missing,however,aresemanticdescriptionsofwhateachsourcedoes,whichisrequiredinordertosupportautomateddataintegration.

c2007AIAccessFoundation.Allrightsreserved.󰀂

Carman&Knoblock

1.1StructuredQuerying

Givenallthestructuredsourcesavailable,wewouldliketocombinethedatadynamicallytoanswerspecificuserrequests(asopposedtostaticallyinthecaseofMash-Ups).Dynamicdatarequestscanbeexpressedasqueriessuchasthoseshownbelow.Suchqueriesmayrequireaccesstomultiple(publiclyavailable)datasourcesandcombineinformationinwaysthatwerenotenvisagedbyitsproducers.

1.Tourism:Getpricesandavailabilityforallthreestarhotelswithin100kilometersofTrento,Italythatliewithin1kilometerofaskiresortthathasover1meterofsnow.2.Transportation:DeterminethetimethatIneedtoleaveworkinordertocatchabustotheairporttopickupmybrotherwhoisarrivingonQantasflight205.

3.DisasterPrevention:Findphonenumbersforallpeoplelivingwithin1mileofthecoastandbelow200feetofelevation.Itisclearevenfromthissmallsetofexamplesjusthowpowerfultheabilitytocombinedatafromdisparatesourcescanbe.Inordertogivethesequeriestoanautomatedsystem,wemustfirstexpressthemformallyinaquerylanguagesuchasSQLorDatalog(Ullman,1989).InDatalogthefirstquerymightlookasfollows:

q(hotel,price):-accommodation(hotel,3*,address),available(hotel,today,price),distance(address,󰀛Trento,Italy󰀜,dist1),dist1<100km,skiResort(resort,loc1),distance(address,loc1,dist2),

dist2<1km,snowCondiditions(resort,today,height),height>1m.

Theexpressionstatesthathotelandpricepairsaregeneratedbylookingupthreestarhotelsinarelationaltablecalledaccommodation,thencheckingthepricefortomorrownightinatablecalledavailable.TheaddressofeachhotelisusedtocalculatethedistancetoTrento,whichmustbelessthan100kilometers.ThequeryalsochecksthatthereisaskiResortwithin1kilometerofthehotel,andthatthesnowConditionsfortodayshowmorethan1meterofsnow.1.2Mediators

AsystemcapableofgeneratingaplantoanswersuchaqueryiscalledanInformationMediator(Wiederhold,1992).Inordertogenerateaplan,mediatorslookforsourcesthatarerelevanttothequery.Inthiscase,relevantsourcesmightbe:

1.TheItalianTourismWebsite:tofindallhotelsnear‘Trento,Italy’.2.ASkiSearchEngine:tofindskiresortsneareachhotel.

3.AWeatherProvider:tofindouthowmuchsnowhasfallenateachskiresort.Foramediatortoknowwhichsourcesarerelevant,itneedstoknowwhatinformationeachsourceprovides.WhileXMLdefinesthesyntax(formatting)usedbyasource,thesemantics(intendedmeaning)oftheinformationthesourceprovidesmustbedefinedseparately.ThiscanbedoneusingLocal-as-View(LAV)sourcedefinitionsinDatalog(Levy,2000).Essentially,sourcedefinitionsdescribequeriesthatifgiventoamediator,willreturnthesamedataasthesourceprovides.Exampledefinitionsareshownbelow.Thefirststates

2

LearningSemanticDefinitionsofInformationSourcesontheInternet

thatthesourcehotelSearchtakesfourvaluesasinput(inputsareprefixedbythe$-symbol),andreturnsalistofhotelswhichliewithinthegivendistanceoftheinputlocation.Foreachhotelitalsoreturnstheaddressaswellasthepriceforaroomonthegivendate.(NotethatthesourceonlyprovidesinformationforhotelsinItaly.)

hotelSearch($location,$distance,$rating,$date,hotel,address,price):-country(location,Italy),accommodation(hotel,rating,address),available(hotel,date,price),distance(address,location,dist1),dist1findSkiResorts($address,$distance,resort,location):-skiResort(resort,location),distance(address,location,dist1),dist1getSkiConditions($resort,$date,height):-snowCondiditions(resort,date,height).

Inordertogenerateaplanforansweringthequery,amediatorperformsaprocesscalledqueryreformulation(Levy,2000),wherebyittransformsthequeryintoanewqueryover(intermsof)therelevantinformationsources.1(Asourceisrelevantifitreferstothesamerelationsasthequery.)Theresultingplaninthiscaseisshownbelow.

q(hotel,price):-hotelSearch(󰀛Trento,Italy󰀜,100km,3*,today,hotel,address,price),findSkiResorts(address,1km,resort,location),

getSkiConditions(resort,today,height),height>1m.

Inthiswork,thequestionsofinterestare:wheredoallthedefinitionsfortheseinformationsourcescomefromandmoreprecisely,whathappenswhenwewanttoaddnewsourcestothesystem?Isitpossibletogeneratethesesourcedefinitionsautomatically?1.3DiscoveringNewSources

Intheexampleabove,themediatorknowsofasetofrelevantsourcestousetosuccessfullyanswerthequery.Ifinstead,oneofthosesourcesismissingordoesn’thavethedesiredscope(e.g.getSkiConditionsdoesn’tprovidedataforTrento),thenthemediatorfirstneedstodiscoverasourceprovidingthatinformation.Asthenumberandvarietyofinformationsourcesincrease,wewillundoubtedlyrelyonautomatedmethodsfordiscoveringthemandannotatingthemwithsemanticdescriptions.Inordertodiscoverrelevantsources,asystemmightinspectaserviceregistry2(suchasthosedefinedinUDDI),orperformkeyword-basedsearchoveraWebindex(suchasGoogleordel.icio.us).Theresearchcommunityhaslookedattheproblemofdiscoveringrelevantservices,developingtechniquesforclassifyingservicesintodifferentdomains(suchasweatherandflights)usingservicemetadata(Heß&Kushmerick,2003)andclusteringsimilarservicestogethertoimprovekeyword-basedsearch(Dong,Halevy,Madhavan,Nemes,&Zhang,2004).Thesetechniques,althoughuseful,arenotsufficientforautomatingserviceintegration.

1.Thecomplexityofqueryreformulationisknowntobeexponential,althoughefficientalgorithmsforperformingitdoexist(Pottinger&Halevy,2001).

2.Notethattechnically,aserviceisdifferentfromasource.Aserviceisaninterfaceprovidingaccesstomultipleoperations,eachofwhichmayprovideinformation.Ifanoperationdoesnotaffectthe“stateoftheworld”(e.g.bychargingsomebody’screditcard),thenwecallitaninformationsource.Forthispaper,however,weusethetermservicetoreferonlytoinformationsources.

3

Carman&Knoblock

1.4LabelingServiceInputsandOutputs

Oncearelevantserviceisdiscovered,theproblemshiftstomodelingitsemantically.Model-ingsourcesbyhandcanbelaborious,soautomatingtheprocessmakessense.Sincedifferentservicesoftenprovidesimilaroroverlappingdata,itshouldbepossibletouseknowledgeofpreviouslymodeledservicestolearndescriptionsfornewones.

Thefirststepinmodelingasourceistodeterminewhattypeofdataitrequiresasinputandproducesasoutput.Thisisdonebyassigningsemantictypes(likezipcode,telephonenumber,temperature,andsoon)totheattributesofaservice.Semantictypesrestrictthepossiblevaluesofanattributetoasubsetofthecorrespondingprimitivetype.Theresearchcommunityhasinvestigatedautomatingtheassignmentprocessbyviewingitasaclassificationproblem(Heß&Kushmerick,2003).Intheirsystem,HeßandKushmericktrainedaSupportVectorMachine(SVM)onmetadatadescribingdifferentsources.Thesystem,givenasourcesuchasthefollowing:

getWeather($zip,temp)

usesthelabelsgetWeather,zipandtemp(andanyotheravailablemetadata)toassigntypestotheinputandoutputattributes,e.g.:zip→zipcode,temp→temperature.Notethatadditionalmetadataisoftenusefulfordistinguishingbetweenpossibleassignments.(If,forexample,thenameoftheoperationhadbeenlistEmployees,thentempmayhavereferredtoatemporaryemployeeratherthanatemperature.)

Insubsequentwork,researchersdevelopedamorecomprehensivesystemthatusedbothmetadataandoutputdatatoclassifyserviceattributes(Lerman,Plangprasopchok,&Knoblock,2006).Inthatsystem,aLogisticRegressionbasedclassifierfirstassignsse-mantictypestoinputparameters.Examplesofthoseinputtypesarethenusedtoinvoketheservice,andtheoutputisgiventoapattern-languagebasedclassifier,thatassignstypestotheoutputparameters.Theauthorsarguethatclassificationbasedonbothdataandmetadataisfarmoreaccuratethanthatbasedonmetadataalone.Usinganexample,itiseasytoseewhy.ConsiderthefollowingtuplesproducedbyourgetWeathersource:

󰀛90292,25◦C󰀜,󰀛10274,15◦C󰀜,󰀛60610,18◦C󰀜,...

Giventhedata,theclassifiercanbecertainthattempreallydoesreferstoatemperature,andindeedcanevenassignittoamorespecifictype,temperatureC(inCelsius).

Whiletheproblemofdeterminingthesemantictypesofaservice’sattributesisveryinteresting,andthereisroomforimprovementoncurrenttechniques,weassumeforthepurposesofthisworkthatithasalreadybeensolved.1.5GeneratingaDefinition

Onceweknowtheparametertypes,wecaninvoketheservice,butwearestillunabletomakeuseofthedataitreturns.Todothat,weneedalsotoknowhowtheoutputattributesrelatetotheinput(i.e.adefinitionforthesource).Forexample,forthegetWeatherserviceweneedtoknowwhetherthetemperaturebeingreturnedisthecurrenttemperature,thepredictedhightemperaturefortomorrowortheaveragetemperatureforthistimeofyear.Suchrelationshipscanbedescribedbythefollowingdefinitions:

getWeather($zip,temp):-currentTemp(zip,temp).

getWeather($zip,temp):-forecast(zip,tomorrow,temp).getWeather($zip,temp):-averageTemp(zip,today,temp).

4

LearningSemanticDefinitionsofInformationSourcesontheInternet

Therelationsusedinthesedefinitionswouldbedefinedinadomainontology(orschema).Inthispaperwedescribeasystemcapableoflearningwhichifanyofthesedefinitionsisthecorrect.Thesystemleverageswhatitknowsaboutthedomain,i.e.thedomainontologyandasetofknowninformationsources,tolearnwhatitdoesnotknow,namelytherelationshipbetweentheattributesofanewlydiscoveredsource.1.6Outline

Thispaperpresentsacomprehensivetreatmentofmethodsforlearningsemanticdescrip-tionsofWebinformationsources.Itextendsourpreviousworkonthesubject(Carman&Knoblock,2007)bypresentingdetaileddescriptionsofthemethodsforbothenumeratingthesearchspaceandevaluatingtheindividualcandidatedefinitions.Weprovideadditionaldetailsregardingtheevaluationmethodologyandtheresultsgenerated.

Thepaperisstructuredasfollows.Westartwithanexampletomotivatethesourceinductionproblemandthenformulatetheproblemconcisely.Wediscussourapproachtolearningdefinitionsforsourcesintermsofotherknownsourcesofinformation(section3).Wegivedetailsofthesearchprocedureforgeneratingcandidatedefinitions(section4)andoftheevaluationprocedureforscoringcandidatesduringsearch(section5).Wethendescribeextensionstothebasicalgorithm(section6)beforediscussingtheevaluationsetupandtheexperiments(section7),whichdemonstratethecapabilitiesofoursystem.Finally,wecontrastourapproachwithpriorwork.

2.Problem

Wenowdescribeindetailtheproblemoflearningdefinitionsfornewlydiscoveredservices.Westartwithaconcreteexampleofwhatismeantbylearningasourcedefinition.Intheexampletherearefourtypesofdata(semantictypes),namely:zipcodes,distances,latitudesandlongitudes.Therearealsothreeknownsourcesofinformation.EachofthesourceshasadefinitioninDatalogasshownbelow.Thefirstservice,aptlynamedsource1,takesinazipcodeandreturnsthelatitudeandlongitudecoordinatesofitscentroid.Thesecondservicecalculatesthegreatcircledistance(theshortestdistanceovertheearth’ssurface)betweentwopairsofcoordinates,andthethirdconvertsadistancefromkilometresintomilesbymultiplyingtheinputbytheconstant1.609.

source1($zip,lat,long):-centroid(zip,lat,long).source2($lat1,$long1,$lat2,$long2,dist):-greatCircleDist(lat1,long1,lat2,long2,dist).

source3($dist1,dist2):-multiply(dist1,1.609,dist2).

Thegoalinthisexampleistolearnadefinitionforanewlydiscoveredservice,calledsource4.Thisservicetakestwozipcodesasinputandreturnsadistancevalueasoutput:

source4($zip1,$zip2,dist)

Thesystemwewilldescribeusesthistypesignature(inputandoutputtypeinformation)tosearchforanappropriatedefinitionforthesource.Thedefinitiondiscoveredinthiscasemightbethefollowingconjunctionofcallstotheindividualsources:

source4($zip1,$zip2,dist):-source1($zip1,lat1,long1),source1($zip2,lat2,long2),

source2($lat1,$long1,$lat2,$long2,dist2),source3($dist2,dist).

5

Carman&Knoblock

Thedefinitionstatesthatsource’soutputdistancecanbecalculatedfromtheinputzipcodes,bygivingthosezipcodestosource1,takingtheresultingcoordinatesandcalculatingthedistancebetweenthemusingsource2,andthenconvertingthatdistanceintomilesusingsource3.Totestwhetherthisdefinitioniscorrect,thesystemmustinvokeboththenewsourceandthedefinitiontoseeifthevaluesgeneratedagreewitheachother.Thefollowingtableshowssuchatest:

$zip1802106060110005

$zip2902661520135555

dist(actual)842.37410.31899.50

dist(predicted)843.65410.83899.21

Inthetable,theinputzipcodeshavebeenselectedrandomlyfromasetofexamples,andtheoutputfromthesourceandthedefinitionareshownsidebyside.Sincetheoutputvaluesarequitesimilar,oncethesystemhasseenasufficientnumberofexamples,itcanbeconfidentthatithasfoundthecorrectsemanticdefinitionforthesource.

Thedefinitionaboveisgivenintermsofthesourcerelations,butcouldalsohavebeenwrittenintermsofthedomainrelations(therelationsusedtodefinesources1to3).Toconvertthedefinitionintothatform,onesimplyneedstoreplaceeachsourcerelationbyitsdefinitionasfollows:

source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2),

greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.609,dist2).Writteninthisway,thenewsemanticdefinitionmakessenseatanintuitivelevel:thesourceissimplycalculatingthedistanceinmilesbetweenthecentroidsofthetwozipcodes.2.1ProblemFormulation

HavinggivenanexampleoftheSourceDefinitionInductionProblem,wenowdescribetheproblemmoreformally,butbeforedoingso,weintroducesomeconceptsandnotation.(Wenotethatourfocusinthispaperisonlearningdefinitionsforinformation-providing,asopposedto“world-altering”services.)

•Thedomainofasemanticdata-typet,denotedD[t],isthe(possiblyinfinite)setofconstantvalues{c1,c2,...},whichconstitutethesetofvaluesforvariablesofthattype.ForexampleD[zipcode]={90210,90292,...}

•Anattributeisapair󰀛label,semanticdata-type󰀜,e.g.󰀛zip1,zipcode󰀜.Thetypeofattributeaisdenotedtype(a)andthecorrespondingdomainD[type(a)]isabbreviatedtoD[a].

•Aschemeisanordered(finite)setofattributes󰀛a1,...,an󰀜withuniquelabels,wherenisreferredtoasthearityofthescheme.Anexampleschememightbe󰀛zip1:zipcode,zip2:zipcode,dist:distance󰀜.ThedomainofaschemeA,denotedD[A],istheCartesianproductofthedomainsoftheattributesinthescheme{D[a1]×...×D[an]},ai∈A.

•AtupleoveraschemeAisanelementfromthesetD[A].Atuplecanberepresentedbyasetofname-valuepairs,suchas{zip1=90210,zip2=90292,dist=8.15}

6

LearningSemanticDefinitionsofInformationSourcesontheInternet

•Arelationisanamedscheme,suchasairDistance(zip1,zip2,dist).Multiplere-lationsmaysharethesamescheme.

•Anextensionofarelationr,denotedE[r],isasubsetofthetuplesinD[r].Forexam-ple,E[airDistance]mightbeatablecontainingthedistancebetweenallzipcodesinCalifornia.(Notethattheextensionofarelationmayonlycontaindistincttuples.)•AdatabaseinstanceoverasetofrelationsR,denotedI[R],isasetofextensions{E[r1],...,E[rn]},oneforeachrelationr∈R.

•AquerylanguageLisaformallanguageforconstructingqueriesoverasetofrelations.WedenotethesetofallqueriesthatcanbewrittenusingthelanguageLoverthesetofrelationsRreturningtuplesconformingtoaschemeAasLR,A.

•Theresultsetproducedbytheexecutionofaqueryq∈LR,AonadatabaseinstanceI[R]isdenotedEI[q].

•Asourceisarelations,withabindingpatternβs⊆s,whichdistinguishesinputattributesfromoutputattributes.(Theoutputattributesofasourcearedenotedby

c≡s\\β.)thecomplementofthebindingpattern3,βss

•AviewdefinitionforasourcesisaqueryvswritteninsomequerylanguageLR,s.TheSourceDefinitionInductionProblemisdefinedasatuple:

󰀛T,R,L,S,V,s∗󰀜

whereTisasetofsemanticdata-types,Risasetofrelations,Lisaquerylanguage,Sisasetofknownsources,Visasetofviewdefinitions(oneforeachknownsource),ands∗isthenewsource(alsoreferredtoasthetarget).

Eachsemantictypet∈TmustbeprovidedwithasetofexamplesvaluesEt⊆D[t].(WedonotrequiretheentiresetD[t],becausethedomainofmanytypesmaybepartiallyunknownortoolargetobeenumerated.)Inaddition,apredicateeqt(t,t)isavailableforcheckingequalitybetweenvaluesofeachsemantictypetohandlethecasewheremultipleserialisationsofavariablerepresentthesamevalue.

Eachrelationr∈Risreferredtoasaglobalrelationordomainpredicateanditsextensionisvirtual,meaningthattheextensioncanonlybegeneratedbyinspectingeveryrelevantdatasource.ThesetofrelationsRmayincludesomeinterpretedpredicates,suchas≤,whoseextensionisdefinedandnotvirtual.

ThelanguageLusedforconstructingqueriescouldbeanyquerylanguageincludingSQLandXQuery(theXMLQueryLanguage).InthispaperwewilluseaformofDatalog.

Eachsources∈ShasanextensionE[s]whichisthecompletesetoftuplesthatcanbeproducedbythesource(atagivenmomentintime).Werequirethatthecorrespondingviewdefinitionvs∈V(writteninLR,s)isconsistentwiththesource,suchthat:E[s]⊆EI[vs],(whereI[R]isthecurrentvirtualdatabaseinstanceovertheglobalrelations).Notethatwedonotrequireequivalence,becausesomesourcesmayprovideincompletedata.

Theviewdefinitionforthesourcetobemodeleds∗isunknown.ThesolutiontotheSourceDefinitionInductionProblemisaviewdefinitionv∗∈LR,s∗forthesources∗suchthatE[s∗]⊆EI[v∗],andthereisnootherviewdefinitionv󰀄∈LR,s∗thatbetterdescribes(providesatighterdefinitionfor)thesources∗,i.e.:

¬∃v󰀄∈LR,s∗s.t.E[s∗]⊆EI[v󰀄]∧|EI[v󰀄]|<|EI[v∗]|

3.The‘\\’-symboldenotessetdifference.

7

Carman&Knoblock

Givenlimitedavailablecomputationandbandwidth,wenotethatitmaynotbepossibletoguaranteethatthisoptimalityconditionholdsforaparticularsolution;thusinthispaperwewillsimplystrivetofindthebestsolutionpossible.2.2ImplicitAssumptions

Anumberofassumptionsareimplicitintheproblemformulation.Thefirstisthatthereexistsasystemcapableofdiscoveringnewsourcesandmoreimportantlyclassifying(togoodaccuracy)thesemantictypesoftheirinputandoutput.Systemscapableofdoingthiswerediscussedinsection1.4.

Thesecondassumptionhastodowiththerepresentationofeachsourceasarelationalviewdefinition.MostsourcesontheInternetprovidetreestructuredXMLdata.Itmaynotalwaysbeobvioushowbesttoflattenthatdataintoasetofrelationaltuples,whilepreservingtheintendedmeaningofthedata.Consideratravelbookingsitewhichreturnsasetofflightoptionseachwithaticketnumber,apriceandalistofflightsegmentsthatconstitutetheitinerary.Onepossibilityforconvertingthisdataintoasetoftupleswouldbetobreakeachticketupintoindividualflightsegmenttuples(therebyobscuringtherelationshipbetweenpriceandthenumberofflightsegments).Anotherwouldbetocreateoneverylongtupleforeachticketwithroomforanumberofflightsegments(therebycreatingtupleswithmanynullvalues).Inthiscase,itisnotobviouswhich,ifeither,oftheseoptionsistobepreferred.Mostonlinedatasourcescan,however,bemodeledquitenaturallyasrelationalsources;andbyfirsttacklingtherelationalproblem,wecandeveloptechniquesthatcanlaterbeappliedtothemoredifficultsemi-structuredcase.

Athirdassumptionisthatthesetofdomainrelationssufficesfordescribingthesourcetobemodeled.Forinstance,considerthecasewherethedomainmodelonlycontainsrelationsdescribingfinancialdata,andthenewsourceprovidesweatherforecasts.Obviously,thesystemwouldbeunabletofindanadequatedescriptionofthebehaviorofthesourceandwouldnotlearnamodelofit.Fromapracticalperspective,thislimitationisnotabigproblem,sinceausercanonlyrequestdata(writequeriestoamediator)usingtherelationsinthedomainmodelanyway.(Thusanysourcewhichcannotbedescribedusingthoserelationswouldnotbeneededtoansweruserrequests.)Inotherwords,theonusisonthedomainmodelertomodelsufficientrelationssoastobeabletodescribethetypesofqueriesausershouldbeabletoposetothesystemandconsequently,thetypesofsourcesthatshouldbeavailable.Thatsaid,aninterestingavenueforfutureresearchwouldbetoinvestigatetheproblemofautomating(atleastinpart)theprocessofexpandingthescopeofthedomainmodel(byaddingattributestorelations,orinventingnewones),basedonthetypesofsourcesdiscovered.2.3ProblemDiscussion

Anumberofquestionsarisefromtheproblemformulation,thefirstbeingwherethedomainmodelcomesfrom.Inprinciple,thesetofsemantictypesandrelationscouldcomefrommanyplaces.Itcouldbetakenfromstandarddatamodelsforthedifferentdomains,oritmightjustbethesimplestmodelpossiblethataptlydescribesthesetofknownsources.Thedomainmodelmaythenevolveovertimeassourcesarediscoveredforwhichnoappropriatemodelcanbefound.Somewhatrelatedisthequestionofhowspecificthesemantictypes

8

LearningSemanticDefinitionsofInformationSourcesontheInternet

oughttobe.Forexample,isitsufficienttohaveonesemantictypedistanceorshouldonedistinguishbetweendistanceinmetersanddistanceinfeet?Generallyspeaking,asemantictypeshouldbecreatedforeachattributethatissyntacticallydissimilartoallotherattributes.Forexample,aphonenumberandazipcodehaveverydifferentsyntax,thusoperationswhichacceptoneofthetypesasinputareunlikelytoaccepttheother.Inpractice,onemightcreateanewsemantictypewheneveratrainedclassifiercanrecognisethetypebasedonitssyntaxalone.Ingeneral,themoresemantictypesthereare,theharderthejobofthesystemclassifyingtheattributes,andtheeasierthejobofthesystemtaskedwithlearningadefinitionforthesource.

Anotherquestiontobeconsiderediswherethedefinitionsfortheknownsourcescomefrom.Initiallysuchdefinitionswouldneedtobewrittenbyhand.Asthesystemlearnsdefinitionsfornewsources,theytoowouldbeaddedtothesetofknownsources,makingitpossibletolearnevermorecomplicateddefinitions.

Inorderforthesystemtolearnadefinitionforanewsource,itmustbeabletoinvokethatsourceandthusneedsexamplesoftheinputtypes.Themorerepresentativethesetofexamplesavailable,themoreefficientandaccuratethelearningprocesswillbe.Aninitialsetofexampleswillneedtobeprovidedbythedomainmodeler.Then,asthesystemlearnsovertime,itwillgeneratealargenumberofexamplesofdifferentsemantictypes(asoutputfromvarioussources),whichcanberetainedforfutureuse.

InformationIntegrationresearchhasreachedapointwheremediatortechnology4isbecomingmatureandpractical.Theneedtoinvolveahumaninthewritingofsourcedefinitionsis,however,theAchilles’Heelofsuchsystems.Thegainsinflexibilitythatcomewiththeabilitytodynamicallyreformulateuserqueriesareoftenpartiallyoffsetbythetimeandskillrequiredtowritedefinitionswhenincorporatingnewsources.Thusasystemcapableoflearningdefinitionsautomaticallycouldgreatlyenhancetheviabilityofmediatortechnology.Thismotivationaloneseemssufficientforpursuingtheproblem.

3.Approach

TheapproachwetaketolearningsemanticmodelsforinformationsourcesontheWebistwofold.Firstly,wechoosetomodelsourcesusingthepowerfullanguageofconjunctivequeries.Secondly,weleveragethesetofknownsourcesinordertolearnadefinitionforthenewone.Inthissectionwediscusstheseaspectsinmoredetail.3.1ModelingLanguage

ThesourcedefinitionlanguageListhehypothesislanguageinwhichnewdefinitionswillneedtobelearnt.Asisoftenthecaseinmachinelearning,wearefacedwithatrade-offwithrespecttotheexpressivenessofthislanguage.Ifthehypothesislanguageistoosimple,thenwemaynotbeabletomodelrealservicesusingit.Ontheotherhand,ifthelanguageisoverlycomplex,thenthespaceofpossiblehypotheseswillbesolargethatlearningwillnotbefeasible.ThelanguagewechooseisthatofconjunctivequeriesinDatalog,whichis

4.InfluentialInformationIntegrationsystemsincludeTSIMMIS(Garcia-Molina,Hammer,Ireland,Pa-pakonstantinou,Ullman,&Widom,1995),SIMS(Arens,Knoblock,&Shen,1996),InfoMaster(Duschka,1997),andAriadne(Knoblock,Minton,Ambite,Ashish,Muslea,Philpot,&Tejada,2001).

9

Carman&Knoblock

ahighlyexpressiverelationalquerylanguage.Inthissectionwearguewhyalessexpressivelanguageisnotsufficientforourpurposes.

ResearchersinterestedintheproblemofassigningsemanticstoWebServices(Heß&Kushmerick,2003)haveinvestigatedtheproblemofusingMachineLearningtechniquestoclassifyservices(basedonmetadatacharacteristics)intodifferentsemanticdomains,suchasweatherandflights,andtheoperationstheyprovideintodifferentclassesofoperation,suchasweatherForecastandflightStatus.Fromarelationalperspective,wecanconsiderthedifferentclassesofoperationsasrelations.Forinstance,considerthedefinitionbelow:

source($zip,temp):-weatherForecast(zip,tomorrow,temp).

ThesourceprovidesweatherdatabyselectingtuplesfromarelationcalledweatherForecast,whichhasthedesiredzipcodeanddateequaltotomorrow.Thisqueryisreferredtoasaselect-projectquerybecauseitsevaluationcanbeperformedusingtherelationaloperatorsselectionandprojection.Sofarsogood,wehavebeenabletouseasimpleclassifiertolearnasimpledefinitionforasource.Thelimitationimposedbythisrestricted(select-project)modelinglanguagebecomesobvious,however,whenweconsiderslightlymorecomplicatedsources.ConsiderasourcethatprovidesthetemperatureinFahrenheitaswellasCelsius.Inordertomodelsuchasourceusingaselect-projectquery,wewouldrequirethattheweatherForecastrelationbeextendedwithanewattributeasfollows:

source($zip,tempC,tempF):-weatherForecast(zip,tomorrow,tempC,tempF).

Themoreattributesthatcouldconceivablybereturnedbyaweatherforecastoperation(suchasdewpoint,humidity,temperatureinKelvin,latitude,etc.),thelongertherelationwillneedtobetocoverthemall.Better,inthiscase,wouldbetointroduceasecondrelationconvertCtoFthatmakesexplicittherelationshipbetweenthetemperaturevalues.If,inaddition,thesourcelimitsitsoutputtozipcodesinCalifornia,areasonabledefinitionforthesourcemightbe:

source($zip,tempC,tempF):-weatherForecast(zip,tomorrow,tempC),convertCtoF(tempC,tempF),state(zip,California).

Thisdefinitionisnolongerexpressedinthelanguageofselect-projectqueries,becauseitnowinvolvesmultiplerelationsandjoinsacrossthem.Thusfromthissimpleexample,weseethatmodelingservicesusingsimpleselect-projectqueriesisnotsufficientforourpurposes.Whatweneedareselect-project-joinqueries,alsoreferredtoasconjunctivequeries.5Thereaderhasalreadybeenintroducedtoexamplesofconjunctivequeriesthroughouttheprevioussections.ConjunctivequeriesformasubsetofthelogicalquerylanguageDatalogandcanbedescribedmoreformallyasfollows:

AconjunctivequeryoverasetofrelationsRisanexpressionoftheform:q(X0):-r1(X1),r2(X2),...,rl(Xl).

whereeachri∈RisarelationandXiisanorderedsetofvariablenamesofsizearity(ri).6Eachconjunctri(Xi)isreferredtoasaliteral.Thesetofvariables

󰀃

inthequery,denotedvars(q)=li=0Xi,consistsofdistinguishedvariablesX0(fromtheheadofthequery),andexistentialvariablesvars(q)\\X0,(which

5.Evaluatingaselect-project-joinqueryrequiresadditionalrelationaloperators:naturaljoinandrename.6.Notethataconjunctivequerycanalsobeexpressedinfirstorderlogic󰀃asfollows:

l

∀X0∃Ys.t.r1(X1)∧r2(X2)∧...∧rl(Xl)→q(X0)whereX0∪Y=i=1Xi

10

LearningSemanticDefinitionsofInformationSourcesontheInternet

onlyappearinthebody).Aconjunctivequeryissaidtobesafeifallthe

󰀃

distinguishedvariablesappearinthebody,i.e.X0⊆li=1Xi.3.2MoreExpressiveLanguages

ModelingsourcesusingconjunctivequeriesimpliesthataggregateoperatorslikeMINandORDERcannotbeusedinsourcedefinitions.Thefunctionalityofmostsourcescanbedescribedwithoutsuchoperators.Somesourcescanonlybedescribedpoorly,however.Considerahotelsearchservicethatreturnsthe20closesthotelstoagivenlocation:

hotelSearch($loc,hotel,dist):-accommodation(hotel,loc1),distance(loc,loc1,dist).

Accordingtothedefinition,thesourceshouldreturnallhotelsregardlessofdistance.Onecannotexpressthefactthatonlytheclosesthotelswillbereturned.Thereasonfornotincludingaggregateoperatorsinthehypothesislanguageisthatthesearchspaceassociatedwithlearningdefinitionsisprohibitivelylarge.(Thusweleaveaggregateoperatorstofutureworkasdiscussedinsection9.2.)

Similarly,sourcedefinitionscannotcontaindisjunction,whichrulesoutunionandre-cursivequeries.Again,thissimplifyingassumptionholdsformostinformationsourcesandgreatlyreducesthesearchspace.Itmeanshowever,thataweatherserviceprovidingforecastsonlyforcitiesintheUSandCanadawouldbemodeledas:

s($city,temp):-forecast(city,country,tomorrow,temp).

Sincethedefinitiondoesnotrestrictthedomainofthecountryattribute,whenconfrontedwitharequestfortheforecastinAustralia,amediatorwouldproceedtocalltheservice,oblivioustoanyrestrictiononthatattribute.

Wealsodonotallownegationinthequeriesbecausesourcedefinitionsveryrarelyrequireit,soincludingitwouldneedlesslycomplicatethesearch.Inthoserarecaseswherethenegationofaparticularpredicateisusefulfordescribingcertaintypesofsources,thenegatedpredicatecanbeincluded(asadistinctpredicate)inthesearch.Forinstance,wemightuse“≥”todescribeasource,eventhoughstrictlyspeakingitisthenegationof“<”.3.3LeveragingKnownSources

Ourapproachtotheproblemofdiscoveringsemanticdefinitionsfornewservicesistoleveragethesetofknownsourceswhenlearninganewdefinition.Broadlyspeaking,wedothisbyinvokingtheknownsources(inamethodicalmanner)toseeifanycombinationoftheinformationtheyprovidematchestheinformationprovidedbythenewsource.Fromapracticalperspective,thismeansinordertomodelanewlydiscoveredsourcesemantically,werequiresomeoverlapinthedatabeingproducedbythenewsourceandthesetofknownsources.Onewaytounderstandthisistoconsideranewsourceproducingweatherdata.Ifnoneoftheknownsourcesproduceanyweatherinformation,thenthereisnowayforthesystemtolearnwhetherthenewsourceisproducinghistoricalweatherdata,weatherforecasts-oreventhatitisdescribingweatheratall.(Inprinciple,onecouldtrytoguesswhattheserviceisdoingbasedonthetypesignaturealone,buttherewouldbenoguaranteethatthedefinitionwascorrect,makingitoflittleusetoamediator.)Giventhisoverlappingdatarequirement,onemightclaimthatthereislittlebenefitinincorporatingnewsources.Wedetailsomeofthereasonswhythisisnotthecasebelow.

11

Carman&Knoblock

Themostobviousbenefitoflearningdefinitionsfornewsourcesisredundancy.Ifthesystemisabletolearnthatonesourceprovidesexactlythesameinformationasacurrentlyavailablesource,thenifthelattersuddenlybecomesunavailable,theformercanbeusedinitsplace.Forexample,ifamediatorknowsofoneweathersourceprovidingcurrentconditionsandlearnsthatasecondsourceprovidesthesameorsimilardata,thenifthefirstgoesdownforwhateverreason(perhapsbecauseanaccessquotahasbeenreached),weatherdatacanstillbeaccessedfromthesecond.

Thesecondandperhapsmoreinterestingreasonforwantingtolearnadefinitionforanewsourceisthatthenewsourcemayprovidedatawhichliesoutsidethescopeof(orsimplywasnotpresentin)thedataprovidedbytheothersources.Forexample,consideraweatherservicewhichprovidestemperaturevaluesforzipcodesintheUnitedStates.Thenconsiderasecondsourcethatprovidesweatherforecastsforcitiesworldwide.Ifthesystemcanusethefirstsourcetolearnadefinitionforthesecond,theamountofinformationavailableforqueryingincreasesgreatly.

Bindingconstraintsonaservicecanmakeaccessingcertaintypesofinformationdifficultorinefficient.Inthiscase,discoveringanewsourceprovidingthesameorsimilardatabutwithadifferentbindingpatternmayimproveperformance.Forexample,considerahotelsearchservicethatacceptsazipcodeandreturnsasetofhotelsalongwiththeirstarrating:

hotelSearch($zip,hotel,rating,street,city,state):-accommodation(hotel,rating,street,city,state,zip).

NowconsiderasimplequeryforthenamesandaddressesofallfivestarhotelsinCalifornia:

q(hotel,street,city,zip):-accommodation(hotel,5*,street,city,California,zip).Answeringthisquerywouldrequirethousandsofcallstotheknownsource,oneforeveryzipcodeinCalifornia,andamediatorcouldonlyanswerthequeryiftherewasanothersourceprovidingthosezipcodes.Incontrast,ifthesystemhadlearntadefinitionforanewsourcewhichprovidesexactlythesamedatabutwithadifferentbindingpattern(suchastheonebelow),thenansweringthequerywouldrequireonlyonecalltoasource:

hotelsByState($state,$rating,hotel,street,city,zip):-accommodation(hotel,rating,street,city,state,zip).

Oftenthefunctionalityofacomplexsourcecanbedescribedintermsofacompositionofthefunctionalityprovidedbyothersimplerservices.Forinstance,considerthemotivatingexamplefromsection2,inwhichthefunctionalityprovidedbythenewsourcewastocalcu-latethedistanceinmilesbetweentwozipcodes.Thesamefunctionalitycouldbeachievedbyperformingfourdifferentcallstotheavailablesources.Inthatcase,thedefinitionlearntbythesystemmeantthatanyqueryregardingthedistancebetweenzipcodescouldbehan-dledmoreefficiently.Ingeneral,bylearningdefinitionsformorecomplicatedsourcesintermsofsimplerones,thesystemcanbenefitfromcomputation,optimisationandcachingabilitiesofservicesprovidingcomplexfunctionality.

Finally,thenewlydiscoveredservicemaybefastertoaccessthantheknownsourcesprovidingsimilardata.Forinstance,considerageocodingservicethattakesinanaddressandreturnsthelatitudeandlongitudecoordinatesofthelocation.Becauseofthevarietyinalgorithmsusedtocalculatethecoordinates,it’snotunreasonableforsomegeocodingservicestotakeaverylongtime(upwardsofonesecond)toreturnaresult.Ifthesystemwereabletodiscoveranewsourceprovidingthesamegeocodingfunctionality,butusing

12

LearningSemanticDefinitionsofInformationSourcesontheInternet

afasteralgorithm,thenitcouldlocateanddisplaymanymoreaddressesonamapinthesameamountoftime.

4.InducingDefinitions

Inthissectionwedescribeanalgorithmforgeneratingcandidatedefinitionsforanewlydiscoveredsource.Thealgorithmformsthefirstphaseinagenerateandtestmethodologyforlearningsourcedefinitions.Wedeferdiscussionofthetestingphasetolaterinthepaper.Westartbybrieflydiscussingworkonrelationalrulelearningandthendescribehowouralgorithmbuildsupontheseideas.4.1InductiveLogicProgramming

Thelanguageofconjunctivequeriesisarestrictedformoffirst-orderlogic.IntheMachineLearningcommunity,systemscapableoflearningmodelsusingfirst-orderrepresentationsarereferredtoasInductiveLogicProgramming(ILP)systemsorrelationalrulelearners.Becauseoftheexpressivenessofthemodelinglanguage,thecomplexityoflearningismuchhigherthanforpropositionalrulelearners(alsocalledattribute-valuelearners),whichformthebulkofMachineLearningalgorithms.Givenourrelationalmodelingofservices,manyofthetechniquesdevelopedinILPshouldalsoapplytoourproblem.

TheFirstOrderInductiveLearner(foil)isawellknownILPsearchalgorithm(Cameron-Jones&Quinlan,1994).Itiscapableoflearningfirst-orderrulestodescribeatargetpred-icate,whichisrepresentedbyasetofpositiveexamples(tuplesoverthetargetrelation,denotedE+)andoptionallyalsoasetofnegativeexamples(E−).Thesearchforaviabledefinitioninfoilstartswithanemptyclause7andprogressivelyaddsliteralstothebody(antecedent)oftherule,therebymakingtherulemorespecific.Thisprocesscontinuesuntilthedefinition(denotedh)coversonlypositiveexamplesandnonegativeexamples:

E+∩EI[h]=∅

andE−∩EI[h]=∅

Usuallyasetofrulesislearntinthismannerbyremovingthepositiveexamplescoveredbythefirstruleandrepeatingtheprocess.(Thesetofrulesistheninterpretedasaunionquery.)Searchinfoilisperformedinagreedybest-firstmanner,guidedbyaninformationgain-basedheuristic.Manyextensionstothebasicalgorithmexist,mostnotablythosethatcombinedeclarativebackgroundknowledgeinthesearchprocesssuchasfocl(Pazzani&Kibler,1992).Suchsystemsarecategorisedasperformingatopdownsearchbecausetheystartfromanemptyclause(themostgeneralrulepossible)andprogressivelyspecializetheclause.Bottomupapproaches,ontheotherhand,suchasgolem(Muggleton&Feng,1990),performaspecifictogeneralsearchstartingfromthepositiveexamplesofthetarget.4.2Search

Wenowdescribetheactualsearchprocedureweusetogeneratecandidatedefinitionsforanewsource.Theprocedureisbasedonthetop-downsearchstrategyusedinfoil.Thealgorithmtakesasinputatypesignatureforthenewsourceandusesittoseedthesearchfor

7.WeusethetermsclauseandqueryinterchangeablytorefertoaconjunctivequeryinDatalog.Anemptyclauseisaquerywithoutanyliteralsinthebody(rightside)oftheclause.

13

Carman&Knoblock

input:Apredicatesignatures∗output:Thebestscoringviewdefinitionvbest

invoketargetwithsetofrandominputs;2vbest←emptyclauses∗;3addvbesttoemptyqueue;

4whilequeue=∅∧time()0do9forallv2∈constrain(v1)do10insertv2intoqueue;11ifeval(v2)≥eval(v1)thenv1←v2;12end13end14ifeval(v1)≥eval(vbest)thenvbest←v1;15end16end

17returnvbest;

Algorithm1:Best-firstsearchthroughthespaceofcandidatesourcedefinitions.1

candidatedefinitions.(Wewillrefertothenewsourcerelationasthetargetpredicateandthesetofknownsourcerelationsassourcepredicates.)Thespaceofcandidatedefinitionsisenumeratedinabest-firstmanner,witheachcandidatetestedtoseeifthedataitreturnsissimilartothetarget.Pseudo-codedescribingtheprocedureisgiveninAlgorithm1.8

Thefirststepinthealgorithmistoinvokethenewsourcewitharepresentativesetofinputtuplestogenerateexamplesofoutputtuplesthatcharacterisethefunctionalityofthesource.Thissetofinvocationsmustincludepositiveexamples(invocationsforwhichoutputtupleswereproduced)andifpossible,alsonegativetuples(inputsforwhichnooutputwasreturned).Thealgorithm’sabilitytoinducethecorrectdefinitionforasourcedependsgreatlyonthenumberofpositiveexamplesavailable.Thusaminimumnumberofpositiveinvocationsofthesourceisimposed,meaningthatthealgorithmmayhavetoinvokethesourcerepeatedlyusingdifferentinputsuntilsufficientpositiveinvocationscanberecorded.Selectingappropriateinputvaluessoastosuccessfullyinvokeaserviceiseasiersaidthandone.Wedeferdiscussionoftheissuesanddifficultiesinvolvedinsuccessfullyinvokingthenewsourcetosection6.1,andassumeforthemomentthattheinductionsystemisabletogenerateatableofvaluesthatrepresentitsfunctionality.

Thenextstepinthealgorithmistoinitialisethesearchbyaddinganemptyclausetothequeueofdefinitionstoexpand.Therestofthealgorithmissimplyabest-firstsearchprocedure.Ateachiterationthehighestscoringbutnotyetexpandeddefinition(denotedv0)isremovedfromthequeueandexpandedbyaddinganewpredicatetotheendofthe

8.Theimplementationofthisalgorithmusedintheexperimentsofsection7.3isavailableat:http://www.isi.edu/publications/licensed-sw/eidos/index.html

14

LearningSemanticDefinitionsofInformationSourcesontheInternet

clause(seethenextsectionforanexample).Eachcandidategenerated(denotedv1)isaddedtothequeue.Thealgorithmthenprogressivelyconstrainsthecandidatebybindingvariablesofthenewlyaddedpredicate,(seesection4.4).Theevalfunction(seesection5.3)evaluatesthequalityofeachcandidateproduced.Theprocedurestopsconstrainingthecandidatewhenthechangeintheevaluationfunction(∆eval)dropstozero.Itthencomparesv1tothepreviousbestcandidatevbestandupdatesthelatteraccordingly.

Inprinciplethealgorithmshouldterminatewhenthe“perfect”candidatedefinitionisdiscovered-onethatproducesexactlythesamedataasthetarget.Inpracticethatneveroccursbecausethesourcesareincomplete(don’tperfectlyoverlapwitheachother)andarenoisy.Insteadthealgorithmterminateswheneitherthequeuebecomesempty,atimelimitisreachedoramaximumnumberofiterationshasbeenperformed.4.3AnExample

Wenowrunthroughanexampleoftheprocessofgeneratingcandidatedefinitions.Consideranewlydiscoveredsource,whichtakesinazipcodeandadistance,andreturnsallthezipcodesthatliewithinthegivenradius(alongwiththeirrespectivedistances).Thetargetpredicaterepresentingthissourceis:

source5($zip1,$dist1,zip2,dist2)

Nowassumethattherearetwoknownsources.Thefirstisthesourceforwhichthedefinitionwaslearntintheexamplefromsection2,namely:

source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2),

greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.609,dist2).Thesecondsourceisn’tactuallyasourcebutaninterpretedpredicate:

≤(dist1,dist2).

Thesearchforadefinitionforthenewsourcemightthenproceedasfollows.Thefirstdefinitiontobegeneratedistheemptyclause:

source5($,$,,).

Thenullcharacter()representsthefactthatnoneoftheinputsoroutputshaveanyrestrictionsplacedontheirvalues.Priortoaddingthefirstliteral(sourcepredicate),thesystemwillcheckwhetheranyoutputattributesechotheinputvalues.Inthiscase,giventhesemantictypes,twopossibilitiesneedtobechecked:

source5($zip1,$,zip1,).source5($,$dist1,,dist1).

Assumingneitherofthesepossibilitiesistrue(i.e.improvesthescore),thenliteralswillbeaddedoneatatimetorefinethedefinition.Aliteralisasourcepredicatewithanassignmentofvariablenamestoitsattributes.Anewdefinitionmustbecreatedforeverypossibleliteralthatincludesatleastonevariablealreadypresentintheclause.(Forthemomentweignoretheissueofbindingconstraintsonthesourcesbeingadded.)Thusmanycandidatedefinitionswouldbegenerated,includingthefollowing:

source5($zip1,$dist1,,):-source4($zip1,$,dist1).source5($zip1,$,zip2,):-source4($zip1,$zip2,).source5($,$dist1,,dist2):-≤(dist1,dist2).

15

Carman&Knoblock

Notethatthesemantictypesinthetypesignatureofthetargetpredicatelimitgreatlythenumberofcandidatedefinitionsthatcanbeproduced.Thesystemthenevaluateseachofthesecandidatesinturn,selectingthebestoneforfurtherexpansion.Assumingthefirstofthethreehasthebestscore,itwouldbeexpandedbyaddinganotherliteral,formingmorecomplicatedcandidatessuchasthefollowing:

source5($zip1,$dist1,,dist2):-source4($zip1,$,dist1),≤(dist1,dist2).Thisprocesscontinuesuntilthesystemdiscoversadefinitionwhichperfectlydescribesthesource,orisforcedtobacktrackbecausenoliteralimprovesthescore.4.4IterativeExpansion

Thesourcesusedinthepreviousexampleallhadrelativelylowarity.OntheInternetthisisrarelythecase,withmanysourcesproducingalargenumberofattributesofthesametype.Thisisaproblembecauseitcausesanexponentialnumberofdefinitionstobepossibleateachexpansionstep.Considerforinstanceastockpriceservice,whichprovidesthecurrent,high,low,market-openingandmarket-closingpricesforagiventickersymbol.Thetypesignatureforthatservicewouldbe:

stockprice($ticker,price,price,price,price,price)

Ifthedefinitiontowhichthispredicateistobeaddedalreadycontainskdistinctpricevariables,thenthenumberofwaysinwhichthepriceattributesofthenewrelationcanbe

󰀂5󰀅5󰀁i

assignedvariablenamesisi=0ik,whichisprohibitivelylargeevenformoderatek.9Tolimitthesearchspaceinthecaseofsuchhigh-aritypredicates,wefirstgeneratecandidateswithaminimalnumberofboundvariablesinthenewliteralandprogressivelyconstrainthebestperformingofthesedefinitionswithineachexpansion.(Higharitypredicatesarehandledinasimilarfashioninfoil,QuinlanandCameron-Jones,1993.)Forexample,considerusingthesourceabovetolearnadefinitionforanewsourcewithsignature:

source6($ticker,price,price)

Westartbyaddingliteralstoanemptydefinitionasbefore.Thistimethough,insteadofgeneratingaliteralforeverypossibleassignmentofvariablenamestotheattributesofeachrelation,wegenerateonlythesimplestassignmentssuchthatallofthebindingconstraintsaremet.(ThisistheexpandprocedurereferredtoinAlgorithm1.)Inthisexample,thetickersymbolinputofthestockpricesourcewouldneedtobebound,generatingasingledefinition:

source6($tic,,):-stockprice($tic,,,,,).

Thisdefinitionwouldbeevaluated,andthenmoreconstraineddefinitionsaregeneratedbyequatingavariablefromthesameliteraltoothervariablesfromtheclause.(ThisistheconstrainprocedurefromAlgorithm1.)Twosuchdefinitionsareshownbelow:

source6($tic,pri1,):-stockprice($tic,pri1,,,,).source6($tic,pri1,):-stockprice($tic,,pri1,,,).

Thebestofthesedefinitionswouldthenbeselectedandconstrainedfurther,generatingdefinitionssuchas:

i

9.Intuitively,onecanassignvariable󰀅5󰀁namestoiattributesusingklabelsinkdifferentways.Onecanchooseiofthe5attributesiniways,andonecandothatfori=0,1,..,5.SeeWeber,Tausend,andStahl(1995)foradetaileddiscussionofthesizeofthehypothesisspaceinILP.

16

LearningSemanticDefinitionsofInformationSourcesontheInternet

source6($tic,pri1,pri2):-stockprice($tic,,pri1,pri2,,).source6($tic,pri1,pri2):-stockprice($tic,,pri1,,pri2,).

Inthisway,thebestscoringliteralcanbefoundwithouttheneedtoiterateoverallofthepossibleassignmentsofvariablestoattributes.4.5DomainPredicatesvs.SourcePredicates

Intheexamplesofsections4.3and4.4,thedecisiontoperformsearchoverthesourcepredicatesratherthanthedomainpredicateswasmadeinanarbitraryfashion.10Inthissectionwejustifythatdecision.Ifoneweretoperformthesearchoverthedomainpredicatesratherthanthesourcepredicates,thentestingeachdefinitionwouldrequireanadditionalqueryreformulationstep.Forexample,considerthefollowingcandidatedefinitionforsource5containingthedomainpredicatecentroid:

source5($zip1,$,,):-centroid(zip1,,).

Inordertoevaluatethiscandidate,thesystemwouldneedtofirsttreatthedefinitionasaqueryandreformulateitintoasetofrewritings(thattogetherformaunionquery)overthevarioussourcesasfollows:

source5($zip1,$,,):-source4($zip1,$,).source5($zip1,$,,):-source4($,$zip1,).

Thisunionquerycanthenbeexecutedagainsttheavailablesources(inthiscasejustsource4)toseewhattuplesthecandidatedefinitionreturns.Inpracticehowever,ifthedefinitionsfortheknownsourcescontainmultipleliterals(astheynormallydo)andthedomainrelationsareofhigh-arity(astheyoftenare),thenthesearchoverthespaceofconjunctionsofdomainpredicatesisoftenmuchlargerthanthecorrespondingsearchoverthespaceofconjunctionsofsourcepredicates.Thisisbecausemultipleconjunctionsofdomainpredicates(candidatedefinitions)endupreformulatingtothesameconjunctionofsourcepredicates(unionqueries).Forexample,considerthefollowingcandidatedefinitionswrittenintermsofthedomainpredicates:

source5($zip1,$,,):-centroid(zip1,lat,),greatCircleDist(lat,,,,).source5($zip1,$,,):-centroid(zip1,,lon),greatCircleDist(,lon,,,).

source5($zip1,$,,):-centroid(zip1,lat,lon),greatCircleDist(lat,lon,,,).Allthreecandidateswouldreformulatetothesamequeryoverthesources(shownbelow),andthusareindistinguishablegiventhesourcesavailable.

source5($zip1,$,,):-source4($zip1,$,).

Ingeneral,thenumberofcandidatedefinitionsthatmaptothesamereformulationcanbeexponentialinthenumberofhiddenvariablespresentinthedefinitionsoftheknownsources.Forthisreason,wesimplifytheproblemandsearchthespaceofconjunctionsofsourcepredicates.Insomesense,performingthesearchoverthesourcepredicatescanbeseenasintroducinga“similarityheuristic”whichfocusesthesearchtowarddefinitionswithsimilarstructuretothedefinitionsoftheavailablesources.Wenotethatthedefinitionsproducedcan(andwill)laterbeconvertedtoqueriesovertheglobalpredicatesbyunfoldingand

10.Anoteonthedifferencebetweendomain,sourceandinterpretedpredicates.Domainpredicatesare

inventedbyadomainexpertforuseinmodelingaparticularinformationdomain.Theydefineacommonschemathatcanbeusedfordescribinginformationfromdifferentsources.Sourcepredicatesrepresentthesourcesavailableinthesystem.Interpretedpredicates,(suchas“≤”),areaspecialtypeofdomainpredicate,thatcanbetreatedassourcepredicates,sincetheirmeaningisinterpreted(understood).

17

Carman&Knoblock

possiblytighteningthemtoremoveredundancies.Wewilldiscusstheprocessoftighteningtheunfoldingsinsection6.6.4.6LimitingtheSearch

Thesearchspacegeneratedbythistop-downsearchalgorithmmaybeverylargeevenforasmallnumberofsources.Theuseofsemantictypeslimitsgreatlythewaysinwhichvariableswithineachdefinitioncanbeequated(akathejoinpaths)andthusgoesalongwaytoreducethesizeofthesearchspace.Despitethisreduction,asthenumberofsourcesavailableincreases,thesearchspacebecomessolargethattechniquesforlimitingitmustbeused.Weemploysomestandard(andothernotsostandard)ILPtechniquesforlimitingthisspace.Suchlimitationsareoftenreferredtoasinductivesearchbiasorlanguagebias(N´edellec,Rouveirol,Ad´e,Bergadano,&Tausend,1996).

Anobviouswaytolimitthesearchistorestrictthenumberofsourcepredicatesthatcanoccurinadefinition.Wheneveradefinitionreachesthemaximumlength,backtrackingcanbeperformed,allowingthesearchtoescapefromlocalminimathatmayresultfromgreedyenumeration.Theassumptionhereisthatshorterdefinitionsaremoreprobablethanlongerones,whichmakessensesinceserviceprovidersarelikelytoprovidedatainthesimplestformpossible.Moreover,thesimplerthedefinitionlearnt,themoreusefulitwillbetoamediator,sowedecidetotradecompleteness(theabilitytoexpresslongerdefinitions)forimprovedaccuracyovershorterdefinitions.

Thesecondrestrictionplacedoncandidatedefinitionsistolimitthenumberoftimesthesamesourcepredicateappearsinagivencandidate.Thismakessensebecausethedefinitionsofrealservicestendnottocontainmanyrepeatedpredicates.Intuitively,thisisbecausemostservicesproviderawdatawithoutperformingmanycalculationsuponit.Repeateduseofthesamepredicateinadefinitionismoreusefulfordescribingsomeformofcalculationthanrawdataitself.(Exceptionstothisruleexist,forexamplepredicatesrepresentingunitconversionfunctionalitysuchasFahrenheittoCelsius,maynecessarilyoccurmultipletimesinthedefinitionofasource.)

Thethirdrestrictionlimitsthecomplexityofthedefinitionsgeneratedbyreducingthenumberofliteralsthatdonotcontainvariablesfromtheheadoftheclause.Specifically,itlimitsthelevelofexistentialquantification(sometimesalsoreferredtoasthedepth,MuggletonandFeng,1990)ofeachvariableinaclause.Thislevelisdefinedtobezeroforalldistinguishedvariables(thoseappearingintheheadoftheclause).Forexistentialvariablesitisdefinedrecursivelyasoneplusthelowestlevelofanyvariableappearinginthesameliteral.Forexample,thecandidatedefinitionshownbelowhasamaximumexistentialquantificationlevelofthreebecausetheshortestpathfromthelastliteraltotheheadliteral(viajoinvariables)passesthroughtwootherliterals:

source5($zip1,$,,):-source4($zip1,$,d1),source3($d1,d2),source3($d2,).Theeffectofthisbiasistoconcentratethesearcharoundsimplerbuthighlyconnecteddefinitions,whereeachliteraliscloselylinkedtotheinputandoutputofthesource.

Thefourthrestrictionplacedonsourcedefinitionsisthattheyareexecutable.Morespecifically,itshouldbepossibletoexecutethemfromlefttoright,meaningthattheinputsofeachsourceappeareitherinthetargetpredicate(headoftheclause)orinoneoftheliteralstotheleftofthatliteral.Forexample,ofthetwocandidatedefinitionsshownbelow,

18

LearningSemanticDefinitionsofInformationSourcesontheInternet

onlythefirstisexecutable.Theseconddefinitionisnot,becausezip2isusedasinputforsource4inthefirstliteral,withoutfirstbeingboundtoavalueintheheadoftheclause:

source5($zip1,$,zip2,):-source4($zip1,$zip2,).

source5($zip1,$,,):-source4($zip1,$zip2,dist1),source4($zip2,$zip1,dist1).Thisrestrictionservestwopurposes.Firstly,liketheotherbiases,itlimitsthesizeofthesearchspace.Secondly,itmakesiteasiertoevaluatethedefinitionsproduced.Intheory,onecouldstillevaluatetheseconddefinitionabovebygeneratinglotsofinputvaluesforzip2,butthatwouldrequirealotofinvocationsforminimalgain.

Thelastrestrictionreducesthesearchspacebylimitingthenumberoftimesthesamevariablecanappearinanygivenliteralinthebodyoftheclause.Definitionsinwhichthesamevariableappearsmultipletimesinagivenliteral,suchasinthefollowingexamplewhichreturnsthedistancebetweenazipcodeanditself,arenotverycommoninpractice:

source5($zip1,$,,dist2):-source4($zip1,$zip1,dist2).

Explicitlypreventingsuchdefinitionsfrombeinggeneratedmakessensebecausesourcesrequiringthemaresorare,thatitisbettertoreducethesearchspaceexponentiallybyignoringthem,thantoexplicitlycheckforthemeachtime.

5.ScoringDefinitions

Wenowproceedtotheproblemofevaluatingthecandidatedefinitionsgeneratedduringsearch.Thebasicideaistocomparetheoutputproducedbythesourcewiththeoutputproducedbythedefinitiononthesameinput.Themoresimilarthesetoftuplesproduced,thehigherthescoreforthecandidate.Thescoreisthenaveragedoverasetofdifferentinputtuplestoseehowwellthecandidatedefinitiondescribesthenewsource.

Inthemotivatingexampleofsection2,thesourceforwhichadefinitionwasbeinglearnt(thedefinitionisrepeatedbelow)onlyproducedoneoutputtuple󰀛dist󰀜foreveryinputtuple󰀛zip1,zip2󰀜:

source4($zip1,$zip2,dist):-centroid(zip1,lat1,long1),centroid(zip2,lat2,long2),greatCircleDist(lat1,long1,lat2,long2,dist2),multiply(dist1,1.6093,dist2).

Thisfactmadeitsimpletocomparetheoutputoftheservicewiththeoutputoftheinduceddefinition.Ingeneralhowever,thesourcetobemodeled(andthecandidatedefinitionsmodelingit)mayproducemultipleoutputtuplesforeachinputtuple.Takeforexamplesource5fromsection4.3,whichproducesthesetofoutputtuples󰀛zip2,dist2󰀜containingallthezipcodeswhichliewithinagivenradiusoftheinputzipcode󰀛zip1,dist1󰀜.Insuchcases,thesystemneedstocompareasetofoutputtupleswiththesetproducedbythedefinitiontoseeifanyofthetuplesarethesame.Sinceboththenewsourceandtheexistingsourcesmaynotbecomplete,thetwosetsmaysimplyoverlap,evenifthecandidatedefinitioncorrectlydescribesthenewsource.Assumingthatwecancountthenumberoftuplesthatarethesame,weneedameasurethattellsushowwellacandidatehypothesisdescribesthedatareturnedbyasource.Onesuchmeasureisthefollowing:

1󰀄|Os(i)∩Ov(i)|

score(s,v,I)=

|I|i∈I|Os(i)∪Ov(i)|

19

Carman&Knoblock

wheresisthenewsource,visacandidatesourcedefinition,andI⊆D[βs]isthesetofinputtuplesusedtotestthesource(βsisthesetofinputattributesofsources).Os(i)denotesthesetoftuplesreturnedbythenewsourcewheninvokedwithinputtuplei.Ov(i)isthecorrespondingsetreturnedbythecandidatedefinition.Usingrelationalprojection(π)andselection(σ)operatorsandthenotationintroducedinsection2.1,thesesetscan

crepresentstheoutputattributesofs.)bewrittenasfollows.(Notethatβs

c(σβ=i(E[s]))Os(i)≡πβss

and

c(σβ=i(EI[v]))Ov(i)≡πβss

Ifweviewthishypothesistestingasaninformationretrievaltask,wecanconsiderrecall

tobethenumberofcommontuples,dividedbythenumberoftuplesproducedbythesource,andprecisiontobethenumberofcommontuplesdividedbythenumberoftuplesproducedbythedefinition.TheabovemeasuretakesbothprecisionandrecallintoaccountbycalculatingtheaverageJaccardsimilaritybetweenthesets.Thetablebelowgivesanexampleofhowthisscoreiscalculatedforeachinputtuple.

inputtuplei∈I󰀛a,b󰀜󰀛c,d󰀜󰀛e,f󰀜󰀛g,h󰀜󰀛i,j󰀜

actualoutputtuplesOs(i){󰀛x,y󰀜,󰀛x,z󰀜}{󰀛x,w󰀜,󰀛x,z󰀜}{󰀛x,w󰀜,󰀛x,y󰀜}

∅∅

predictedoutputtuplesOv(i){󰀛x,y󰀜}{󰀛x,w󰀜,󰀛x,y󰀜}{󰀛x,w󰀜,󰀛x,y󰀜}{󰀛x,y󰀜}∅

Jaccardsimilarity

fortuplei1/21/310#undef!

Thefirsttworowsofthetableshowinputsforwhichthepredictedandactualoutputtuplesoverlap.Inthethirdrow,thedefinitionproducesexactlythesamesetoftuplesasthesourcebeingmodeledandthusgetsthemaximumscore.Inthefourthrow,thedefinitionproducedatuple,whilethesourcedidn’t,sothedefinitionwaspenalised.Inthelastrow,thedefinitioncorrectlypredictedthatnotupleswouldbeoutputbythesource.Ourscorefunctionisundefinedatthispoint.Fromacertainperspectivethedefinitionshouldscorewellherebecauseithascorrectlypredictedthatnotuplesbereturnedforthatinput,butgivingahighscoretoadefinitionwhenitproducesnotuplescanbedangerous.Doingsomaycauseoverlyconstraineddefinitionsthatcangenerateveryfewoutputtuplestoscorewell.Atthesametime,lessconstraineddefinitionsthatarebetteratpredictingtheoutputtuplesonaveragemayscorepoorly.Forexample,considerasourcewhichreturnsweatherforecastsforzipcodesinLosAngeles:

source($zip,temp):-forecast(zip,tomorrow,temp),UScity(zip,LosAngeles).

Nowconsidertwocandidatedefinitionsforthesource.Thefirstreturnsthetemperatureforazipcode,whilethesecondreturnsthetemperatureonlyifitisbelow0◦C:

v1($zip,temp):-forecast(zip,tomorrow,temp).

v2($zip,temp):-forecast(zip,tomorrow,temp),temp<0◦C.

Assumethatthesourceandcandidatesareinvokedusing20differentrandomlyselectedzipcodes.Formostzipcodes,thesourcewillnotreturnanyoutput,becausethezipcodewilllieoutsideofLosAngeles.Thefirstcandidatewilllikelyreturnoutputforallzipcodes,whilethesecondcandidatewould,likethesource,onlyrarelyproduceanyoutput.Thisisbecausethetemperatureinmostzipcodeswillbegreaterthanzero,andhasnothingtodo

20

LearningSemanticDefinitionsofInformationSourcesontheInternet

withwhetherornotthezipcodeisinLosAngeles.Ifwescoredefinitionshighlywhentheycorrectlyproducenooutput,thesystemwoulderroneouslypreferthesecondcandidateoverthefirst(becausethelatteroftenproducesnooutput).Topreventthatfromhappening,wesimplyignoreinputsforwhichthedefinitioncorrectlypredictszerotuples.Thisisthesameassettingthescoretobetheaverageoftheothervalues.

Returningourattentiontothetable,afterignoringthelastrow,theoverallscoreforthisdefinitionwouldbecalculatedas0.46.5.1PartialDefinitions

Asthesearchproceedstowardthecorrectdefinitionfortheservice,manysemi-complete(unsafe)definitionswillbegenerated.Thesedefinitionswillnotproducevaluesforallattributesofthetargettuplebutonlyasubsetofthem.Forexample,thecandidate:

source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist1).

producesonlyoneofthetwooutputattributesproducedbythesource.Thispresentsaproblem,becauseourscoreisonlydefinedoversetsoftuplescontainingalloftheoutputattributesofthenewsource.Onesolutionmightbetowaituntilthedefinitionsbecomesufficientlylongastoproducealloutputs,beforecomparingthemtoseewhichonebestdescribesthenewsource.Thereare,however,tworeasonswhythiswouldnotmakesense:

•Thespaceofsafedefinitionsistoolargetoenumerate,andthusweneedtocomparepartialdefinitionstoguidethesearchtowardthecorrectdefinition.

•Thebestdefinitionthatthesystemcangeneratemaywellbeapartialone,asthesetofknownsourcesmaynotbesufficienttocompletelymodelthesource.

Thesimplestwaytocomputeascoreforapartialdefinitionistocomputethesamefunctionasbefore,butinsteadofusingtherawsourcetuples,projectingthemoverthesubsetofattributesthatareproducedbythedefinition.Thisrevisedscoreisshownbelow.(Notethattheprojectionisoverv\\βs,whichdenotesthesubsetofoutputattributesofswhichareproducedbytheviewdefinitionv.Notealsothattheprojectionisnotdistinct,i.e.multipleinstancesofthesametuplemaybeproduced.)

1󰀄|πv\\βs(Os(i))∩Ov(i)|

score2(s,v,I)=

|I|i∈I|πv\\βs(Os(i))∪Ov(i)|

Thisrevisedscoreisnotveryusefulhowever,asitgivesanunfairadvantagetodefinitionsthatdonotproducealloftheoutputattributesofthesource.Thisisbecauseitisfareasiertocorrectlyproduceasubsetoftheoutputattributesthantoproduceallofthem.Considerforexamplethetwosourcedefinitionsshownbelow.Thetwodefinitionsareidenticalexceptthatthesecondreturnstheoutputdistancevaluedist2,whilethefirstdoesnot:

source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist2),≤(dist2,dist1).source5($zip1,$dist1,zip2,dist2):-source4($zip1,$zip2,dist2),≤(dist2,dist1).Sincethetwoareidentical,theprojectionoverthesubsetwillinthiscasereturnthesamenumberoftuples.Thismeansthatbothdefinitionswouldgetthesamescorealthoughtheseconddefinitionisclearlybetterthanthefirstsinceitproducesalloftherequiredoutputs.

Weneedtobeabletopenalisepartialdefinitionsinsomewayfortheattributestheydon’tproduce.Onewaytodothisistofirstcalculatethesizeofthedomain|D[a]|ofeach

21

Carman&Knoblock

ofthemissingattributes.Intheexampleabove,themissingattributeisthedistancevalue.Sincedistanceisacontinuousvalue,calculatingthesizeofitsdomainisnotobvious.Wecanapproximatethesizeofitsdomainby:

|D[distance]|≈

max−minaccuracy

whereaccuracyistheerror-boundondistancevalues.(Wewilldiscusserror-boundsfurtherinsection5.4.)Notethatthecardinalitycalculationmaybespecifictoeachsemantictype.Armedwiththedomainsize,wecanpenalisethescoreforthedefinitionbydividingitbytheproductofthesizeofthedomainsofalloutputattributesnotgeneratedbythedefinition.Inessence,wearesayingthatallpossiblevaluesfortheseextraattributeshavebeen“allowed”bythisdefinition.Thistechniqueissimilartotechniquesusedforlearningwithoutexplicitnegativeexamples(Zelle,Thompson,Califf,&Mooney,1995).

c\\v,thusthepenaltyThesetofmissingoutputattributesisgivenbytheexpressionβs

formissingattributesisjustthesizeofthedomainoftuplesofthatscheme,i.e.:

c

penalty=|D[βs\\v]|

Usingthispenaltyvaluewecancalculateanewscore,whichtakesintoaccountthemissing

attributes.Simplydividingtheprojectedscorebythepenaltywouldnotadheretotheintendedmeaningofcompensatingforthemissingattributevalues,andthusmayskewtheresults.Instead,wederiveanewscorebyintroducingtheconceptoftypeddompredicatesasfollows:

Adompredicateforasemanticdata-typet,denoteddomt,isasinglearityrelationwhoseextensionissettobethedomainofthedatatype,i.e.E[domt]=D[t].Similarly,adompredicateforaschemeA,denoteddomA,isarelationoverAwhoseextensionisE[domA]=D[A].

DompredicateswereintroducedbyDuschkatohandletheproblemofqueryreformulationinthepresenceofsourceswithbindingconstraints(Duschka,1997).(Inthatworkthepredicateswerenottyped,althoughtypingwouldhaveresultedinamoreefficientalgo-rithm.)Hereweusethemtoconvertapartialdefinitionvintoasafe(complete)definitionv󰀄.Wecandothissimplybyaddingadompredicatetotheendoftheviewdefinitionthatgeneratesvaluesforthemissingattributes.Fortheexampleabove,v󰀄wouldbe:

source5($zip1,$dist1,zip2,x):-source4($zip1,$zip2,dist2),≤(dist2,dist1),domdistance(x).

wherexisanewvariableoftypedistance.Thenewviewdefinitionv󰀄issafe,becauseallthevariablesintheheadoftheclausealsoappearinthebody.Ingeneral,wecanturnanyunsafeviewdefinitionvintoasafedefinitionv󰀄byappendingadompredicate

c\\v(x1,...,xn),whereeachxiisadistinguishedvariable(fromtheheadoftheclause)domβs

correspondingtoanoutputattributeofv󰀄thatwasn’tboundinv.Nowwecanusethiscompletedefinitiontocalculatethescoreasbefore:

1󰀄|Os(i)∩Ov󰀁(i)|

score3(s,v,I)=score(s,v,I)=

|I|i∈I|Os(i)∪Ov󰀁(i)|

󰀄

22

LearningSemanticDefinitionsofInformationSourcesontheInternet

whichcanberewritten(byexpandingthedenominator)asfollows:

score3(s,v,I)=

|Os(i)∩Ov󰀁(i)|1󰀄

|I|i∈I|Os(i)|+|Ov󰀁(i)|−|Os(i)∩Ov󰀁(i)|

Wecanthenremovethereferencestov󰀄fromthisequationbyconsidering:

c

c\\v]=Ov(i)×D[βs\\v]Ov󰀁(i)=Ov(i)×E[domβs

c\\v]|andthesizeoftheintersectionThusthesizeofthesetisgivenby|Ov󰀁(i)|=|Ov(i)||D[βs

canbecalculatedbytakingtheprojectionovertheoutputattributesproducedbyv:

c

|Os(i)∩Ov󰀁(i)|=|πv\\βs(Os(i)∩Ov(i)×D[βs\\v])|=|πv\\βs(Os(i))∩Ov(i)|

Substitutingthesecardinalitiesintothescorefunctiongivenabove,wearriveatthefollowing

equationforthepenalisedscore:

|πv\\βs(Os(i))∩Ov(i)|1󰀄

score3(s,v,I)=c\\v]|−|π|I|i∈I|Os(i)|+|Ov(i)||D[βsv\\βs(Os(i))∩Ov(i)|

5.2BindingConstraints

Someofthecandidatedefinitionsgeneratedduringthesearchmayhavedifferentbinding

constraintsfromthetargetpredicate.Forinstanceinthepartialdefinitionshownbelow,thevariablezip2isanoutputofthetargetsource,butaninputtosource4:

source5($zip1,$dist1,zip2,):-source4($zip1,$zip2,dist1).

Fromalogicalperspective,inordertotestthisdefinitioncorrectly,weneedtoinvokesource4witheverypossiblevaluefromthedomainofzipcodes.Doingthisisnotpracticalfortworeasons:firstly,thesystemmaynothaveacompletelistofzipcodesatitsdisposal.Secondlyandfarmoreimportantly,invokingsource4withthousandsofdifferentzipcodeswouldtakeaverylongtimeandwouldprobablyresultinthesystembeingblockedfromfurtheruseoftheservice.Soinsteadofinvokingthesamesourcethousandsoftimes,weapproximatethescoreforthisdefinitionbysamplingfromthedomainofzipcodesandinvokingthesourceusingthesampledvalues.Wethencompensateforthissamplingbyscaling(certaincomponentsof)thescorebytheratioofthesampledzipcodestotheentiredomain.Consideringtheexampleabove,ifwerandomlychooseasample(denotedδ[zipcode])ofsay20valuesfromthedomainofzipcodes,thenthesetoftuplesreturnedbythedefinitionwillneedtobescaledbyafactorof|D[zipcode]|/20.

Amoregeneralequationforcomputingthescalingfactor(denotedSF)isshownbelow.Notethatthesamplingmayneedtobeperformedoverasetofattributes.(Hereβv\\βsdenotestheinputattributesofvwhichareoutputsofs.)

SF=

|D[βv\\βs]||δ[βv\\βs]|

Wenowcalculatetheeffectofthisscalingfactorontheoverallscoreasfollows.Wedenote

˜v(i).ThisvaluethesetoftuplesreturnedbythedefinitiongiventhesampledinputasO

23

Carman&Knoblock

whenscaledwillapproximatethesetoftuplesthatwouldhavebeenreturnedhadthedefinitionbeeninvokedwithallthepossiblevaluesfortheadditionalinputattributes:

˜v(i)|∗SF|Ov(i)|≈|O

Assumingthesamplingisperformedrandomlyoverthedomainofpossiblevalues,the

intersectionbetweenthetuplesproducedbythesourceandthedefinitionshouldscaleinthesameway.Thustheonlyfactornotaffectedbythescalinginthescoredefinedpreviouslyis|Os(i)|.Ifwedividethroughoutbythescalingfactorwehaveanewscoringfunction:

˜v(i)||πv\\βs(Os(i))∩O1󰀄

score4(s,v,I)=

˜v(i)||D[βc\\v]|−|πv\\β(Os(i))∩O˜v(i)||I|i∈I|Os(i)|/SF+|Oss

Theproblemwiththisapproachisthatoftenthesampledsetofvaluesistoosmallandasa

resultitdoesnotintersectwiththesetofvaluesreturnedbythesource,eventhoughalargersamplewouldhaveintersectedinsomeway.Thusoursamplingintroducesunfairdistortionsintothescoreforcertaindefinitions,causingthemtoperformpoorly.Forexample,consideragainsource5andassumethatforscalabilitypurposes,theserviceplacesalimitonthemaximumvaluefortheinputradiusdist1.(Thismakessense,asotherwisetheusercouldsettheinputradiustocovertheentireUS,andatupleforeverypossiblezipcodewouldneedtobereturned.)Nowconsiderthesamplingperformedabove.Ifwerandomlychooseonly20zipcodesfromthesetofallpossiblezipcodes,thechanceofthesamplecontainingazipcodewhichlieswithina300mileradiusofaparticularzipcode(inthemiddleofthedesert)isverylow.Moreover,evenifonepairofzipcodes(outof20)resultsinasuccessfulinvocation,thiswillnotbesufficientforlearningagooddefinitionfortheservice.

Sotogetaroundthisproblemwebiasthesamplesuchthat,wheneverpossible,halfofthevaluesaretakenfrompositiveexamplesofthetarget(thosetuplesreturnedbythenewsource)andhalfaretakenfromnegativeexamples(thosetuplesnotreturnedbythesource).Bysamplingfrombothpositiveandnegativetuples,weguaranteethattheapproximationgeneratedwillbeasaccurateaspossiblegiventhelimitedsamplesize.Wedenotethesetofpositiveandnegativesamplesasδ+[βv\\βs]andδ−[βv\\βs],andusethesevaluestodefinepositiveandtotalscalingfactorsasshownbelow.(Thenumeratorforthepositivevaluesisdifferentfrombefore,asthesevalueshavebeentakenfromtheoutputofthenewsource.)

SF+=

|πβv\\βs(πv\\βs(Os(i)))|

|δ+[βv\\βs]|

Thetotalscalingfactoristhesamevalueasbefore,butcalculatedslightlydifferently:

SF=

|D[βv\\βs]|

|δ+[βv\\βs]|+|δ−[βv\\βs]|

Thescorecanthenbeapproximatedaccordinglybytakingintoaccountthesenewscalingfactors.Theintersectionnowneedstobescaledusingthepositivescalingfactor:

˜v(i)|∗SF+|πv\\βs(Os(i))∩Ov(i)|≈|πv\\βs(Os(i))∩O

Thisnewscalingresultsinanewfunctionforevaluatingthequalityofaviewdefinition:˜v(i)|∗SF+|πv\\βs(Os(i))∩O1󰀄

score5(s,v,I)=

c\\v]|∗SF−|π+˜v(i)||D[βs˜|I|i∈I|Os(i)|+|Ov\\βs(Os(i))∩Ov(i)|∗SF

24

LearningSemanticDefinitionsofInformationSourcesontheInternet

5.3FavouringShorterDefinitions

Nowthatwehavederivedascoreforcomparingthedatathatasourceandcandidateproduce,wecandefinetheevaluationfunctionevalusedinAlgorithm1.Asmentionedinsection4.6,shorterdefinitionsforthetargetsourceshouldbepreferredoverlongerandpossiblylessaccurateones.Inaccordancewiththisprinciple,wescalethescorebythelengthofthedefinition,soastofavourshorterdefinitionsasfollows:

eval(v)=ωlength(v)∗score5(s,v,I)

Herelength(v)isthelengthoftheclauseandω<1isaweightingfactor.Settingtheweightingfactortobealittlelessthan1(suchas0.95)helpstoremovelogicallyredundantdefinitions,whichcansometimesbehardtodetect,butoftenreturnalmostexactlythesamescoreastheirshorterequivalent.Wewilldiscusstheproblemofgeneratingnon-redundantclausesinsection6.3.

5.4ApproximatingEquality

Untilnow,wehaveignoredtheproblemofdecidingwhethertwotuplesproducedbythetargetsourceandthedefinitionarethesame.Sincedifferentsourcesmayserializedataindifferentwaysandatdifferentlevelsofaccuracy,wemustallowforsomeflexibilityinthevaluesthatthetuplescontain.Forinstance,intheexamplefromsection2,thedistancevaluesreturnedbythesourceanddefinitiondidnotmatchexactly,butwere“sufficientlysimilar”tobeacceptedasthesamevalue.

Fornumerictypesliketemperatureordistanceitmakessensetouseanerrorbound(like±0.5◦C)orapercentageerror(suchas±1%)todecideiftwovaluescanbeconsideredthesame.Thisisbecausethesensingequipment(inthecaseoftemperature)orthealgorithm(inthecaseofdistance)willhavesomeerrorboundassociatedwiththevaluesitproduces.Werequirethatanerrorboundforeachnumerictypebeprovidedintheproblemspecification.(Ideally,theseboundswouldbelearntautomaticallyfromexamples.)

Forcertainnominaltypeslikecompanynames,wherevalueslike󰀛IBMCorporation󰀜and󰀛InternationalBusinessMachinesCorp.󰀜representthesamevalue,simplisticequalitycheckingusingexactorsubstringmatchesisnotsufficientfordecidingwhethertwovaluescorrespondtothesameentity.Inthiscase,stringeditdistancessuchastheJaroWinklerscoredobetteratdistinguishingstringsrepresentingthesameentityfromthoserepresentingdifferentones(Bilenko,Mooney,Cohen,Ravikumar,&Fienberg,2003).Amachinelearningclassifiercouldbetrainedonasetofsuchexamplestolearnwhichoftheavailablestringeditdistancesbestdistinguishesvaluesofthattypeandwhatthresholdtosetforacceptingapairasamatch.Werequirethatthispairofsimilaritymetricandthreshold(oranycombinationsofmetrics)beprovidedintheproblemspecification.

Inothercases,enumeratedtypeslikemonthsoftheyearmightbeassociatedwithasimpleequalitycheckingprocedure,sothatvalueslike󰀛January󰀜,󰀛Jan󰀜and󰀛1󰀜canbefoundequal.Theactualequalityprocedureusedwilldependonthesemantictypeandweassumeinthisworkthatsuchaprocedureisgivenintheproblemdefinition.Wenotethattheprocedureneednotbe100%accurate,butonlyprovideasufficientlevelofaccuracytoguidethesystemtowardthecorrectsourcedescription.Indeed,theequalityrulescouldalsobegeneratedofflinebytrainingaclassifier.

25

Carman&Knoblock

Complextypessuchasdatepresentabiggerproblemwhenoneconsiderstherangeofpossibleserializations,includingvalueslike󰀛5/4/2006󰀜or󰀛Thu,4May2006󰀜or󰀛2006-05-04󰀜.Insuchcasesspecializedfunctionsarenotonlyrequiredtocheckequalitybetweenvaluesbutalsotobreakthecomplextypesupintotheirconstituentparts(inthiscaseday,monthandyear).Thelatterwouldformpartofthedomainmodel.

Insomecases,decidingwhethertwovaluesofthesametypecanbeconsideredequaldependsnotonlyonthetype,butalsoontherelationstheyareusedin.Considerthetworelationsshownbelow.Thefirstprovidesthelatitudeandlongitudecoordinatesofthecentroidforazipcode,whilethesecondreturnsthecoordinatesforaparticularaddress:

centroid(zipcode,latitude,longitude)

geocode(number,street,zipcode,latitude,longitude)

Giventhedifferentwaysofcalculatingthecentroidofazipcode(includingusingthecenterofmassorthecenterofpopulationdensity)anerrorboundof500metersmightmakesenseforequatinglatitudeandlongitudecoordinates.Forageocodingservice,ontheotherhand,anerrorboundof50metersmaybemorereasonable.Ingeneral,sucherrorboundsshouldbeassociatedwiththesetofglobalrelations,insteadofjustthesemantictypes,andcouldbelearntaccordingly.Whentherelationscontainmultipleattributes,thentheproblemofdecidingwhethertwotuplesrefertothesameentityiscalledrecordlinkage(Winkler,1999).Anentirefieldofresearchisdevotedtotacklingthisproblem.Duetothecomplexityoftheproblemandthevarietyoftechniquesthathavebeendevelopedtohandleit,wedonotinvestigateitfurtherhere.

6.Extensions

Inthissectionwediscussextensionstothebasicalgorithmneededforhandlingrealdatasources,aswellaswaystoreducethesizeofthehypothesisspaceandimprovethequalityofthedefinitionsproduced.6.1GeneratingInputs

Thefirststepinthesourceinductionalgorithmistogenerateasetoftupleswhichwillrepresentthetargetrelationduringtheinductionprocess.Inotherwords,thesystemmusttrytoinvokethenewsourcetogathersomeexampledata.Doingthiswithoutbiasingtheinductionprocessiseasiersaidthandone.Thesimplestapproachtogeneratinginputvaluesistoselectconstantsatrandomfromthesetofexamplesgivenintheproblemspecification.Theproblemwiththisapproachisthatinsomecasesthenewsourcewillnotproduceanyoutputfortheselectedinputs.Insteadthesystemmayneedtoselectvaluesaccordingtosomedistributionoverthedomainofvaluesinorderforthesourcetoinvokecorrectly.Forexample,considerasourceprovidingpostsofusedcarsforsaleinacertainarea.Thesourcetakesthemakeofthecarasinput,andreturnscardetails:

usedCars($make,model,year,price,phone)

Althoughthereareoverahundreddifferentcarmanufacturersintheworld,onlyafewofthemproducethebulkofthecars.ThusinvokingthesourcewithvalueslikeFerrari,LotusandAstonMartinwillbelesslikelytoreturnanytuples,whencomparedwithmorecommonbrandssuchasFordandToyota(unlessthesourceisonlyprovidingdataforsportscarsofcourse).Ifadistributionoverpossiblevaluesisavailable,thesystemcanfirsttry

26

LearningSemanticDefinitionsofInformationSourcesontheInternet

themorecommonvalues,ormoregenerally,itcanchoosevaluesfromthatsetaccordingtothedistribution.Inthisparticularexample,itmightnotbetoodifficulttoquerythesourcewithacompletesetofcarmanufacturersuntiloneoftheinvocationsreturnssomedata.Ingeneral,thesetofexamplesmaybeverylarge(suchasthe40,000+zipcodesintheUS)andthenumberof“interesting”valuesinthatset(theoneslikelytoreturnresults)maybeverysmall,inwhichcasetakingadvantageofpriorknowledgeaboutthedistributionofpossiblevaluesmakessense.Itshouldbenotedalsothatduringexecutionthesystemwillreceivealotofoutputdatafromthedifferentsourcesitaccesses.Thisdatacanberecordedtogeneratedistributionsoverpossiblevaluesforthedifferenttypes.

Theproblemofgeneratingviableinputdataforanewsourcebecomesyetmoredifficultiftheinputrequiredisnotasinglevaluebutatupleofvalues.Inthiscasethesystemcanfirsttrytoinvokethesourcewithrandomcombinationsofattributevaluesfromtheexamplesofeachtype.Invokingsomesources(suchassource5)iseasybecausethereisnoexplicitrestrictiononthecombinationofinputvalues:

source5($zip,$distance,zip,distance)

Inothercases,suchasageocodingservicethecombinationofpossibleinputvaluesishighlyrestricted:

USGeocoder($number,$street,$zipcode,latitude,longitude)

Randomlyselectinginputvaluesindependentlyofoneanotherisunlikelytoresultinanysuccessfulinvocations.(Inorderfortheinvocationtosucceed,therandomlygeneratedtuplemustcorrespondtoanaddressthatactuallyexists.)Insuchcases,afterfailingtoinvokethesourceanumberoftimes,thesystemcantrytoinvokeothersources(suchasthehotellookupservicebelow),whichproducetuplescontainingtherequiredattributetypes:

HotelSearch($city,hotel,number,street,zipcode)

Ingeneral,thisprocessofinvokingsourcestogenerateinputforothersourcescanbechaineduntilasetofviableinputsisgenerated.

Wenotethattheproblemofsynthesizingviableinputdataisitselfadifficultandinterestingresearchproblem.Ourcombinedapproachofutilizingvaluedistributionsandinvokingalternativeservicesperformswellinourexperiments(seesection7.3),butanareaoffutureworkistodevelopamoregeneralsolution.6.2DealingwithSources

Inordertominimisesourceaccesses,whichcanbeveryexpensiveintermsofbothtimeandbandwidth,allrequeststotheindividualsourcesarecachedinalocalrelationaldatabase.Thisimplementationmeansthatthereisanimplicitassumptioninthisworkthattheoutputproducedbytheservicesisconstantforthedurationoftheinductionprocess.Thiscouldbeproblematiciftheservicebeingmodeledprovides(near)real-timedatawithanupdatefrequencyoflessthanthetimeittakestoinduceadefinition.Foraweatherpredictionservice,updatedhourly,thismaynotpresentmuchofaproblem,sincethedifferencebetweenpredictedtemperaturesmayvaryonlyslightlyfromoneupdatetothenext.Forareal-timeflightstatusserviceprovidingthecoordinatesofagivenaircrafteveryfiveminutes,thecachingmaybeproblematicasthelocationoftheplanewillvarygreatlyifittakes,forexample,onehourtoinduceadefinition.Intheoryonecouldtestforsuchvariationsystematicallybyperiodicallyinvokingthesamesourcewithapreviously

27

Carman&Knoblock

successfulinputtupletoseeiftheoutputhaschanged,andupdatethecachingpolicyaccordingly.

6.3LogicalOptimisations

Evaluatingdefinitionscanbeexpensivebothintermsoftime(waitingforsourcestore-turndata)andcomputation(calculatingjoinsoverlargetables).Thusitmakessensetocheckeachcandidateforredundancybeforeevaluatingit.Todecidewhichdefinitionsareredundant,weusetheconceptofquerycontainment:

Aqueryq1∈LR,Aissaidtobecontainedinanotherqueryq2∈LR,AifforanydatabaseinstanceI,thesetoftuplesreturnedbythefirstqueryisasubsetofthosereturnedbythesecond,i.e.∀IEI[q1]⊆EI[q2].Wedenotecontainmentbyq1󰀟q2.Twoqueriesareconsideredlogicallyequivalentifq1󰀟q2∧q2󰀟q1.Fortheconjunctivequerieslearntinthispaper,testingforquerycontainmentreducestotheproblemoffindingacontainmentmapping(Chandra&Merlin,1977).11Wecanusethistesttodiscoverlogicallyequivalentdefinitionssuchasthefollowing,(whichcontainareorderingofthesameliterals):

source($zip,temp):-getCentroid($zip,lat,lon),getConditions($lat,$lon,temp).source($zip,temp):-getConditions($lat,$lon,temp),getCentroid($zip,lat,lon).Suchequivalencecheckingcanbeperformedefficientlyifacanonicalorderingofpredicateandvariablenamesischosenapriori.Wheneverlogicallyequivalentdefinitionsaredis-covered,thesearchcanbacktrack,therebyavoidingentiresub-treesofequivalentclauses.Similarly,wecantestforandskiplogicallyredundantclausessuchasthefollowing(whichisequivalenttoashorterdefinitionwithoutthesecondliteral):

source($zip,):-getCentroid($zip,lat,long),getCentroid($zip,lat,).

Again,suchredundancycheckingcanbeperformedefficiently(Levy,Mendelzon,Sagiv,&Srivastava,1995)resultinginlittlecomputationaloverheadduringsearch.6.4FunctionalSources

Moreinformationmaybeknownaboutthefunctionalityofcertainsourcesthanisexpressedbytheirsourcedefinitions.Forexample,sourceslikeMultiplyandConcatenate,whichareimplementedlocally,willbeknowntobecomplete.(Asourceisconsideredcomplete,ifitreturnsallofthetuplesimpliedbyitsdefinition,i.e.E[s]=EI[v].)Wheneversuchinformationisavailable,theinductionsystemcantakeadvantageofittoimprovesearchefficiency.Toexplainhow,wedefineaclassofsourcesthatwecallfunctionalsources,whicharecompleteandforanyinputtuplereturnexactlyoneoutputtuple.ThisisslightlymorerestrictivethanthestandardILPconceptofdeterminateliterals(Cameron-Jones&Quinlan,1994),whichforeveryinputtuplereturnatmostoneoutputtuple.MultiplyandConcatenatearebothexamplesoffunctionalsources.Thesystemtakesadvantageofthefactthatfunctionalsourcesplacenorestrictionsontheirinput.Wheneverafunctionalsourceisaddedtoacandidatedefinition,thescoreforthatdefinitiondoesn’tchangeprovidingallthesource’sinputsandnoneofitsoutputsarebound.(Thesetoftuplesreturnedbythenew

11.Ifthequeriescontaininterpretedpredicates,thencontainmenttestingisalittlemoreinvolved(Afrati,

Li,&Mitra,2004).

28

LearningSemanticDefinitionsofInformationSourcesontheInternet

definitionisthesameasbefore,butwithafewnewattributescorrespondingtotheoutputsofthesource.)Thusthenewdefinitiondoesnotneedtobeevaluated,butcanbeaddedtothequeue(ofdefinitionstoexpand)asis,whichbecomesparticularlyadvantageouswhenasource’sinputarityishigh.6.5Constants

Constantsareoftenusedinsourcedescriptionstodefinethescopeofaservice.ConsideraweatherservicethatprovidesreportsforzipcodesonlyinCalifornia:

calWeather($zip,$date,temp):-forecast(zip,date,temp),USstate(zip,California).IfamediatorreceivesaqueryaskingfortheforecastforChicago,itwillknowthatthissourceisnotrelevanttothequerysinceChicagoisnotinCalifornia.Althoughconstantsinsourcedescriptionscanbeveryuseful,simplyintroducingthemintothehypothesislanguagecouldcausethesearchspacetogrowprohibitively.(Forstates,thebranchingfactorwouldbe50,whileforzipcodesitwouldbeinexcessof40,000.)Obviouslyagenerateandtestmethodologydoesnotmakesensewhenthedomainofthesemantictypeislarge.Alternatively,onecanexplicitlycheckforrepeatedvaluesinthetuplesreturnedbythenewsource(i.e.forconstantsintheheadoftheclause)andinthejoinofthesourceanddefinitionrelations(i.e.forconstantsinthebodyoftheclause).Forexample,inthedefinitionbelowthejoinofthesourcerelation󰀛zip,date,temp󰀜withthedefinitionrelation󰀛zip,date,temp,state󰀜wouldproduceonlytupleswithstateequaltoCalifornia.Sothatconstantcouldbeaddedtothedefinition.

source($zip,$date,temp):-forecast(zip,date,temp),USstate(zip,state).Morecomplicateddetectionprocedureswouldberequiredfordiscoveringconstantsinin-terpretedpredicates(i.e.rangerestrictionsovernumericattributes).6.6Post-Processing

Afteradefinitionhasbeenlearntforanewsource,itmaybepossibletotightenthatdefinitionbyremovinglogicalredundanciesfromitsunfolding.Considerthefollowingdefinitioninvolvingcallstotwohotelsources,onetocheckavailabilityandtheothertocheckitsrating:

source($hotel,address,rating):-HotelAvailability($hotel,address,price),HotelRating($hotel,rating,address).Theunfoldingofthatdefinition(intermsofthedefinitionsofthehotelsources)containstworeferencestoanaccommodationrelation:

source($hotel,address,rating):-accommodation(hotel,,address),available(hotel,today,price),accommodation(hotel,rating,address).

Thefirstliteralisredundantandcanberemovedfromtheunfolding.Ingeneral,thesamerulesusedtodiscoverredundancyincandidatedefinitionscanbeusedtoremoveredundantliteralsfromtheunfolding.Moreover,sincethispost-processingstepneedstobeperformedonlyonce,timecanbespentsearchingformorecomplicatedformsofredundancy.

29

Carman&Knoblock

7.Evaluation

Inthissectionwedescribeourevaluationofthesourceinductionalgorithm.Wefirstdescribetheexperimentalsetupusedandthentheexperimentsperformed.Finally,wecomparetheinductionalgorithmwithaparticularcomplexschemamatchingsystem.7.1ExperimentalSetup

Thesourceinductionalgorithmdefinedinthispaperwasimplementedinasystemcalledeidos,whichstandsforEfficientlyInducingDefinitionsforOnlineSources.eidosim-plementsthetechniquesandoptimisationsdiscussedinsections4through6.(Certainextensionsfromsection6wereonlypartiallyimplemented:theimplementationcurrentlychecksforconstantsonlyintheheadoftheclauseanddoesnotperformanytighteningofthedefinitions.)AllcodewaswritteninJavaandaMySQLdatabasewasusedforcachingtheresultsofsourceinvocations.

eidoswastestedon25differentproblemsinvolvingrealservicesfromseveraldomainsincludinghotels,financialdata,weatherandcars.Thedomainmodelusedwasthesameforeachproblemandincludedover70differentsemantictypes,rangingfromcommononeslikezipcodetomorespecifictypessuchasstocktickersymbols.Thedatamodelalsocontained36relations(excludinginterpretedpredicates),whichwereusedtomodel33differentservices.Allofthemodeledservicesarepubliclyavailableinformationsources.Wenoteherethatthedecisiontousethesamesetofknownsourcesforeachproblem(regardlessofthedomain)wasimportantinordertomakesurethatthetestswererealistic.Thisdecisionmadetheproblemmoredifficultthanthestandardschemamatching/mappingscenarioinwhichthesourceschemaischosen,becauseitprovidesdatathatisknownaprioritoberelevanttotheoutputschema.

Inordertogiveabettersenseoftheproblemsettingandthecomplexityoftheknownsourcesavailable,welisttenbelow(orderedbyarity).Duetospacelimitationswedon’tshowthecompletelistnortheirdefinitions,onlytheinput/outputtypesforeachsource.Notethatallthesourcessharethesemantictypeslatitudeandlongitude,whichmeansthatthesearchspaceassociatedwiththesesourcesaloneisverylarge.

1

WeatherConditions($city,state,country,latitude,longitude,time,time,timeoffset,datetime,temperatureF,sky,pressureIn,direction,speedMph,humidity,temperatureF)2WeatherForecast($city,state,country,latitude,longitude,timeoffset,day,date,

temperatureF,temperatureF,time,time,sky,direction,speedMph,humidity)3GetDistance($latitude,$longitude,$latitude,$longitude,distanceKm)4USGeocoder($street,$zipcode,city,state,latitude,longitude)5ConvertDMS($latitudeDMS,$longitudeDMS,latitude,longitude)6USGSEarthquakes(decimal,timestamp,latitude,longitude)7GetAirportCoords($iata,airport,latitude,longitude)8CountryCode($latitude,$longitude,countryAbbr)9GetCentroid($zipcode,latitude,longitude)10Altitude($latitude,$longitude,distanceM)

Inordertoinducedefinitionsforeachproblem,thesource(andeachcandidatedefini-tion)wasinvokedatleast20timesusingrandominputs.Wheneverpossible,thesystemattemptedtogenerate10positiveexamplesofthesource(invocationsforwhichthesourcereturnedsometuples)and10negativeexamples(inputswhichproducednooutput).To

30

LearningSemanticDefinitionsofInformationSourcesontheInternet

ensurethatthesearchterminated,thenumberofiterationsofthealgorithmincludingback-trackingstepswaslimitedto30.Asearchtimelimitof20minuteswasalsoimposed.Theinductivesearchbiasusedduringtheexperimentsisshownbelow,andaweightingfactor(definedinsection5.3)of0.9wasusedtodirectthesearchtowardshorterdefinitions.

SearchBiasMaximumclauselength=7Maximumpredicaterepetition=2Maximumvariablelevel=5Executablecandidatesonly

Novariablerepetitionwithinaliteral

Intheexperiments,differentprocedureswereusedtodecideequalitybetweenvaluesofthesametypeasdiscussedinsection5.4.Someoftheequalityproceduresusedfordifferenttypesarelistedbelow.Theaccuracyboundsandthresholdsusedwerechosentomaximizeoverallperformanceofthelearningalgorithm.(Inpractice,ameta-learningalgorithmcouldbeusedtodeterminethebestaccuracyboundsfordifferentattributetypes.)Forallsemantictypesnotlistedbelow,substringmatching(checkingifonestringcontainedtheother)wasusedtotestequalitybetweenvalues.

Typeslatitudes,longitudesdistances,speeds,temperatures,priceshumidity,pressure,degreesdecimals

companies,hotels,airportsdates

EqualityProcedureaccuracyboundof±0.002accuracyboundof±1%accuracyboundof±1.0accuracyboundof±0.1JaroWinklerscore≥0.85specialisedequalityprocedure

Theexperimentswererunonadualcore3.2GHzPentium4with4GBofRAM(althoughmemorywasnotalimitingfactorinanyofthetests).ThesystemwasrunningWindows2003Server,JavaRuntimeEnvironment1.5andMySQL5.0.7.2EvaluationCriteria

Inordertoevaluatetheinductionsystem,onewouldliketocompareforeachproblemthedefinitiongeneratedbythesystemwiththeidealdefinitionforthatsource(denotedvbestandv∗respectively).Inotherwords,wewouldlikeanevaluationfunction,whichratesthequalityofeachdefinitionproducedwithrespecttoahand-writtendefinitionforthesource(i.e.quality:vbest×v∗→[0,1]).Theproblemwiththisistwofold.Firstly,itisnotobvioushowtodefinesuchasimilarityfunctionoverconjunctivequeriesandmanydifferentpossibilitiesexist(seeMarkovandMarinchev,2000,foraparticularexample).Secondly,workingoutthebestdefinitionbyhand,whiletakingintoaccountthelimitationsofthedomainmodelandthefactthattheavailablesourcesarenoisy,incomplete,possiblylessaccurate,andevenserialisedataindifferentways,maybeextremelydifficult,ifevenpossible.Soinordertoevaluateeachofthediscovereddefinitions,weinsteadcountthenumberofcorrectlygeneratedattributesineachdefinition.Anattributeissaidtobecorrectlygenerated,if:

31

Carman&Knoblock

•itisaninput,andthedefinitioncorrectlyrestrictsthedomainofpossiblevaluesforthatattribute,or

•itisanoutput,andthedefinitioncorrectlypredictsitsvalueforgiveninputtuples.Considerthefollowingdefinitionthattakestwoinputvaluesandreturnsthedifferenceanditssquareroot(providingthedifferenceispositive):

source($A,$B,C,D):-sum(B,C,A),product(D,D,C),A≥B.

andimaginethattheinductionsystemmanagedtolearnonlythatthesourcereturnsthedifferencebetweentheinputandtheoutput,i.e.:

source($A,$B,C,):-sum(B,C,A).

WesaythatthefirstinputattributeAisnotcorrectlygeneratedasitisaninputandisnotconstrainedwithrespecttotheinputattributeB(theinequalityismissing).TheinputBisdeemedcorrectlygeneratedasitispresentinthesumrelation(onlyoneinputispenalisedforthemissinginequality).TheoutputCisdeemedcorrectlygeneratedwithrespecttotheinputs,andthemissingattributeDisn’tgeneratedatall.(Notethatiftheorderingofvariablesinthesumrelationhadbeendifferent,saysum(A,B,C),thenCwouldhavebeengenerated,butnotcorrectlygenerated.)

Givenadefinitionforcorrectlygeneratedattributes,onecandefineexpressionsforprecisionandrecallovertheattributescontainedinasourcedefinition.Wedefineprecisiontobetheratioofcorrectlygeneratedattributestothetotalnumberofattributesgeneratedbyadefinition,i.e.:

precision=

#ofcorrectlygeneratedattributestotal#ofgeneratedattributes

Wedefinerecalltobetheratioofgeneratedattributestothetotalnumberofattributesthatwouldhavebeengeneratedbytheidealdefinition,giventhesourcesavailable.(Insomecasesnosourcesareavailabletogeneratevaluesforanattributeinwhichcase,thatattributeisnotincludedinthecount.)

recall=

#ofcorrectlygeneratedattributes

total#ofattributesthatshouldhavebeengenerated

Notethatwedefinedprecisionandrecallattheschemalevelintermsoftheattributesinvolvedinasourcedefinition.Theycouldalsobedefinedatthedatalevelintermsofthetuplesbeingreturnedbythesourceandthedefinition.Indeed,theJaccardsimilarityusedtoscorecandidatedefinitionsisacombinationofdata-levelprecisionandrecallvalues.Thereasonforchoosingschemalevelmetricsinourevaluationisthattheybetterreflectthesemanticcorrectnessofthelearntdefinition,insofarastheyareindependentofthecompleteness(amountofoverlap)betweentheknownandtargetsources.

Returningtoourexampleabove,theprecisionforthesimpledefinitionlearntwouldbe2/3andtherecallwouldbe2/4.Notethat,iftheproductrelationhadnotbeenavailableinourdomainmodel(inwhichcaseattributeDcouldneverhavebeengenerated),recallwouldhavebeenhigherat2/3.

32

LearningSemanticDefinitionsofInformationSourcesontheInternet

7.3Experiments

Thedefinitionslearntbythesystemaredescribedbelow.Overallthesystemperformedverywellandwasabletolearntheintendeddefinition(ignoringmissingjoinvariablesandsuperfluousliterals)in19outofthe25problems.7.3.1GeospatialSources

Thefirstsetofproblemsinvolvedninegeospatialdatasourcesprovidingavarietyoflocationbasedinformation.Thedefinitionslearntbythesystemarelistedbelow.Theyarereportedintermsofthesourcepredicatesratherthanthedomainrelations(i.e.theunfoldings)becausethecorrespondingdefinitionsaremuchshorter.Thismakesiteasiertounderstandhowwellthesearchalgorithmisperforming.

123

GetInfoByZip($zip0,cit1,sta2,_,tim4):-GetTimezone($sta2,tim4,_,_),GetCityState($zip0,cit1,sta2).GetInfoByState($sta0,cit1,zip2,_,tim4):-GetTimezone($sta0,tim4,_,_),GetCityState($zip2,cit1,sta0).GetDistanceBetweenZipCodes($zip0,$zip1,dis2):-GetCentroid($zip0,lat1,lon2),GetCentroid($zip1,lat4,lon5),

GetDistance($lat1,$lon2,$lat4,$lon5,dis10),ConvertKm2Mi($dis10,dis2).GetZipCodesWithin($_,$dis1,_,dis3):-<(dis3,dis1).

YahooGeocoder($str0,$zip1,cit2,sta3,_,lat5,lon6):-USGeocoder($str0,$zip1,cit2,sta3,lat5,lon6).GetCenter($zip0,lat1,lon2,cit3,sta4):-WeatherConditions($cit3,sta4,_,lat1,lon2,_,_,_,_,_,_,_,_,_,_,_),GetZipcode($cit3,$sta4,zip0).

Earthquakes($_,$_,$_,$_,lat4,lon5,_,dec7,_):-USGSEarthquakes(dec7,_,lat4,lon5).USGSElevation($lat0,$lon1,dis2):-ConvertFt2M($dis2,dis1),Altitude($lat0,$lon1,dis1).CountryInfo($cou0,cou1,cit2,_,_,cur5,_,_,_,_):-GetCountryName($cou0,cou1),GoCurrency(cur5,cou0,_),

WeatherConditions($cit2,_,cou1,_,_,_,_,_,_,_,_,_,_,_,_,_).

456

789

Thefirsttwosourcesprovideinformationaboutzipcodes,suchasthenameofthecity,thestateandthetimezone.Theydifferintheirbindingconstraints,withthefirsttakingazipcodeasinput,andthesecondtakingastate.Thesecondsourcereturnsmanyoutputtuplesperinputvalue,makingithardertolearnthedefinition,eventhoughthetwosourcesprovidelogicallythesameinformation.Theinduceddefinitionsarethebestpossiblegiventheknownsourcesavailable.(Noneofthemprovidedthemissingoutputattribute,atele-phonearea-code.)Thethirdsourcecalculatesthedistanceinmilesbetweentwozipcodes,(itisthesameassource4fromsection2).Thecorrectdefinitionwaslearntforthissource,butforthenextsource,whichreturnedzipcodeswithinagivenradius,areasonabledefini-tioncouldnotbelearntwithinthetimelimit.12Ignoringbindingconstraints,theintended

12.Therecallforthisproblemis1/4becausetheinputattributedis1isdeterminedtobetheonlycorrectly

generatedattribute(itisconstrainedwithrespecttotheoutputattributedis3),whileallfourattributesshouldhavebeengenerated(theoutputattributedis3isnotgeneratedbythe33

Carman&Knoblock

definitionwasthesameasthethird,butwitharestrictionthattheoutputdistancebelessthantheinputdistance.Thusitwouldhavebeenfareasierforeidostolearnadefinitionforthefourthsourceintermsofthethird.Indeed,whenthenewdefinitionforthethirdsourcewasaddedtothesetofknownsources,thesystemwasabletolearnthefollowing:

4’GetZipCodesWithin($zip0,$dis1,zip2,dis3):-GetDistanceBetweenZipCodes($zip0,$zip2,dis3),<(dis3,dis1).

Theabilityofthesystemtoimproveitslearningabilityovertimeasthesetofknownsourcesincreasesisakeybenefitoftheapproach.

SourcefiveisageocodingserviceprovidedbyYahoo.(Geocodingservicesmapad-dressestolatitudeandlongitudecoordinates.)eidoslearntthatthesamefunctionalitywasprovidedbyaservicecalledUSGeocoder.Sourcesixisasimpleserviceprovidingthelatitude/longitudecoordinatesandthecityandstateforagivenzipcode.Interestingly,thesystemlearntthatthesource’scoordinateswerebetterpredictedbyaweathercondi-tionsservice(discussedinsection7.3.3),thanbytheGetCentroidsourcefromthethirddefinition.Notethatwhenthenewsourcedefinitionisunfoldeditwillcontainextrane-ouspredicatesrelatedtoweatherinformation.13Theadditionalpredicatesdonotinterferewiththeusefulnessofthedefinition,however,asaqueryreformulationalgorithmwillstillusethesourcetoanswerthesamequeriesregardless.(Thusprecisionandrecallscoresarenotaffected.)Apost-processingsteptoremoveextraneouspredicatesispossible,butwouldrequireadditionalinformationinthedomainmodel.14Theseventhsourceprovidedearthquakedatawithinaboundingboxwhichittookasinput.Inthiscase,thesystemdiscoveredthatthesourcewasindeedprovidingearthquakedata,(lat4andlon5aretheco-ordinatesoftheearthquakeanddec7isitsmagnitude).Itdidn’tmanage,however,toworkouthowtheinputcoordinatesrelatedtotheoutput.Thenextsourceprovidedelevationdatainfeet,whichwasfoundtobesufficientlysimilartoknownaltitudedatainmetres.Finally,thesystemlearntadefinitionforasourceprovidinginformationaboutcountriessuchasthecurrencyused,andthenameofthecapitalcity.Sinceknownsourceswerenotavailabletoprovidethisinformation,thesystemendeduplearningthatweatherreportswereavailableforthecapitalofeachcountry.Problem123456789

#Candidates2524888350401115176

#Invocations50689804111361117613148151621487717728784

Time(s)85914449253242831872559

log10(Score)-1.36-1.08-0.750.25-0.45-7.61-6.87-8.58-5.77

Precision4/44/43/31/16/65/53/33/34/4

Recall4/44/43/31/46/65/53/93/34/4

13.Theunfoldingisshownbelow.Theconditionspredicatecouldberemovedwithoutaffectingitsmeaning:

GetCenter($zip0,lat1,lon2,cit3,sta4):-municipality(cit3,sta4,zip2,tim3),country(,cou5,),

northAmerica(cou5),centroid(zip2,lat1,lon2),conditions(lat1,lon2,,,,,,,,,,,),timezone(tim3,,),municipality(cit3,sta4,zip0,).

14.Inparticular,universallyquantifiedknowledgewouldbeneededinthedomainmodel,e.g.:

∀lat,lon∃x1,...,x11s.t.conditions(lat,long,x1,...,x11)

34

LearningSemanticDefinitionsofInformationSourcesontheInternet

Thetableshowssomedetailsregardingthesearchperformedtolearneachofthedefi-nitionslistedabove.Foreachproblem,itshowsthenumberofcandidatesgeneratedpriortothewinningdefinition,alongwiththetimeandnumberofsourceinvocationsrequiredtolearnthedefinition.(Thelasttwovaluesshouldbeinterpretedwithcautionastheyarehighlydependentonthedelayinaccessingsources,andonthecachingofdatainthesystem.)Thescoresshowninthefifthcolumnareanormalisedversionofthescoringfunctionusedtocomparethedefinitionsduringsearch.(Normalisationinvolvedremovingthepenaltyappliedformissingoutputs.)Thescorescanbeverysmall,sothelogarithmofthevaluesisshown(hencethenegativevalues).15Thesescorescanbeinterpretedastheconfidencethesystemhasinthedefinitionsproduced.Thecloserthevalueistozero,thebetterthedefinition’sabilitytoproducethesametuplesasthesource.Weseethatthesystemwasfarmoreconfidentaboutthedefinitionsonethroughfive,thanthelatterones.16Thelasttwocolumnsgivetheprecisionandrecallvalueforeachproblem.Theaverageprecisionfortheseproblemswas100%.(Notethatahighprecisionvalueistobeexpected,giventhattheinductionalgorithmreliesonfindingmatchingtuplesbetweenthesourceanddefinition.)Theaveragerecallforthegeospatialproblemswasalsoveryhighat84%.

7.3.2FinancialSources

Twosourcesweretestedthatprovidedfinancialdata.Thedefinitionsgeneratedbyeidosforthesesourcesareshownbelow.

10GetQuote($tic0,pri1,dat2,tim3,pri4,pri5,pri6,pri7,cou8,_,pri10,_,_,pri13,_,com15):-YahooFinance($tic0,pri1,dat2,tim3,pri4,pri5,pri6,pri7,cou8),

GetCompanyName($tic0,com15,_,_),Add($pri5,$pri13,pri10),Add($pri4,$pri10,pri1).11YahooExchangeRate($_,$cur1,pri2,dat3,_,pri5,pri6):-GetCurrentTime(_,_,dat3,_),GoCurrency(cur1,cou5,pri2),

GoCurrency(_,cou5,pri5),Add($pri2,$pri5,pri12),Add($pri2,$pri6,pri12).

Thefirstfinancialserviceprovidedstockquoteinformation,andthesystemlearntthatthesourcereturnedexactlythesameinformationasastockmarketserviceprovidedbyYahoo.Itwasalsoabletoworkoutthatthepreviousday’scloseplustoday’schangewasequaltothecurrentprice.Thesecondsourceprovidedtherateofexchangebetweenthecurrenciesgivenasinput.Inthiscase,thesystemdidnotfarewell.Itwasunabletolearntheintendedresult,whichinvolvedcalculatingtheexchangeratebytakingtheratioofthevaluesforthefirstandsecondcurrency.Problem1011

#Candidates2844367

#Invocations1667116749

Time(s)387282

log10(Score)-8.13-9.84

Precision12/131/5

Recall12/121/4

Detailsregardingthesearchspacesforthetwoproblemsareshownabove.Theaverageprecisionandrecallfortheseproblemsweremuchlowerat56%and63%respectively,becausethesystemwasunabletolearntheintendeddefinitioninthesecondproblem.

15.Thepositivevalueforproblem4resultsfromanapproximationerror.

16.Lowscoresandperfectprecisionandrecall(problems6,8and9)indicateverylittleoverlapbetween

thetargetandtheknownsources.Thefactthatthesystemlearnsthecorrectdefinitioninsuchcasesistestimonytotherobustnessoftheapproach.

35

Carman&Knoblock

7.3.3WeatherSources

OntheInternet,therearetwotypesofweatherinformationservices,thosethatprovideforecastsforcomingdays,andthosethatprovidedetailsofcurrentweatherconditions.Intheexperiments,apairofsuchservicesprovidedbyWeather.comwereusedtolearndefini-tionsforanumberofotherweathersources.Thefirstsetofdefinitions,whichcorrespondtosourcesthatprovidecurrentweatherconditions,arelistedbelow:

12NOAAWeather($ica0,air1,_,_,sky4,tem5,hum6,dir7,spe8,_,pre10,tem11,_,_):-GetAirportInfo($ica0,_,air1,cit3,_,_),

WeatherForecast($cit3,_,_,_,_,_,_,_,_,_,_,_,sky4,dir7,_,_),

WeatherConditions($cit3,_,_,_,_,_,_,_,_,tem5,sky4,pre33,_,spe8,hum6,tem11),ConvertIn2mb($pre33,pre10).

13WunderGround($sta0,$cit1,tem2,_,_,pre5,pre6,sky7,dir8,spe9,spe10):-WeatherConditions($cit1,sta0,_,_,_,_,_,_,dat8,tem9,sky7,pre5,dir8,spe13,_,tem2),WeatherForecast($cit1,sta0,_,_,_,_,_,_,tem24,_,_,_,_,_,spe10,_),

ConvertIn2mb($pre5,pre6),<(tem9,tem24),ConvertTime($dat8,_,_,_,_),<(spe9,spe13).14WeatherBugLive($_,cit1,sta2,zip3,tem4,_,_,dir7,_,_):-WeatherConditions($cit1,sta2,_,_,_,_,_,_,_,tem4,_,_,dir7,_,_,_),GetZipcode($cit1,$sta2,zip3).

15WeatherFeed($cit0,$_,tem2,_,sky4,tem5,_,_,pre8,lat9,_,_):-WeatherConditions($cit0,_,_,lat9,_,_,_,_,_,_,sky4,pre8,dir12,_,_,tem5),WeatherForecast($cit0,_,_,_,_,_,_,_,_,tem2,_,_,_,dir12,_,_).

16WeatherByICAO($ica0,air1,cou2,lat3,lon4,_,dis6,_,sky8,_,_,_,_):-Altitude($lat3,$lon4,dis6),GetAirportInfo($ica0,_,air1,cit6,_,cou8),WeatherForecast($cit6,_,cou8,_,_,_,_,_,_,_,_,_,sky8,_,_,_),GetCountryName($cou2,cou8).

17WeatherByLatLon($_,$_,_,_,_,lat5,lon6,_,dis8,_,_,_,_,_,_):-Altitude($lat5,$lon6,dis8).

Inthefirstproblem,thesystemlearntthatsource12providedcurrentconditionsatairports,bycheckingtheweatherreportforthecitiesinwhicheachairportwaslocated.Thisparticularproblemdemonstratessomeoftheadvantagesoflearningdefinitionsfornewsourcesdescribedinsection3.3.Oncethedefinitionhasbeenlearnt,ifamediatorreceivesarequestforthecurrentconditionsatanairport,itcangenerateananswerforthatquerybyexecutingasinglecalltothenewlymodeledsource,(withoutneedingtofindanearbycity).Thesystemperformedwellonthenextthreesources(13to15)learningdefinitionswhichcovermostoftheattributesofeach.Onthelasttwoproblems,thesystemdidnotperformaswell.Inthecaseofsource16,thesystemspentmostofitstimelearningwhichattributesoftheairportwerebeingreturned(suchasitscountry,coordinates,elevation,etc.).Inthelastcase,thesystemwasonlyabletolearnthatthesourcewasreturningsomecoordinatesalongwiththeirelevation.Wenoteherethatdifferentsourcesmayprovidedataatdifferentlevelsofaccuracy.Thusthefactthatthesystemisunabletolearnadefinitionforaparticularsourcecouldsimplymeanthatthedatabeingreturnedbythatsourcewasn’tsufficientlyaccurateforthesystemtolabelitamatch.

Inadditiontocurrentweatherfeeds,thesystemwasrunontwoproblemsinvolvingweatherforecastfeeds.Itdidverywellonthefirstproblem,matchingallbaroneoftheattributes(thecountry)andfindingthattheorderofthehighandlowtemperatureswas

36

LearningSemanticDefinitionsofInformationSourcesontheInternet

inverted.Itdidwellalsoforthesecondproblem,learningadefinitionforthesourcethatproducedmostoftheoutputattributes.

18YahooWeather($zip0,cit1,sta2,_,lat4,lon5,day6,dat7,tem8,tem9,sky10):-WeatherForecast($cit1,sta2,_,lat4,lon5,_,day6,dat7,tem9,tem8,_,_,sky10,_,_,_),GetCityState($zip0,cit1,sta2).

19WeatherBugForecast($_,cit1,sta2,_,day4,sky5,tem6,_):-WeatherForecast($cit1,sta2,_,_,_,tim5,day4,_,tem6,_,tim10,_,sky5,_,_,_),WeatherConditions($cit1,_,_,_,_,tim10,_,tim5,_,_,_,_,_,_,_,_).

Detailsregardingthenumberofcandidatesgeneratedinordertolearndefinitionsforthedifferentweathersourcesareshownbelow.Theaverageprecisionofthedefinitionsproducedwas91%,whiletheaveragerecallwas62%.Problem1213141516171819

#Candidates27719899819910245119116

#Invocations579426249975494676691387614857

Time(s)23360593029248410267591217

log10(Score)-2.92-6.35-13.37-6.48-29.69-26.71-5.74-12.56

Precision8/96/95/55/66/73/310/105/5

Recall8/116/105/85/106/93/1310/115/7

7.3.4HotelSources

DefinitionswerealsolearntforsourcesprovidinghotelinformationfromYahoo,GoogleandtheUSFireAdministration.Thesedefinitionsareshownbelow.

20USFireHotelsByCity($cit0,_,_,sta3,zip4,cou5,_):-HotelsByZip($zip4,_,_,cit0,sta3,cou5).

21USFireHotelsByZip($zip0,_,_,cit3,sta4,cou5,_):-HotelsByZip($zip0,_,_,cit3,sta4,cou5).

22YahooHotel($zip0,$_,hot2,str3,cit4,sta5,_,_,_,_,_):-HotelsByZip($zip0,hot2,str3,cit4,sta5,_).

23GoogleBaseHotels($zip0,_,cit2,sta3,_,_,lat6,lon7,_):-WeatherConditions($cit2,sta3,_,lat6,lon7,_,_,_,_,_,_,_,_,_,_,_),GetZipcode($cit2,$sta3,zip0).

Thesystemperformedwellonthreeoutofthefourproblems.Itwasunableinthetimeallocatedtodiscoveradefinitionforthehotelattributes(name,street,latitudeandlongi-tude)returnedbytheGoogleWebService.Theaverageprecisionfortheseproblemswas90%whiletheaveragerecallwas60%.Problem20212223

#Candidates16164395

#Invocations448189431374931

Time(s)4852821161

37

log10(Score)-4.00-2.56-2.81-7.50Precision4/44/45/53/5Recall4/64/65/93/6

Carman&Knoblock

7.3.5CarsandTrafficSources

ThelastproblemsonwhichthesystemwastestedwereapairoftrafficrelatedWebServices.Thefirstservice,providedbyYahoo,reportedlivetrafficdata(suchasaccidentsandconstructionwork)withinagivenradiusoftheinputzipcode.Noknownsourceswereavailablewhichprovidedsuchinformation,sonotsurprisingly,thesystemwasunabletolearnadefinitionforthetrafficrelatedattributesofthatsource.(Instead,thesystemdiscoveredarelationshipbetweentheinputzipcodeandtheoutputlongitudethatwasn’tcorrect,soprecisionforthisproblemwaszero.)

24YahooTraffic($zip0,$_,_,lat3,lon4,_,_,_):-GetCentroid($zip0,_,lon4),CountryCode($lat3,$lon4,_).25YahooAutos($zip0,$mak1,dat2,yea3,mod4,_,_,pri7,_):-GoogleBaseCars($zip0,$mak1,_,mod4,pri7,_,_,yea3),

ConvertTime($dat2,_,dat10,_,_),GetCurrentTime(_,_,dat10,_).

Thesecondprobleminvolvedaclassifiedused-carlistingfromYahoothattookazipcodeandcarmanufacturerasinput.eidoswasabletolearnagooddefinitionforthatsource,takingadvantageofthefactthatsomeofthesamecars(definedbytheirmake,model,yearandprice)werealsolistedforsaleonGoogle’sclassifiedcarlisting.Problem2425

#Candidates

8155

#Invocations

29974405

Time(s)1065815

log10(Score)-11.21-5.29

Precision0/36/6

Recall0/46/6

Sincethesystemfailedonthefirstproblem(itfoundsomeincorrect/non-generalrelation-shipsbetweendifferentattributes),butsucceededonthesecondproblemtofindthebestpossibledefinition,theaverageprecisionandrecallfortheseproblemswereboth50%.7.3.6OverallResults

Acrossthe25problems,eidosmanagedtogeneratedefinitionswithhighaccuracy(averageprecisionwas88%)andalargenumberofattributes(averagerecallwas69%).Theseresultsarepromising,especiallyconsideringthatallproblemsinvolvedrealdatasourceswithinsomecasesverysmalloverlapbetweenthedataproducedbythetargetandthatprovidedbytheknownsources(asevidencedbylowlogarithmicscores).Inadditiontominimaloverlap,manysourcesprovidedincompletetuples(i.e.tuplescontainingmultiple“NULL”or“N/A”values)aswellaserroneousorinaccuratedata,makingtheproblemallthemoredifficult.ThehighaverageprecisionandrecallleadustobelievethattheJaccardmeasureisdoingagoodjobofdistinguishingcorrectfromincorrectdefinitionsinthepresenceofdatasourcesthatarebothnoisy(inconsistent)andincomplete(missingtuplesandvalues).

Comparingthedifferentdomains,onecanseethatthesystemperformedbetteronproblemswithfewerinputandoutputattributes(suchasthegeospatialproblems),whichwastobeexpectedgiventhattheresultingsearchspaceismuchsmaller.

38

LearningSemanticDefinitionsofInformationSourcesontheInternet

7.4EmpiricalComparison

Havingdemonstratedtheeffectivenessofeidosinlearningdefinitionsforrealinformationservices,wenowshowthatthesystemiscapableofhandlingthesameproblemsasawell-knowncomplexschemamatchingsystem.

TheiMAPsystem(Dhamanka,Lee,Doan,Halevy,&Domingos,2004)(discussedinsection8.4)isaschemamatcherthatcanlearncomplex(many-to-one)mappingsbetweentheconceptsofasourceandatargetschema.Itusesasetofspecialpurposesearcherstolearndifferenttypesofmappings.Theeidossystem,ontheotherhand,usesagenericsearchalgorithmtosolveacomparableproblem.Sincethetwosystemscanbemadetoperformasimilartask,weshowthateidosiscapableofrunningononeoftheproblemdomainsusedintheevaluationofiMAP.Wechosetheparticulardomainofonlinecricketdatabasesbecauseitistheonlyoneusedintheevaluationthatinvolvedaligningdatafromtwoindependentdatasources.(Allotherproblemsinvolvedgeneratingsyntheticdatabysplittingasingledatabaseintoasourceandtargetschema,whichwouldnothavebeenasinterestingforeidos.)

Playerstatisticsfromtwoonlinecricketdatabases(cricketbase.comandcricinfo.com)wereusedintheexperiments.Sinceneitherofthesourcesprovidedprogrammaticaccesstotheirdata,thestatisticsdatawasextractedfromHTMLpagesandinsertedintoarelationaldatabase.Theextractionprocessinvolvedflatteningthedataintoarelationalmodelandasmallamountofdatacleaning.(TheresultingtablesaresimilarbutnotnecessarilyexactlythesameasthoseusedintheiMAPexperiments.)Thedatafromthetwowebsiteswasusedtocreatethreedatasourcesrepresentingeachwebsite.Thethreesourcesrepresentingcricinfo.comwerethenusedtolearndefinitionsforthethreesourcesrepresentingcricketbase.com.Otherknownsourceswereavailabletothesystem,includingfunctionalityforsplittingapartcomma-separatedlists,addingandmultiplyingnumbers,andsoon.Thedefinitionslearnttodescribethecricketbaseservicesareshownbelow:

1CricbasePlayers($cou0,nam1,_,dat3,_,unk5,unk6):-CricinfoPlayer($nam1,dat3,_,_,_,lis5,nam6,unk5,unk6),contains($lis5,cou0),CricinfoTest($nam6,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_,_).

2CricbaseTest($_,nam1,cou2,_,_,_,cou6,cou7,dec8,cou9,_,_,_,cou13,cou14,dec15,_,dec17,cou18,_,_):-CricinfoTest($nam1,_,_,cou2,cou6,cou18,cou7,_,dec8,dec15,_,cou9,_,_,_,cou14,cou13,_,_,_,_,dec17).

3CricbaseODI($_,nam1,cou2,_,_,_,cou6,cou7,dec8,cou9,cou10,_,cou12,cou13,_,dec15,dec16,dec17,cou18,cou19,_):-CricinfoODI($nam1,_,_,cou2,cou6,_,cou10,_,dec8,_,cou7,cou18,cou19,cou9,_,_,cou13,cou12,dec15,_,dec16,dec17).

Thefirstsourceprovidedplayerprofilesbycountry.Thesecondandthirdsourcesprovideddetailedplayerstatisticsfortwodifferenttypesofcricket(TestandOne-Day-Internationalrespectively).Thesystemeasilyfoundthebestdefinitionforthefirstsource.Thedefinitioninvolvedlookingfortheplayer’scountryinalistofteamsthatheplayedfor.eidosdidnotperformquiteaswellonthesecondandthirdproblems.Thereweretworeasonsforthis.Firstly,thearityofthesesourceswasmuchhigherwithmanyinstancesofthesamesemantictype(countanddecimal),makingthespaceofpossiblealignmentsmuchlarger.

39

Carman&Knoblock

(Becauseofthelargesearchspace,alongertimeoutof40minuteswasused.)Secondly,ahighfrequencyofnullvalues(theconstant“N/A”)inthedataforsomeofthefieldsconfusedthealgorithm,andmadeitharderforittodiscoveroverlappingtupleswithallofthedesiredattributes.Problem123

#Candidates19911623114

#Invocations376215174299

Time(s)43213192127

log10(Score)-3.95-4.70-6.28

Precision5/58/118/14

Recall5/58/168/16

Detailsofthesearchperformedtolearnthedefinitionsareshownabove.Theaverageprecisionfortheseproblemswas77%whiletheaveragerecallwaslowerat66%.ThesevaluesarecomparabletothequalityofthematchingsreportedforiMAP.17Theseresultsareverygood,consideringthateidossearchesinthespaceofmany-to-manycorrespondences,(tryingtodefinethesetoftargetattributescontemporaneously),whileiMAPsearchesthespacesofone-to-oneandmany-to-onecorrespondences.Moreover,eidosfirstinvokesthetargetsourcetogeneraterepresentativedata(atasknotperformedbyiMAP)andthenperformsagenericsearchforreasonabledefinitionswithoutrelyingonspecialisedsearchalgorithmsfordifferenttypesofattributes(asisdoneiniMAP).

8.RelatedWork

InthissectionwedescribehowtheworkinthispaperrelatestoresearchperformedbytheMachineLearning,DatabaseandtheSemanticWebcommunities.Beforedoingthat,wedescribesomeearlyworkperformedbytheArtificialIntelligencecommunity.WealsodiscusshowouralgorithmdiffersfromstandardILPtechniques,andinparticularwhyadirectapplicationofsuchtechniqueswasnotpossibleforourproblem.8.1AnEarlyApproach

ThefirstworkconcernedwithlearningmodelsfordescribingoperationsavailableontheInternetwasperformed(inthepre-XMLera)onaproblemcalledcategorytranslation(Perkowitz&Etzioni,1995;Perkowitz,Doorenbos,Etzioni,&Weld,1997).Thisproblemconsistedofanincompleteinternalworldmodelandanexternalinformationsourcewiththegoalbeingtocharacterizetheinformationsourceintermsoftheworldmodel.TheworldmodelconsistedofasetofobjectsO,whereeachobjecto∈Obelongedtoacertaincategory(e.g.people)andwasassociatedwithasetofattributes󰀛a1(o),...,an(o)󰀜,madeupofstringsandotherobjects.Asimplerelationalinterpretationofthisworldmodelwouldconsidereachcategorytobearelation,andeachobjecttobeatuple.Theinformationsource,meanwhile,wasanoperationthattookinasinglevalueasinputandreturnedasingletupleasoutput.Thecategorytranslationproblemcanbeviewedasasimplificationofthesourcedefinitioninductionproblem,whereby:

17.Theactualvaluesforprecisionandrecallinthecricketdomainarenotquoted,butanaccuracyrange

of68-92%forsimplematches(one-to-onecorrespondencesbetweensourceandtargetfields)and50-86%forcomplexmatches(many-to-onecorrespondences)acrosssyntheticandrealproblemswasgiven.

40

LearningSemanticDefinitionsofInformationSourcesontheInternet

•Theextensionsoftheglobalrelationsareexplicit.(Thereisonesourceperglobalrelation,anditdoesn’thavebindingconstraints,i.e.R=S.)

•Theinformationprovidedbythesourcesdoesnotchangeovertime.

•Thenewsourcetakesasinglevalueasinputandreturnsasingletupleasoutput.Inordertofindsolutionstoinstancesofthecategorytranslationproblem,theauthorsem-ployedavariantofrelationalpath-finding(Richards&Mooney,1992),whichisanextensiononthefoilalgorithm,tolearnmodelsoftheexternalsource.Thetechniquedescribedinthispaperforsolvinginstancesofthesourceinductionproblemissimilarinthatittooisbasedonafoil-likeinductivesearchalgorithm.8.2DirectApplicationofILPtechniques

ResearchersbecameinterestedinthefieldofInductiveLogicProgrammingintheearlynineties,resultinginanumberofdifferentILPsystemsbeingdevelopedincludingfoil(Cameron-Jones&Quinlan,1994),progol(Muggleton,1995)andaleph18.Ideally,onewouldliketoapplysuch“off-the-shelf”ILPsystemstothesourcedefinitioninductionproblem.Anumberofissues,however,limitthedirectapplicabilityofthesesystems.Theissuescanbesummarisedasfollows:

••••

Extensionsoftheglobalrelationsarevirtual.

Sourcesmaybeincompletewithrespecttotheirdefinitions.Explicitnegativeexamplesofthetargetarenotavailable.Sourcesmayserialiseconstantsindifferentways.

ThefirstissuehastodowiththefactthatallILPsystemsassumethatthereisanextensionaldefinitionofthetargetpredicateandextensional(orinsomecasesintentional)definitionsofthe(source)predicatesthatwillbeusedinthedefinitionforthetarget.Inotherwords,theyassumethattablesalreadyexistinsomerelationaldatabasetorepresentboththenewsourceandtheknownsources.Inourcase,weneedtogeneratesuchtablesbyfirstinvokingtheserviceswithrelevantinputs.Onecouldenvisageinvokingeachofthesourceswitheverypossibleinputandusingtheresultingtablestoperforminduction.Suchadirectapproachwouldnotbefeasiblefortworeasons.Firstly,acompletesetofpossibleinputvaluesmaynotbeknowntothesystem.Secondly,evenifitispossibletogenerateacompletesetofviableinputstoaservice,itmaynotbepracticaltoquerythesourcewithsuchalargesetoftuples.Considersource4fromsection2,whichcalculatesthedistanceinmilesbetweentwozipcodes.Giventhatthereareover40,000zipcodesintheUS,generatinganextensionalrepresentationofthissourcewouldrequireperformingmorethanabillioninvocations!Performingsuchalargenumberofinvocationsdoesnotmakesensewhenasmallnumberofexampleinvocationswouldsufficeforcharacterisingthefunctionalityofthesource.Inthispaperwehavedevelopedanefficientalgorithmthatonlyqueriesthesourcesasneededinordertoevaluateindividualcandidatedefinitions.

Thesecondissueregardingtheincompletenessofthesourcescausesaproblemwhenacandidateistobeevaluated.Sincethesetoftuplesreturnedbyeachknownsourcemayonly

18.SeetheAlephManualbyAshwinSrinivasan,whichisavailableat:

http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/aleph.html

41

Carman&Knoblock

beasubsetofthoseimpliedbyitsowndefinition,sotoowillbethesetoftuplesreturnedbythecandidatehypothesiswhenexecutedagainstthosesources.Thismeansthatwhenthesystemtriestoevaluateahypothesisbycomparingthosetupleswiththeoutputofthenewsource,itcannotbesurethatatuplewhichisproducedbythenewsourcebutnotbythehypothesisisinfactnotlogicallyimpliedbyit.Thisfactistakenintoaccountinourevaluationfunctionforscoringcandidatedefinitions,discussedinsection5.

Thethirdissueregardingthelackofexplicitnegativeexamplesforthetargetpredicatealsoaffectstheevaluationofcandidatehypotheses.Theclassicapproachtodealingwiththisproblemistoassumeaclosedworld,inwhichalltuples(overtheheadrelation)whicharenotexplicitlydeclaredtobepositivemustbenegative.Sincethenewsourcemayinfactbeincompletewithrespecttothebestpossibledefinitionforit,thisassumptiondoesnotnecessarilyhold.Inotherwords,justbecauseaparticulartupleisproducedwhenthecandidatedefinitionisexecutedandthatsametupleisnotreturnedbythenewsourcedoesnotnecessarilymeanthatthecandidatedefinitionisincorrect.

Thefourthissuehastodowiththefactthatthedataprovidedbydifferentsourcesmayneedtobereconciled,inthesensethatdifferentserialisationsof(stringsrepresenting)thesamevalue(suchas“Monday”and“Mon”forinstance)mustberecognized.SinceILPsystemshavebeendesignedtooperateoverasingledatabasecontainingmultipletables,theissueofheterogeneityinthedataisnothandledbycurrentsystems.Insection5.4wediscussedhowthisheterogeneityisresolvedinoursystem.8.3MachineLearningApproaches

SincetheadventofservicesontheInternet,researchershavebeeninvestigatingwaystomodelthemautomatically.Primarily,interesthascenteredonusingmachinelearningtechniquestoclassifytheinputandoutputtypesofaservice,soastofacilitateservicediscovery.Heß&KushmerickproposedusingaSupportVectorMachinetoclassifytheinputandoutputattributesintodifferentsemantictypesbasedonmetadataininterfacedescriptions(Heß&Kushmerick,2003,2004).Theirnotionofsemantictypes(suchaszipcode)asopposedtosyntactictypes(likeinteger)wentsomewaytowarddefiningthefunctionalitythatasourceprovides.Recently,otherresearchers(Lermanetal.,2006)proposedtheuseoflogisticregressionforassigningsemantictypestoinputparametersbasedonmetadata,andapatternlanguageforassigningsemantictypestotheoutputparametersbasedonthedatathesourceproduces.Thisworkonclassifyinginputandoutputattributesofaservicetosemantictypesformsaprerequisitefortheworkinthisarticle.Forthepurposesofthispaper,wehaveassumedthatthisproblemhasbeensolved.

Inadditiontoclassifyingtheinput/outputattributesofservices,Heß&Kushmerickinvestigatedtheideaofclassifyingtheservicesthemselvesintodifferentservicetypes.Moreprecisely,theyusedthesameclassificationtechniquestoassignserviceinterfacestodiffer-entsemanticdomains(suchasweatherandflights)andtheoperationsthateachinterfaceprovidestodifferentclassesofoperation(suchasweatherForecastandflightStatus).Theresultingsourcedescription(hypothesis)languageislimitedtoselect-projectqueries,whicharenotsufficientlyexpressivetodescribemanyofthesourcesavailableontheInternet.Accordingtothatapproach,sinceeveryoperationmustbecharacterizedbyaparticu-laroperationclass,operationsthatprovideoverlapping(non-identical)functionalitywould

42

LearningSemanticDefinitionsofInformationSourcesontheInternet

needtobeassigneddifferentclassesaswouldoperationswhichprovidecomposedfunction-ality(suchas,forexample,anoperationthatprovidesbothweatherandflightdata).Theneedforanexhaustivesetofoperationclasses(andaccompanyingtrainingdata)isamajorlimitationofthatapproach,notsharedbytheworkdescribedinthispaper,whichreliesonamoreexpressivelanguagefordescribingserviceoperations.

Onewaytoeliminatetheneedforapredefinedsetofoperationtypesistouseunsuper-visedclusteringtechniquestogeneratethe(operation)classesautomaticallyfromexamples(WSDLdocuments).ThisideawasimplementedinasystemcalledWoogle(Dongetal.,2004).Thesystemclusteredserviceinterfacestogetherusingasimilarityscorebasedontheco-occurrenceofmetadatalabels.Itthentookadvantageoftheclustersproducedtoim-provekeyword-basedsearchforWebServices.Anadvantageofthisunsupervisedapproachisthatnolabeledtrainingdataisrequired,whichcanbetime-consumingtogenerate.Suchclusteringapproaches,however,whileusefulforservicediscovery,sufferthesamelimitationsasthepreviousapproachwhenitcomestoexpressiveness.8.4DatabaseApproaches

Thedatabasecommunityhaslongbeeninterestedintheproblemofintegratingdatafromdisparatesources.Specifically,intheareasofdatawarehousing(Widom,1995)andin-formationintegration(Wiederhold,1996),researchersareinterestedinresolvingsemanticheterogeneitywhichexistsbetweendifferentdatabasessothatthedatacanbecombinedoraccessedviaasingleinterface.Theschemamappingproblemistheproblemofdeterminingamappingbetweentherelationscontainedinasourceschemaandaparticularrelationinatargetschema.Amappingdefinesatransformationwhichcanbeusedtopopulatethetargetrelationwithdatafromthesourceschema.Mappingsmaybearbitrarilycomplexprocedures,butingeneraltheywillbedeclarativequeriesinSQLorDatalog.Thecomplex-ityofthesequeriesmakestheschemamappingproblemfarmoredifficultthanthehighlyinvestigatedschemamatchingproblem(Rahm&Bernstein,2001),whichinvolvesfinding1-to-1correspondencesbetweenfieldsofasourceandtargetschema.

Thesourcedefinitioninductionproblemcanbeviewedasatypeofschemamappingproblem,inwhichtheknownsourcesdefinethesourceschemaandtheunknownsourcespecifiesthetargetrelation.Inordertosolveaschemamappingproblem,onetypicallytakesadvantageofallavailableauxiliaryinformation(includingsourceandtargetdatainstances,labelsfromtherespectiveschemas,andsoon).Suchproblemsaregenerallysimpler,however,becausethedata(theextensionsoftherelations)inthesourceandtargetschemaareusuallyexplicitlyavailable.Insourceinduction,thatdataishiddenbehindaserviceinterface,whichhasbindingconstraints,andthedataitselfcanbeextremelylargeoreven(inthecaseofsourcesprovidingmathematicalfunctions)infinite.Thusmakingtheproblemconsiderablymoredifficult.

TheschemaintegrationsystemCLIO(Yan,Miller,Haas,&Fagin,2001)helpsusersbuildSQLqueriesthatmapdatafromasourcetoatargetschema.InCLIO,foreignkeysandinstancedataareusedtogenerateintegrationrulessemi-automatically.SinceCLIOreliesheavilyonuserinvolvement,itdoesnotmakesensetocompareitdirectlywiththeautomatedsystemdevelopedinthispaper.

43

Carman&Knoblock

Anothercloselyrelatedproblemisthatofcomplexschemamatching,thegoalofwhichistodiscovercomplex(many-to-one)mappingsbetweentworelationaltablesorXMLschemas.Thisproblemisfarmorecomplicatedthanbasic(one-to-one)schemamatchingbecause:

•ThespaceofpossiblecorrespondencesbetweentherelationsisnolongertheCartesianproductofthesourceandtargetrelations,butthepowersetofthesourcerelationtimesthetargetrelation.

•Many-to-onemappingsrequireamappingfunction,whichcanbesimplelikeconcate-nate(x,y,z),oranarbitrarilycomplexformulasuchasz=x2+y.

TheiMAPsystem(Dhamankaetal.,2004)triestolearnsuchmany-to-onemappingsbe-tweentheconceptsofasetofsourcerelationsandatargetrelation.Itusesasetofspecialpurposesearcherstolearndifferenttypesofmappings(suchasmathematicalexpressions,unitconversionsandtime/datemanipulations).Itthenusesameta-heuristictocontrolthesearchbeingperformedbythedifferentspecialpurposesearchers.Ifoneviewsboththesourceschemaandthefunctionsavailableforuseinthemappings(suchasconcate-nate(x,y,z),add(x,y,z),etc.)asthesetofknownsourcesinthesourcedefinitioninductionproblem,thenthecomplexschemamatchingandsourceinductionproblemsaresomewhatsimilar.Themaindifferencesbetweentheproblemsare:

•Thedataassociatedwiththesourceschemaisexplicit(andstatic)incomplexschemamatching,whileitishidden(anddynamic)insourceinduction.

•Ingeneral,thesetofknownsourcesinasourceinductionproblemwillbemuchlarger(andthedatatheyprovidemaybelessconsistent),thanthesetofmappingfunctionsandsourcerelationsinacomplexschemamatchingproblem.

Inthispaperwedevelopageneralframeworkforhandlingthesourceinductionproblem.SinceiMAPprovidesfunctionalitywhichissimilartothatofoursystem,weperformasimpleempiricalcomparisoninsection7.4.8.5SemanticWebApproach

ThestatedgoaloftheSemanticWeb(Berners-Lee,Hendler,&Lassila,2001)istoenablemachineunderstandingofWebresources.Thisisdonebyannotatingthoseresourceswithsemanticallymeaningfulmetadata.ThustheworkdescribedinthispaperisverymuchinlinewiththeSemanticWeb,insofarasweareattemptingtodiscoversemanticallymeaningfuldefinitionsforonlineinformationsources.Defactostandardsforannotatingserviceswithsemanticmarkuphavebeenaroundforanumberofyears.Thesestandardsprovideserviceownerswithametadatalanguageforaddingdeclarativestatementstoserviceinterfacedescriptionsinanattempttodescribethesemanticsofeachserviceintermsofthefunctionality(e.g.abookpurchaseoperation)ordata(e.g.aweatherforecast)thatitprovides.Workontheselanguagesisrelatedtothisarticlefromtwoperspectives:

•Itcanbeviewedasanalternativeapproachtogainingknowledgeastothesemanticsofanewlydiscoveredsource(providingithassemanticmetadataassociatedwithit).•SemanticWebServiceannotationlanguagescanbeseenasatargetlanguageforthesemanticdescriptionslearntinthispaper.

44

LearningSemanticDefinitionsofInformationSourcesontheInternet

IfaWebServiceisalreadysemanticallyannotated,heterogeneitymaystillexistbetweentheontologyusedbytheserviceproviderandthatusedbytheconsumer,inwhichcasethelearningcapabilitiesdescribedinthispapermayberequiredtoreconcilethosedifferences.Moreimportantly,weareinterestedinthevastnumberofsourcesforwhichsemanticmarkupiscurrentlyunavailable.TheworkinthisarticlecomplementsthatoftheSemanticWebcommunitybyprovidingawayofautomaticallyannotatingsourceswithsemanticinformation;therebyrelievingserviceprovidersoftheburdenofmanuallyannotatingtheirservices.Oncelearnt,DatalogsourcedefinitionscanbeconvertedtoDescriptionLogic-basedrepresentationssuchasisusedinOWL-S(Martin,Paolucci,McIlraith,Burstein,McDermott,McGuinness,Parsia,Payne,Sabou,Solanki,Srinivasan,&Sycara,2004)andWSMO(Roman,Keller,Lausen,deBruijn,Lara,Stollberg,Polleres,Feier,Bussler,&Fensel,2005).ThereasonweuseDataloginthispaper(ratherthanDescriptionLogics)isthatmostmediator-basedintegrationsystemsrelyonitasarepresentationlanguage.

9.Discussion

Inthispaperwehavepresentedacompletelyautomaticapproachtolearningdefinitionsforonlineservices.Ourapproachexploitsthedefinitionofsourcesthathaveeitherbeengiventothesystemorlearnedpreviously.Theresultingframeworkisasignificantadvanceoverpriorapproachesthathavefocusedonlearningonlytheinputsandoutputsortheclassofaservice.Wehavedemonstratedempiricallytheviabilityoftheapproach.

Thekeycontributionofthisarticleisaprocedureforlearningsemanticdefinitionsforonlineinformationservicesthatis:

•Fullyautomated:Definitionsarelearntinacompletelyautomatedmannerwithouttheneedforanyuserintervention.

•Moreexpressive:Thequerylanguagefordefiningsourcesisthatofconjunctivequeries,whichisfarmoreexpressivethanpreviousattribute-valueapproaches.

•Sufficientlyrobust:Theprocedureisabletolearndefinitionsinthepresenceofnoisyandincompletedata,andthusissufficientlyrobusttohandlerealdatasources.

•Data-accessefficient:Theproceduresamplesdatafromlivesources,invokingthemsparinglyandonlyasrequired,makingithighlyefficientintermsofsourceaccesses.•Evolving:Theprocedure’sabilitytolearndefinitionsimprovesovertimeaseachnewdefinitionislearntandaddedtothesetofknownsources.9.1ApplicationScenarios

Thereareanumberofdifferentapplicationscenariosforasystemthatiscapableoflearningdefinitionsforonlinesources.Theygenerallyinvolveprovidingsemanticdefinitionstodataintegrationsystems,whichthenexploitandintegratetheavailablesources.

Themostobviousapplicationforourworkwouldbeasystem(depictedontheleftsideofFigure1)thatcrawlstheWeb,searchingforinformationsources.Uponfindingasource,thesystemwoulduseaclassifiertoassignsemantictypestoit,followedbytheinductivelearnertogenerateadefinitionforit.ThedefinitioncouldthenbeusedtoannotatethesourcefortheSemanticWeb,orbyamediatorforansweringqueries.Importantly,thisentireprocesscouldrunwithminimaluserinvolvement.

45

Carman&Knoblock

Figure1:Architecturediagramsforthreedifferentapplicationscenarios.

Amorechallengingapplicationscenario(showninthecenterofFigure1)wouldinvolvereal-timeservicediscovery.Considerthecasewhereamediatorisunabletoansweraparticularquerybecausethedesiredinformationliesoutofscopeofthesourcesavailable.Asearchisthenperformedbasedonthe“missingconjuncts”(relationnamesandconstants)fromthequeryusingaspecialisedWebServicesearchengine,suchasWoogle(Dongetal.,2004).Theservicesreturnedwouldbeannotatedwithsemantictypesand,ifpossible,semanticdefinitions.Afterthedefinitionsareprovidedtothemediator,itwouldcompletethequeryprocessingandreturnananswertotheuser.Thisscenariomayseemalittlefar-fetcheduntiloneconsidersaspecificexample:imagineauserinteractingwithageospatialbrowser(anonlineatlas).Iftheuserturnsonaparticularinformationlayer,suchasskiresorts,butnosourceisavailableforthecurrentfieldofview(of,forinstance,Italy),thennoresultswouldbedisplayed.Inthebackgroundasearchcouldbeperformedandanewsourcediscovered,whichprovidesskiresortsalloverEurope.Therelevantdatacouldthenbedisplayed,withtheuserunawarethatasearchhasbeenperformed.

Perhapsthemostlikelyapplicationscenario(totherightofFigure1)forasourceinductionsystemwouldbeamixedinitiativeone.Inthiscaseahumanwouldannotatethedifferentoperationsofaserviceinterfacewithsemanticdefinitions.Atthesametime,thesystemwouldattempttoinducedefinitionsfortheremainingoperations,andprompttheuserwithsuggestionsforthem.Inthisscenariotheclassifiermaynotbeneeded,sinceattributesofthesamenameinthedifferentoperationswouldlikelyhavethesamesemantictype.Moreover,sincethedefinitionslearntbythesystemmayinsomecasescontainerroneousorsuperfluouspredicates,theusercouldalsobeinvolvedinaprocessofcheckingandimprovingthedefinitionsdiscovered.

46

LearningSemanticDefinitionsofInformationSourcesontheInternet

9.2OpportunitiesforFurtherResearch

Anumberoffuturedirectionsforthisworkwillallowthesetechniquestobeappliedmorebroadly.Wenowdiscusstwosuchdirections,improvingthesearchalgorithmandextendingthequerylanguage.

Asthenumberofknownsourcesgrows,sotoowillthesearchspace,anditwillbenec-essarytodevelopadditionalheuristicstobetterdirectthesearchtowardthebestdefinition.ManyheuristictechniqueshavebeendevelopedintheILPcommunityandsomemaybeapplicabletothesourceinductionproblem.Morepressingperhapsistheneedtodeveloparobustterminationconditionforhaltingthesearchoncea“sufficientlygood”definitionhasbeendiscovered.Asthenumberofavailablesourcesincreases,thesimpletimeoutusedintheexperimentswillbeineffectiveascertain(morecomplicated)definitionswillnecessarilytakelongertolearnthanothers.

Anotherwaytoincreasetheapplicabilityofthisworkistoextendthequerylanguagesothatitbetterdescribesthesourcesavailable.Oftenonlinesourcesdonotreturnacompletesetofresultsbutrathercutoffthelistatsomemaximumcardinality.ForexampletheYahooHotelsourcedescribedinsection7.3.4returnsamaximumof20hotelsnearagivenlocation,andordersthemaccordingtodistance.Inthiscase,recognisingthespecificorderingonthetuplesproducedwouldbeveryusefultoamediator.Asecondusefulextensiontothequerylanguagewouldbetheabilitytodescribesourcesusingtheproceduralconstructif-then-else.Thisconstructisneededtodescribethebehaviourofsomesourcesoncertaininputs.Forexample,considertheYahooGeocoderfromsection7.3.1,whichtakesasinputatuplecontainingastreetname,number,andzipcode.Ifthegeocoderisunabletolocatethecorrespondingaddressinitsdatabase(becauseitdoesn’texist),insteadofreturningnotuples,itreturnsthecentroidofthezipcode.Describingsuchbehaviorisonlypossibleusingproceduralconstructs.

Acknowledgments

ThisresearchisbaseduponworksupportedinpartbytheDefenseAdvancedResearchProjectsAgency(DARPA),throughtheDepartmentoftheInterior,NBC,AcquisitionSer-vicesDivision,underContractNo.NBCHD030010.TheU.S.GovernmentisauthorizedtoreproduceanddistributereportsforGovernmentalpurposesnotwithstandinganycopy-rightannotationthereon.Theviewsandconclusionscontainedhereinarethoseoftheauthorsandshouldnotbeinterpretedasnecessarilyrepresentingtheofficialpoliciesorendorsements,eitherexpressedorimplied,ofanyoftheaboveorganizationsoranypersonconnectedwiththem.

References

Afrati,F.N.,Li,C.,&Mitra,P.(2004).Oncontainmentofconjunctivequeriesusingarith-meticcomparisions.In9thInternationalConferenceonExtendingDatabaseTechnol-ogy(EDBT2004)Heraklion-Crete,Greece.

47

Carman&Knoblock

Arens,Y.,Knoblock,C.A.,&Shen,W.-M.(1996).Queryreformulationfordynamic

informationintegration.JournalofIntelligentInformationSystems-SpecialIssueonIntelligentInformationIntegration,6(2/3),99–130.Berners-Lee,T.,Hendler,J.,&Lassila,O.(2001).Thesemanticweb.ScientificAmerican,

284(5),34–43.Bilenko,M.,Mooney,R.J.,Cohen,W.W.,Ravikumar,P.,&Fienberg,S.E.(2003).

Adaptivenamematchingininformationintegration..IEEEIntelligentSystems,18(5),16–23.Cameron-Jones,R.M.,&Quinlan,J.R.(1994).Efficienttop-downinductionoflogic

programs.SIGARTBulletin,5(1),33–42.Carman,M.J.,&Knoblock,C.A.(2007).Learningsemanticdescriptionsofwebinfor-mationsources.InProceedingsoftheTwentiethInternationalJointConferenceonArtificialIntelligence(IJCAI-07)Hyderabad,India.Chandra,A.K.,&Merlin,P.M.(1977).Optimalimplementationofconjunctivequeries

inrelationaldatabases.InProceedingsofthe9thACMSymposiumonTheoryofComputing(STOC),pp.77–90Boulder,Colorado.Dhamanka,R.,Lee,Y.,Doan,A.,Halevy,A.,&Domingos,P.(2004).imap:Discovering

complexsemanticmatchesbetweendatabaseschemas.InSIGMOD’04:Proceedingsofthe2004ACMSIGMODInternationalConferenceonManagementofData.Dong,X.,Halevy,A.Y.,Madhavan,J.,Nemes,E.,&Zhang,J.(2004).Simlaritysearch

forwebservices.InProceedingsofVLDB.Duschka,O.M.(1997).QueryPlanningandOptimizationinInformationIntegration.Ph.D.

thesis,DepartmentofComputerScience,StanfordUniversity.Garcia-Molina,H.,Hammer,J.,Ireland,K.,Papakonstantinou,Y.,Ullman,J.,&Widom,

J.(1995).Integratingandaccessingheterogeneousinformationsourcesintsimmis.InProceedingsoftheAAAISymposiumonInformationGathering,pp.61-64.Heß,A.,&Kushmerick,N.(2003).Learningtoattachsemanticmetadatatowebservices.

In2ndInternationalSemanticWebConference(ISWC).Heß,A.,&Kushmerick,N.(2004).Iterativeensembleclassificationforrelationaldata:

Acasestudyofsemanticwebservices.In15thEuropeanConferenceonMachineLearning(ECML2004)Pisa,Italy.Springer.Knoblock,C.A.,Minton,S.,Ambite,J.L.,Ashish,N.,Muslea,I.,Philpot,A.,&Tejada,

S.(2001).Theariadneapproachtoweb-basedinformationintegration.InternationalJournalofCooperativeInformationSystems,10(1-2),145–169.Lerman,K.,Plangprasopchok,A.,&Knoblock,C.A.(2006).Automaticallylabelingdata

usedbywebservices.InProceedingsofthe21stNationalConferenceonArtificialIntelligence(AAAI).

48

LearningSemanticDefinitionsofInformationSourcesontheInternet

Levy,A.Y.(2000).Logic-basedtechniquesindataintegration.InMinker,J.(Ed.),Logic-BasedArtificialIntelligence.KluwerPublishers.Levy,A.Y.,Mendelzon,A.O.,Sagiv,Y.,&Srivastava,D.(1995).Answeringqueriesusing

views.InProceedingsofthe14thACMSIGACT-SIGMOD-SIGARTSymposiumonPrinciplesofDatabaseSystems,pp.95–104SanJose,Calif.Markov,Z.,&Marinchev,I.(2000).Metric-basedinductivelearningusingsemanticheight

functions.InProceedingsofthe11thEuropeanConferenceonMachineLearning(ECML2000).Springer.Martin,D.,Paolucci,M.,McIlraith,S.,Burstein,M.,McDermott,D.,McGuinness,D.,

Parsia,B.,Payne,T.,Sabou,M.,Solanki,M.,Srinivasan,N.,&Sycara,K.(2004).Bringingsemanticstowebservices:Theowl-sapproach.InProceedingsoftheFirstInternationalWorkshoponSemanticWebServicesandWebProcessComposition(SWSWPC2004).Muggleton,S.,&Feng,C.(1990).Efficientinductionoflogicprograms.InProceedingsof

the1stConferenceonAlgorithmicLearningTheory.Muggleton,S.(1995).InverseentailmentandProgol.NewGenerationComputing,Special

issueonInductiveLogicProgramming,13(3-4),245–286.N´edellec,C.,Rouveirol,C.,Ad´e,H.,Bergadano,F.,&Tausend,B.(1996).Declarative

biasinILP.InDeRaedt,L.(Ed.),AdvancesinInductiveLogicProgramming,pp.82–103.IOSPress.Pazzani,M.J.,&Kibler,D.F.(1992).Theutilityofknowledgeininductivelearning.

MachineLearning,9,57–94.Perkowitz,M.,&Etzioni,O.(1995).Categorytranslation:Learningtounderstandinforma-tionontheinternet.InProceedingsoftheFourteenthInternationalJointConferenceonArtificialIntelligence(IJCAI-95).Perkowitz,M.,Doorenbos,R.B.,Etzioni,O.,&Weld,D.S.(1997).Learningtounder-standinformationontheinternet:Anexample-basedapproach.JournalofIntelligentInformationSystems,8(2),133–153.Pottinger,R.,&Halevy,A.Y.(2001).Minicon:Ascalablealgorithmforansweringqueries

usingviews.VLDBJournal,10(2-3).Quinlan,J.R.,&Cameron-Jones,R.M.(1993).FOIL:Amidtermreport.InMachine

Learning:ECML-93,EuropeanConferenceonMachineLearning,Proceedings,Vol.667,pp.3–20.Springer-Verlag.Rahm,E.,&Bernstein,P.(2001).Asurveyofapproachestoautomaticschemamatching.

VLDBJournal,10(4).Richards,B.L.,&Mooney,R.J.(1992).Learningrelationsbypathfinding.InNational

ConferenceonArtificialIntelligence,pp.50–55.

49

Carman&Knoblock

Roman,D.,Keller,U.,Lausen,H.,deBruijn,J.,Lara,R.,Stollberg,M.,Polleres,A.,

Feier,C.,Bussler,C.,&Fensel,D.(2005).Webservicemodelingontology.AppliedOntology,1(1),77–106.Ullman,J.D.(1989).PrinciplesofDatabaseandKnowledge-BaseSystems,Vol.2.Com-puterSciencePress,Rockville,Maryland.Weber,I.,Tausend,B.,&Stahl,I.(1995).Languageseriesrevisited:Thecomplexityof

hypothesisspacesinILP.InProceedingsofthe8thEuropeanConferenceonMachineLearning,Vol.912,pp.360–363.Springer-Verlag.Widom,J.(1995).Researchproblemsindatawarehousing.InCIKM’95:Proceedingsof

thefourthInternationalConferenceonInformationandKnowledgeManagement,pp.25–30.ACMPress.Wiederhold,G.(1992).Mediatorsinthearchitectureoffutureinformationsystems.Com-puter,25(3),38–49.Wiederhold,G.(Ed.).(1996).IntelligentIntegrationofInformation.KluwerAcademic

Publishers,BostonMA.Winkler,W.(1999).Thestateofrecordlinkageandcurrentresearchproblems.Tech.rep.,

StatisticalResearchDivision,U.S.BureauoftheCensus,Washington,DC.Yan,L.L.,Miller,R.J.,Haas,L.M.,&Fagin,R.(2001).Data-drivenunderstanding

andrefinementofschemamappings.InSIGMOD’01:Proceedingsofthe2001ACMSIGMODInternationalConferenceonManagementofdata.Zelle,J.M.,Thompson,C.A.,Califf,M.E.,&Mooney,R.J.(1995).Inducinglogic

programswithoutexplicitnegativeexamples.InProceedingsoftheFifthInternationalWorkshoponInductiveLogicProgramming.

50

因篇幅问题不能全部显示,请点此查看更多更全内容