smushing

I am putting together more about smushing, which will be a key factor in the global semantic web: to connect annotations that were made by different people.

A typical smushing algorithm would be:

  • take a large datastore DS that contains a set of triples Tset = {Ta, Tb, Tc, … }
  • iterate through known InverseFunctionalProperties IFPset = {Ia, Ib, Ic, ….}
  • for each InverseFunctionalProperty Iy that is represented in the Tset as predicate, do a check for smushing.
  • find all triples TxIy so that Tx has Property Iy
  • find one triple Txc of TxIy that points to a grounding resource / canonical resource (see below)
  • Use the subject Sx from the triple Txc and aggregate all other triples of subjects of TxIy to Sx. This means, change the subject in the triples to Sx.
  • add owl:sameAs triples to connect all Subjects(TxIy) to Sx

The problem is, when you have a set of triples TxIy that have several subjects that should be the same – as defined by IFP – to choose which subject is the “canonical” subject and should now be filled with the triples.

There are different approaches to find the canonical resource:

  • take by random
  • prefer the resource that is annotated in special ontology (ie prefer SKOS concepts over foaf:Persons)
  • prefer the more public resource (googlefight, public urls wins over private uris)
  • prefer the best annotated resource (the resource with the most triples – attention, this is self-amplification of single resources)
  • prefer the resource with the shortest / the longest uri
  • prefer named resources over anonymous resources (this is very important, you must not smush to anonyms)

Another question is what to do with the smushing. Different approaches

  1. store the smushing in an extra graph
  2. delete the old triples, add the smushing
  3. add the smushing additional to the old triples (tricky)

Each has obvious advantages and disadvantages. For gnowsis I would prefer (1)to smush into an extra graph, which is similiar to (3) but seperates the data.

In gnowsis we have the problem of incremental smushing, which means that we crawl thousands of emails per day and then would like to smush the persons in the addresses, but only of the new messages.

I have posted this algorithm also in the ESW wiki, where you can comment on it.

new diploma thesis topics available

I updated my diploma thesis website, where I offer topics related to the Semantic Desktop as Master Thesis.

http://www.dfki.uni-kl.de/~sauermann/projekte/index.html

this always takes so much time, to write down my ideas and things that will be needed. The next student starts in October with a Thesis, so there are still places open.

here are the current topics. Well, I should add another one for GUI!

matatour 2005

wir waren segeln in kroatien (kroatien wikipedia)!

Was soll man da schreiben? Nie wieder Strandurlaub, Yacht-Segeln ist das Ideal.

Kurz die wichtigsten Daten:
Unser Schiff war eine Sun Odyssey 54. 54 Fuss lang, etwa 17 meter. Diese Bootstype ist das grösste Standard-Schiff der Jeanneau Werft, und grösser als alle anderen Yachten die man üblicherweise chartern kann. Das boot heisst “Torro del Mar” und wir haben es über ecker yachting geborgt. Interessanterweise kann man unser boot hier kaufen.

Das ganze wurde von Matthias organisiert, wofür wir ihm alle sehr dankbar sind, hat alles perfekt geklappt. Die Taktik, das beste Boot wo gibt zu nehmen, ist voll aufgegangen. Die Yacht war sehr gemütlich aber auch sportlich.

Wir sind am 2.7.2005 in Trogir gestartet und von dort einige Inseln angefahren, um dann am Freitag dem 8.7.2005 wieder in Trogir einzulaufen.

Der typische Tagesablauf war in etwa:

  • aufstehen
  • unter deck oder an deck frühstücken
  • hafen verlassen
  • segel setzen
  • autopilot anschalten
  • sonnen und bier trinken
  • irgendwo hinfahren
  • grillen, schwimmen, essen gehen, sightseeing
  • in einem hafen anlegen oder ankern
  • trinken & schlafen gehen