PhD step1: integrating data into the semantic desktop

I will be blogging about my Semantic Web PhD for the next months, until I am finished. First, you learn what I did and do, and perhaps you can copy something for your own thesis or point me to information I missed, critique, positive and negative, is warmly welcome.

First part of my dissertation will be about integrating data into the semantic desktop. The problem at hand is, that we face data from different sources (files, e-mail, websites) and with different formats (pdf, e-mails, jpg, perhaps some RDF in a foaf file, or an iCalendar file), and these can change frequently or never. Faced with all this lovely data that can be of use for the user, we are eager to represent it as RDF. There are two main problems when transforming data to rdf:

  1. find a RDF representation for the data (RDF(S), or OWL vocabulary)
  2. find a URI identifying the data

I have experienced, that the second question is far harder to solve. While it is quite easy to find a RDF(S) vocabulary for e-mails, MP3s, or People (and if you don’t find one on schemaweb.info, btw the only website I never had to bookmark because its so obvious, you make up the vocab yourself), finding the correct URI to identify the resource can be a longer task.

The most tricky thing is when identifying files or things crawled from a bigger database like the Thunderbird address book. For files, there are some possibilities, all of them have been used by me or others.

You can skip this section about typical URIs for files, its just an example of what implications the URI may have.

  • file://c:/document/leobard/documents/myfile.txt this is the easiest way, because it is comformant with all other desktop apps. Firefox will open this url, Java knows it, its good. The problems are: what if you move the file? You lose the annotations, which can be fixed. Second, the URI is not world-unique. Two people can have the same file at the same place. Also, it is not possible to use this URI in the semantic web at large, because the server misses.
  • http://desktop.leobard.net/~leobard/data/myfile.txt Assume you have a HTTP deamon running on your machine, like Apples OSX does, and assume you have the domain name leobard.net and register your desktop at the DNS entry desktop.leobard.net then you could host your files at this address. Using access rights, you could block all access to the files, but still open some for friends. Great. But first, people usually don’t run http servers on their machines, nor do they own namespaces, nor are their desktops reachable on public IP addresses, but are rather behind NAT.
  • urn:file-id:243234-234234-2342342-234. Semantic Web researchers love this one. You use a hash or something else to identify the file, and then have a linking from the URI to the real location. Systems like kspaces.net used this scheme. It is ok to identify files, but looses all the nicety of URLs, that can actually locate the file also.

So, after this excursion we know that its not straightforward to identify files with a URI. We tried the first two approaches, but I am not happy with them, perhaps I blog the latest findings regarding URIs sometimes.

On with metadata integration. So, four years ago I needed a way to extract metadata from MP3s, Microsoft Outlook and other files. I created something called “File Adapters”. They worked very elegant: you post a query for ” ?x” and get the answer “Numb”. This was done by analysing the subject URI (file://…) and then invoking the right adapter. The adapter looked at the predicate and extracted only that, very neat. BUT after two years, around 2004 I realised that I need an index of all data anyway to do cool SPARQL queries, because the question “?x mp3:artist ‘U2′” was not possible – for such queries, you need a central index like Google Desktop or Mac’s Quicksilver (ahh, I mean Spotlight) does. For this, the Adapters are still useable, because they can extract the triples bit by bit. But then, if you do it by crawling anyway, then you could simplify the whole thing drastically. Thats what we found out the hard way by implementing it and seeing that interested students that helped out had many problems with the complicated adapter stuff, but are quite quick writing crawlers. We have written this in a paper called “Leo Sauermann, Sven Schwarz: Gnowsis Adapter Framework: Treating Structured Data Sources as Virtual RDF Graphs. In Proceedings of the ISWC 2005.” (bibtex here). Shortly after finishing this paper (may 2005?), I came to the conclusion that writing these crawlers is a problem that many other people have, so I asked the people from x-friend if they would want to do this together with me, but they didn’t answer. I then contacted the Aduna people, who do Autofocus, and, even better for us, they agreed to cooperate on writing adapters and suggested to call the project Aperture. We looked at what we did before and then merged our approaches, basically using the code Aduna had before and putting it into Aperture.

What we have now is an experiment that showed me that accessing the data live was slower and more complicated than using an index, and the easiest way to fill the index is crawling.

The problem that is still unsolved is, that the Semantic Web is not designed to be crawled. It should consist of distributed information sources, that are accessed through technologies like SPARQL. So, at one point in the future we will have to rethink what information should be crawled and what not, because it is already available as SPARQL endpoint. And then, it is going to be tricky to distribute the SPARQL queries amongst many such endpoints, but that will surely be solved by our Semantic Web services friends.

Doing a dissertation on Semantic Desktop at DFKI

I am in the midst of writing my dissertation about “Information representation on the Semantic Desktop”, something I do since three years every day and which I really like doing.

Thomas Roth Berghofer has written a nice little story on how he sees Doctorate Studies at our research company. Its written in German.

I stumbled into this science business, when I found no other way to continue work on gnowsis, and after two years of doing it, I got somehow used to it and learned the “way of the force”, at least a little. Still, the biggest problem is writing. As you may notice, I don’t care much about grammar or cool wording. So, its a long endeavour to do this, and to do real science is even harder. Coming more from the systems engineering group (my diploma was done at distributed systems group Vienna), my work focusses on the engineering science: how can we build the semantic web. Alas, Today I meet my doctorate “supervisor” and colleague Thomas Roth Berghofer, to check my status.

Promise, I will blog more frequently about my scientific work from now on. For example, about my various publications.

SemWeb introductionary websites by Richard Cyganiak

Cygri posted some websites that show how the Semantic Web may work.

He collects them using the del.icio.us tag “semwebintro”, which I copy from him, so you find a list with more contributions (or also yours?) here:

del.icio.us/tag/semwebintro

The material is practically oriented and is not bloated by theoretical papers on what wishful thingies you may do sometimes in the future given a hypothetical semantic web. I like it as it is: showcases, demos, FAQ, TimBl.

hacking for nepomuk: getting SOAP to run in Eclipse’s OSGI

We are building the desktop semantic web server for Nepomuk at the moment, and I had a look how to use OSGI to start Java services as SOAP services.

At the end, we will start a RDF database, some ontology matchers, text indexing, Data Crawling (Aperture and Beagle++) and many other things using this code, so wait a little and you get a really cool Semantic Desktop platform. If everything works fine, it should be cooler than gnowsis 😉

The code will be open-source in December or January, but if you are really interested, I may bundle this as a zipfile for you (its not Nepomuk relevant, its only a hassle with Eclipse)

UPDATE (12.10.2006): Below oddysee is really odd, today Christopher Tuot informed me where the compiled bundles of knopflerfish are, the are in knopflerfish.org\osgi\jars\axis-osgi\axis-osgi_all-0.1.0.jar, I saw them but didn’t realize they were bundles. So if I just imported this bundle to the Eclipse plugins, it would have worked probably within one hour. Although I still don’t know how to add these precompiled bundles to a developing project like we have.

here is my oddysee:

Running SOAP services from OSGI

_The plan was to run SOAP services from OSGI, as announced:_

  • use Tomcat as webserver (a http server) on port X
  • wrap Tomcat as OSGI service (done already by eclipse)
  • start services as web-applications (done by tomcat)
  • let components (excample: comp-rdfdatabase) start inside this VM and register objects as services: the component registers a web-service (SOAP server object) using AXIS

Mikhail has already submitted some service for points one and two – this package might help me:
https://dev.nepomuk.semanticdesktop.org/browser/trunk/java/org.ungoverned.osgi.bundle.http

_Result: It works, checkout the newest NEPOMUK from https://dev.nepomuk.semanticdesktop.org/repos/trunk/java/_

It took me exactly three hours, here is the steps I took.

Step: read the documentation provided by Christopher Tuot.

good to read, sounds exactly what we need.

  • publish SOAP services
  • use SOAP services as client
  • create WSDL files

Step: checkout the SVN sources using eclipse from the SVN

ha – the SVN is exactly where the doc came from, thats easy:

I decide to checkout the whole SOAP branch somewhere to my disk, outside eclipse. I go for the whole package, because there is an ANT file inside the parent folder of the sub-packages, which seems to indicate that they all belong together.

svn checkout https://www.knopflerfish.org/svn/knopflerfish.org/trunk/osgi/bundles_opt/soap

its 5.81 MB, 195 files, thats nothing.

Step: Legal Check – is the license ok?

at this point, I check if Knoplerfish has a compatible license – they use BSD, ok, no problem here.

Step: look inside axis-osgi

  • this package contains AXIS and WSDL stuff, very small, neat.
  • the bundle.manifest doesn’t require that many knoplerfish things. These two sound bad:
    • Import-Package:
    • org.knopflerfish.service.log,
    • org.knopflerfish.service.axis,

Step: import to eclipse/Nepomuk

I try to run the build.xml in the main soap dir. Fails, it needs the other knoplerfish dependencies “commons-logging”

  • I decide to ignore this and try to import it myself using eclipse.
  • I think again and checkout commons-logging from knoplerfish
  • 68kb, 29 files. nothing.
  • ok, now I have to stop: “java.io.FileNotFoundException: C:\ant\bundlebuild_include.xml (Das System kann den angegebenen Pfad nicht finden)” – the thingy seems to have weird dependencies. But wait, perhaps it just needs all of knoplerfish
  • I download the latest knopflerfish source distro, to get this running quicker (doing this via SVN may cause heavy server load for these guys, I avoid it)
  • At this moment, I notice I downloaded OSGI Knopflerfish already on 27.10.2005, when Mikhail Kotelnikov and I first talked about OSGI 🙂
    • I delete my old knopflerfish 1.3.4 and replace it with the newer 2.0
  • I notice they did not include the optional bundles (where SOAP is in) and I move my previously checkout of SVN to the right place. knopflerfish.org\osgi\bundles_opt
  • ANT still fails – BCEL misses. They said I needed it, but never trust them. Downloading BCEL.
  • OK – knopflerfish compiles: knopflerfish.org\osgi\bundles_opt\soap>ant
    • BUILD SUCCESSFUL
    • Total time: 19 seconds
  • The generated output is in knopflerfish.org\osgi\out and knopflerfish.org\osgi\out\jars

PROBLEMS ok, after all this rubble, I go for Graphical user interface and just start knopflerfish to start the HTTP services using Knopflerfish. Ok, this works and I can install HTTP and AXXIS using some clicks on bundles there: knopflerfish.org\osgi>java -jar framework.jar

Step: rethink how to get Knopflerfish stuff working in Eclipse

All that hassle says: someone did this before.

I decide to make new plugins for Eclipse OSGI, using the Eclipse IDE, and copy the sources and manifest files from knopflerfish.

at this point, I realise, that the Knopflerfish people use Eclipse to code Knopflerfish, so I install the Eclipse IDE plugin to see what it can do for me:

Revelation: all this Knopflerfish testing was useless

but I learned a lot.

I found no straightforward way to compile Knopflerfish into plugins that can be used conveniently from inside eclipse, so I just take the sourcecode of the plugins and make new Eclipse plugins from that, copying the Manifest files into the Eclipse manifests.

  • I copy the code now
  • this took me about 30 minutes, but surely was quicker than all above.
  • if knopflerfish has great new stuff, we cannot easily put it into Nepomuk, until someone finds a way to bundle their bundles as plugins for equinox/eclipse. but 30 minutes effort is affordable.

Step: finishing this and starting a soap service

Ok, until now I have these OSGI bundles in my eclipse:

  • the SOAP service in org.knopflerfish.bundle.axis-soap
  • the log service in org.knopflerfish.log
  • the rdfrepository in org.semanticdesktop.nepomuk.comp.rdfrepository

All of them seem to work, I only get one nullpointerexception when starting the axis-soap:

java.lang.NullPointerException
	at org.knopflerfish.bundle.axis.Activator.setupAxis(Activator.java:109)

This is easy, he wants to find the AXIS configuration in resource resources/axis/server-config.wsdd

I add this to the build.properties of the soap bundle, using the classy graphical editor of Eclipse, rocks.

bin.includes = META-INF/,\
...
resources/axis/

It seems the Eclipse building works differently from the Knopflerfish building process, so I move the contents of resources to the root of the plug-in, most important the resources/axis/.. is now axis/….

  • this makes the product not start, OSGI does not even show the console. We had this before. I delete the “run…” config for the product and start again, ok it starts.
  • now it doesn’t start because commons-logging misses, I notice that we probably have the commons logger already in Mikhail’s logging plugin and faithfully delete this plugin dependency from
  • start/stop

Step: Commit all to SVN

  • I commit the knopflerfish wrapped packages to our Nepomuk SVN. We can fix this later ™
  • I also updated the server product and the server app to start SOAP now

_DONE: We can start Java objects as SOAP services_
Go here to see the started SOAP services:

Go here to see the automatically created WSDL files for out example RDF repository component:

Summary

All in all this took three hours. It was hard, but not impossible. I would recon that it proves that our SOAP in OSGI approach can rock so hard the keyboards will fly. No guarantee that everything will workout, but this is the only code you have to write to start a SOAP service now:


// inside your Bundle Activator

 public void start(BundleContext context) throws Exception {
    // assume RDFRepositoryImpl is your Java Object that should be accessible via SOAP
    ServiceReference srRA = registerObject("remoteFW", new RDFRepositoryImpl(context));
 }

 private ServiceReference registerObject(String name, Object obj) {
    Hashtable ht = new Hashtable();     
    ht.put("SOAP.service.name", name);
    return (context.registerService(obj.getClass().getName(), obj, ht)).getReference();
 }

Java is great …

I wondered how to make the Semantic Web fly, so I wandered around looking for possible deployment of Java based semweb stuff (many semweb apps are written in Java). Reading about webhosting and possibilities to host services like gnowsis as a web-application, stumbling around in the world of tomcat web-hosters, a dicussion with a nice line catches my eye:

As someone said, “java is great for engineering next generation
solutions to enable maximization of developer income by means of enhanced buzzword use”.

Point there. Naively I wondered who the someone may be and googled for the phrase “java is …” resulting in an estimated 3,220,000 buzzword bullshitters. Oh, I forgot to quote the quote, so searching “java is …” with quotes it bakes down to 1 someone who said it.

What can you learn? As many people buzzword-pump-fill their web-writings with above terms, they may be highly paid Java people. I can only say:

Java is great for engineering next generation solutions to enable maximization of developer income by means of enhanced buzzword use.

Five issues to think about before you try to do real AI

Gunnar Grimnes pointed me to this article by Roger
C. Schank

Note the best part here on Five issues to think about before you try to do real AI, which is related to my other rants about researchers who should dig into engineering:

1. Real problems are needed for prototyping. We cannot keep working in toy domains. Real problems identify real users with real needs. This changes what the interactions with the program will be considerably and must be part of the original design.

2. Real knowledge that real domain experts have must be found and stored. This does not mean interviewing them and asking for the rules that they use and ignoring everything else that fails to fit. Real experts have real experiences, contradictory viewpoints, exceptions, confusions, and the ability to have an intuitive feel for a problem. Getting at these issues is critical. It is possible to build interesting systems that do not know what they know. Expertise can be captured in video, stored and indexed in a sound way, and retrieved without having to fully represent the content of that expertise (e.g., the ASK TOM system (Schank, Ferguson, Birnbaum, Barger, Greising, 1991). Such a system would be full of AI ideas, interesting to interact with, and not wholly intelligent but a far sight better than systems that did not have such knowledge available.

3. Software engineering is harder than you think. I can’t emphasize strongly enough how true this is. AI had better deal with the problem.

4. Everyone wants to do research. One serious problem in AI these days is that we keep producing researchers instead of builders. Every new Ph.D. receipient, it seems, wants to continue to work on some obscure small problem whose solution will benefit some mythical program that no one will ever write. We are in danger of creating a generation of computationally sophisticated philosophers. They will have all the usefulness and employability of philosophers as well.

5. All that matters is tool building. This may seem like an odd statement considering my comments about the expert system shell game. However, ultimately we will not be able to build each new AI system from scratch. When we start to build useful systems the second one should be easier to build than the first, and we should be able to train non-AI experts to build them. This doesn’t mean that these tools will allow everyone to do AI on their personal computers. It does mean that certain standard architectures should evolve for capturing and finding knowledge. From that point of view the shell game people were right, they just put the wrong stuff in the shell. The shell should have had expert knowledge about various domains in it, available to make the next system in that domain that much easier to build.

Aduna goes open source

Aduna, the company that is known for guided exploration of compnay information, and of course for programming the sesame server at openrdf.org, and for autofocus, and and and …. they have swichted their business model to open source.

read their statement on open-source.

They have also changed their name to “Aduna-Software”, giving focus on what they do (software).

This continues what I have experienced working with them together on Aperture, the framework I have been using now in the latest gnowsis system to crawl data from various file-formats and data sources. We developed this together, they use it for the new version of Autofocus.

Aperture raised interest, Henry Story from SUN blogged about how important it is to have projects like Aperture to make the semantic web run.

A new website was created to separate open-source services from company issues, this seems to be their open-source portal: aduna-software.org

and look, they use their own tech to crawl their own website and have a search interface.

So, best wishes for Aduna-Software from my side and looking forward for the release of Sesame2.

quote: semantic web based research isn’t working

Zack Rosen blogs about why RDF research sucks and has written a mail to the Simile mailinglist for comments. No comments from me, but a general agreement on his suggestions for a way out:

So what can we do about it?

1. Researchers need to stop thinking of themselves as researchers and start thinking of themselves as implementors.
2. Research institutes need to join forces with emerging businesses looking to adopt semantic technology. This breaks the current model of business / research institute collaboration since startups do not have money to contribute to fund research, but tough noogies.
3. Researchers need to build their tools in real-world development environments, i.e. as modules for LAMP web-publishing tools such as Drupal and WordPress. They need to find more organizational partners to deploy their solutions. They need to do something other than build widgets.

joining the Nepoverse

Getting up and reading news this morning, and still thinking about yesterdays ramblings how we could benefit from our ideas, TRB’s greetings reached me at the right moment:

Welcome to the Nepoverse, by Thomas Roth-Berghofer
Yesterday morning I woke up with this greeting on my mind, a greeting to all those who are interested in the goal of the Nepomuk project: the Social Semantic Desktop. And it got even better: the Nepoverse did not exist in the Googleverse. Until now!
As you may know, we–the Nepomukians–strive for providing you with new tools for better working with (your) knowledge. We want to change the way we, as knowledge workers, live in and with the digital world, not only by providing cool Nepomuked applications and a feature-rich toolbox, but by building a community around the Social Semantic Desktop. Thus, we are shaping our own universe, don’t we?

Yes Thomas, you are right. We want that, and I need that. I don’t feef exactly like “our own universe” but would put it more like Stefan Decker often tells the story: Nepomuk is a seed of a community, it starts at one point and gets bigger in circles, bigger, bigger, circling, …

As I said yesterday night:
Our discipline is a crossover, we need results from artificial intelligence, web 2.0, usability, personalization, databases, data integration, software engineering . . .

And what I should have said then was: we got Nepomuk. There are many people in this project that make exactly this crossover possible, through their different characters and backgrounds.

If you now wonder what we are all blogging about,

, see the