Ever tried to convert data into RDF? Extract something from iCalendar or an MP3 file and then use a bit of RDF? Have it all in a graph? Then you may be interestedhow to choose your weapons wise: If you want a fast and easy way for RDF integration, follow Patrick Stickler and his URIQA ideas.
Today I had another day of fighting with gnowsis, my desktop integration framework. The task was to extract data from MS-Outlook, on demand. The output format and request format is RDF, I used iCalendar/RDF as ontology.
Gnowsis does wrap all read access to outlook by a Jena Model. Outlook is represented as a Jena Model, each resource in outlook gets a URL and is a RDF resource. So f.e. a query like
SELECT ?x, ?y WHERE (<rdfp://leo.gnowsis.com/msoutlook/appointment/
00000000B2CDC30BFF2EED4ABA9C61436A07FE3384002000> ?x ?y)
does give a RDF/XML return like this: QueryResult.xml (xml, 1 KB)
as you may note, there are some “in between resources” where usually you would find anonymous resources: the properties dtstart and dtend have as object the (normally anonymous resource) “…#Start” and “…#End”.
This helps with Jena internals: My Jena model is dynamic, it has no storage backend. Whenever a query hits the model, it searches for resources and creates new triples. The anonymous nodes in between would break this model, when I get an anonymous object in dtstart, I could not easily do a request for the cal:dateTime value inside.
at this time consider reading some example RDF/ical, if you don’t have any clue what I am talking about, check out this: TestTermin.rdf. It is a longer version of the same VEvent entry.
So how can I get the anonymous nodes and will it be efficient?
This is a flaw in gnowsis: Each triple is generated by a Java class that represents the property in the triple. Triples containing the “summary” property are created by a corresponding “summary” java class, the Java class checks out the subject and finds a correct object value (and vice versa).
So Gnowsis adapts all properties with java classes. This can get too big, when an ontology uses many properties and classes. iCalendar is already too much for me to program.
So enter Patrick Stickler and his Uriqa
Most applications need – on the client side – only data about one or a few resources. You want to see an email, a person, an event. Then you may want to write an email, add the event to your calendar or do similiar stuff.
To Extract the needed data from a Jena Model, you will (based on RDQL) need many queries, even for a single Resource, as RDQL has no optional joins. This is a problem, and other people have it too. Read f.e. The Veudas Announcement and see there.
The solution is to get bigger chunks of RDF in one request.
what is a RDF chunk and how do I get it?
When you want to work with RDF data you need normally the data about a resource (at the best, the resource is identified by URL or downloadable). If the resource points to other resources via RDF triples, you may want to load the other resources, too.
F.E I have an appointment with JohnDoe, I may start at the appointment and then load the resource containing data about JohnDoe.
So when I load the chunks, how would I load them and what should they contain?
Most people will want “RDF-Subgraphs”, and I prefer them great above “variable-bound result sets” (like RDQL will give you). You will get such a subgraph from you RDF server by a protocol. A possible protocol may be URIQA. you may also do it with Joseki or Sesame.
And what should be contained in the RDF-Subgraph? Patrick Stickler has an answer, that is conformant with my wishes: He defined the Concise Bounded Description.
Everything that I need immediately, with the option to get more if I want.
A concise bounded description of a resource is a body of knowledge about a named resource which does not include any explicit knowledge about any other named resource.
more about Concise Bounded Description is at the URIQA page.
So the Concise Bounded Descriptions are the kind of chunks we will like to retrieve from any RDF host or RDF publishing application, why this is so good I will tell now.
Using Chunks instead of accessing single triples has several implications:
- Easier to write extraction algorithms
- Much faster & efficient
- Addressable chunks
- Restricts Access to chunks, single triples cannot be retrieved
now to a deeper look at the implications:
1 Easy Extraction algorithms
For gnowsis, it was needed to wrap every single rdfs:Property and rdfs:Class. This is good for some applications but not for all. Complicated tasks are better done by “hand-written” extractors, programs like the many Perl/Python scripts that convert stuff to RDF. You can find many examples written by TimBl, DanC and others at SWAP and some in cal-space, f.e. fromIcal.py.
Most of these converters and tools supply single resources, they convert a single file or similiar.
It is easiser to write RDF-integration for “chunks” of RDF, this is proven by the many adapters and experience.
2 Much Faster and Efficient
I have written adapters to extract RDF from the Filesystem, MP3 files, iCal and Outlook. I have gone so far to write a specific Java class for every property and class around. I have a big overhead with this. Especially when you use ActiveX or SOAP bridges to communicate with your data sources, you will have a problem.
So it is better to write an extractor very “near” to the source data. Imagine to write an MS-Outlook data Extractor in Java or in C++/Dll what will be faster? Surely the C++ DLL.
But when you wrote your adapter as DLL, how can the DLL understand queries when they come from a sourrounding like Jena or Sesame?
So the best is to say: DLL, give me the Concise Bounded Description of Resource X.
Then the dll may extract all stuff and return a chunk of RDF/XML. And if you are really clever, you may have built it in a way that you can pass the XML directly to a calling client.
Think of implementing a Enterprise Integration Server and you need RDF data (readonly), this can be a very neat way to do it. Just use any protocol and Concise Bounded Descriptions.
A main problem in Semantic Web is the question: How in the world will I get information about the resource X?
Many people propose to use indirect addressing:
To get a FOAF of Person X, search for
X? <foaf:mbox> <mailto:email@example.com>
This is a way to address people with email address “firstname.lastname@example.org”. hm. But which server shall I run this query against? www.test.com? smtp.test.com? And which interface to use? http, smpt, rdfp, uriqa, … ?
So if you use indirect addressing you have to think of a search engine or something like a central register. (which is ok, I respect that and I know many people who do and I like them).
But I prefer another approach, where every resource is addressable by URL. (Perhaps I will write something about this, too lazy now).
It is much easier to work with RDF-Chunks when the resource identifier URI is also a URL and contains information about protocol and server where I can get the Concise Bounded Description of the chunk.
It is even good when you are only working on Desktop Integration of data on a single host: There may be a Sesame Server and a Joseki Server running Happily on your machine, how will you decide which server to contact when you need something?
So I propose: use the power of URLS and fill them with information of how to get data about the resources.
Only chunk access, no single triples
If you work only on Concise Bounded Description chunks and cannot query a server for individual triples, this may be ineffizient! Yep, it may be, especially, if you have “directory” resources, f.e. a folder resource that is connected to thousands of files.
In this case your adapters may always have to build the whole chunk and then return it to you, so you can filter it on client side. But this is not the last word. You can use an adaptive Framework for such queries (like gnowsis framework). Or you can use another query interface only to do the selection of resources.
F.e. the selection of “all Appointments I have in the next seven days” may be a complicated search in your iCalendar data. If you are a badass hacker, you may want to write a query engine that does the trick for you, as long as you support SeRQL or RDQL I am happy hitting your server with Queries.
When I then have my list of needed Resource Identifiers, I can use URIQA or any other Concise Bounded Description compatible server to get my RDF chunks.
to sum it up
Building RDF aggregators is not an easy task. I have tried it in many different ways, and had varying forms of success. If you want an easy approach that does it, think about concise bounded descriptions.