Annotating files – but where to store the metadata?

An interesting thread about file metadata for KDE got my attention: Portable Meta-Information. I waited a month until it cooled down and re-read it to draw my own conclusions.

The author, zwabel, correclty identified the problem that the Semantic Desktop must be compatible with the past – and with the future!.

I think, for the future, we need to find a way to keep the users data together, so it is as persistent and approachable as the files themselves:
– When the user copies his photo archive or backs it up to a CD, no matter what application he uses, meta-information like ratings, comments, or tags, have to move together with the photos
– When the user has a fresh install, and copies his photo archive from a CD to the disk, the meta-information for the photos should be just there
– User-generated meta-data should _never_ be lost just because a file/directory was renamed, a mount-point changed, or whatever
– User-generated meta-data should not be lost when a file completely unrelated to the item is damaged or deleted(Database)
– In 20 years, when KDE4 is history for a long time, and I find an old photo backup CD, the meta-data should still be readable

zwabel then suggested to store the metadata additionally to the central store (which NEPOMUK needs for the search engine and is essential anyway) in a multitude of “.meta” files, which are stored in the same directory as the files. For the file picture1.png, the metadata would be in picture1.png.meta. I think this is a pragmatic idea and would say:

Lets store it in picture1.png.rdf

As serialization, I suggest the W3C RDF standard, which we use in the central NEPOMUK store anyway (in the database) and which has a well-readable standardized serialization format in either XML or a plain-text format. To achieve linux-geek compability, I suggest the plaintext format. For example, to add authorship information about picture1.png, it would be:

@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.  
<>  dc:creator "Dave Beckett";
dc:date "2002-07-31";
dc:publisher "ILRT, University of Bristol";
dc:title "Dave Beckett's Home Page" .

Note that the <> is a known shortcut for “this”, the equivalent rdf/xml is: rdf:about=””.

Sebastian Trüg also argues in a way that also leaves both ways open for the future, database and filesystem:
“you need a database anyway. Thus, in the end, the only solution I see at the moment is a kind of copy wrapper that makes sure metadata is copied with the file. Then one could also send information like a person or a project to a friend and the system would pick up all interesting metadata.”

So – how do we format the metadata inside the files? The same way we do as in the RDF repository of nepomuk. There we use the NIE and NAO ontologies. But Pushing Dublin Core is also a good way to do, but do it the W3C way, standardized.

Using the RDF encoding of Dublin Core and for example Turtle/N3 as serialization format gives a rock-solid W3C industry standardized (or at least well implemented) way.

Because the world is not perfect and needs many possible ways to evolve, we can store the metadata in redundancy now in as many places as possible – but in one format. For freedesktop and nepomuk RDF is the best choice, in my (not so humble) opinion. It is serializeable, it can be stored in a database, it can be hosted on the web. No other standard has this. It is embedded in PDF already in the XMPP format.

I propose “.turtle” files to indicate that its RDF/Turtle serialization, but if you insist, “.rdf” is also fine with me (but implying RDF/XML storage, which is a bit sluggish), and “.meta” is also fine with me if you store RDF/turtle inside. Making up a new micro format would be stupid.

My Summary:

  • storing it in the filesystem is nice, but not a killer-argument. It works ™ by just storing it in the central nepomuk repository for 90% of all use cases, so start hacking applications that help the users save time and improve their user experience with what is there today.
  • do not store it in .meta, but in .turtle, which is the rock-solid industry standard by W3C and human-readable and a simple microformat-like text format (smoother than xml)
  • do also store it however possible in the files themselves, not to block out others. Use EXIF fields, use XMPP fields in PDF, use ID3v2 fields, use those metedata!
  • do also index it in the central search engine, be it nepomuk or beagle++ (beagle++ is the rdf-enabled beagle, check it out if you are not aware of it)
  • storing it in metadata file attributes (xattr/channels/…) is the goal, but I propose to extend these standards with RDF to achieve cross-system compability. What worked for the web, may also work here.

OrganiK project: working on testdata collection

As blogged in January, Gunnar, Remzi and I are working for DFKI on the Organik-Project. As true hard bloggin’ scientists, we keep on reporting.

In the next two weeks, I will gather an exhaustive test-data collection of texts that we use for ontology learning. I hope to gather around 10.000 documents from various sources that have a topic overlap. We need e-mails, office documents (contracts, etc) and news documents. There are a lot of test data sets out there, the question is now to pick the right one. Also, in OrganiK we have SME partners who could provide some data.

After this, the next step will be to create a taxonomy learning module that analyses the documents and semi-automatically (or fully automatically) creates a taxonomy out of it for future classification. If its fully automatic, I expect that the taxonomy will have probabilistic elements in it (“it thinks that this is a customer, but only 60%”). If we work with a probabilistic model throughout the whole project, we can rank everything all the time, maybe this will reduce human work. We will see.
Anyone has experience with taxonomies that have a weight added? Its similar to a TF/IDF rank.

Permanent Breakfast 1.5.2009 Startfrühstück

Wie jedes Jahr, werden wir auch dieses Jahr in Kaiserslautern am 1.Mai am Stiftsplatz frühstücken! Teil der globalen Permanent Breakfast Kunstaktion.

Freitag, 1.Mai 2009
10:00 Stiftsplatz
bis etwa 12:00

Die regeln:

  • es muss wie ein formales frühstück aussehen (keine picknicks)
  • es geht ums partizipieren – jeder hilft mit
  • ein klein wenig mehr mitnehmen als für dich selbst, dann ist auch genug da für passanten und freunde die was vergessen haben

was man mitnehmen kann:

  • Tische
  • Stühle
  • Frühstück
  • Hüte
  • Sonnenschirme
  • Lampen, kleine Radios, es soll Wohnzimmerstimmung aufkommen

Breakfast

vergangene freude:

es frühstücken: Björn, Florian und Elisabeth
Blick von oben

Merkt euch den Termin an, es zahlt sich aus! Haltet es nicht geheim, macht comments hier (unten) wenn ihr dabei seid.

Kinetic Sculpture Race has 41th Anniversary! how about one in Vienna/Klosterneuburg?

“Adults having fun so children will want to grow older”,
Hobart Brown, glorious founder of the race.

en.wikipedia.org/wiki/Kinetic_Sculpture_Race
A Kinetic sculpture race is an organized contest of human-powered amphibious all-terrain works of art. The original event, the Kinetic Grand Championship in Humboldt County, California, is also called the “Triathlon of the Art World” because art and engineering are combined with physical endurance during a three day cross country race that includes sand, mud, pavement, a bay crossing, a river crossing and major hills.

This year is 41th Anniversary! The first race started in 1969 in Hippie-California. How about making a race this year in Klosterneuburg, next to Vienna? We have wild cross-country, road, and water. I would like to find a few mates for a team to create “Free Ellie” and you should also make a team! its fun, it teaches you artwork and handicrafts.

and look at all the fun it can be after you do it for many years:

I search for alternative-minded people who would like to help out, co-organize, form teams, and carry on the spirit of this family-friendly event in Europe. Are you in? Write to me or comment below.

Our design is intentionally simple so that its possible to build it within a week – would you like to ride in this? then help building it!
Free Ellie

update: I made a facebook group to kickstart this: http://www.facebook.com/home.php#/group.php?gid=67272838900

See me speak at webinale 2009

Webinale 09 is the premier german conference about web 2.0. 70 speakers on two days, on all relevant topics: Web technology, scaling, running services, marketing, business, future trends, ria, mobile web, social networks and communities. Various hands-on sessions to learn about building iPhone apps, Air/Flex, etc. A startup day to see and meet the next xing or facebook. And facebook and xing groups to do some boo-haa already today.

webinale09

two years ago I spoke about the state of the semantic web, this year again I will speak 40 minutes on the current state of the semantic web, or, as we call it, “the web of linked data”.

see me speak on 26.5.2009 from 10:30 to 11:30 at the ufo-lookalike congress center in berlin about the fact that even Barack Obama’s new administration does the semantic web now, and other bits. Free your data!

The ufo-lookalike congress center:
berlin congress center

deadline surfing

A colleague from a related research institute just expressed the pressure we all experience when facing EU proposal deadlines:
“Sorry, can we reschedule to later? I am currently deadline surfing for the call deadline tomorrow”.

Deadline surfing, of course, means: To have around 20 man-days of work build up behind you, and 5 workdays in front of you. While you wade through the doable tasks in front of you, more work piles up behind you faster and faster, pushing you towards the deadline. Then, the wave breaks, either you surf straight out of it (unbelievable) or you crash and fall into the whitewater (which experienced deadline surfers call the “stuck inside a washing machine mayhem”). The deadline arrives, washes every crashed surfer on shore, while the experienced riders swim out to catch the next set. Once the debris is washed from the beach, the wildlife of scientific work continues.

Let me illustrate the process:
deadlinesurfing
sources, cc-by

In the graph, we compare two typical people being approached by a deadline which they are going to surf. Orang is the prepared and experienced surfer: when he sees the work coming, he gets on top of it early and then rides it at the bottom of the curve, gaining momentum and keeping the work well behind him. Finally, he elegantly finishes before the deadline and turns his board around, before the whitewater of accusations and last-minute panic crushes him. Not so the blue surfer. He waits a bit too long at the beginning, is taken by work to fast which tips him over. Unable to stay in front of the work, he ends up in the whitewater of accusations and last-minute panic.

Further illustrations:

A knowledge worker riding the perfect deadline, excellent sports:
(c) dude crush, flickr

Waited to long to start working, now trying to get away from the deadline, clearly visible for everyone still working (not a good exit, you should dive underwater so that they don’t notice your wipeout):
(c) vaguely artistic

Even a small deadline can trip you (the wave is about the size of a local gov funding contract, or a NOE):
(c) coast guard bm

A team of two knowledge workers stuck right on the deadline. Bob, the lower one is tripped by the tasks slipping away under him, David, the upper, is crashing over him because he depended on Bob’s input for the cost calculation:
(c) localsurfer

A sole knowledge project manager writing the final deliverable for a 15mio EUR IP project that is under close surveillance by the PO already, the double tripping wave means that half the project members invested their money into stocks and expensive mediterrian “research visits” which makes it impossible to meet cost statements (and all accounts receivable):
(c) soulsurfer3 on flickr

I conclude:
“I love deadlines, I love the sound they make when they swoosh by”.
Douglas Adams

p.s.: this is of course related to the deadline of IST calls tomorrow.

PhD step5: burning the last draft

After submitting my PhD in January, I continue my long-term effort to blog about my phd.

PhD BurningPhD Burning - Whisky and Leo

When you submitted your phd, according to an old scottish tradition, you burn the last printed draft in the woods. Gunnar Grimnes and I did adhere to that scottish tradition on the 10th of January 2009, it also includes drinking a lot of alcoholic beverages.

PhD Burning

The tradition also includes defending the thesis quickly, you say something like “I made a phd on helping people remember, and it is great.” – the attackers (your dudes) then shout “It is shit – burn it!”. Do that, and drink alcohol.

Please comment below, blog it, or contact me if you also burnt your phd. use the flickr tag phdburning.

Design for the Other 90%

“The majority of the world’s designers focus all their efforts on developing products and services exclusively for the richest 10% of the world’s customers. Nothing less than a revolution in design is needed to reach the other 90%.”
—Dr. Paul Polak, International Development Enterprises

Design for the Other 90% is an exhibition currently on view at Centers for Disease Control and Prevention through May 29, 2009, and online.

There must be a use for technology for the Other 90%. And they are a market – for products that help them improve life, or even stay alive. Products such as the Q-Drum:
q-drum

http://other90.cooperhewitt.org/Design/q-drum

It can hold up to 50 liters of water and be used to transport that water over long distances. At an affordable price, saving time and money, all in one, clever, design.

I remember a keynote at the I-Know conference in Graz (I think 2005, and I forgot the speaker’s name – is there a program online?) which was not about knowledge management, but how computer are be used in rural india. One case was to examine eye patients remotely via a webcam – the patient sits in front of the only computer in the village and looks into the webcam, the examining doctor sits somewhere else and gives a diagnose. This cuts travel costs and saves money (and improves health). So, there is a market in the low-income population, for life-improving products.

SemVox DFKI Startup combines Ontologies with voice interaction

“So, computer, please find me all documents that contain research information about a drug that can cure cancer, developed anywhere in the world” – this is a classic question we would like to ask a computer. Actually, its so classic that it is defined as an example in the 1992 version in the TREC test data.

The DFKI Spin-Off SemVox may provide something that helps realizing this. They are combining ontologies with speech interaction:
The SemVox technology enables the user to employ various applications without having to resort to traditional operating concepts such as keyboards or remote controls. Using our technology the user is free to choose between a number of modalities such as speech, gestures, keyboard or mouse or a combination thereof.

semvox logo

Their technology incorporates a heterogenous set of modules that can be remixed to allow different application scenarios. Part of their demos is to tell the computer to “find me an action film”. Nice side-effect: using the speech-synthesis module offered by SVOX, the computer will talk back to you (press release in german).

So – this is a next step to the semantic web,as Vint Cerf has put it:
I’m almost certain you’ll see products emerging that will allow you to orally interact with the network
Sure, it is nearly here, and you can buy the tools for it off-the-shelf. And I guess SemVox is open for investors 🙂

What is really funny, is that today we are very close to actually answer the questions defined as scientific goals in 1992 (for example here, page 64, I was not able to find the original TREC-1 set).

I have seen the SemVox system live at CeBit, I was demoing NEPOMUK (and advertising my gnowsis.com startup) 5 meters away from them and we had great fun demoing our products to each other. Here is a picture of Jan, one of the founders:
Jan Schehl, Semvox