Annotating files – but where to store the metadata?

An interesting thread about file metadata for KDE got my attention: Portable Meta-Information. I waited a month until it cooled down and re-read it to draw my own conclusions.

The author, zwabel, correclty identified the problem that the Semantic Desktop must be compatible with the past – and with the future!.

I think, for the future, we need to find a way to keep the users data together, so it is as persistent and approachable as the files themselves:
– When the user copies his photo archive or backs it up to a CD, no matter what application he uses, meta-information like ratings, comments, or tags, have to move together with the photos
– When the user has a fresh install, and copies his photo archive from a CD to the disk, the meta-information for the photos should be just there
– User-generated meta-data should _never_ be lost just because a file/directory was renamed, a mount-point changed, or whatever
– User-generated meta-data should not be lost when a file completely unrelated to the item is damaged or deleted(Database)
– In 20 years, when KDE4 is history for a long time, and I find an old photo backup CD, the meta-data should still be readable

zwabel then suggested to store the metadata additionally to the central store (which NEPOMUK needs for the search engine and is essential anyway) in a multitude of “.meta” files, which are stored in the same directory as the files. For the file picture1.png, the metadata would be in picture1.png.meta. I think this is a pragmatic idea and would say:

Lets store it in picture1.png.rdf

As serialization, I suggest the W3C RDF standard, which we use in the central NEPOMUK store anyway (in the database) and which has a well-readable standardized serialization format in either XML or a plain-text format. To achieve linux-geek compability, I suggest the plaintext format. For example, to add authorship information about picture1.png, it would be:

@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.  
<>  dc:creator "Dave Beckett";
dc:date "2002-07-31";
dc:publisher "ILRT, University of Bristol";
dc:title "Dave Beckett's Home Page" .

Note that the <> is a known shortcut for “this”, the equivalent rdf/xml is: rdf:about=””.

Sebastian Trüg also argues in a way that also leaves both ways open for the future, database and filesystem:
“you need a database anyway. Thus, in the end, the only solution I see at the moment is a kind of copy wrapper that makes sure metadata is copied with the file. Then one could also send information like a person or a project to a friend and the system would pick up all interesting metadata.”

So – how do we format the metadata inside the files? The same way we do as in the RDF repository of nepomuk. There we use the NIE and NAO ontologies. But Pushing Dublin Core is also a good way to do, but do it the W3C way, standardized.

Using the RDF encoding of Dublin Core and for example Turtle/N3 as serialization format gives a rock-solid W3C industry standardized (or at least well implemented) way.

Because the world is not perfect and needs many possible ways to evolve, we can store the metadata in redundancy now in as many places as possible – but in one format. For freedesktop and nepomuk RDF is the best choice, in my (not so humble) opinion. It is serializeable, it can be stored in a database, it can be hosted on the web. No other standard has this. It is embedded in PDF already in the XMPP format.

I propose “.turtle” files to indicate that its RDF/Turtle serialization, but if you insist, “.rdf” is also fine with me (but implying RDF/XML storage, which is a bit sluggish), and “.meta” is also fine with me if you store RDF/turtle inside. Making up a new micro format would be stupid.

My Summary:

  • storing it in the filesystem is nice, but not a killer-argument. It works ™ by just storing it in the central nepomuk repository for 90% of all use cases, so start hacking applications that help the users save time and improve their user experience with what is there today.
  • do not store it in .meta, but in .turtle, which is the rock-solid industry standard by W3C and human-readable and a simple microformat-like text format (smoother than xml)
  • do also store it however possible in the files themselves, not to block out others. Use EXIF fields, use XMPP fields in PDF, use ID3v2 fields, use those metedata!
  • do also index it in the central search engine, be it nepomuk or beagle++ (beagle++ is the rdf-enabled beagle, check it out if you are not aware of it)
  • storing it in metadata file attributes (xattr/channels/…) is the goal, but I propose to extend these standards with RDF to achieve cross-system compability. What worked for the web, may also work here.