OrganiK project: working on testdata collection

As blogged in January, Gunnar, Remzi and I are working for DFKI on the Organik-Project. As true hard bloggin’ scientists, we keep on reporting.

In the next two weeks, I will gather an exhaustive test-data collection of texts that we use for ontology learning. I hope to gather around 10.000 documents from various sources that have a topic overlap. We need e-mails, office documents (contracts, etc) and news documents. There are a lot of test data sets out there, the question is now to pick the right one. Also, in OrganiK we have SME partners who could provide some data.

After this, the next step will be to create a taxonomy learning module that analyses the documents and semi-automatically (or fully automatically) creates a taxonomy out of it for future classification. If its fully automatic, I expect that the taxonomy will have probabilistic elements in it (“it thinks that this is a customer, but only 60%”). If we work with a probabilistic model throughout the whole project, we can rank everything all the time, maybe this will reduce human work. We will see.
Anyone has experience with taxonomies that have a weight added? Its similar to a TF/IDF rank.