Case Studies–Research Ontologies (March 15):
This seems to be the session with technical problems. I was intending to use my machine. The picture on my machine is rendered in a very awful way on this screen, so I’m glad to use Syd’s one for my presentation on basically the EDM [Europeana Data Model] .
This is what I’m intending to do: I’m trying to give you an idea of what this actually is and what it is not. I’m trying to show you a few examples of how we model objects in context. I will be talking a little bit about how EDM is related to the linked open data paradigm. I’ll say a few words on RDF–potential problems, limitations. And to end with, I’ll show you an example of an ongoing project, and how we think scholars might make use of the EDM.
So, the EDM–where does it come from? Basically, it comes from Europeana, which was a reaction to what, back in 2005, the French perceived this way: the Google rat that was devouring European culture. So this then triggered a violent reaction by the former librarian of France, Jean-Noel Jeanneney, who was perceiving this as as challenge, as Europe being challenged by Google. Then, this, again triggered a political reaction, which, you all know, led to the creation of our first prototype, which was launched in 2008 and immediately crashed quite spectacularly. Which was helpful, in a way, by the way. But, since then, this first Europeana portal has been growing and is currently up to 20 million objects being represented. Still, this is basically just another portal. Nothing really spectacular or innovative about it; it’s just a lot of objects represented.
In a model, which we chose because we had to be quick, the commission had promised something in 2005 that would happen in 2008. They didn’t know what they were talking about, so we came up with a very simplistic solution. The so-called Europeana Semantic Elements (ESE), which I’m not going into in detail, which euphemistically you could call simple and robust, but, basically, it had lots of limitations, and we probably shouldn’t ever have called it “semantic,” because that’s definitely what it was not.
This is what the ESE is in essence: it’s what we all know–at least, the librarians among us–a typical record-based model to represent metadata.
Now, right after the launch of the portal, we started working on a different model, and that’s the EDM I want to talk about, which basically is another metadata model currently in the course of replacing ESE. And it’s a model, to be a bit more precise, for making statements about digital representations of cultural heritage objects. That’s how I would define it. It’s not about the actual objects. It’s about digital representations, and it’s a model for making statements–potentially conflicting statements. The second objective of the model is to enable contextualization of such representations. Contextualizations in the sense of making this part of the link data paradigm.
Now, it is not, for all the TEI people in the room, it’s not an object model. It’s not competing with any of the object models. It may be combined with them, but it’s not a model for objects, and it’s not a record model like the ones we know from the library environment. It is an RDF graph-based model, and that has consequences that would be interesting for our workshop here; because, basically trying to simplify a point that has been made yesterday already: the XML models we know basically model knowledge as trees, ordered trees, with all the problems that ‘stem’ from that. The schema language is…these are using no elements and types. They [the models] know this concept of validation, and in a way, contain a prescriptive aspect that you don’t have in RDF.
Graphically represented, rather, it looks like this, and it models knowledge as graphs, not necessarily directed graphs. RDF has classes and properties. It has an inbuilt model of inheritance, and that way enables some kind of very simple deterministic inference, which makes all the difference, as I would see.
This is something I will skip because I don’t have time. [Slide labeled: EDM: Requirements and Design Principles]. There were a number of requirements and design principles, such as distinctions between object and representation, the object properties and metadata attributes, the necessity to have several perspectives of an object represented, support for composite objects, standard format for metadata with specializing options, standard format for vocabularies with specializing options, and the maximum reuse of existing standards.
We did that [Slide labeled: EDM Standards] in three areas.
We have really drawn on SKOS [Simple Knowledge Organization System] for knowledge organization/representation. We draw on the DCMI Metadata terms, which is a much richer terms set than the Dublin Core attributes. And, the one we draw on most heavily is the OAI ORE (Object Reuse and Exchange Model), an exchange model, which is a graph-based model for representing complex objects on the web as aggregations of web resources.
[Slide labeled: EDM Classes]. All of that resulted in a class model, which is deliberately simple. We only model the upper levels of what you can actually model, and the assumption is that all of these classes here can be specialized in the way I’m trying to indicate with the event here, for instance; you could prune the CIDOC CRM E5 hierarchy below that event class here. You could make similar choices for most of the upper classes, and the same applies to the properties. This is a rather simplistic property model, and it’s meant to be top-level so as to enable specialization by communities.
Now, you get the Mona Lisa again, [Slide labeled: Mona Lisa: French Ministry of Culture]. because this is the example we use in the primer, and we’ve seen it used yesterday already. This is what it looks like on the web currently. This is how it looks at the French Ministry of Culture. And, by the way, there is a second representation of this at the Louvre, so we need to have two views of this. This is what it looks like, basically, in the EDM.
[In Mona Lisa] there is a cultural heritage object which we’re referring to that has digital representations (the thumbnails you’ve seen on the website). It has a proxy that binds all statements that have been made by one instance of this representation. All of this is pulled together by the aggregation node, which makes it a representation entity within Europeana.
Now, using this model, you can do all sorts of things; you can do semantic enrichment. [For instance:] you can link to an agent that’s modeled on the web, work with time spans, link to external SKOS concepts. You can model events and things that happen in time using this model such as the event of the creation of the Mona Lisa at a given place, in a given time. Or, again, the event of the Mona Lisa having been stolen by François Le Premier(Francis I). You can model events. You can model more complex objects. You can model part-to-whole hierarchical relations, order among parts of objects, or, as in this example here, derivation and versioning relations. Like here: the two editions that can be modeled as the one being derivative of the other.
Last example: Here again, a MARC record in its sheer beauty, which in Gallica looks like this. [Slide Labeled: Les Fleurs du Mal: EDM.] This is the way it looks in the digitized environment. This is how it looks in the EDM, the physical thing, the proxy with all the metadata, the digital representations, and the semantic contexts. I’ll use purple for things that are outside Europeana but which we’ll link to it, and the aggregation that binds all of this together. We have confirmed feasibility/viability of this model in several workshops of all of the communities involved. It’s already deployed at data.europeana.eu. That’s our linked data pilot. Specifications and [primer] publications are already available, and for the more technical people among you, there is EuropeanaLabs: http://europeanalabs.eu/ development documentation and a related ontology. So, if you want to reuse the OWL, that’s possible from there.
Now, this brings me to the second objective of the model, which, in my eyes, is the more important one, which is the contextualization bit. If you take again here the core of an idiom aggregation, this can have context. The aggregation is created by an agent. The proxy may be linked to a VIAF, a SKOS, or a GeoNames entity. That way, on a more higher level, you get an image of Europeana being organized on two levels: the bottom level of networked object representations, and that would be modeled in the EDM and the level above, and the semantic network that would be used for contextualizing these objects. And the user is free to navigate horizontally and vertically, thus creating a user experience that’s radically different from librarian OPAC environments.
[Slide Labeled: …and the Big Picture: The Semantic Data Layer.] (13:06) The charm of this [model] is that this architecture which we originally designed as a stand-alone architecture can be plugged in[to] this big cloud [Linked Open Data Cloud]. This is the linked-in data cloud that most of you will have seen in several presentations during the last month. This is the biggest contextualization resource in the world, and Europeana currently now is part of this. This was important for us. A lesson to learn that we do not build another librarian silo but try to be a part of this bigger one.
[Slide Labeled: Aggregations and Context: Calculating Closeness.] Now, if you do things this way–and this is where I link to the workshop–you get all sorts of interesting questions. To start with, a resource aggregation: where does it actually start? Where does it end? Very difficult questions.
Or, a similar question: what actually constitutes the document boundaries in such an RDF aggregation environment? I’m using colors here. That makes it intuitive. There are no colors in RDF. So, you have to find some criteria to actually constitute the boundaries, or again, a really challenging one. Which of these nodes was connected to another one at a given time? How to version all of this…it’s a real challenge.
Or, again, if you take two resources, the purple one here and the orange one here, you have context which is directly related to them. First order links. We would agree on this being context of these resources. But then there are other resources linked to these secondary links. Are these still a part of this object’s context? And, if we try to version things, do we have to version this as a part of the object’s context? What about all of these things here that are somehow related to the secondary relations and are shared between these two objects: Which of them belongs to which, and which is the context of which? If you take this again into the time vector, how do you version all of this? They can challenge it. But you get some new opportunities.
You can do some sort of very simplistic reasoning on RDF, as you all know. If you have, for instance, this triple here that says that La Joconde is a painting, and another that says that paintings are a subclass of artistic work, we can infer that La Joconde must be an artistic work. That’s possible for a machine to do. This creates opportunities in the sense of very simple deterministic reasoning that may have some potential of enabling novel digital heuristics.
Such an approach shares the pros and cons of RDF. There are pros: it’s simple, it’s lightweight, and it avoids all of the complications and heavy logic approaches of former semantic work. It’s robust, atomistic, and it seems to be scalable. But it has limitations. It has the limitations we’ve been talking about yesterday–namely, the limitation of the triple syntax to actually express things like provenance and versioning without doing systematic reification, which the linked data people would like to avoid.
So, we had [a] discussion on Named Graphs, as we mentioned yesterday already. And we have more complex problems, like the question I was just asking to Daniel: how do we deal with things like similarity? How do we express similarity between resources? Or, again, ambiguity, because this is the most important limitation in my eyes: RDF triple syntax is definitely limited to denotative modes of signification. Anything more complex than mere denotation, [e.g.,] “a has this name,” is impossible to do in RDF.
And then, of course, there is this huge LOD linking resource potential, but it’s dirty data. It has big quality issues, and these need to be taken into account. Now, considering all of this, we have kicked off two weeks ago a project with this logo.
This is the new logo. It’s called DM2e — “digitized manuscripts to Europeana.” You see a reference to Europeana there, of course, in the logo. If you also see, at least intended, a reference to the RDF model.
This is what it basically does: it provides digitized manuscripts to Europeana. It integrates existing technical building blocks to produce a production chain that would enable migration from any kind of structured source to RDF to the EDM and then automated contextualization that, again, a bit [of what] Daniel was alluding to: How do you automatically create links between an object representation and semantic contexts. It has a third work-package that may be interesting for this community here, where we go into the way humanities research could eventually build on EDM and more RDF data, eventually generating digital heuristics and making these data as well as these heuristics available to specialized environments, like the one we’ll be building there, which looks a bit like this:
[Slide Labeled WP3: Digital Humanities Related Engineering.]
To the left, you have all the structured to semi-structured data sources we can take as input, Europeana being one of them. In the middle, you have a platform, a prototype of this already exists under the name of Muruca, which was developed in the Discovery Project, and which basically enables digital curation workflows, which enables semantic annotation, collation, text mining, data linking, mixing, augmenting, and which basically has a potential of being used by all sorts of communites. But in this work-package, we try to experiment it with a digital humanities community. So as to combine things like what we have here, this is a fragment of a Wittgenstein manuscript, which we can contextualize; we can say it has been created at a given date. It has a web presence that is part of another portal. It has another in VIAF [viaf.org] who is the author of something else, who has been living at a place when creating this fragment. And the idea is to combine this kind of information available from the EDM together with the actual digitized objects, which are not part of Europeana. [It would be used] in an environment that would use ontological, granular representations of what John Unsworth has called “scholarly primitives,” and what Tobias Blanke and Mark Hedges have been publishing on again last year, to build what I would call a social semantic scholarly graph in which the EDM data would be extended with RDF statements. Here are just a few examples of what you could come up [with] there. Version A is a successor of Version of B, or scribe Y copied from scribe Z or Statement 1 on the meta-level contradicts Statement 2.
The verbs in these sentences would be modeled ontologically, and the question actually is what we could obtain from inferencing on such a graph, and what kind of scholarship this would eventually enable. And [the question is] what the limitations of such an approach actually are, because we’ve been speculating on this for quite some time, so we’d just like to do a large-scale experiment. There’s some related reading.
[Thomas Stäcker] As to the quality issue, I think it’s a very crucial point for the scholarly community, for the digital community. I wonder what approach you would choose to solve this problem. Would you please […] on this issue, because I think it’s very simple. You’re accepting that sort of data […]
[Gradmann] The good news is that we don’t need to solve this problem as a digital humanities community, because it’s part of a bigger agenda. There’s this whole linked open data community that’s quite visible in the web architecture today, and which have recognized that solving this quality problem is essential, and all the discussions we have about […] and its use. So I would be optimistic in the sense of, if that problem cannot be solved by that larger community, we have a really big problem on the web as a whole, which we then share with the web as a whole. But the choice is to make our environment a part of the generic web architecture.
[Allen Renear] Could you say more about going beyond simple denotation?
[Gradmann] Well, that may be the crucial point: whether or not that’s actually possible. Because the RDF model, like so many computer science-inspired models, has a built-in semiological limitation. It’s in a way a nominalist regression on a high technical level. It’s just about things having names. That’s why I’m unsure whether it’s possible in aggregating lots of RDF statements to transcend this limitation to denotation in producing an approximation to, for instance, connotative modes of signification.
No one knows. But at least I’d like us to work with some consciousness of this problem.
[Allen Renear] Is the sort of thing you’re referring to that the some of the […]?
[Gradmann] In allowing for contradictions and ambiguity, we may get somewhere. The question is how to deal with these on an inferencing level, because the inferencing machines that we have are of course build on non-ambiguous and non-contradicting sets of statements.
[Maximilian Schich] As one of the persons who probably has one of the largest variety of source data sets, you’re probably in a position to actually evaluate the, say, you know, how does the source data fit to the genre? Do people expect too much? Do people pick low-hanging fruits or, oh, they’re too complicated and stuff like that? Do you see something like a feedback process where one could say “this is how we get the data, and now, to make it better, we have to go back to the providers and say, you know, you have to accept . . .” because . . . I mean, would this work? Do you know how this could work? Nobody likes the pathologist who says “it could be great if you go ahead and do this or that, but–.”
[Gradmann] We have such a feedback mechanism built into the DM2E processing chain. The idea is that the digital humanities scholars working in the specialized environment would be able to feed back to the data providers or those who did the automated contextualization that they’ve done or not done useful things. So, there is a feedback loop built into the project, which is small scale. Although, we could do the same thing on the Europeana level, it depends on whether Europeana accepts this idea of being complemented by another specialized platform that would provide functionality Europeana doesn’t want to provide on a portal level. We have Europeana as a partner in this project, and the idea is that sustainability of this project’s results would be obtained by transferring results to Europeana. So, there is this perspective of growing what we do here as a small-scale approach to the project to Europeana as a whole, but that’s a perspective.
[Flanders] I wonder whether there’s a way in which the contents of scholarly data sets, historical materials, could also serve as context. In other words, the Women Writers Project has however many documents that presumably attest to some information that you’re trying to convey. What in practical terms would be involved in using that kind of data?
[Gradmann] That’s one of the crucial technical questions. In more generic terms, it’s about the coexistence of the RDF and, for instance, the TEI data, and how you relate these to each other. How you point from an RDF statement to a given point in the TEI-encoded document, for instance? This is something we’d like to find out. It’s part of the work package. There are RDF representations of TEI around, and this may be one of the ways to go. But this is something we would be very happy to have advice on from others. I’m wondering if there have been experiments here by people in the room regarding the coexistence of RDF and then TEI-based models. That’s one of the crucial questions.
[Julia Flanders] I guess, just to add to my comment, I was also wondering whether projects that use TEI internally, or other markup systems internally, might broadcast or expose in an RDF form. And whether that would provide another kind of avenue.
[Gradmann] I would be careful with the term “expose.” If that means replication of data, no. But, if that means giving the same data a second face…
[Flanders] Like OAI.
[Gradmann] …that would be great.
[Speaker 3] Ok. Thanks again.
[Gradmann] Thank you.