Alexander Czmiel, “The Person Data Repository”

Case Studies–Historical Archives (March 16):

Alexander Czmiel, “The Person Data Repository” (slidesvideo)

[00:00]

[Alexander Czmiel] Hi. I don’t have to look too far from this perspective. What I’m going to show you now is something that might look, at first glance, like … similar to something … to what Daniel [Pitti] showed us yesterday. It is similar, but at the same time it’s something completely different. [In a lower voice, partly to himself.] This thing is crazy…. [To the audience.] Oh, I don’t have a … [refers to the screen behind him.] Okay; give me a second. I have to remember my password…. [To the person adjusting the microphone.] Oh, yeah, I … thank you! [To the audience.] What do you see now? [Laughs to a reaction. Indistinct exchange of words off the mike.] So before I’m going to present the […] I want to start with a question … with a general question. The question, “why, why are we modeling?” I’m coming off from this pragmatical [sic] approach of data modeling. So if we never ever had to leave this room again, this would be the answer: “we model data, because it’s fun.” But as soon as we enter this elevator and go out of the street, we encounter reality suddenly, and then we need a little more pragmatical [sic] approach, because we want to do something with our data. So from a more pragmatical [sic] point of view, we want to … we do data modeling, because we want to structure information: to think clearer, to do better research, to gain knowledge, and, in the end, hopefully wisdom. Or, to say with easier [to understand] words, to “process, use, and understand data in better ways.” If you cannot process data, it’s useless. You don’t need a data model if you can’t process it. To have [a] successful data model, a reliable data model, you have to think a lot before you start it; it must be well-designed, and it has to have a designated use, a certain purpose. And to … further the success, you should provide tools … to generate, access, edit, analyse, visualize, and, in the end, to publish the data as well, so … and, of course, for people who don’t want to see any angle brackets and don’t care about data modeling at all. So these are the things we had to bear in mind when we started our project at Person Data Repository [PDR].

[03:29]

It’s a context. It’s funded by the DFG, where the runtime for two years [sic], and it ended last October … which means we applied for the second phase, and … although we don’t have a concrete answer now, we’re slightly optimistic that we can continue for three more years. The project is embedded in the work of the Telota-Initiative working crew, with is roughly the age working crew of the Berlin-Brandenburg Academy of Sciences and Humanities–the institute where come from [sic]–and to which I will refer from now on just as The Academy; otherwise, my time is running away [sic]. And you see a list of names, which are the names of persons who were involved or still are involved in this fabulous work. The aims of this project are first, of course, building a repository of biographic information–person data. We restricted the time period to the nineteenth century for two reasons. The first reason is obvious: we only had two years time so we had … we had to focus on something. The second reason is that many of the projects … the long-term projects hosted by the academy do research in the period of the nineteenth century so they produce a lot of biographical information … or biographic information. So these are the two reasons. But all these projects produced this data in different formats. We had very heterogeneous data from different projects. Heterogenous in this case does not necessarily mean different XML formats; it means word documents, text files, databases, even books that we had to digitize. A second, very important point is cooperation and implementation of the network. We want to set up more repositories all over the world and build a distributed system. So far we have one sister repository, so you cannot really talk of a network, and it’s set up in Rome at the German Historic[al] Institute there. The other or last important point or aim is open access: everything we provide, the tools, the services, and the data, is for free so please use it. Take our data, take our tools, and take our source code, and approve it or join us. So far now, after two years, we have about 90,000 person objects, which isn’t equal [to] 90,000 persons. This data comes from ten different projects, and they’re all doing research within the same period. That means there’s an overlap so that 90,000 person objects are referring to not 90000 persons–I cannot say how many persons. We have 400,000 aspect [object]s, which is the information about a person, from circa 4,000 sources. So in total we have to deal with circa half a million objects from these ten different projects. And at [sic] this diagram you can probably see–or probably can’t see them; I cannot–the distribution of the aspects, the information, that peaks in the middle of the nineteenth century.

[07:41]

So that’s the way/how biographic information modeling is done traditionally. You have normalized names put in alphabetical order, and then when you found a person you’re interested in, you can read the information about it. Or you cannot do with this kind of information as … I have [to] work with very sophisticated research questions. For instance … or for example, if you want to know who signed the Declaration of Independence, you can just … yeah, I know, I’m sure you know, but … [Laughs.] but what if you want to know who signed the Declaration of Independence and went to school together probably, and met years later at a university at another place. Then you get stuck with this model, but you can use our model, because it’s much more flexible. So there it is. Our data model. Three objects, that it is. It’s quite simple, which makes it very flexible. That’s what we needed, and what we wanted; we wanted a simple model which is flexible enough to represent high complexity. So we have these three objects which I will explain in more detail further on: the persons, the aspects, and references. The references are the sources.

[09:45]

The important point for … the important points for us were [that they] had to be flexible, transparent, must … every time it must be clear where the information comes from, from which source, and who was responsible for extracting this information from a [sic] source. And it had to be extensible. Why? I know you surely think that three objects are not enough, and you’re probably right. During the first project phase we realized that, yes, we needed more. And these more things we still need from our places–corporate bodies, obvious. Even objects. Our corporate partner in Rome is doing research on musicians and they want to describe musical instruments … and events. Although no one knows what an event is, there is a need to describe them. So … this is how these three objects stick together. But I didn’t mention the identifier, but this is clear that every object has to have an identifier. So … then we have a person: the green circle. A person itself doesn’t contain any information; it just has an identifier or more identifier[s]. And the aspects: you have the actual information extracted from the source, and the aspects pointing to the source where the information comes from. So this offers the opportunity to give [Stumbles over the word.] contradictory [Laughs.] information as well. So if Aspect One says [he]’s born on March 1, and Aspect Three says it’s [sic] born on May 1, and you can … both have point[ed] to the same … to the same person/object, and it’s [up] to the … to the researcher to decide which one he prefers and which source he trusts more.

[11:44]

So now let’s have a look at these objects–the data. We start with the reference object, which is the only object we use in existing modeling scheme–the Metadata Object Description Language [MODS]–which is fine for us for describing digital objects. So the reference object describes the source of an information [sic], which can be a manuscript, a book, a person, a database, anything. And first, it was important that from the aspects we are always able to reconstruct the source. That we not [only] have the persons, but we can [also] reconstruct where the information comes from. There’s the … the person object which can have one or more identifiers and that’s it. Well, in the XML example, there’s this element record, which is for administrative purposes, where the users start to generate this object, when it was generated, and when it was … revised, or the … revision statement. The identifiers: of course, every person object as every object in this data model has a PDR identifier from the person … the repository, but he could have many more from the PND or the LCCN or VIAF or whatever; or you can introduce your own authority records. Like the “MUSICI”-Project in Rome does. This is obviously, in their authority record, person number 65, which I don’t know who he is or who she is–it’s just the object–because who this person is is described by the aspect object.

[13:57]

The aspect object … this holds the information concerning a person, and this information is taken directly from the source. It’s a string–how it is represented from the source with the same spelling. And then you add semantic information on a metadata level where you can use our predefined set of classifications, or you can extend it, or you can use your own set. We use text for this…. So you can say, for example, we have the information for a person. This person is a philosopher. So we want to take this as occupation and hobby, and then leave the decision and the interpretation “if being a philosopher is an occupation or a hobby” to the scholar. So we don’t label the information directly. And the aspect objects additionally have geographical statements, time statements, relationships to persons and other objects … other aspects, and the context, of course, as I said, the source it comes from, the certainty and the responsibility [i.e. identity of the aspect object generator], which means who am I [sic] the user who generated this aspect object. This is exactly what I just did: just formalized in a diagram with blue boxes. And this is an example of an XML serialization. All these objects are described–the aspect [objects] and person objects are described in a[n] XML schema. So the information … the actual information is in the lower part of this document in the notification element. There you have the string how it comes directly from the source. And this string you can annotate with further markup, further annotation, which is aligned at [sic] TEI, but on a very flat level. We don’t allow too much nesting at this point for processing reasons.

[16:47]

Another big problem: identity. We have these person objects but we don’t know who they are. So we identify the persons via identifier–the PND or LCCN authority records. S here we have three persons: P1, P2, and P3. P1 and P2 have the same ID–the same identifier–the same P and D number. So that means that the blue box, the aspect in [sic] the far left, also belongs to P2 and not just to P1. If you get … go one step further, and you identify that P2 and P3 are also the same persons via LCCN, for example, then you know that every aspect belonging to P1 is also relevant to P3; that’s because [they’re] the same person. But the … I know [Laughs.], it’s not … it’s not that easy. [Laughs again.] But the scholar or the user or the researcher doesn’t have to believe this. He can decide, “Yeah, no, I don’t trust you; I know it’s not true.” So the decision … this is just the model. What you make with the model, how you use with the model, is on the scholar. This is a concrete example of Rudolf Virchow. On the left you see the data sources–where the information comes from–then you see the extracted aspects pointing to the person objects, and then the identifiers [as] I just described how we suggest identity. So what are we doing with this all? Now the repository is coming [up]. We put this all in the Fedora Repository, connected with a Lucene Solr-based search index. And we provide conversion services for different kinds of sources that … are scripts, which convert all this heterogenous data into an aspect model. And we provide exchange services … for interoperability reasons. We have software at Tool with a very nice user interface everybody can use to collect data. It’s synchronized with the repository. You can work offline. When you synchronize it, it’s updated, and you can search the data. We provide publishing service, which means a webservice–some kind of API and a direct access to the repository as well.

[20:08]

So, now, this is our dream to have a network of different repositories for different scholars [who can] provide data and can use the data of the scholars, where everything is connected with each other. As I said, we now are two repositories, and we’re looking forward to be[ing] more [sic] in the future. So how can you use the PDR? As I just mentioned, we have this very cool archive editor, which is connected and synchronized with the repository which helps you to … to get direct access to authority records, and which helps you to model … follow directly this data model without knowing how this model works. And when a scholar comes to us and says, “Yeah, I want to do something. I have data, I have personal information which I want to collect in a way.” We say, “Yeah, we have the proper tools; take it!” [sic]. And then they say, “Wow, that’s so impressive! That’s so cool, so powerful! What can you do with it? But I’m not willing to learn it, because I don’t need all of these features. What I … what I need is just a table to put in a name, a place of birth, a date of birth, occupation, education, all this [sort of] stuff….” And then [we] say, “Yeah….” And in two years you come back to us and say you want to do research on your data, and it’s not possible, because it’s not powerful enough what you have there. So … what we did then [is that] we met in the middle and built a simplified user interface, which doesn’t offer all the features but reduces the [shock] factor [of the first version]. So I hope people are using more and more.

[22:23]

We have a website, of course, where you can…–this is just an example website. It’s unfortunately not yet available for [sic] the public, but I can show you during lunchbreak–where you can exactly do these kinds of query, or research query, that I mentioned before. We have this PIT, the PDR interface tools, which are webservices. You can directly access the repository. I wanted to show you this live, but I think I’m already late, am I [sic]? Yeah, okay, yeah, I have two or more slides. [Laughs.] You might have wanted a…. I didn’t show you a network visualization, because this is what people are [always] thinking first off when talking about personal information: “Wow, you can build networks!” Well, it’s not that easy, because we have so many different kinds of relation between people that you can, say yeah, you can build a network of … or you can build a professional network or correspondence network. Or even imagine Wendell sitting here, has been cited by Trevor, and Laurent is writing his son an e-mail about it. So what [are] the relation[s] between these persons? What is … [What] could the network look like?

[23:58]

So we have other things–not networks yet; we’re working on it. We have the archive editor, which is well documented and comes with a multilingual user interface. It’s available in German, Italian, and in English. We have a website for querying sophisticated research questions. We have these conversion scripts, which do, while transforming data, automatic pattern recognition for dates and places, which works quite well if your place is located in Prussia of the nineteenth century. We have interfaces, as I mentioned, and we have something alien–we have documentation. And for the second project phase, we want to get rid of these time boundaries: no limits anymore. We want to collect data from any period of history. We want to extend the data model, as I mentioned, and try to … to improve automatic identity recognition. We want to support workflows. This is something which came from … scholars. They said, “Yeah, I want to start [my work] with the archive editor, and I want to always have the feeling that it’s … that I have support, that I know which revision it is from … from starting reading the source until publishing the research results.” They want [to] be treated very well so we want [to] support this workflow. This is what TextScript [does] very well. We want to improve interoperability so we want to export in different formats–like TEIRDF–to become the power of the Linked Open Data [LOD] world. And we want to export JSON. We want to do much much more visualizations. I didn’t show you any visualizations, because this isn’t integrated into PIT. We think about jQuery Plugin that if you’re building a digital resource, you have persons in your material. You can just link this or identify this person with an identifier and point to the person there in the repository, and then you get back all the information [on] this person or just the information you want, and then you can show it with this jQuery Plugin on the fly. And we’re looking for more corporations to build this network of person data repositories, so please join us. Get your own instance of a  person data repository. Let’s build this network of person data repositories, and then, lunch break. [Laughter, applause, and a few exchanges.]

[27:02]
[Syd Bauman] Really cool stuff. I love playing with this idea both of, you know, describing people and dealing with the problem of identifying people, disambiguating them, […] But … a couple of questions about models since this is a data modeling workshop. I didn’t catch when you have an aspect, can an aspect apply to one person/contact?

[Czmiel] Oh, no, no, of course not. You have–if I may take yours as an example–if the … if you have the information that is: “If [this information says that] Mr. [Wendell] Piez attended a data modeling workshop like in 2012, then you’re not sure if it’s Wendell Piez or his father. You can … you have to search the statement, then you can [make] sure, “It’s probably Wendell, but perhaps it was his father.”

[Wendell Piez] I think he’s watching. [Laughter.]

[Czmiel] Oh! [Gestures.] I think my parents are watching, too, so…. [Laughter] And you can do this as well with the information like the signing of the Declaration of Independence and pointing it to all the person objects who are part of this event. This is more [of] an event than a … [Gestures and smiles.].

[Syd Bauman] So … so to represent that we are all here, the fear of ambiguity is what led you to using one aspect to say … to point to twenty people, whatever we are. So we have, “Forty aspects attended this workshop.”

[Czmiel] No, we have one aspect pointing to forty person objects.

[Bauman] So we do have one?

[Czmiel] Yeah.

[Bauman] Only one.

[Czmiel] Yeah.

[Daniel Pitti] And that aspect would have to be the space, a corporate body, which is to say that … that one aspect we interpret, I would say, would be a corporate body, which is to say all of us scattered here for a certain range of time under the domain of Knowledge Organization and Data Modeling [KODM].

[Czmiel] But then you have the problem, “Is this a corporate body or is this an event?” So we thought … that’s the reason why we thought we [wouldn’t] do any prediction of any information. We’d just say that it’s an aspect describing a person. And then a scholar can decide, “Is this an event or….”

[Pitti] The library world doesn’t need some conference, as they classify it, because they define the world, and you have to put everything into the slide as a corporate body.

[Julia Flanders] And I hope they have a really good definition for “conference” so we know when we’re in conference, or workshop, or a symposium, or a meeting that’s gone unofficial, or … [Nods smiling.]

[Pitti] They didn’t, they didn’t begin with that conference or expedition or … if you were at the tail of that expedition, then you would be….

[Flanders] Right.

[Piez] Well, what Alex is suggesting is something analogous to […]

[Flanders] Exactly.

[Piez] … from those assertions to those more nebulous categories.

[Flanders] So in fact we might have a conference-like description of this thing, we might also have a meeting-like description, which treats each of us as individuals or something like that. That would be nice. Um, Max [Schich] and then…. [Points to Jim Kuhn.]

[30:38]
[Maximillian Schich] One thing … I think … the point you have on your list is like “export,” right? I think that’s something which is missing a little in digital humanities projects, and that actually should enable you to solve a lot of other problems on your list. So the export is actually I think the most important thing, because if you do networks and visualization, people come up and say, “One tool with one or [a] couple of other ones … we’ll do it one way or another,” right? But you … [Czmiel asks for clarification.] If you … So you had this export to tie RDF, right? [Czmiel] Yeah [….] link to [data cloud?], JSON, whatever. And so … And then the visualization maps […] So there’s one thing about network visualization, for example: that we can easily visualize a network on a small level, which would basically boil down to the same thing that a user interface does, right? Because what you see is actually a small network, which is bent together to some test, but if you want to analyze, say, the whole social network structure of this whole thing … including all the [link types?]  it’s very likely that you will have orders of magnitude better results if you just throw it to the social network crowd, which is huge by the way–and just let them mess with it, and then one guy says…. [Czmiel] That’s basically what we do.] So, basically, to have that on the agenda to do something really really interesting, and then have a feedback to the model, which is something very interesting. That’s something which doesn’t really exist, but just people analyze things, and….

[32:17]
[Czmiel] There are … these are two points. The one thing is what you said: re-exporters re-export the data, and anyone can use it, so if you want to use it, if you want to do analyze the whole thing, please do so; we would appreciate it. The other thing is we want to test our model. Therefore, we’re doing these visualizations that … to see if this works, or [conforms to what] we thought beforehand. That we thought that, “We want to model this certain way, because we want to … visualize maps, networks, timelines, stuff like that. And we want to do this from a humanities, or digital humanities point of view, and it’s like doing digital humanities research. [Laughs; a few indistinct exchanges.]

[33:17]
[Jim Kuhn] In one of your earlier slides I noticed a DOI associated [Loud coughing.] with something, but I wasn’t sure whether it was for record, describing a person or…. What gets a DOI in your system? For instance, if I submit a Julia Flanders [to the] system. [Indistinct question from Czmiel.] Yeah…. How do you disambiguate, how do you merge items?

[Czmiel] You never … you do not merge items; no. Just pointing.

[Kuhn] So there might be fifteen Julia Flanderses in that network.

[Czmiel] Exactly. But they’re all [the] same. Who knows if these fifty Julia Flanders are all the same? Who knows? You can only do research on it and then decide, “This Julia Flanders is identical to that, that, and that, but [not to] these.” So you cannot go [and] put all data together, because [we would not feel confident doing such a thing].

[34:23]
[Thomas Stäcker] Just two questions. The first is rather a practical one: what is the status of the data? [Did] you capture this from publications of the Academy services for your presentation? [Czmiel: Yeah.] And how accurate are these data? And the second question leads to the position you offer in your data model. So this is a very simple, so to speak, data model. And what might happen if you got a PND or GND number that you have access to not accurate information in terms of ontology–how can you manage to integrate this into your data model? So this is a problem of consistency of data. If you ask the crowd to give you information, then your model may be not efficient enough given this position that the crowd gives you. This a little concerns me.

[Czmiel] At what point [is this] not efficient enough?

[Stäcker] So for instance, say, say something about the SKOS preferred label. So you have various labels for the person. And in PND you have this preferred label. [Czmiel: Yeah.] And … how do you define this in your data model?

[Czmiel] You generate an aspect. [Stäcker: Well, you have to qualify your aspect….] Yeah, you have to qualify your aspect and say, “This”–you have a name … let’s say Wendell–sorry Wendell! [Laughter. Piez jokes back.]–let’s say Wendell; Wendell Piez is the name. Then you have an aspect with a semantic statement that says, “It’s the long name of the PND.” And you have the qualified … the certainty. You say, “We trust German National Library so he is it.”

[36:27]
[Daniel Pitti] But we trust you to trust them. I think that’s part of the issue. Who…. [Czmiel: Yeah, of course, of course….] What … what is the witness? Is anybody a witness? And are they trustworthy?

[Schich] That’s very interesting…at [census.gov?]  the metadata point to the … … Nevertheless Nature and Science papers get away with saying, “I have the […] population data from [census.com?] which is basically there is no reason why that should be more accurate or awesome than somebody else’s estimate, if they only have twenty estimates. But it’s basically boiling down to that, right? The Gettys or […] census, which is also […] group project–they have a statement where they say, “If you produce data, it’s actually the opinion of the curator.” So we can actually say, “We have, say, twenty-seven aspects that are called Julia Flanders,” but actually our preferred ID for them is the PND one, right? Or vice versa: you could say, therefore, “This ID of the preferred Julia Flanders aspect is actually this one.” Which helps you if you then go and look for the data, right? Or you look into the data. And you want to visualize something; you have to, for example, decide on which label you actually put on that note, right? And….

[37:53]
[Czmiel] That’s true, but the problem is different people prefer different things. [Schich: Yeah, yeah, but….] We don’t give any preference.

[Schich] No…. To give an example: there are people [on] Wikipedia [who] are assigned like Einstein as a scientist, but who might also be a piano player, or a […]. And then the question is … as you said, it’s pretty clear for Einstein, of course, and for Wolfgang Amadeus Mozart, composer, that this is just one guy, even though there might be other people who call themselves like that. But for most of the other people … for Julia Flanders, there are 37 Julia Flanders whom we didn’t know. [Czmiel: Yeah.] And that’s a really really big problem if you want to be interpretive in terms of analysis.

[Czmiel] But we cannot solve this problem; we can just provide the information.

[Confused exchanges.]

[Pitti] One, one problem with that though is at the current time, the people who’re moving the indents through the arena are the owners of intellectual property. So the accuracy and reliability of being able to identify someone accurately as a legal entity with rights becomes extremely critical, so it doesn’t become a matter of competition…. [Voice becomes indistinct.] And that’s [Gestures.] Now there are things going on that are all over the place that they’re looking for very reliable identification of a person…. That’s the intellectual property [side]. On the other side–the dark side–of course you have governments who’re very interested in positive identification.

[Czmiel] Actually, at first you should be interested in what identity means. Because sometimes I have the feeling that when I’m in the audience I’m a very different person than when I’m standing in front of the audience. So….

[Laughter. A few indistinct exchanges, and more laughter.]

[Flanders] I think it’s lunchtime. Thank you.

1 thought on “Alexander Czmiel, “The Person Data Repository”

  1. Pingback: Knowledge Organization and Data Modeling in the Humanities: An ongoing conversation | datasymposium

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s