Case Studies–Research Ontologies (March 15):
Daniel Pitti, “EAC-CPF” (video)
[Daniel Pitti] I am going to talk about the Social Networks and Archival Contexts Project and that is actually not what I came here to talk about. I was going to give you the jaded view of someone who has spent close to thirty years modeling data in one context or another but a lot of people expressed a lot of interest in this project. And while I’m not going to go into the data modeling in any explicit way I do think it raises all kinds of issues about the use of data models, what you can do with data that is smarter data than raw data that we were talking about before.
I really do not like this display that PowerPoint is giving me because I cannot see the…and I cannot do anything about it. Let me stand out here and I will switch and stand out here. First thing I would like to talk about is the funding timeline for it. It is funded by a grant from the National Endowment for the Humanities. It began in, a two year project, it began two years ago so it will formally end next month but at the end I will talk about what might happen after that period. The people involved are myself and Worthy Martin at the University of Virginia, Ray Larson at the University of California Berkeley and Brian Tingle and Adrian Turner at the California Digital Library in Oakland, California.
[delay adjusting slides]
[Daniel Pitti] The project’s objectives succinctly stated on one hand the audience and the community that I am interested in are archivists and manuscript librarians and you know having been involved for a long time in transforming what was an analog based descriptive system into a machine readable descriptive system. A part of what I am up to here is furthering that process of creating a machine readable environment for archival descriptions. And this particular step involves taking the description of people that create and are documented in archival records and separating the description of them from the description of the records, as such. But beyond that it is also designed for the users of archives, which is to say many of you in this room and a fairly large proportion of the scholarly world that is interested in things human, as well as a large proportion of the general public.
So the first thing in terms of enhancement is providing integrated access to archival resources by and about documenting a particular person, corporate body or family. And it is one of the holy grails of unified access. But the other thing that we are doing is we are providing the social, professional context within which the people lived and worked and you will see this demonstrated if I do not run out of time.
What we are doing in the data that we are working with is around 30,000 archival finding aides or guides, descriptions, encoded in archival description. And then in addition to that we have authority records for the Library of Congress and authority records for the Union List of Artists Names from the Getty and finally that number there actually of 5 million – we are now working with 16 million virtual and international file records, which we use for matching purposes and to help in identity resolution.
So the processing that we are engaged in here essentially is done in three steps. The first of which is done at Virginia, the second of which is done at Berkeley and the third is done at the California Digital Library. But what we are doing is extracting information, the names and when we can the description of people or these corporate, persons or families from the archival description and assembling that into encoded archival context, CPF records. That CPF is corporate bodies, persons and families. Then we subsequently match those records against one another in a process called identity resolution and it is a bit tricky so what you are essentially doing is if you have the same name string or similar name strings, is it for the same person or not? And when we decide that it is for the same person, we are merging them together and then keeping track of all the information and where we got it.
And then subsequent to that they are matched against mostly so far the Virtual International Authority File where we pick up and go through the same identity resolution process and acquire, and aggregate, the additional data that we find there. So I just laid out the key challenge in that second bullet point. And then finally taking the resulting set of these matched, merged and enhanced records. We are creating a prototype, historical resource and access system, which is the historical data, the social professional networks within which the people existed and built links to archive, library, and museum resources.
The source data as I have mentioned, just to say a bit more about it, is these EAD-encoded finding aids created by archivists, manuscript librarians. And as they currently stand they intermix the description of records with the descriptions of the creators. The description of the creators from an archival point of view is thought of as providing context for understanding the records. So you have these records that emanate from one source in order to interpret them and understand them you need to know something about who created them and what was the context in which they were created. And some of these have very, very detailed descriptions of the creators. They vary widely while the archival community is been engaged in this kind of ongoing process of becoming more standardized and normalized and more and more of a consensus. The quality of practice is irregular so if you go to the Library of Congress the finding aides there are created by the divine guidance of God and most everyone else does something far more human than what they do at the Library of Congress. Many of the names are given but not identified as such so some of the names are tags and easily found and in other cases there are names. One would like to say some of the contexts are in the natural language context but the others, they’re in a very strange context that is not natural language, but they are there.
One of the things that we are focusing on is a lot of papers, private papers that have correspondence in them so we are particularly interested in if this person corresponded with this person because then we can establish as a fact that they have some social professional relation between them. In other cases we cannot identify that and I drop that into an ambiguous category. They are merely associated. Sometimes those associations are purely intellectual. So if someone collected Dante for example, you will see a link from this 20th century person to Dante and you clearly know that they did not know one another personally but there is an intellectual relation between the two.
I am going to speed this up a little but because I want to show you the demo, which is what most people want to see. This is a bit on archival records. One of the key things to know about them, Allen [Renear] brought up the word intentionality yesterday, is archival records and then went on qualify exactly what he meant by intentionality which was not what most of us assumed by the word intentionality. By saying that records are the byproduct of people living and doing their work is quite frequently the records we generate when we are doing a digital humanities project is we’re intentionally building a digital humanities project but in the process of doing that we are sending emails to people, we are writing grant proposals, we are doing all kinds of other things that serve the purpose of building that humanities project. So the project out there is the intentionally built thing and then these other things are byproducts of us following through on that intention. And so we have, in fact, the archive of the digital humanities is all those records that document the creating of the digital humanities project. I was trying there to make this whole notion relevant to this group. I hope that succeeded.
The underlying standard in addition to the EAD is this schema called Encoded Archival Context, CPF, and again it’s designed for describing corporate bodies, persons and families. It is based on the International Council for Archive Standards, ISAAR (CPF). It is an international group of people behind its design. Just a touch of what the tagging looks like in there. This is the identity section and what you’re seeing there is at the top an entity type which to say it’s a person here and then you have a name entry. In this case for [Robert] Oppenheimer. And there is no such thing in this system as an explicitly top identified authorized form. You just have different names for the same entity but there are ways to designate which of those names is the one you prefer within your catalogue. Things like exist dates, probably skip down that local description. Language used by the person or languages used by the person. Occupation or occupations that the person had. And one of the things that you will find frequently in these descriptions is what we call a chronological list, which is a sequence of rows with a date, a place and an event in them. There are ways of extending this if you want to type it in such a way to participate in a conceptually modeled world, you could do it. You can create timelines out of this. We have not gotten to doing that yet. This would be a case of linking one person to another person. So the role here of using a FRBR entity is to say that at the other end of this is a person, corresponded with, relation entry, a resource relation.
Okay so quickly, year one extraction, this is a little bit out of date because I put these slides together a few months ago and we’ve done a bit more processing. But in essence we’ve extracted close to 200,000 names from the 30,000 finding aides. And then whittled those down to about a—well you’ll see when I show you the prototype. I cannot remember the exact number. Okay now what is next in all of this. It’s actually judgment day for me. I’ll find out whether this next step goes forward today. If so we will vastly expand both the research agenda and the source data and have a least a bit of an international flavor to it, in terms of adding to it data from the UK and a little bit of data from France. Also taking place in parallel is an Institute for Museum and Library Services funded project where we are going to try and transform what we’re doing into a sustainable national cooperative program. So I have a series of meetings set up to begin writing up a blueprint for what that would mean. And then finally since I am running out of time here. I will have to cut and paste this since this is not active as a link. And unfortunately it does not look very good because it was not designed for such low resolution.
So it’s been about 130,000, close to 200,000, reduced to about a 130,000 names, which you can go in and search. We have featured some out here because a lot of things end up just being a name and not much else linked to one resource. But for some of the more interesting people, I like to do Vannevar Bush since he is the hero of linking thinking machine, the web, you know everyone looks at him as the father. So here is Vannevar Bush. I can see here that he was an engineer, a physicist, a public official, a science administrator. I can read a biography of his life. I can go off to the right and given the limited source data here I have identified him in nine archival collections. He is the creator of this collection here in the Library of Congress. He is referenced in eight more here in addition to that. He is also associated with these forty-six people. These nineteen corporate bodies. Additional resources found here. Linked data here so you can go off to a VIAF record and we have had DBpedia as well so you can go off there. And then this is in the early stages but everyone seems to like this.
Here he is mapped into a radial graph and the interesting thing about Vannevar Bush is you can find him connected with Robert Oppenheimer who was the father of the atomic bomb who you will find related to TS Eliot who you will find connected to Marx, but not Karl, Groucho Marx. Makes Groucho Marx two degrees of separation from the father of the atomic bomb. I have no idea what that means so I will stop there. Questions?
[Maximilian Schich] There is this problem of normalizing people. Right? It is obviously one of the huge elephants in the room. So the other elephant in the room—
[David Pitti] Literally and figuratively.
[Maximilian Schich] I’m sorry?
[David Pitti] Normalizing people —
[Maximilian Schich] Yeah, yeah. So it is the same data. We have some other elephants, like locations for example.
[David Pitti] Yeah.
[Maximilian Schich] So it is a very similar problem and it could also be useful to that. So are you dealing with that problem?
[David PItti] Yes, it falls into the category of identity resolution so if you have two strings that you have reason to believe may be for the same entities, then trying to algorithmically determine whether they are for the same entity —whether it is geographic or a person or corporate body. Then the other thing is – and we are not dealing with this yet with geographic names – but the underlying encoding framework will support this. So what we want to do is we are going to take the timelines where we have events like “was born here”, “did this here and did this here.” We will process those and match them against a geographic name authority file. And do the same things we are doing with the other names and then we will pick up the coordinate information. And then the idea is we take a chronological list and we can do a mash up with a timeline and a map.
[Maximilian Schich] My other question is […] just like yesterday you had not […] yourself.
[Daniel Pitti] Yeah.
[Maximilian Schich] Is this a typical kind of […] data models always use the ones that always have the most data. Where the problem is if you take twenty percent where you have eighty percent of the data […] make some solution which may totally modify to the wrong table of names where you have […]
[David Pitti] And under-identified say with no birth date or death date or anything else. I mean there is a few ways that you can do this from a processing point of view and one of them is the context out of which you retrieve the name to begin with. So let us say you have a name string that shows up with this physicist and a name string that shows up with this physicist and it is similar name strings. The fact that each of them is associated with two physicists may or may not help you tip the balance towards identity resolution. Typically what you do is you get match candidates and then you are looking for some evidence which you can weigh and then if you have enough accumulated weight in terms of the additional evidence you can slide this scale. And you also have to review and do recall relevance evaluation of what you are doing in order to slide your algorithms. But you can get it pretty accurate. The people at OCLC Research got quite good at this.
[Maximilian Schich] Are you doing this automatically? The assurance that if you have different –
[Daniel Pitti] Yes, at this point it is all being done algorithmically by Ray Larson.
[Maximilian Schich] How is it done?
[Daniel Pitti] I can give you some citations, not off the top of my head, to some of Ray’s work but there is also a lot of literature on the subject. And it involves reading all kinds of mathematical symbols at some point if you get into it.
[Maximilian Schich] That is very interesting for data models. If you go down to the second most frequent thing to having only one document would be two documents. And people will have two countries with which they are associated and that would probably be twenty percent or thirty percent of the data. And for this twenty or thirty percent of the data you have […].
And data models are more [tough?]. They do not usually pick up like how do I actually encode that I cannot decide if it should go this way or the other.
[Daniel Pitti] Yes, well one of the things that we would like to do and have not done yet is encode a relation between two entities and say “may be the same as”? Then if you started a national cooperative those would be flagged for some human being to come in to see if they could find additional data and resolve it. And then you could envision easily a future in which if you get all that off the ground, which we will see, is add social computing network to it so that the public can come in and make suggestions and say “that is my aunt so-and-so who was born County Cork in this year. Blah, blah, blah.”
[Michael Sperberg-McQueen] Question about the converse problem in […] and how to deal with it. You talk about how you start with the assumption that names listed over here and the names listed over here are […] and you merge them—
[Daniel Pitti] Yes, I know where you are going with this, Michael.
[Michael Sperberg-McQueen] And I do not know enough about archival practices to know whether it happens it finding aids that it happens in prosopographic work you have—
[Daniel Pitti] People make mistakes.
[Michael Sperberg-McQueen] You think it is one person and you later decide it is two or three. We do not have very much data so we all go into Holmes and get three names, probably the same guy. Later discover you discover one is a justice of the supreme court and the other is a poet.
[Daniel Pitti] And another is an archivist.
[Michael Sperberg-McQueen] So how do you handle that?
[Daniel Pitti] Well, from our perspective keeping two names apart when you are in doubt is better than merging them together because it obscures it. And so one of the things is you want to make sure that in terms of your matching and merging that you are reliable. So one of the things we will be doing —you know, God willing—in the next phase of this is we have an outside person to go through this and systematically, annually go through and check. And then OCLC Research is making available to us some software for quality assessment on it. And that is an absolutely critical piece of this. And the whole issue of identity resolutions, of course it is absolutely critical. And it is a very hot topic in the IP world right now as they begin to implement ISNI and ORCID and other things that that piece of it is –
[Syd Bauman] If understood correctly, a large of the project is to take large datasets from disparate sources and kind of pull them in. I do not recognize all the sources but the ones I do do not share a common format. I do not think they share really a common data model behind the format. How did you—
[Daniel Pitti] Well, which of the formats? Where I am extracting all the data at the moment is EAD and again if all goes well I will also have some MARC archival descriptions, about two million of them.
[Speaker] But someone in the Library of Congress will hand you a record that does not look like an EAD record. How—
[Daniel PItti] Well, Library of Congress is not going to do that but WorldCat is going to give me MARC bibliographic records and you will write a separate process for extracting the names and assembling the EAC-CPFs. So one, it is just a different mapping.
[Speaker] So any particular interesting observations on the different data models that you have had to suck in?
[Daniel Pitti] Mostly we are dealing so far with the EAD and the EAC-CPF and then Getty has it is own and we have deciphered that and have mapping models for it. VIAF has it is own for doing it is clustered records. I mean the way this is all done is very large and has to index all of this stuff and Cheshire is the state of the art XML indexing tool. At the moment, the last thing he did was he already mapped over the schema for the VIAF records for example, all the data is coming in XML. And his latest development is he’s created n-gram indices to 16 million cluster records.
[Syd Bauman] Cluster?
[Daniel Pitti] Yeah, the […] cluster would be authority records from different places around the world that are deemed to be from the same entity in identity resolution and they aggregate them.
[Stefan Gradmann] I would like to come back to the issue of people being identical. For instance, […] are the same or not the same. Once you get them to geographical entities for instance like […] things are not exactly the same as they are in this—
[Speaker] Could you speak up a bit?
[Stefan Gradmann] I said as long as you are […conditions you can take?] are binary […]. Whereas geographical entities for instance have different degrees of similarity.
[Daniel Pitti] Right. Identity resolutions are a whole different challenge there.
[Stefan Gradmann] So what are your plans for coping with those considering that there is a big discussion in the data community how to say that and how to do that.
[Daniel Pitti] In a certain sense we are completely punting on that. It is not part of our problem set as such. I mean it’s part of your problem set. Whatever you figure out, let me know. This is collaborative.
[Fotis Jannidis] So we are already ten minutes over time and so please short questions, short answers.
[Susan Schreibman] I wondered whether with the visualizations you have, is there a way for me to [interpret that?] to show, so I can see exactly that relationship.
[Daniel Pitti] Not that particular R graph but essentially the relations we have pulled out into a stack of RDF triples. We have a SparQL end point so if you know how to write SparQL you can go in an query it.
[Speaker] And as a user everything […] once I see something, to say “let me see how that happened.”
[Daniel Pitti] Yes, next generation.
[Speaker] I have a question concerning this visualization as well. The lines between the entities are they qualified in some way or are they just..?
[Daniel Pitti] They have a simple qualification. They are not color coded.
[Speaker] The ones entities mentioned in the record of the—
[Daniel Pitti] Well when we can determine we will say “corresponded with.” Otherwise if we don’t know we just merely say “associated with.”
[Speaker] “Corresponded with” actually means they did know each other.
[Daniel Pitti] They exchanged letters.
[Speaker] That is the two things you have?
[Daniel Pitti] Yeah, that is the only two things we have. Let is say put this into a cooperative and come up with a nice list of possible relations based on some nice ontology, then it could be differentiated. But there is not enough data available to us to in any reliable kind of way to be more refined.
[Fotis Jannidis] Thanks a lot.