Case Studies–Historical Archives (March 16):
[Julia Flanders] Okay. Next up we have three case studies focusing on historical archives starting with Douglas Knox.
[Douglas Knox] I’ve really enjoyed all the discussion we’ve had and the fact that we all come from different perspectives. Some of you may share my story. In my experience, five simple words are really all anyone should need to get started grappling with data modeling issues unwittingly. Oh, my spacebar is not working. I’m sorry. Let me try this again. [Pause]
“How hard can it be?” That’s what has gotten me started many times. And it’s a very productive question, if you ask it in the right rhetorical spirit. [technical pause]
[Pause for technical issue [03:25-03:40]]
How’s this? Okay. Thank you. We see the population. Something’s happening. We’ve got two points, we’ve got one county, and we can start building a story or speculating what’s behind this. Well, one of the interesting things about Portage County, though, is that it’s a different county in some sense. The coordinates stream has a “date modified” value, let’s say, in Allen’s [Renear] terms. So, is it really the same county or isn’t it? What we’re seeing here in 1840, that’s not a sloppy arrow. Portage County in 1840 really did consist of two noncontiguous areas. They had been contiguous. Another county got stamped out to the west of it and separated them. By 1850, it moved. It didn’t really just kind of slide up there. If I showed you what happened in the 1840s, there was a big Portage County — Portage County was much bigger. It encompassed both of these areas and went all the way to the Northern border of Wisconsin.
So boundaries do make a difference. And in some way, identity makes a difference. This is a problem only because we commit to Portage County. That slide was borrowed from John Long, who’s the editor of Atlas of Historical County Boundaries. This was a project of the Newberry Library that was completed in 2010. It took 20 years. It documents every change in counties in the United States from the 1630s to the year 2000, state by state. I worked at the Newberry for 15 years until recently and was just down the hallway from John and his colleagues for many years. There’s the Newberry.
The project actually goes back more than 20 years. It has its roots in the 1970s. Lester Cappon was a historian and archivist, an early American historian who went to the Newberry, started the Atlas of Early American History, and John [Long] was the assistant editor of that. Barbara Petchenik was cartographic editor. They were concerned to tell historical stories, to do the research very well, and to also do the cartography well. And in the course of that work, they discovered that they thought there would be resources on where counties were, and there just wasn’t. So they started a new project. In the late 1970s, John started a project that was interestingly both digital and print-based, working with the cartography lab at the University of Wisconsin to start documenting where counties were based on primary research. They produced five volumes covering fourteen states. They produced data files that are on deposit with ICPSR. You can download them today but John [Long] says “don’t” — there’s better data available now. So that project ran into difficulties, methodological and technological difficulties, which I’ll talk about in a minute. I’m giving you a brief project overview and then we’ll get to the data modeling specifics.
In the late 1980s, John [Long] started a new project, having learned from the earlier problems. That was the 20-year project that initially produced books. In the 1990s they produced 19 volumes– 24 states and the District of Columbia. A lot happened in the 1990s. By the 2000s the publisher was no longer interested in publishing this kind of book. But as we all know, computing got much better on the desktop in the 1990s. GIS software was available. So they increasingly moved to digital methods. After the year 2000 they became their own digital publisher. I was involved with a piece of the project that retrospectively digitized the books, but the core team, especially at the end, was John Long, Peggy Sinko, Emily Kelley, and Pete Siczewicz, and a number of others who’ve helped. But they were really incredibly dedicated, so I’m mainly talking about their work.
Why did it take so long? Counties are creatures of the states. Or, counties, in the early years, were territories, territorial governments. So the research that went into this was really text-based. They didn’t start with maps. They certainly consulted basemaps but they had to start with laws, and the laws of each state. So they would go to the law library and come back with xeroxes. This is what a typical primary source looks like. “Be it enacted by the general assembly of the state of Missouri that all that portion of territory bounded as follows…” Then there’s a description of various points on the landscape. Then they say that all of that territory, you go back to the beginning point, that shall compose one county. Then they assign a name to that county and it exists.
So this, I think, is performative speech, as people have been saying. This is intentionality of a kind. I think this is an example of modeling being something that we inherit. The counties would not exist if the state legislatures didn’t invent them. So in a way, it’s not quite right to say that the model is not the thing. Here there is no thing until you model it, or until they modelled it. And it’s based on earlier models, actually. I guess the other thing I would say is that some of these points are points in the landscape but they’re also things like the line between the townships 43 and 44. There’s a whole Northwest Ordinance. If you’ve flown over the whole middle part of the United States, there’s a grid that’s evident in the land, which is the result of earlier acts of modeling that surveyed that land and imposed a grid on it, and that’s part of a reference system that they’re building this on.
When the 1970s and early 80s project saw this, they decided to model this as a set of line segments. They said we’ll figure out what the atomic segments are, then we’ll be able to say what the coordinates are that define that from such and such a date to another date. On the left hand side it’s this county and on the right hand side it’s some other county. That made a lot of sense back then because coordinates were expensive. It was really hard to manage this kind of thing. As I mentioned, this was a project at the Newberry in Chicago, in collaboration with the cartography lab at the University of Wisconsin. So the method was, the Newberry would come up with the coordinates. There would be a couple of editors doing quality control, sitting down with pages of numbers like this and reading digit by digit to each other to make sure that they were looking at the same thing. They would ship that off to Wisconsin, which would put them into a computer, run it through a plotter, and send the plot back to Chicago. So the feedback loop was very slow for this. I have a lot of sympathy for why this was difficult and why this approach didn’t quite work.
When the project started up again in the late 1980s they dropped digital methods altogether and they focused on getting the methodology right for just knowing what they were describing. There would be a piece of tracing paper — this is a step that they called “historical compilation.” You can see the xeroxes of the laws are there. The historical compiler would read the laws, would trace out what they thought the laws were describing based on their research, and try to cram as many of those counties as possible onto one piece of tracing paper to be sure that they were getting the right coverage, giving the whole story, and that it all made sense.
So this actually means modeling it so that you can ask several questions at once. Modeling by looking at modeling it in several different ways, and that’s really key to why this method works. They were doing this for print publications so that they could produce pages like this, that would tell the story of a single county through time, with the idea that you would look at McHenry County in Illinois, let’s say, and you’d want the chronology of legislative events–you see that at the top. That includes all legislative events, not just ones that are mappable. There are things that happen that aren’t mappable. Sometimes they would rename something. They’d just redefine what the boundaries are without really making a difference in how it’s mapped, especially at this scale. They also give you the maps, and they’re drawing these by hand, so they don’t really care about redundant line segments–they’re cheap.
So this is one output from the historical compilation. The other is making sure that we have a comprehensive story of the area. Sometimes there are contradictions. So, there were areas of overlap. If it was an area of overlap that came from the compiler misunderstanding what they were reading, they could catch that and correct it. But sometimes the legislature made a mistake. They got it wrong. Sometimes it took them years to catch up with their own error, but they eventually acknowledged it and corrected it. And sometimes there were disputes where different states would contest the same area and it would need to be resolved through negotiation or through legislative acts. This is the kind of contradiction that needs to be modeled. So, we can’t simply correct the original mistake and just pretend that it didn’t happen. We need a metamodel around the original model that says, you know, it’s inconsistent but it’s an inconsistency that has to be recorded, that’s now a permanent part of the record.
You’ll also see up there to the northeast of the yellow area, NCA3, NCA4. Those are non-county areas. So in the course of historical compilation, the compilers noticed that the were areas that the legislatures didn’t name and didn’t care about a lot, except for when they could carve new counties out of it once the population reached a certain density. But in order to make those stories make sense, they had to invent and really model new things that you could tell stories about. So that you can look up in these books Non-County Area 3 and get an invented story that is useful because it really gives us something that has appropriate temporal and spatial coverage for the whole state.
Behind the scenes that the data looks like this — you can download all this data as shapefiles, as text files. There are chronologies and indexes. There’s a lot of textual data that went into this project. It’s not just geospatial. I guess I would draw your attention to the identifier. The ID field that was project-invented identifier. They found that there really was nothing that could be used. The first column there is the FIPS code– the Federal Information Processing Standard. So those are used in federal data sets to refer to counties. The first two digits are the state and then three digits for the county. But, if you look at them historically, and we don’t have much of a historical sense. In the United States, the federal government would issue the data set and make the numbers match alphabetical order. If alphabetical order changed, they would just assign new numbers. So they don’t really serve as identifiers over time, which shocks me, but that’s the way it is.
The version number is blank in some cases because we assign versions only to the ones that are mappable. But the full record, the model here is of legislative events even more that is county boundaries, in some ways. I think that the other thing I would draw attention to in the creation or change field is how redundant this is. There’s a row here for every county to record its perspective. So you’ll see that Campbell gained from Anderson and Overton. Anderson lost to Campbell. Overton lost to Campbell. This made sense for book production. This worked for a process where you had a spreadsheet or a database. I think it would be interesting if the project were starting today to think about how you would model that as a relationship and then generate these descriptions from it. I can say that there’s a lot of exceptions and a lot of idiosyncratic stories that are in that creation or change field that would be very difficult to model and impractical to do in that way. There’s something that would be lost by doing that that way. Every one of these changes has a citation. There’s something like 17,000 different shapes. Ultimately, there’s about 3,000 counties in the US today, more than that, but that’s why it took 20 years. There’s a lot there.
Part of the process here, to kind of back up, there are many complexities to this. Part of the process is building up an understanding of what a county is and then doing the research and finding something that frustrates that understanding, then adapting the model. This is an example. This is from the South Carolina chronology and is one memorable instance of that kind of thing. South Carolina has counties, but historically they also had parishes and districts. They had sometimes the same name but different geographies. They coexisted at different times. When you get to South Carolina, you have to add a new column to your database, add a new layer to your interface, and rethink the model.
So, to get to the question of my title, I think that the pattern here is that there’s a thing that changes — we want to tell a story about counties, we want to tell a story of change over time, which is what anyone doing historical kind of research does. You want to have some way of referring to that thing and have some property of it that captures that change, like the population of Portage County change. But we need to know that we’re talking about the same Portage County. Part of this is building ways of representing time into our models. We can add a column for start date and end date. But I think there’s also an issue of how we historicize the model and allow for the fact that the thing we want to talk about, what a county is, itself has a history. That’s going to need something else to encompass that.
I had some fun drawing up a list of things that could change over time, of various kinds, related to this matter. I think I kind of set this up with some “gotchas,” with some surprises. I had a “can” thing where I tried to show you the population first and then said “you didn’t know that the boundaries changed first, but then they did.” Now we know that boundaries change, counties should be easy now, right? We’ve got the data. So, I feel comfortable. But if you look at population characteristics, those too change over time. So I’m imagining taking something like the historical census browser and expressing it in RDF as a set of assertions, thinking that all we need are identifiers. Well, we could easily have URIs for these things. The technology for doing that might be easier than the conceptual work of, “do we want to do that? Which of these are the same and which are different, in different years?” This is what exercises historians I know. If we don’t model this right, I think there will be difficulties. So I think I’ll leave it with that.
[Kari Kraus] So you had the slide about things that might change over time. What might you say about things that are stable over time?
[Knox] Actually, all of them could be stable over time, right? Misinterpretations might be stable over time. We might have a model of misinterpretation that says “well I’ve seen that kind of thing before, and now that I know how to accommodate it. . . “
[Kraus] So the things on that side could also just as easily appear on the other side at once? The things that are fixed are also liable to change . . .
[Knox] Sure. Sure.
[Syd Bauman] But what about the set of things you know won’t change over time? Are there any in that list or is it just an empty slot?
[Knox] I won’t make any commitments to that myself, but I think. . . you never know.
[Elena Pierazzo] We have some research going on at King’s [College London] with the Ph.D. students. One of my Ph.D. students at King’s is doing some research on some Sami material. In the eighteenth century, the Norwegian, Swedish, and Finland area weren’t valid on paper. At a certain point you were in Norway; at a certain point you were in Sweden, but you were nowhere, in particular, but the Sami, because they were nomadic. So at a certain point they decided that they need to have boundaries somewhere. The way they decided to do that is to interview the people and say, “Where do you feel you belong?”, “Where do you think you are?” Because that’s the only way you could find out. If you think you live in Norway, or if you think you live in Sweden. They asked them to define the area in which they that they think belonged in order to describe the territory. Their attempt was to take this description and to transcribe it into another language, so there’s a lot of filters [here], and try to plot them in a month. Which is a fascinating approach. He stopped to a point because he couldn’t go on because certain information was missing. At a certain point, they described three borders out of four, and one of them is not described, it’s just given. They say “you know, you go on until that ends–until the the wood ends.” This was the description. And the woods aren’t there anymore, so we don’t know exactly. So do you find that the question is sometimes like “I am modeling the impossibility.” Where, if you don’t have the information, you cannot model. Do you have problems like that?
[Knox] Yes, and my colleagues who did the historical compilation would be much better at answering this with concrete detail. But I know that in the books, they had dotted lines sometimes and they could say “this is approximate.” With the GIS files, it’s much harder to do that.
[Pierazzo] […] the possibility of declaring many times. We don’t know.
[Knox] Another issue is that in the books, they could draw a line with a pen, and know that it’s one to five hundred thousand. So the pen width on the landscape is pretty wide. But with GIS, you add another decimal point and it’s absurdly precise without really having any meaning.
[Laurent Romary] I think this talk is essential from the point of data modeling at identifying what you’ve been doing as a resource for many other researchers in the humanities. Because of that, maybe, to identify within your data modeling activities what are the entry points if you were to put this resource online. Like, each stable state of accounting. So for instance someone annotating a corpus of letters at a given period referring to a specific county could actually point to this object while annotating his resource. I think this is also pointing to what you said, “let us.” And I think part of our data modeling thinkings should be about those stable entities we want to bring into the communities from various fields, so that we can actually network them, for whatever reason.
[Knox] Yeah, and for this project that’s one the intentions of it from the beginning is to be a reference resource for further study.
[Allen Renear] Right, so, as I said the other day, I think that modeling change is a tremendous challenge in data modeling. (23:58) It plays out differently in these two categories that I guess we’ve merged here — pragmatic modeling and theoretic modeling. In pragmatic modeling you have to move on, you have other things to do, and you must come up with a solution to a problem of change. The reasonable amount of time or a reasonable amount of effort and money so that you can move forward. In theoretic data modeling, you can’t move on. There’s a sense, it’s extreme to put it this way, in which you can’t move on. The whole point, The Whole Point [emphasis], is to solve this problem, so there’s no moving on until you solve it. It’s really hard. That list you put up of things that change strike me as, each one of them, things that do not change. I just noticed a few that may. But at first I look at them and say “yeah, here’s the problem,” which is that none of them change. Notice that we’re completely opposite from the discussion taking place over here. We say that they change but as soon as we use the unforgiving logic-based data description language that have become popular, like RDF etc., we boxing ourselves in and unable to represent change. Part of the problem is that these are cultural objects. I think idea that the problem of change is, I think, primarily one that faces us, because we’re dealing with cultural objects. It’s cultural objects that have a particular kind of nature that makes them difficult to understand as changing, in the sense implied by your title. With your title suggesting that if there is change, there’s an underlying subject that changes. There is a thing that changes. It’s an Aristotelian view.
[Knox] I really enjoyed your talk the other day and thought a lot about this. I’m still a bit puzzled by this question of . . . I think, in here with state legislatures, we have intentionality in a naive way at least, as I would say. The legislature intends that the county be the same and assigns it different boundaries and says “it is the same thing and it’s different, and the fact that its geographic attributes are different doesn’t matter to us.” Which might lead to modeling problems, but I think if we want to model what they intended, that’s . . . yeah, that just makes my mind boggle.
[Julia Flanders] I just want to add one question to this, which is: does it help if, in acknowledging the kind of human, intuitive sense in which the county is the thing we’re interested in, that there’s a thing there that’s changing, is it possible to to handle these changes by regularity? In other words, by saying, by qualifying the county with time, with whatever, and saying in effect “there’s this one” and “there’s also this one,” and then describe the relationship between them that acknowledges how the relationship isn’t of “intuitive identity.” Is that sloppy or?
[Knox] I think that’s the intent of the whole project, and as pragmatic modelers, I think that’s what they did. The other thing I would say in response to the pragmatic versus theoretic is, as pragmatic modelers they didn’t call themselves data modelers, they called themselves “editors.” But I think that’s a good word to have in mind for the pragmatic end of it.
[Maximilian Schich] […] I think that with this kind of granularity question, you can look at cities, which the census also covers, right? You have the same problem because New York City changes tremendously from one year to the other. Because Brooklyn is just a county… […]You have the same problems, but a different granularity level. But then, on that granularity level you could then go back to the counties because the distribution of the towns within the counties might be so striking, right? This county, which changes shape completely, may have 95% of the population but one place where it overlaps. The main solution is in the data. The other thing is that there’s not only a problem of change, there’s also a problem of plurality in the data modeling all of a sudden. Because if you look at the ethnicities, who has collected this data? That is sure to be racist. That’s something which we have with much more data right now from the African war diaries or the WikiLeaks,or genocide data in Vietnam. There’s people working about this, and I think this is a very strong data modeling problem. Because in the end, that goes all the way to crime. . .we say […] is shocked […] Auschwitz prison records, which is a data model, because it’s […] where you enter this stuff. There’s something where we can actually do something really dangerous.
[Knox] The more the modeling is an act by the authority of the state, the easier it is to model, then the more we need that provenance to wrap it, so that we say that it comes from somewhere.
[Wendell Piez] I’ve corroborated all of this. I mean, it’s all about Frankenstein again. The thing is that what’s so interesting about this list to me is not that you have things that can change. You also have things that are acting on one another, right? Judicial decision changes something else. This goes to your point about intentionality and performances. These things are defined by the virtue of other entities that we’re also trying to understand and map. Those relationships are such that it’s very definition, what it is constitutes a county as a county is implicated in in the definition of something else, by way of that something else’s definition of itself. This connects to what Max[imillian Schich] is saying, because this network of mutual implications includes the historian. The problem is that, of course historians and historiographers are very much concerned with this, is that history can be seen as an objective study of the past or it can be seen as an ideological project to justify something we want to believe.
[Knox] Thank you. Yeah. That’s right.
[Piez] So where is it that you’ve found the boundaries in all of this?
[Knox] I really appreciate that question because that’s another way I thought of ending up. The answer to the question “What is the thing that changes?” should be followed by what is the thing that changes as a result of our act of modeling, as a rhetorical and as social intervention in the world? I think that’s your question and it’s a great question.
[Stephen Ramsay] This is a great presentation. It’s also a terrifying one. And it’s a terrifying one because I think we have a tendency to look at it and say that this is a particular domain in which these kinds of problems surface, when in fact, it may be the case that this is the problem no matter what you do. It brings it forward more powerfully. It’s exactly for the reason that Allen [Renear] said, we don’t really have a robust account of state identity in our data models, period. I’m inclined to, and I don’t know if this is what Allen [Renear] said, but it’s what I’ll say. One of our problems is when we look at our data and say “that list is a set of mutable values over time,” when in fact what it may be is a set of immutable values and that the identity of the object is some slice of that neglective set of immutable properties. That’s a very different way of thinking about objects and labeling. But it may be one that actually has implications not only for this kind of problem, but for the rest of data modeling. It may be that everything is sort of immersed in this Periclitean fire and we need to bring that forward or else our. . . and again I accept the difference between pragmatic data and theoretical modeling and so forth.
[Knox] In response to that, I was really interested in and troubled by your distinction yesterday between noun people and verb people. I thought initially that I must be a verb person because I wanted to talk about change, but then I thought, wait a minute, change is a noun for change over time–that’s noun, preposition, noun. I think that’s what the historical method is, actually, to look at the thing that changes and figure out how to stabilize it through narrative, which is what. . .
[Jean Bauer] This is just an amazing project, and also thank you so much for making all that data available, so the rest of us can go and can play with it and bring it into our projects.
[Knox] I should say that that’s to the credit of the Newberry Public Library. The project was funded by the National Endowment for the Humanities.
[Bauer] Thank you to all involved. But just one other thing to pick up on a couple of other threads. One of the interesting things about this particular data set is that it’s also referencing a period of an expanding land empire. As the United States, as what becomes, from the British Colonial [empire] into the United States, moving across time–since you’re only dealing with states, it’s not as big of an issue if you’re handling with territories, but even so, when something becomes a “state” there’s an issue about who the population that may or may not be considered citizen for any number of reasons. And deciding who might live within those lines, who may not consider themselves part of the United States at all. So, as another thing to worry about, we look at shapefiles for imperial systems. There are plenty of people living east of a particular line who might hotly detest that they are, in fact, part of the system who draws a line around them.
[Knox] If you download the whole set you’ll find some boundaries for Spanish territory, but American Indian history is not formally represented here. And in some ways, it’s all over it. What would be the ethics of formalizing it, that’s also an interesting question.
[Fotis Jannidis] Well, probably this is very obvious but considering the philosophy of personal identity, they made this proposal to say “you can’t just describe identity by defining specific features.” One of the […] proposals was to say “it’s probably their narrative about change.” So that all these things can change, but they belong together because there’s a narrative keeping them together. Saying “this was the thing. . .” something constructing the object and saying that this is probably the best way.
[Knox] Yeah I don’t know what that would look like, but what if the narrative were the primitive?
[Fotis Jannidis] You would break down the narrative to have, say, events which change the states. One of these states, actually.
[Maximilian Schich] I think that’s a very interesting point. I think we often think about each point and each record we have as one of the opinions, and there’s a difference of opinion. The timelines change so that if you want to put all of those together like a chain of prose on a timeline. But actually, it would be interesting to say, drop all these little differences of opinion on the same plane and say what are the merits of these timelines going on the same plane. The question is: do you preserve something like open shapefiles or something like that? Because obviously there were a lot of one… There wasn’t one in the McHenry sample and the analytic sample, there was one reticle which was open in the analog version. Do you have something like that in the data?
[Knox] I think that’s Chicago bias. That’s the lake. I guess we just kind of take it for granted.