Open discussion: Key themes

Open Discussion–Key Themes (March 15): (video)

[00:05]

[Julia Flanders] I thought about some of the issues that seem to be swimming about – on the surface or under the surface – of the discussions during the course of the day yesterday. And I came up with this initial set of provocations I guess, more than questions, to get us started this morning for the opening hour. And I want to sort of gloss them because they are a little bit abbreviated and telegraphic. Then we can pick them up in any order that you would like but I would like to suggest that if the first two are interesting we might spend a little time on them.

The archival or generalizable data versus personal data question really had to do with what we saw as an emerging distinction between data that’s being created for others to use: data that’s being created with a sense of social responsibility or professional responsibility with the goal that it become a kind of generalized stratum of information that can be shared and used. And we were contrasting that in our mind with data that’s been created by an individual or a project for the internal consumption of that individual or project. In other words, it is data that serves some specific research need at a specific time and may be more opportunistic or more locally focused. And it seemed to us I think, and Fotis [Jannidis] you can correct me if I am going in the wrong direction here, but it seemed to us that the discussion yesterday had focused to a large extent on that first category and there might be interesting things we could learn by thinking about the second category. You know, data modeling opportunities or desiderata that might emerge in that more private space of modeling.

[02:06]

The second point here is really another way of thinking about that. There were a number of comments during the course of the day yesterday from Laurent [Romary], from Desmond [Schmidt], from a few others pointing out this question of whether it’s possible to have, in effect, pre-modeled or un-modeled data, or very lightly modeled. The list of possible cooking metaphors here [on the slide] is an index of my own uncertainty of exactly how little modeling we are talking about. And I think it is interesting to reflect on whether in practice or perhaps philosophically it is possible to have data, or let us say observations, which are purely un-modeled or completely raw. Or whether it is possible through some means to agree upon some modeling that serves as if it were raw. In other words, modeling that is so unobjectionable or so universal that it is unproblematic, that it does not take the data in any particular direction. And I think that these first two questions in some ways kind of play together a little bit and I’d be interested in hearing what people have to say. But it also strikes me that that second question would benefit from some real examples. So Laurent for example at lunchtime was talking to Elena [Pierazzo] and me about astronomical data and the way that astronomers deal with the observational layer and the slightly more cooked layer, let us say. And I thought that was a really interesting example and one that I realized I knew nothing at all about and I think that we can probably adduce other examples that would kind of give us a set of points on a continuum or on a plane that would suggest what kinds of modeling we are talking about, what kinds of possibilities for non-modeling we are talking about.

[04:00]

Anyway, so that is what those first two points are trying to get at. And I guess the third question here about the motives, the pragmatically driven modeling, modeling towards an analytical goal or a particular functional goal, let’s say, versus modeling which is purely intellectually driven—that’s something that also I think came out of the discussion of it yesterday, and could benefit from some examples and also from some further explanation about what kinds of different directions those types of modeling intentions take us. The fourth and fifth questions are questions which had been implicit in the opening panel discussion; they were on the agenda for that panel, but the discussion so quickly took off with the first question that we never really got to them. So I want to put them back on the list here. Basically what they are asking is whether the kinds of data modeling research that we are doing in the humanities has application outwards. In other words, whether there are people who can learn from what we are doing. And the implication of what Laurent opened by saying—that the humanities data poses hard problems—is especially interesting. That might lead us to some ideas there.

[05:32]

The fifth question is essentially the same question restated a different way. And then obviously if there are other questions we’d like to put on the agenda, now’s a good time to do it. So yes, Laurent?

[05:45]
[Laurent Romary] I wanted to put the three first questions together and come back to the thing we lost during the day which basically the introduction given by Wendell [Piez] and this lego metaphor, in a way, is how much should we orient our discussion toward the identification of the tool box for the researcher which would allow to identify how much of those breaks do we need to have in archival format as opposed to a lot of specialized break like […] yesterday to actually to have personal data. And you would have more of the same kind of reading grid for the second question. So raw data or—how do you call that—parboiled, blanchi maybe.

[06:44]
[Julia Flanders] Like the instant […]

[06:48]
[Laurent Romary] Like they just put in hot water and…To [feel?] that. But it would just be taking a few blocks of that. What are the basic blocks that you ould offer to prepare a very flat edition that could be used afterwards by other researchers and the subquestion is on the same thing. Exploratory means in a really, okay taking one of your potatoes and putting that  in funny shapes to actually create a soup that is directly derived from the original recipe. I’m going from lego to the kitchen. But you see what I mean? This is the central aspect because we really need to identify what are those basic concepts and my favorite example, because you asked for an example, is really lexical data. Where from a flat scan of a dictionary you can build that structure at a very deep level according to the kind of observations you want or the needs with other data sources you will have elsewhere.  And that’s probably a possible orientation.

[07:58]
[Julia Flanders] Other examples…?

[08:00]
[Elena Pierazzo] Yesterday you mentioned a […] at lunch of what can be the raw data for the text content from a document and we mentioned the possibility that it could be just a digitization, a photograph of the pages. I would like to discuss this point in the sense that a page of a manuscript is […] when it was produced as a unit, because manuscripts are made of folios, not of pages or openings or gatherings or orderings. So this role that we can think, because before we start editing, before we start doing anything starts with a photograph. Actually there is already a big assumption that we are mostly as digitizers, as librarians that when we take a picture of an object that it was never conceived of. So this is something that lots of discussion in the major lists, digital humanities people that say what do these floating pages mean? And there’s a lot of uneasiness in that community about this formal digitization, that  does not represent the object, the material object at all and it’s misleading in many cases.

[09:38]
[Julia Flanders] So I am going to ask you to say a little bit more? I am wondering whether the problem with the manuscript page for example as a unit of rawness is that it is—is there some other thing that the manuscript editing community, let us say, could agree upon and that just happens to not be it? Or is it that there is no unit from which one could then build all the other higher level units? In other words, there’s not any single raw thing, like the individual surface or something like that.

[10:00] [Speaker] Like a primitive or something.

[10:02]
[Julia Flanders] Right, exactly. Is there a primitive there, or is it that there is no possibility of a primitive because of fundamental differences in the way that members of that community need to think about the data?

[10:10]
[Elena Pierazzo] I think more the latter. There is a […] to want to tell you that it is required, that it is the only unit that makes some sense, someone would say the folio, the bi-folio, even more so. So that’s the point. The point is that there is only one model at the moment.  I call it a model because these photographs do answer to some research question: for instance, reading from afar a text in a way that you cannot read from even a close inspection or the naked eye. So there is some benefit from that digitalization. That is only one of the models possible. I’m including that there is a possibility of more models. And there is, […] that I am aiming at with my intervention is that there is such a thing as—there is only even when you think about the minimal approach you can take, which is just a photograph, is already making an assumption and is already making an interpretation on what you think this object is. So –

[11:21]
[Julia Flanders] So those are tomatoes that come with the basil leaf already in.

[11:30]
[Elena Pierazzo] Yes, you have the constant basil.

[11:35]
[Fotis Jannidis] Yes, well I am not going to put this in a kitchen metaphor, but I think we have to discuss this because “raw” is misleading. There is no “raw” in the sense of primitives, which are irreducible items of something like that. But, you could say that we have things that are strong: at specific times in history, we have very strong assumptions in the culture about what constitutes a unit. And for example a book, and all our knowledge how to handle a book, and that it is delimited in many ways from its surroundings, in many ways makes it a unit. And then to say, okay I am cutting this in half or anything else— it would be misleading, and if we just concentrate on half of the book it would be against these assumptions. But I think if we talk raw data we really talk about the social construction of the objects we want to reconstruct in the medium of the digital. And then many of these problems disappear, if you rephrase it like that because we say, “okay we have a social construction.” It is always a construction. There is nothing behind it. But then you can reconstruct the construction. That is very nice.

[12:40]
[Wendell Piez] I would accept that and amplify that, and yet also point out that although maybe the conceptual problem disappears, the practical problem opens up in the most dramatic form. Because to come back to what Laurent was just saying, part of the issue here is that if we cannot even agree on what the primitives are, then we come back to the issue of getting our hands dirty. Who is it that gets to define the model? Am I the scholar defining the model that other scholars, other readers, other students will be consumers of? Or am I going to be operating a canning factory and they will just use my product? Or are we all engaged in this? And are they also at liberty in effect to unpack and undo and redo and completely reinvent the way I work? And I think that within in the tradition of the humanities, clearly I think that although we understand that there’s going to be a lot of the first activity— I mean publishers do reasonably important work—nevertheless there need to be opportunities for the latter activity where it is not just the ancient manuscript as my raw material, it is my conception of the manuscript and my representation of that manuscript in the various forms […] that I have available to me. So it comes back to the architectures of the lower levels of the system, where we have got bit maps and we have got encoded texts, and we have got the raw material at that level which needs to be handled and managed and understood. And I think that this is really tied up with the issue of the individual workload and methodological practice of the scholar and the student, and also with pedagogical problems and how we get into this because nobody gets into this knowing everything, right? We kind of have to figure it out as we go along, which means that when we are earlier on maybe we need to assemble a few packages of instant rice or whatever, but as we get more sophisticated we are not going to be doing that anymore. We are going to be doing the real thing. And so there is plenty of roles and opportunity for both approaches where I am creating something that is prepackaged and is going to be downloaded on the phone when you read it on the train. And yet at the same time there also needs to be the sharing and distribution of much, much more fluid probing. I mean using the the abstract, that things that do not necessarily have a defined shape which you as a user, as a receiver are then going to complete. Alright? So we have this huge spectrum of problems that we need to address there.

[15:36]
[Douglas Knox] Following up on what Wendell was saying about work flow: I’ve been thinking about the politics of work flow, and a lot of what we have said assumes that individual scholars are using digital resources that are created by libraries or institutions that know how to create this thing that think about modeling. But I think more and more there’s a lot of self created, scholar created, digital representations of material of interest. So I am most familiar with historians who will go to archives. Many archives now allow cameras in the reading room so scholars will have thousands of photographs on their hard drive organized in various ways in folders, and there is a certain ad hoc modeling that takes place that way. There is a lot of transcription that goes on in Microsoft Word and even in Excel in spreadsheets, and there is a certain kind of modeling that goes on there.  And some of it is really part of graduate methodological training, where you write down what box and folder you are looking at when you are transcribing for example. And I think that is worth considering in this context because a lot things we wouldn’t think of it as data. It’s rotting in people’s attics, more or less, but it might be of value.

[16:51]
[Maximilian Schich] I think it makes very good sense to talk about this distinction between archival and generalizable and personal because I think in private as researchers we work with personal data all the time. Especially, if you collect information and you try to come up with something new, not doing something inside your official project, what usually happens is you take some data which comes in some standard and then you basically mix it up, you bend it and whatever and all of a sudden you also find that the stuff that is not modeled in that data. For example, in a lot of personl data sets, location is poorly modeled. So basically you will have different notations for different locations and stuff like that. That is even worse if you go to publicly aggregated sources such as Wikipedia, where everybody can enter whatever he or she wants in square brackets. And basically this kind of normalization thing is something where we say, “okay Ulan Bator is really Ulaanbaatar, it’s the same thing.” But that’s not necessarily done by the person entering data. And it can be a mistake, it can be the person does not have time, so I think for a scholarly process of data modeling we need some minimal standard and I think “minimal standard” is actually very very minimal. It is a little bit like in a scholarly publication you need something to refer to, right? The natural sciences do not do that. They refer to, even if it is a monograph, it is 450 pages, in Nature you would see only the monograph. But in the humanities we actually say, “okay, page 432 notes this and that, especially this and that sentence,” or something like that sometimes.

[18:34

And I think we need to […] If somebody points out something in the data, meaning talking about something, say the shadow on the wall, we have to – if that is a scholarly discussion – we have to expect to say, “Okay, if you want to talk about that shadow, please give it a URI or something, an identifier. And if you can, point out the equivalence in some public data model, because if you come up with, say, ‘idiosyncratic data model: shadow’ this is something which I can work with, but if you can say ‘Dublin Core: shadow’ (if that exists), please put it in.” And it does not really matter which one you choose, you can mix up 300 different data models as long as it is one of the known ones. But if your concept of “shadow” or your concept of art say is different from Dublin Core then please do your own thing. And that is also something which is important. But as a minimal standard we can also say the minimum thing you need to do is just tag and provide an ID because all the rest can be done by somebody else. I can say Ulan Batar and Ulaanbaatar is the same thing even if you do not know. But if you do not give it an ID, I cannot put the ID into the text of somebody else if it is published somewhere. That is the problem.

And I think in the discussion like we are discussing, obviously that is the case about how you do this archival and generalizable data models. While on the local level practitioners or researchers do not learn how to actually build data sets, which they can just publish in something like a database journal or something. And then they can work from there and make it better and better because no one is going to normalize their data for every purpose, right?

[20:28]
[Julia Flanders] So it sounds like there is a strong sort of socialization of pedagogical […].

[20:50]
[Maximilian Schich] Yeah, I think so.

[20:54]
[Susan Schreibman] I was just going to add another word that I imagine the personal today will become the public tomorrow. So just like we collect archives and then, I mean I have had an archive offer to buy my reading archive. That is my personal archive on reading. I suspect that when they buy that, they are going to get a lot of digital stuff as well, so all of those images that we are now allowed to take in the library of archived things, that is valuable. They will be getting all this digital stuff as well. So in a way does that go back to I guess what we are going to be speaking about later, in terms of particularly postgraduate education. Should we be teaching people, in research methods, how to organize and categorize their personal digital records? Not only for themselves so they can find it on their computer later but also to make more public whether you do it or somebody else does it.

[21:30]
[Julia Flanders] And also whether or not we can really address the problem that Elena has pointed out of the sort of pre-aggregation of things, the premature aggregation of things. I wonder—and this is a case where I think it would be very useful to have examples—I wonder whether it might be possible to at least do as little pre-aggregation as possible. And I was getting back to Laurent’s suggestion, if, as part of the good practice that people are learning and adopting in their graduate training, we say “if you are going to digitize something at least put one surface per file or something. Make sure there’s metadata at the level of the individual thing rather than something bigger so that it can be decomposed and treated in as modular a way as possible.”

[22:22]
[Susan Schreibman] But on the other hand […] typical archives keep the ordering of the creator and so it would not […] as well that I put these things in this folder for some purpose that is clear […] the item level but because it’s digital an archive in theory should be able to be both: preserve one ordering and allow me to see it in any other ordering that you choose so being able to decompose at that object level any way you would like.

[22:56]
[Julia Flanders] I see a lot of people nodding.

[22:59]
[Kari Kraus] That’s a canonical archival principal. What is the term for it? Respect des fonds or something?

[23:13]
[Speaker] Respect des fonds.

[23:14]
[Susan Schreibman] Yes, which […] in a much more embracing way than archivists [who have to do it?] with physical archives.

[23:22] [Daniel Pitti] It is frequently misunderstood now what the sense of “original order” is. It’s more a reflection of how are these records used and so the order of the records reflects that usage. It does not necessarily mean a particular physical arrangement as such.

[23:45] [Julia Flanders] But the overall principle is that the personal is interesting and also –

[24:02]
[Daniel Pitti] I mean this would be an interesting dialog to engage in with archivists. There is the AIMS project which was in Virginia. Basically this is an extremely hot topic in the archival world. How do you go about collecting the personal papers or private papers of individuals. And it’s a really difficult thing. So imagine for example that you have a zip file with 5,000 files on it and you have no text about it telling you it all was there. How one goes about appraising what is there and deciding what is what and how to describe it. Not to mention that they get boxes with 3.5 disks in them and on and on and on and things written in WordPerfect 1.0 and so on.

[25:05]
[Julia Flanders] Okay, Elke [Teich] had a comment.

[25:18]
[Elke Teich] Okay, my comment relates back to Max’s and Fotis. Back to the unit. I’m a linguist so for me things are probably quite easy. But if the linguists did not agree on what the unit is we are actually looking at, we could not do anything. So we agree what the units are. They are words, they are sentences, things like that. If we did not agree on this you could go nowhere. I mean, basically itis a compromise as well. So you may not agree on my word exactly, so it is an approximation. But we need to do this, so this goes back to what you said, what are these units? I do not know but we habitually talk about it.

[26:27]
[Laurent Romary] It’s a very important because it has to do with the question of semantic interoperability  which we did not tackle so much yesterday but we should probably do. It is not so much I am agreeing, it is that once you look at a primary source with a certain perspective, if you decide to identify words there, you need to be able to refer to a kind of semantic repository of concepts where you could say I mean “word” in the sense of, lets say, the <w> in the TEI, if this is your background, what have you, I mean there could be others. But at least you would be able to have some semantic standpoints where you can actually articulate whatever the combinations of the later breaks you have afterwards. You make sure, I mean this is the little drawing, you said “this is the length and width of what I mean by this kind of later break”. If you do something else it is another concept.

[27:11]
[Elke Teich] That’s right. And to come back to Max’s point,  Only if you have that unit, you can address it, but this is what you need to do.

[27:18]
[Julia Flanders] I want to just observe that this seems to be a crucial question and I wonder whether different communities would answer your question of whether we have  basic agreement. I mean, I saw some people kind of nodding and some people going, “uhhh”. And I wonder how many of us think we are working in communities where this is basic agreement at this point? And one person’s hand goes up. And is everyone else feeling that they work in a community where there is not basic agreement? How many of us think we’re in communities where basic agreement is possible but we just haven’t worked hard enough yet?

[28:01]
[Fotis Jannidis] I think that the question has to be rephrased because all linguists agree that they have agreed on a definition of a word but they […] talk about words. So if we agree on a word, the definition is mmm-mmm and not umm-umm and agree on this. So because of some identity politics or something. On the other hand, they talk about words at length, so obviously this is a very important unit and it hasn’t been developed by linguists. It is a cultural unit and they take it from culture and they are trying to come up with definitions so I think we should try to focus. Maybe we are talking about this archival side of data modeling; we should acknowledge that there is a cultural [outset?] and we are trying to reconstruct it in a way and try to define it in a more formalized way than the culture itself usually does. And that this is the hard work that we are doing–that it is not that there is nothing out there and we are totally free to do what we want.

[28:43]
[Julia Flanders] Elena has been waiting…

[28:45]
[Elena Pierazzo] There is also a community for which not even talking about words is possible. And in fact for the forty years of our sixty years in the rules of critical  paleoography they can not agree on what the kind of thing is on the page, what we call that: grapheme, character, [brief?] or whatever. So they have not even started to talk about that. They keep screaming at each other. There is a committee that is doing the dictionary for six years. They have not even agreed on the first letter. They are still screaming about it. But what I wanted to say is that yes, these primitives, these things, they are all models. They are models and there is one characteristic of models: they are based on selection. Every model is based on the selection of features and this selection is cultural treatment as you say, surely, and so I would not really spend that much more time and say which is the primitive, because those primitives are always cultural and all […] selection. You just need to be clear what they are. And offer alternatives, possibly more than one: no one will generalise all the scholarship on it. The plurality of those models will make the scholarship wrong because they are all affected by the fact that we choose some features […] others. And my features will influence the models. So oblige us to say this: that there is such as thing as a more primitive unit because they’re all affected by our understanding of these units.

[30:37]
[Wendell Piez] Yes, you guys think this is bad. It is much worse, because in the work that I do which is largely not-for-profit, commercial publishing, the situation we have is that people believe they agree much more than they actually do. And in fact I see a lot of this in the academic environment as well. There is a situation as if everyone is sort of getting together with their lego […] and they were not actually commensurable, they were knock off legos, they were not real legos. They do not really fit together very well. They are sort of building things and cramming them together. They are sort of mundane, they do not really work but they work well enough, right? So it is not simply a matter of coming up from the bottom to agree on what our primitives are. It is also understanding and identifying where we actually are not agreeing even when we think we are, or when we are using the same terms to mean different things. And this is something that we only discover in the instance when things do not really work very well. I mean we have been talking about this quite a lot in the TEI community and with relation to the problem of data interchange which is a core problem. And so it is as if you have to go both directions at once. You have to come up from the bottom and also down from the top.

[31: 57]
[Paul Caton] So the situation we face is kind of the reverse of the one you are talking about. Okay, you know you do not agree so you agree that we are going to disagree, but we will settle on the compromise. You start from that feeling of disagreement where we start from the feeling of “well we all know what a document is” but if I ask people here privately, without conferring with anyone else, if we put all papers and stuff in the middle of the room, and say “how many documents are there in this room right now?” I guarantee the answers will be different, because you have got to come back and say, “well it depends on what you mean by document”. And we never get to that point where we say, “what is a document?” But when we do, then we find the differences.

[32:41]
[Desmond Schmidt] I just want to make a point that when people talk depending on the volume of their voice it may be easier or difficult to hear them. Could we possibly have a microphone that moves around a little bit? There was a table there at one point.

The thing is I cannot hear people who are very close to me if they are not facing me and I think other people are having the same problem.

[33:16]
[Julia Flanders] Yes, I wonder if you might be able to move in a little bit because I think the microphone is pointing forward so you are not getting good pickup for the stream.

[33:27]
[Elena Pierazzo] You need to talk into the microphone as well?

[33:30]
[Desmond Schmidt] Or to me.  You do not need to talk to the microphone.

[33:34]
[Julia Flanders] What we mean is if you are having trouble hearing you can move forward but I do not think these microphones necessarily move.

[33:45]
[Jean Bauer] Apparently they are having a similar problem on the live streaming with picking us up so somebody just chimed in on twitter to say—

[33:56]
[Julia Flanders] Would it be possible to just turn up that mic?

[34:06]
[Videographer] I’m picking up everything fine on here. I’m having no trouble.

[34:09]
[Julia Flanders] Okay, so it is really—

[34:10]
[Videographer] If people could turn somewhat to face towards us that usually makes things much clearer.

[34:15]
[Julia Flanders] Okay, great. So if you could all just sort of make sure to speak up loudly, clearly and in that direction.

[34:20]
[Desmond Schmidt] Okay sure, I would just like to move onto the second one. We have actually covered a lot of the first one and in a sense it rolls into the second one, which is just about what you can do with plain text. It is amazing in fact what you can do with plain text. There is a massive amount of information in there. Even if you just look at proximity of words. I worked for a few years on program called Leximancer which just broke the text up into sentences and then count the distances between words, and we fed it all sorts of things in different languages. For example, we fed it ancient Greek. It knew nothing at all any languages. Extraordinarily it would map out the concepts, the words, the meaning of the text. It would put all the words that had the same semantic derivation together. So it would, as it were, “understand” the language incredibly, all of the inflections and so on, group them all together and say this concept here and this concept here. I think it is amazing what you can do, you can get that location in your personal data just by reading ordinary, plain, raw text. You don’t necessarily need any model.

[35:05]
[Kari Kraus] So that is a great point. In terms of going back to the linguistics example where they do indeed have well established units and primitives, the concept of the phoneme, even though different cultures have their own phonemic systems, the concept is highly, highly fertile and generative. I think part of the reason computation linguistics is the vibrant, creative field it is, is because they do work from acknowledged primitives or units. And I think we can better explore pluralities if we have a common point of reference. If the pluralities emerge in part by having a common frame.

So I’m thinking back to Steve’s point about constraints yesterday and how generative constraints can be. And I think that’s the function the units serve in computational linguistics. It goes back to the conversation we had yesterday about discrete versus continuous representations and how important discrete representations are. I think there was some ambiguity about how arbitrary or capricious those discrete representations were on the one hand versus motivated on the other hand. And I think they are cultural, they are contingent but I do not think they are arbitrary or capricious. And so I think really kind of digging into the philosophical questions about how we carve artifacts or phenomena at the joints to create those discrete representations is important work. And I think one of the ways you can get at this pedagogically as a research community is to do more of the things people were saying before: Susan’s point about the private leading into the public. I’m think about Dan Cohen’s notion of the hidden archive. The way Zotero exposes the hidden archive. That we could do this in our own work is a way of exposing the intermediate genres we produce, or intermediate artifacts we produce, as we’re trying to design data models or ontologies to make that work of carving at the joints explicit. And also to think about how we can better document our data models so that they’re shareable and also crucially revisable.

[37:58]
[Julia Flanders] I think it is interesting that we have moved from the idea of whether there is a kind of a base unmodeled layer to a discussion which I think is addressing the modeling layer. I mean, things like TEI or like other modeling systems. Am I hearing this wrong? I guess I am wondering whether, picking up on Desmond and Laurent, whether it seems possible and useful to try to pursue an idea of a kind of substratum, a base layer that would leave aside the questions  of culture and so forth. I mean not completely, but to some extent. Or whether that is simply a red herring, to go back to […]’s point.

[38:54]
[Laurent Romary] There is nothing like raw data.

[38:56] [Julia Flanders] Right, nothing like raw data. But for example your astronomical observation swhich it sounded as if you were positioning as being, if not raw then pre…

[39:08] [Laurent Romary] No, I am just putting the discussions we at lunch in context because you were saying well, it is  already elaborate data you have in astronomy, in comparison to us where we have scans and do transcription and the like. And basically astronomers are doing exactly the same thing. The issue of having an initial photograph or observation of all the wavelengths was exactly like Elena said. It is a stance with regards reality. You put your telescope with a certain focal, with certain filters, at a certain time of the day. You’ve got photograph and then you annotate to identify the objects, which are original to you. You link that to central databases of standard objects or you apply some tools, data mining tools, some specific variations on that. And everything is primary data or raw data from the other stage, basically. So you are never really raw.

[40:01]
[Julia Flanders] Right, okay. So the question for us might be what corresponds to that furthest back?

[40:07]
[Laurent Romary] When you’ve done the whole you have already processed the part.

[40:12]
[Maximilian Schich] I think this kind of raw versus not raw discussion is a little bit artificial because of course with data it comes in any form if it is raw. It is just like a tuna steak if you buy it at Whole Foods: it is also raw but it also has been processed. It has been cut into a square and stuff like that. The problem is there is data which we receive to analyze and then we have to do something with it, and what comes out is say a more modeled or another modeled version or something like that. And I would like to go back to this question, what are the consequences of humanities data modeling for other areas?

So there is one thing I think in the humanities, in many branches of the humanities, we are actually in the situation that it is totally not clear what our item or unit is. Because look at something like the drawings collection of the Uffizi: are we talking about the drawings of the circle of Sangallo, or are we talking about the drawings of […] or are we talking about the actual books which exist right now? Are we talking about the books the pages were bound in in a different kind of configuration before? So this is the work. So basically identifying the unit. And multiple researchers might come up with multiple different units. It is actually some work in some parts of humanities. While I agree in linguistics […] have a unit then you can do other things and it might be very, very interesting to actually work with that.

But then if you look at other areas such as, one example is human mobility research, currently heavily based on cell phone CDR, which is basically the billing record. You have two cell phone numbers, you can throw away the target number, you are only interested in one cell phone number and the location of the tower where the phone call was made. And basically people always talk about human mobility but it is actually the mobility of cell phones, right? Because obviously in whatever, in Kinshasa ten people share a cell phone because it is too expensive. In Sicily people might do […]and basically have three cell phones to talk to different friends and maybe they share one cell phone with their cousin who does the toto when they’re not there. And stuff like that. So this kind of thing is not a problem for them even though as a humanities person if you listen to these talks it is the first thing you think about. It is like, “What? They are talking about people? But they are talking about cell phones actually.” All the data they have is cell phones. And so I think we are not going to provide the solution to them because if you point it out people get annoyed because they do not see the problem. Because in the limit of their interests looking at the top most connected 50,000 people in the country. It is not a problem. Right? Because there are only two or three probably who will share their cell phone. I think once they get to this point where they say, “oh, there is this idiosyncrasy. I have to look into that,” and then they figure out, “oh, this family is using the cell phone.” They get the problem by themselves and that is the moment where we have to say: “Here is our hunch. Here is what arts and humanities said, we have solutions for this problem.” But I think we cannot go out and tell [them] “oh, arts and humanities have the answer” because these people are working with data which are orders of magnitudes more massive than our data. So I think it does not make sense to say “we have  the special problem, we are much more complicated.” These guys are complicated too. So probably it makes sense to come up with a solution in our kind of platform and then have systematic kind of talks like network scientists, net-sci conferences where people come together, social sciences, economics, physics, biology, whatever. And they listen to talks from each other. Itis constructed to listen to all these different things. And there you learn: “Oh, biologists have a really good solution how to work with sampling bias. These people have a really good solution to work with data models and stuff like that.” That is the kind of procedure we need to establish and that is not in place at all, is it?

[44:45]
[Syd Bauman] So, I am first going to hark back a little bit and respond to Desmond’s very valid point and then I’ll take advantage of Max and move forward on the questions. First, Desmond said something really interesting which is worth harping on, worth examining a little bit, which is that sometimes the more simple we keep things the better off we are. The quintessential example is Professor Eugene Charniak here at Brown during the late 1980s who was one of the golden young CS professors of AI, of this artificial intelligence stuff and he is still around, still doing AI but now he says, “you know, the simpler I keep the model and the stupider I make the algorithm, the better the results.” And that is interesting stuff I think. That is paraphrasing it a little of course. But I think there is one point where I take issue with Desmond and that is just taking the words and looking just at the words frequency kind of stuff is not using a data model. To me that is using a data model. It is perhaps kind of a simplistic model, but it’s still a data model and an interesting one. That the interesting bit is just a set a words and words are bounded by spaces or whatever the model may be.

[46:04] But then to take advantage of what Max just said. Because before I thought about moving to the next questions—where can we see consequences and what are the problems we could tackle for other domains—part of me has a perhaps inappropriate knee jerk kind of reaction to say, “I do not want to go there.” I do not want to know what the effects of my thoughts about data modeling here and now might be because they might be so massive. We have evidence that we in the humanities at times have an enormous effect on the world’s data modeling thought, XML being the primary example. This technology has swept the world by storm. Every field uses XML, sometimes in ridiculous ways. And it originated because you want to represent text in illegal places and stuff like that. The TEI xPointer, the extended pointing mechanism, which became xPointer, is a another example of where a data model in the humanities kind of took over and expanded and did lots of interesting stuff that I for one could have never predicted and would have felt stymied if I had. Maybe because I’m too small minded, I don’t know.

[47:04]
[Jim Kuhn] I think one possible way to rephrase some of these questions might be the difference between encoding for discovery, which is sort of private, or data modeling for discovery which is the private, personal. And encoding for meaning or data modeling for meaning, which is maybe where we take things when we are, “okay I think I am done. Here are my results. Here is something that is publishable.” And I think the hermeneutic spiral does not move back sometimes. So if we end up as archivists with this idea that we should respecting the original order, often we get a jumble of half formed thoughts and things that are still being worked through. But if we encourage looping back to the model that you worked through as you were exploring your sub-domain and identifying the key points as being “hey, these are the important bits.” Then when we are archiving games for instance we have a way of a trust-based mechanism for saying, “okay, these are the bits that are important.” So I do not know how that happens but we have talked about the hermeneutic process and the data modeling as an organic process. But sooner or later all of these projects will become static objects to study in themselves. And maybe we just need to encourage each other to remember that.

[48:45]
[Fotis Jannidis] Just wanted to ask a question again that we did talk about yesterday. The question, “what is data modeling?” Referring to Desmond’s conclusion, and Syd’s. I am not sure whether the data model Desmond was talking about is data is just a stream of bytes or words and so, but probably you could say on a specific level say that’s a data model. But on the other hand you could say my data model is an implied in the algorithms I use to data mine my text. And so finding these concepts and the links between the concepts assumes something about the things—

[49:20]
[Desmond Schmidt] It is just proximity. The information is in the language.

[49:23]
[Fotis Jannidis] Yes, but you are saying that the information is in the language, so that is your data model. Your data model is “proximity is worthwhile to research and to measure so I can then link the concept because it’s not nonsense.” So you have this assumption about– the question is, is data model a good word for that? Or is there something completely different? We should talk about data modelling just for entry for the archive of things, where we put specific labels on specific parts or the way we store them compared to this kind of intellectual data model or model of data.

[50:21] [Desmond Schmidt] I am not excluding that at all. I think it is perfectly okay to have personal data. Data directed towards a particular kind of output. That is fine. But a  layered model of markup and properties in addition to plain text, that’s what I am suggesting. Keeping the two separate or merging them as required in different combinations.

[50:41]
[Fotis Jannidis] No, actually I was taking your comment up to explore different kind of thought. How do we describe what we are doing with data as researchers, compared to storing data in a specific way and saying “this is the author’s name” and so on. This kind of data model. I think this is just, in a way, how much extension or over-boiling of the concept of data modeling can bear.

[51:03]
[Julia Flanders] So the idea that the models also inhere in practices or in procedures as well as in […]

[51:09]
[Wendell Piez] Yes, I might want to pick that up just not to say that much about it, my thoughts are not very formed, but it was something that kind of came in mind yesterday at a couple of points because we can talk about data modeling in a sense that we inscribe information about or information into our information. There is a way in which we actually shape and organize information and and model in that way. But there is also modeling to be performed in the way you are describing, which has to do with the processes and the designs of the processes that operate over our data, and that what we represent in the form of libraries and APIs and interfaces and other ways of actually working with the data, which are not necessarily considered formally part of the data, and yet are completely intrinsic to what we are actually involved in and engaging in. So there’s this way that we have to keep in mind that it’s not either all one or all the other, it is both at the same time.

[52:18]
[Syd Bauman] Just to back up the idea that there are some simplistic models which we kind of take for granted here that really are data models. There’s a stream of words that I want to study. For all the models we’ve been talking about so far no one has mentioned order. But in fact the order of things and if we argue about what are the units, for some data models once you’ve decided on the units, the order of the units is important. And in some other models it’s not. I mean to Desmond, if I hand him a pile of words but I sort them first, I’ve pretty much screwed up everything he wants to do. That data model, a data model that permits that would not work for him. And the same is true I presume for Elena’s pages or folios. If you take them out of order, they may not be as useful to her.

[53:25]
[Michael Sperberg-McQueen] I wanted to ask you, if I understand correctly, Fotis, you are suggesting maybe there’s a level at which it is not worth calling it a data model. Maybe we should reserve data model for things that are more formal. First, I want to point out that there is a way in which I think your suggestion is a recapitulation of history, but then I want to ask you if you really think it is tenable because I do not see how to do that. I will go ahead and confess that now. I do not think we can find a floor beneath which there is no point in calling something a data model. I think I am agreeing with Elena. I think it is theory all the way down. You will not see anything if you do not have some predisposition to see something.

There is however an interesting historical [paradigm?]. No one talked about database models until [Edgar F. Codd] produced the paper defining the relational model. And then in retrospect the discussion of the relational model, with the presumably inevitable claims that earlier databases did not really have a model at all, led to the explicit formulation of the data model inherent in the hierarchical databases that preceded Codd’s work. And so there is a sense in which I think we use the term “data model” for our attempts to make our assumptions explicit. And when they are implicit— in particular when we are not aware of it—we are going to be tempted to say “there is no model here,” but my inclination is to believe there is always a model there, but there are models that we are not yet conscious of because we are only talking to people who share these assumptions.

[55:30]
[Fotis Jannidis] I think we have two problems, two separate problems. One is actually the extension of the concept of data modeling. And as my understanding was from computer science, the books I look at, it is a rather limited extension, because next to data modeling you have process modeling and you have algorithms and so on. And I wanted to say if we use it coming from this direction, obviously there are parts of what we are doing where it is not sensible or useful to talk about data modeling.

The other thing is with data modeling, are there primitives? I disagree with Elena on the point that I think think there are social constructions out there […]. We are scholars, we are looking at the world, but we are not creating this part of the world. And then we can reconstruct these social constructions, and we can do it better or worse, and that’s what I was pointing to. It’s not that, at some point we get at the raw data without any filters. It does not get rawer than the social construction but that is where we get it. And now we have very often, this difference between the word and the phoneme. The word is out there in the wild. The other is a construction by people, by humans. And they say something that is worldwide and to delimit. And then we construct this and it can be useful but it is not out there. Many of our constructions of our primitives are out there in the wild. A book for example, I am not sure about the document. I think documents are not out there.

[57:36]
[Desmond Schmidt] While I agree with Elena and Michael on the interpretations to be put into every representation that we make of reality, there is a difference. We cannot just equate something where we have put loads and loads of subjective judgments into the text with a plain text representation, which contains much, much less interpretation and is more useable. It is really a sliding scale of reusability. The less we put in the text, the more other people can use it and the more we put into it the less usable it becomes. It cannot be just used as a club to say it is all interpretation, let us not care about it, let us just put everything in because there is nothing we can do about it, it is always interpretation.

[58:31]
[Julia Flanders] I find myself in sympathy with that point as well. It seems to me that the philosophical point is so –  it smites you and then you feel disabled or I feel disabled. I say well, if it is all theory, what can I do? I think the point that degree matters is enabling not just in a practical sense but also in a social sense. In other words, even though we may differ on our sense of… there might be very useful and interesting debates about whether one can imagine rawness or whether rawness exists or if so, in what way. Leaving all of that philosophical discussion aside there might still remain a very useful question for us as a community or for many “us-es” as many communities, a discussion concerning what the useful degrees are. In other words, for this community can we say what a useful set of agreed-upon acknowledged social conventions are to let us continue with our research. And I think that those discussions might very well be bounded by discipline or by community of practice or something like that. But they would be very much arising from the fact that degree does matter. It’s not simply a kind of an all or nothing, theory or no theory kind of situation.

[59:55]
[Wendell Piez] I agree with that except I do not see why that agreement needs to be general across a discipline or across any bounded community for anything other than pragmatic reasons local to them. I mean this is what I was talking about yesterday with my sliding scale that I think what we need, what we need have, what we need more of is the ability to move up and down so we can work with lower levels where we have greater potential and less actual commitment and yet at the same time move up the scale to greater commitment, greater constraints, greater power which also gets us less opportunities to do things that used to be potential but are no longer potential because of the commitments we made. And I think this speaks directly to this thing about constraints that came up yesterday that Steve mentioned, about how constraints are bad and good at the same time and that’s the nature of a constraint is that it gives you power. And in some senses I think that community can actually be defined, are defined by the particular agreement that we make within a community to work within a particular set of constraints. Yet at the same time I also think that these are constantly being renegotiated and they need to be renegotiated and it’s the development of the tools over time gives us the ability to work more flexibly up and down. And that is really important because the minute you kind of get locked into one place and say that is the place that we want to be is the point where somebody else comes in and says wait a second, I can do something different if I go up or if I go down, if I go somewhere else. And that is fruitful.

1:01:32 Julia Flanders

It is a good time for a break but there were two sort of urgent looking hands up so maybe that’s some final comments from Michael and Elena.

[1:01:41]
[Michael Sperberg-McQueen] I agree with Desmond’s point. It is a question of degree. And the reason I take out this clout to say it is all theory all the way down is to attack the idea that there is some natural level, a point that you can say, “that is all objective. That is okay. There is no subjective stuff in there.” I do not believe it for texts, I do not believe it for the astronomical data. The notion that the temperature measurements that the astronomers are recording are somehow pretheoretical or atheoretical does not hold up to even ten minutes reading of [Thomas] Kuhn, let alone other philosophies of science. And as Wendell says, sometimes being able to move up is important. It is quite true more people can use a lower level representation that encodes less of our understanding because the parts of it that are understood are more likely to be widely shared. But if Sid and I agree on more than just a sequence of characters, if we agree on the existence of paragraphs and the identity of the paragraphs and chapter headings and so forth in a document, we can make much better use of each other’s texts if we encode that information. And I’ll point out that one of the reasons that SGML and XML distinguish markup from content is precisely to make it possible to exchange rich data with people who don’t share your preconceptions. If you have someone who says, “oh, I cannot use this dictionary because they have encoded their theories of entomologies into it.” And if they have encoded them in XML then my inference is not that these people have encoded subjective information into the dictionary but the user does not know how to use data because the whole point of XML is those assumptions are reified and I can filter them out. I do not have to look at them. They are easily marked. It is easy to turn them off and make them invisible to my processing. So if I want to process it at a lower level, I can.

[1:04:19]
[Elena Pierazzo] Yes, I agree with that, that there are degrees of complexity captured in your model but I do not think the degree matters can be associated with a notion of value. The less things we put, the more valuable is the model. I do think that each model, the simplest as the most complex answer to specific scholarly questions so the naked, the lower the model the less questions it can answer or some specifics it can answer, some model cannot. So it is just a question of their changeability is one of the purposes of the model then so be it. But sometimes that is not the point. The point is to answer the question you have and that is what the model is for. So sometimes the simplest, sometimes it does not answer the complex question we want to answer so you need to apply it. Sometimes you do not. And there is no good or bad. There are just different questions.

[1:05:10]
[Julia Flanders] Well that sounds like a perfect segue to eating and drinking. Thank you all very much.

One thought on “Open discussion: Key themes

  1. Pingback: Knowledge Organization and Data Modeling in the Humanities: An ongoing conversation | datasymposium

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s