Gregor Middell, “On the Value of Comparing Truly Remarkable Texts”

Case Studies–Critical Editions (March 14):

Gregor Middell, “On the Value of Comparing Truly Remarkable Texts” (video)

[00:00]

[Fotis Jannidis] I’m sorry to say that the next speaker Malte [Rehbein] isn’t able to come so this gives us more time to discuss with Gregor [Middell].

[00:35]
[Gregor Middell] Okay. This presentation will appear to some as kind of a flashback, inasmuch as a lot of it is the practical angle to what Wendell [Piez] said in the keynote about multiple hierarchies. My talk today—

[Interruption due to microphone difficulties; exchange of words with audience.]

Okay. Thank you so much. Right, a flashback to Wendell’s presentation inasmuch as this one will also be about multiple hierarchies and markup problems. My day-to-day work is digital edition of Goethe’s Faust–a genetic edition, that is–and I am a technical research assistant in this project. One of the main challenges of this genetic edition of Goethe’s Faust is that we are trying to describe Goethe’s text from multiple perspectives. Those are only five that I try to represent in this diagram, but there are many, many more that we can think of. The two on the left hand side, the document markup and textual markup, I’ll talk about in a little bit more detail in a second, but the general idea is that to deliver a really sincere edition of Goethe’s Faust we not only want to deliver reading text or some story about how the text came into being, but we want to also describe things like metadata, for example, that we got from the archives where the handwritings or manuscripts are lying. Or, we want to link the text to images like, for example, the illustrations that we have for Goethe’s Faust. And, more importantly, we want to deliver a very faithful description of the records that we find in the archives, namely the documents. So let me just take two of these perspectives, namely the document-oriented perspective and the textual perspective of our edition and show you some of the challenges that we have in marking those up and representing them in our edition. In German editorial theory when you deliver a historical-critical or historisch-kritische edition, you’re asked to make a clear distinction between what you find in the archive, the record on the one hand (the Befund) and your interpretation of that record (the Deutung) or what we could simply say a clear constituted reading text. And the problem that we face structurally–I tried to draw [it] up here–is that the reading text, or the interpretation of what we find in the archives, is more or less very regular. So [any]body who has done some typography and text layout knows what’s displayed on the right hand side, namely the classical block level/inline level distinction of a text; that you have block level elements that you can lay out vertically from top to bottom, and then you have inline elements, words, paragraphs, and all these things that line up left to right or right to left, depending on what your writing direction is. So this is what you ultimately want to end up with; this is your textual perspective on the edition. What you see on the left hand side, also a bit abstract, is a documentary view and this doesn’t adhere to this clear structure at all or does not necessarily adhere to this structure.

[03:56]

Obviously you also have certain zones or areas in the manuscript that you can subsume or subordinate to the textual idea of having a vertical and horizontal layout. But then there are also other characteristics that you have to describe completely differently. So some areas line up in a certain way. Some textual sediments are grouped together spatially. The writing direction changes; you can rotate text, for example. There are strikethroughs and other artefacts that you want to represent that don’t adhere to textual order, but more of a graphical nature…. But both constitute or both describe in some way the same text; it’s just that the one is more truthful or faithful to the manuscript, the record, and the other one has more resemblance [to] the text, the reading text. If you want to encode those phenomena–and here I take three perspectives–what you end up with as a data model is something resembling this , and it might look very familiar to you, because it’s a very common or well-known data model in markup theory. I tried to abstract away from an extant text and take a very simple text where you just had three lines that you want to describe from different perspectives. So on the left hand side what you see is a documentary perspective of the text. So you might have a text that is placed on your manuscript in two different zones. So let’s say two different lines, A and C, that are in the middle of the page. And then in the margin, let’s say the right hand margin, you have a third line, the line B, that comes in there. And if you want to describe that spatially, you could do it in a structure like [the one] printed down the left hand side. Let me change the perspective of the text, and you see now what I want to actually describe is the content structure of the text. Let’s say it’s a drama or a verse text. The order of the text as well as the markup changes, so what once stood in the margin, the line B, now comes to stand between the lines A and C, because it might have been an insertion done later on in the text. That might be the second perspective. And just to add to that complexity, you could have a third perspective that now looks at the chronology of the whole text and says, “Okay, A and C,” then A and C have been written in the first stage, second stage, then was line B written down.

[06:10]

On the mark up perspective, and that’s the top layer, you can encode this with XML without any problems. And the data model underlying this markup structure is a classical DOM model made up of nodes and connections that you see in the middle there, and they’re structured quite differently. The interesting thing though is that all these different structures refer to more or less the same text. So our lines A, B, and C don’t really change or only change slightly, depending on the perspective. What really changes is the structure or the interpretation of that text. This data model might seem familiar to you, because it’s a classical GODDAG structure that Michael Sperberg-McQueen and Claus Huitfeldt developed quite some time ago, and now is digital humanities standard. The main problem with this data model is not so much that it’s not well understood or that we would think of this data model as something completely unthinkable of in terms of text, but it’s really, really hard to use in a practical manner. So the question that we haven’t really answered yet is how do we encode such a data model in a way that is really efficient to encode? How do you process it? So what means–storage data, storage mechanisms like databases, query language–do you use to actually use such a data structure?

[07:23]

Our first problem of the edition was to encode it: So how do you encode it? The TEI makes a couple of recommendations how you should encode such multiple hierarchies on a certain text. And all those propositions are very well-thought-[out] and have been applied multiple times in different editions. But they are basically workarounds. My impression was that when you read through these different propositions–and I think it’s in Chapter 20 [of the TEI Guidelines]–you can solve that problem with some trade-offs–that’s what a workaround is about–but my main question is: First of all, in an edition that’s really about multiple perspectives, what perspective should you actually choose as the dominating one? So even if you subordinate different perspectives–let’s say the documentary one or the genetic perspective–should the textual perspective really be the one that dominates your encoding? And is this a wise decision if all those different perspectives on the text should [actually] have their own right? And the second question that might be of specific relevance to this workshop is, aren’t we just shifting complexity? So if we say on the encoding level, “Okay, we work around the deficiencies or the constraints of a specific data model, namely our tree-like data structures,” aren’t we just shifting the complexity to the processing, and where then we again have to deal with this problem in some way, which we don’t see in the encoding but then have in the processing.

[08:49]

So what we ended up with in the Faust edition was instead to do something that is also recommended by the TEI but not very very popular among editors. We transcribed the text several times. So every manuscript gets a transcription from the documentary perspective and then from the textual perspective, and then the question obviously remains, and that’s the main reason why it’s not very popular: How do you synchronize? How do you relate these different transcriptions of the same text? And what we turn to–and that’s what I tried to allude to in the title–what we did was we collated the texts against each other. So take at first a look from a very schematic perspective.  You have the same text, A, B, C, and let’s first of all assume it’s the same text. So the same order of the tokens or lines and no difference between the texts. And what you can see structurally is that the text A, B, C on the left hand side has been marked up in a certain way. And the same text A, B, C has been marked up in a different way in different documents. So these are the two XML documents that I have at hand. So what we need is a correlation between the elements in both transcriptions that are actually the same. So I want to have a correlation between the As, the Bs, and the Cs that I end up [with] as a result is the data model that is more complex than the isolated transcripts of my document. And interestingly that’s exactly what collation is doing. Automatic collation of text does nothing else than correlating things in the text that are the same, to find out what’s different. Or, schematically, and a bit more from the perspective of collation, what it does, it does sequence alignment.

[10:27]

On the left hand side you see schematic collation result[s]. You have to read it from top to bottom. So you have the first text, which reads ABCD, and then, let’s say, a second text that reads ACDB. And what sequence alignment algorithms do is to introduce gaps into those sequences depicted by hyphens so that the same tokens line up; so that, for example, we can see that the A is actually the same in two texts or that the C occurs two times in the [text]. And what I also tried to show you is that you can actually detect things like tokens being moved around. So what you gain by applying collation to this problem is not only the nice effects that you can correlate–things in the text that are the same–but you also get a certain kind of fuzziness or flexibility in terms of how you mark up your text. So in the GODDAG model where you assume that the textual content is actually the same so that you construct several data models or hierarchies or schemas over the same text, this is the kind of constraint that you don’t necessarily have to adhere to anymore if you apply collation to the problem, because then all of a sudden you could leave certain things out of my perspective. We have that in Faust edition, for example, that when you have an archivist who writes down something on the manuscript, we would like to transcribe that, because it gives us hints about how this manuscript was treated in a certain way. So it’s definitely part of the text from a documentary perspective. We obviously have to leave out that part of the text as soon as it comes to the reading text, the textual transcript of it, which is still… you can cope with it in some way with TEI markup means, but with collation it becomes much easier, because all that happens in the alignment of the text is that this particular part is just left out and not aligned or something with some part in a different perspective.

[12:15]

The other nice thing is that this whole problem of sequence alignment is actually a very well-known problem in computer science so we can take advantage of existing algorithms in bioinformatics, for example, or take advantage of existing solutions in textual collations for philologists to make that work. I could show you a very short demo. Because collation is so crucial to our project, what we’re actually doing is we contribute to a collation software that is currently in the works called CollateX. And there are obviously other solutions to that; I’m just showing this one as an example. And CollateX’s main objective is not necessarily to develop a really good sequence alignment algorithm, although that is one of its tasks, but what we really want to achieve is being able to collate any kind of text, be it a markup text or a non-markup text, to actually achieve that kind of correlation. So what I can show you right now here in this demo is just the textual collation that you can hand in, let’s say, three texts to the collator, and what it ends up with is a representation of the things in the text that are actually the same or they’re different, or you can have a tabular display of the differences and commonalities between the different texts. But the main point that I want to stress is that you can do it automatically. So to correlate the two things you don’t necessarily have to read both texts and meticulously go through it, but you can leave that task to a computer. And, more so, if you can do it for XML documents or any markup document, you don’t only get the correlations between the words so your collator doesn’t only say “chases” appear three times in my manuscript, but it also tells you something about the markup context of this “chases” word, for example, in the different manuscripts so you get the correlation.

[14:01]

And that’s what we do in the Faust edition: so we collate our different transcripts of the same text against each other and end up with an architecture that resembles somewhat this architecture that you see on that side. So instead of having a classical XML database that you would put behind a dynamic edition, what we actually have is a graph database where all these different transcripts are stored, but [they] are not stored as separate documents–like you would have in a normal XML document–you would have an XML document for the documentary transcript and one for the reading text–but they’re actually stored in the graph database, and they’re interconnected. So we color the edges, and one color more or less resembles one particular schema or one way to transcribe a text. And what the collation algorithm lets us do is that texts are only represented once in the database. So if you have word or a verse, it’s only one of those nodes in the database that it gets referred to by different schemas or different ways to describe that. And now the whole problem of switching views or switching representations between different data models or schemas of the same text becomes a traversal problem in the graph database. So if I want to have the textual perspective on a text, I pull, let’s say, the blue-colored nodes, including the edges, and push them to the browser to display. If I want to switch the perspective for a particular word in the edition, I take that node, go back to the graph database, and ask, “So in what different-colored hierarchies are you contained?” And switch that perspective to then show, let’s say, a documentary perspective or a genetic one.

[15:43]

The main problem with this approach is it’s nice from a modelling perspective, because that’s what we wanted to have or wanted to kind of achieve for quite some time to have different perspectives on texts and different hierarchies, or multiple hierarchies, but it’s computationally complex. Inasmuch as the reading of this structure or the appearing of this structure works very well, traversing such graphs is something that databases can do very fast by now. But we have a problem with our graph updates. Every time we add a new perspective to the text, we have to manipulate the existing one depending on the granularity with which we mark it up. So one node … just imagine one node represents a line, and all of a sudden some linguist comes along and introduces “part of speech” tagging, and this line gets split up in multiple words. So we have to go to the database and then split up that one node that constitutes one line into multiple words and introduce a hierarchy, which is computationally more expensive than a different aspect that Wendell [Piez] proposed, namely ranges. Ranges are much nicer inasmuch as you can add them to a text independently of the structure or the schema that has been applied to the text beforehand. So if you say the red line is one kind of markup or one structure over my texts, so A and B have some markup meaning, C, D, E have a different one, it’s quite easy to add a blue or a green layer on top of that without interfering with the existing one. And querying is also much easier, because those ranges work very well with relational databases and existing technology. So what we’re currently doing–and I won’t talk about that at length, because Wendell has talked about it quite long and much more sophisticated than I could do that. I won’t talk about the LMNL model and the recursive markup thing and all [such] things. All I want to say is we have some practical problems with this graph database model that you only come up with, or that you only encounter when you actually implement the model instead of just thinking about what would be a proper model of representing text in a digital medium.

[17:44]

So our preliminary conclusions out of that project, with reference to this data modelling workshop, are threefold: So, first of all, modelling texts independent of a specific encoding or markup format and its specific data model, in this case a DOM, allows to cope better with their inherent complexity. Secondly, the combination of established encoding practices–we still use the TEI encoding standards to mark up our texts; it’s just that we separate different ways of describing the text in the different transcripts. The combination of established encoding practices and experimental computational approaches facilitates a gradual increase in complexity. So we can start off with one perspective and say what we describe right now is a text, but we can add a different perspective later on by just collating it and adding it to the existing kind of model. So we have “separation of concerns,” which is a really neat feature of the data model that you can separate different concerns, have different approaches to the text. And the last point is maybe a very simple one, but one that I would like to stress: Modelling texts is no different than modelling in other application domains inasmuch as it must take conceptual as well as computational aspects into account. Thank you.

[19:08]
[Fotis Jannidis] The last point especially could provoke comments.

[19:12 ]
Stephen [Ramsay] I’m really interested in that very last sentence, because, you know, if you conceive of things as graphs, and I think that’s actually kind of something that comes up in the modelling community. Like “I know we need a graph.” Like when things get complicated enough, you know, we go for that most glorious of all data structures, and then we go to process it, we discover that it’s the best way to destroy Java, you know, for example, Java virtual machines and so forth. So you say computational aspects, I wonder which thing we’re talking about. I mean: are we talking about computational tractability, which is, at least, philosophically possible for the most complex cyclic graph you can name, or are we talking about the practical exigencies of the systems we have now, because that second one sounds like something we might ignore. The first one sounds like … you see the distinction I’m making? I mean, there’s, “Are we making things that are computationally tractable?” That’s one question. But the other is, “Are we making things that we can actually build and run easily, and the programmers will tell us,” right? I’m wondering which one is [that] are you referring to in that sentence, because I heard both concerns. And if you were a modeller, it sounds like, “We shouldn’t give so much attention to whether my existing server hardware can handle it or I have enough memory,” and yet in practical terms of course I have to worry about it.

[20:49]
[Gregor Middell] I think I’m referring more to the second part. I mean, the main risk of this project, to begin with, was to go forth with multiple transcriptions of the same text, because already in the TEI Guidelines it’s put down that this is one of the most sincere ways of describing or representing multiple hierarchies of the text. The only problem is that we don’t know how to correlate the different views, and, therefore, we basically erase that option, and then there comes a list of workarounds how to represent that. And we had to develop this collation approach over the project. So for us it was, first of all, a practicability issue. So what can we do with the text? XML delivers for free not only encoding, but it delivers for free validation, [and] it delivers transformation language although not a very aesthetic one–at least in my opinion–and infrastructure like databases. So do we really want to find a substitute for all these different offerings, or is there an ability to gradually migrate it to something more complex?

[21:48]
[Fotis Jannidis] Thanks…. Max[imilian Schich]….

[21:50]
[Maximillian Schich] I think … this kind of argument that graph searchers [are] around since the 1960s, but there is an exponentially, literally exponentially, growing field since ten years which deals with graphs. And one interesting thing if your texts are one-dimensional in a sense, where there […]

[22:13]
[An audience member] Yeah, yeah, of course. [Intermittent laughter.]

[22:14]
[Maximillian Schich] If you compare the two sets of architectural drawings, right? Which basically you have, say, two sections for building each cross, and then you have to do that kind of game where usually art historians used to do hierarchical descriptions of a building, which usually are, like, main building, there’s floors and rooms and whatever, and then you have … a hard time to actually fit that, right? It’s a very, very similar problem, and you could … basically think of it like texts, which have knots in their string, right? So there’s like trees which are strings which are knotted up, and then you have a hard time, because the gaps would be huge. You have so many gaps [An audience member coughs.] some gaps. But nevertheless you could actually use that graph and actually do some metric science on them and actually measure how bad or how good you do. So, for example, in this case if your description of the hierarchic building does not fit the structure of the documents, you can actually see by distributions on both sides, the probability of distributions, how good your actual classification is. Because if it’s really good, it will be exponential, and if it’s bad, it will be […]. So basically these kinds of things are … it’s an alley, I think, that we have to go down. We have to not only measure the data, come up with a nice data model of reality but actually then measure that data model and then say, “Okay, how good are we doing? Is there better ways? What among all the possibilities are the best examples of how to describe that particular feature?”

[23:49]
[Gregor Middell] I skipped one slide, because also there is some copyright issue, because this picture was actually drawn by Wendell in Amsterdam, I think. You made that picture, right?

[23:57]
[Wendell Piez] Mhm.

[23:58]
[Gregor Middell] So it’s not from me, and I left out the credit, but I like this picture very much. It describes all the different theoretical approaches to multiple hierarchies and markup theory in a very spatial kind of way. At that time, right.

[24:11]
[An audience member speaks indistinctly.]

[24:12]
[Middell] At the time. And there you can see that the GODDAG kind of model, the graph-based model, is in the middle somewhat, and the range-based model down there. But these are two main ways of describing text, and each of them have certain computational features, like I tried to explain in this very short presentation. So I’m not very sure whether … or let’s put it that way, just because I can apply certain graph algorithms on span>graph-like structures doesn’t necessarily mean that that’s my main point of applications for texts. I can see how if I have a graph representation of my markup, I can run these analyses, but the current challenges we had in the edition were much, much simpler. Basically, how are we able to write all these different tools evaluating this markup without constantly working around the milestone-based markup that TEI enforces on us. So it was a pure practicality issue. Added to that, GODDAG structures by now are expressed in all different types of formats. So there’s an Italian project, for example, which tries to express GODDAG structures with RDF. And while I don’t really see the point in using this very verbose and kind of triple-based model to describe a GODDAG, they can show very well which applications it can use on top of this model to make it work. It’s just not the representation I would use.

[25:37]
[Desmond Schmidt] I’d like to talk about Steve’s objection about the efficiency of the graph representation. I understand what you’re saying, and indeed when I first looked at this problem of representing a multi-version text as a graph, I came to the exact same conclusion as you that you could so easily slide into the situation where you’re computing an NP-complete problem [an audience member responds, “Sure ….”], and it would never, ever work. So I discarded that, and I decided to use embedded markup when I published that paper in Literary and Linguistic Computing in 2006 [sic; 2010 seems to be the correct publication date.], then I had … I was shown a text by Domenico Fiormonte that was utter spaghetti. It was ten, eleven different drafts all written on the same piece of paper of an Italian poem, and I realized there was no way with any amount of markup and attributes and links and so on [that] you could possibly ever represent that, and I tried for several months to do it, couldn’t do it. So I went back to the model, and I came to a different conclusion: if you constrain the graph, you can prove mathematically that you can compute it in a certain time, and the French who worked on this–MEDITE [by Julien] Bourdaillet–who worked on this proved that you can do it in linear time. You can merge at least two versions, and when you have multiple versions–the worst is quadratic–you can get O(N log N) for the merging operation using a greedy algorithm. That’s true that if you include the transpositions the thing becomes NP-complete, but if you have a heuristic algorithm which gets a pretty … good fit, you can do it in very reasonable time. And I think also what he’s talking—I was a little confused when Gregor talked about the LMNLs in the same breath as the graph model. I think if you separate the two and use the standoff properties, which Wendell has described as a way of marking up the text in layers separate from the versions, you’ve got a complete system that doesn’t use embedded markup. You’ve got the versions … of markup, and that gives you a very good representation, a flexible representation where you can have building blocks, texts, and versions to produce an output, which can also be efficient. You can process that Wendell’s representation into HTML quite efficiently. So I don’t think the efficiency problem is a show-stopper.

[27:41]
[Stephen Ramsay] Of course, obviously, a tree is a […] graph.

[27:45]
[Desmond Schmidt
A tree is a graph.

[27:46]
[Ramsay] A tree is an example … a tree is an example of a graph that we’ve screwed with so we don’t run into NP-hard problems. [Laughs.]

[27:54]
[Schmidt] [He probably reaffirms a previous point.]

[27:55]
[Schmidt] No, I understand, I understand….

[27:57]
[Schich]

But I think that’s an interesting point. So there are these old kind of arguments that if a graph is too large, you have to reduce it to a tree in order to draw it or forget about drawing it.

[28:08]
[Ramsay] But that’s not true. That’s his [i.e. Schmidt’s] point.

[28:10]
[Schich] Yeah, and that kind of notion, which actually is still [the] convention–I don’t know, it’s one of the most famous […] guys [who] notes in a review book from 1999 that is disproved constantly, because the graph we can draw grow larger and larger by orders of magnitude per year. And, I think, that’s an interesting point so, I think, in terms of graph drawing, in terms of graph analysis, we are in the age of [G. J.] Chaitin, and we’re still not that, you know, at Gregor [sic] or something; there are still stuff to come.

[28:41]
[Jannidis] One question … do you think you’re talking about data model at the level of how researchers or any other human being from the humanities could at least think about the object, or is it on a different level of, [as] I understand, the data model for the computer scientist would it be the logical level of data model and–or is it ever caused you to switch the view of how people in the humanities should think about the object?

[29:15]
[Middell] Maybe you can answer the question. I’m not very sure about that, because the model I use, the GODDAG model [was] published in 92, 93 or [addressing Sperberg-McQueen] … no it’s 2000-something, 2003? I’m not quite sure.

[29:35]
[Daniel Pitti] [The] paper’s in 2002.

[29:37]
[Middell] 2002.

[29:39]
[Sperberg-McQueen] The volume came out then.

[29:43]
[Middell] Yeah, so the model isn’t another new invention or something that … where I would say this is the technical kind of perspective on things, but it seems the natural perspective … to humanists to think about these multiple hierarchies exactly at the same time. What I thought of as striking is it hasn’t been implemented. That basically the workarounds are much more popular around the problem than actually finding an implementation for that very clean model of multiple hierarchies….

[30:15]
[Wendell Piez] Yeah, I … I don’t want to change direction, because I’m soaking all of this up and learning a tremendous amount just from differences, but I do think that there’s an issue that Gregor mentioned very briefly which I think is very important and I want to include that in the discussion. And to the rest of you guys who are … thinking about implementing this, which is having to do with the … design and the development process itself and the workflow of the researcher who is actually investigating a text and working with it. And one of the things he mentioned … towards the end of the talk was that one of the things that XML in its current form gives us is that we can develop a schema and then distribute and share it, and somebody who has come into this new has some guidance, has some hints, has some … the framework is already in place, right? And one of the things that I think that has sort of always been lurking in the back of my mind is a question that I really don’t know the answer to, which of course I spoke to this morning but again without any real … any concrete ideas about how this should work is: if we do set aside this idea that this whole commitment to the monolithic hierarchy is a necessary thing and instead begin to do things that we understand are necessary in order to do adequate representations of text, then what are we going to do on the side of actually building the system which allows those points of entry in that progressive, you know … currently the process of development that everyone has to go through? You know: Can we have schema languages that deal adequately with multiple hierarchies, and arbitrary overlap which even […]

[32:00]
[Middell] I’m not the expert on schema languages, but the ones that I know and that validate, let’s say, the text that we use in the Faust project are based on context-free grammars and on one single hierarchy that’s validated. And … so to put it another way, when I talked to Andreas Witt about that problem, and he also did a lot of research in this field, he said, “If you want to have that substitute for XML as a technology (as a markup technology), you don’t only have to find an encoding or a certain markup language, [but also] you have to get the validation and the transformation language and the query language right in order to have a full substitute. So just delivering an encoding that could possibly express multiple hierarchies is not enough, because you don’t get the syntax completion in XML editors, you don’t get the validation part–all those things we are used to by now when we edit texts. My hope basically is that if that idea could be developed a bit further would be that this merging aspect of text would be more or less automatic so that you would have something like a text repository or a database where you can ingest your texts or import them, and they might validate, because they adhere to a single hierarchy; they are proper XML texts. But behind the scenes then some collation algorithm kicks in and says, “Well, this text I have seen already; it’s very similar to that other one.” And then he connects the two and tells you, “By the way, the markup you applied to your own text, to the one text you put in the system has already been marked up to a certain [point] in a different markup system. Take a look at it; compare it.”

[33:36]
[Jannidis] Over here.

[33:38]
[Elke Teich] It’s really a [small one]…. Did you mention which query language you used for query? And the other part of the question is which kind of people have queried this resource? Except for you.

[33:54]
[Middell] What kind of query language? So … we tried two approaches. The first … approach that we tried and I showed you in this graphic here is based on a new graph database called Neo4j. And so the query model that we used for that graph is traversal so there’s no inherent query language. There are query languages for Neo4j, but what we basically do is traverse a graph with certain constraints like: What edges are we able to traverse? What order of the documents? etc. etc. And that didn’t perform very well, not because it’s a query–the query was actually quite fast, and [with] normal PC hardware where you could traverse up to five million knots per second so that went very well. The problem was the update. So what we actually use right now is this model–this range-based model–and there we use a custom query language that’s operating on this range model, and it’s basically a translator from a custom predicate-based first-order logic query language to a SQL.

[34:53]

Regarding scalability, I think that’s what your question relates to (“how many persons have queried that?”): the Faust edition is currently not live so we don’t have many users–many power users–querying that repository, but the source code that we use to implement that model is currently in use in a different project in the USA. It’s about collation Juxta–I don’t know whether you know that collation software. And the upcoming version of Juxta (1.6), I think, will be web-based, so it will be a client service solution where the whole collation is done on the service side and in parallel. So whether that is scaled [or not] we’ll see in one to two months, because then this model will basically power the service.

[35:37]
[Michael Sperberg-McQueen] Can you go back to the … the other slide? … Yes, this one. Can you expand, without boring these people out of their minds, a little bit … explain to me a little bit the relation between the colors and the GODDAG structure? I’m not sure I follow.

[35:56]
[Middell] Okay. The color base basically expresses one single hierarchy of knots or one clear tree structure of a knot. So the constraint is that one color basically constitutes a tree. So multiple parentage in the GODDAG model means that one knot has a parent relationship to two … knots with different colors. [It] could be the same knot or could also be different knots, but the main point is that the color is different.

[36:25]
[Sperberg-McQueen] Okay, but in that case it sounds more to me as though you have implemented both types of trees in the style of the database community than the GODDAG structure’s, because GODDAG structure’s … though … I may be wrong. We should talk more later. Thank you.

[36:50]
[Jannidis] More comments, more comments … or questions, especially…. what are we … good? Time? Julia [Flanders]!

[37:24]
[Flanders] Ah you know I’m putting my computer down in case there are more comments or questions. Just a few minutes.

[37:28]
[Jannidis] Yeah, but I thought it’s a good…. [Voice becomes indistinct.] Thank you. [Applause.]

Advertisements

One thought on “Gregor Middell, “On the Value of Comparing Truly Remarkable Texts”

  1. Pingback: Knowledge Organization and Data Modeling in the Humanities: An ongoing conversation | datasymposium

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s