Jan Christoph Meister, “Tagging in the cloud. A data model for collaborative markup”

Theoretical Perspectives II (March 15):

Jan Christoph Meister, “Tagging in the cloud. A data model for collaborative markup” (video)

[00:00]

[Jan Christoph Meister] I must confess that I don’t really know whether I really belong to this slot – “Theoretical Perspectives II” – because I think that quite a lot of ideas I want to share with you will be indeed very practical observations or practical conclusions drawn from my involvement in text analysis and using and developing CATMA, which is short for Computing Aided Textual Markup and Analysis, and its successors. But I’ll try to emphasize the theoretical aspect of this. CATMA itself prides itself as an integrated system, and that means that – [tech issues, laughter] – it’s an integrated tool in that it’s designed to support a full cycle in the workflow from markup to analysis. Perhaps the most important design principle and design goal in the development of CATMA was to design something that’s as intuitive as possible. I use it in my work in class with students who have absolutely no idea of the digital humanities: What this is, what you need it for, and so forth. All you need to be able to do is click a mouse, basically. CATMA started and tried to achieve this goal by offering the user a very easily useable environment. You can see it here [on the slide]. What the user knows is that she can underline something, can highlight something, can then define a tag on the fly, or can use a predefined tag and assign it to a string. CATMA allows, among others, for overlapping markup, for ambiguous markup, for descent in markup, and so forth. I’ll say more and more about how this works. What you see here is basically our main screen in which users do their tagging. On the right, you see a markup tree, a tag-set tree. Tag sets can be imported and exported. As I said, they can be expanded as we work with the document. You can enable or disable elements by taking off the boxes, so it’s highly customizable.

[03:26]

Where does it come from? Well, it started out as a project to basically re-implement and reengineer TACT, and those of you who’ve been treated to TACT know why we love it, why we hate it. TACT has been extremely useful, and I think it’s still one of the programs I love most. Unfortunately, I cannot run it any longer on this new machine. I haven’t got an emulator. But my students don’t mind, rightfully so, I think.

[04:05]

This gives you an idea of the project context. So we started out by looking at TACT again, then to the new implementation of CATMA. That’s what you’ve just seen and what the current tool is based on. We’re currently busy with the third step called CLÉA, and that’s the “Collaborative Literature Exploration and Annotation.” And if you should ask why there’s an accent on the “E” in exploration, for no other reason than to confuse the recipients of our grant proposal, which was Google. We wanted to demonstrate to them that there are indeed characters that are a bit more difficult than straightforward. They appreciated that, and gave us money to do this. CLÉA is meant to be fully browser-based, java-based. And it can run in a client server environment, but it can also run on your machine.  It’s backed up by a repository function which we’re currently developing. It’s supposed to interface directly with Google Books so that you can import anything from Google Books, provided they actually give us access to their material. To be honest, I don’t think it’s going to happen. Against the background of current legal developments, they’re extremely careful about making promises. Be this as it may, that’s not really essential for any repository we wanted to use. More important, perhaps, is a prototype for machinery model, which we’re also developing, because we want to actually export the markup which users generate, and we have the hope that, over time, using the machinery and algorithms, we will be able to identify some of the lower-level tasks – repetitive tasks – and create something we call the “heuristic machine”, that will make suggestions. The system will give you, or will possibly produce, feedback in the form of “Hey, I’ve discovered something. You might want to tag this as X, Y, Z. Or you might want to check whether I’m right or not.” We don’t have the dream of automating higher-level markup, by any means. I don’t think that’s ever going to happen.

[06:58]

The next step: as of next year, will be […] CLÉA; we’ve just attracted funding for that. So that will have visualization functionality and will robust the machinery-driven heuristic preprocessing engine, and that’s . .. let’s see how far we get with that.

[07:21]

Let’s take a step back from this project history and look at more conceptual issues. When we started out with developing CATMA against the background of TACT, we tried to understand why we’re actually doing this. That’s the “what?” and “why?” of the markup. Following on the various contributions that have been made in the field, in order to distinguish types of markup, this is a variant on it. I think that, in terms of how CATMA and CLÉA are developed, we’re somewhere oscillating between the performative function of markup and the discursive function of markup. On the performative side, I’d say that what we know as procedural markup, which instructs the machine or the human being as to how to process the text in a basic way. On the other end of the spectrum, you have discursive markup, which enables readers to interpret the text and there, of course, you have semantics and somewhere in between you have the clarity of the markup, which is sitting on the fence. And that’s really what we also have encountered in a normal practice as literary scholars. So we’re oscillating between these two poles.

[08:47]

If you approach markup from the point of view of somebody who’s interested mainly in hermeneutics, as I happen to be, you have to hear about the must-haves of discursive markup.  These are, I believe, the following: One, it must facilitate collaborative and non-deterministic annotation. That means that it must allow for multiple markup, for overlap, and for concurrent tagging.  Two, you must conceptualize markup as dynamic and recursive. So, it must allow for extensibility, for multiple – and even contradictory – markup, and for seamless integration of markup and analysis [that supports the hermeneutic loop] back and forth.

[09:36]

Which brings us to an attempt to develop a taxonomy of markup types of data models underlying what we’re doing. My start here is a remark by Allen [Renear], among others, who wrote this wonderful sentence, “There’s no such thing as ‘no markup.’” (Coombs, Renear, and DeRose: Markup systems and the future of scholarly text processing. CACM, November 1987). So it’s not as if we start with something that doesn’t have markup; we have some type of markup that’s implicit in the text analogist called opaque . An opaque data model is there but no one has actually realized that it’s there. Then we have inline and deterministic markup, which is normally linear. Here [on the slide] you see my very quick attempt to get markup at the beginning of the sentence.  It’s the first word in the sentence.  Now we all know, how do we distinguish the first word as not a clear cut sentence before it?  With a full stop or, for anything in the English language, a wonderful capitalization. If that doesn’t work – as with proper names – you need word classes.  So here’s another attempt [in which] you have nested inline deterministic markup.  You have word class, adverb, in this case, and you then have the sentence start.  Linearity becomes sequential, and slightly more complicated.  The next step would really be to venture into standoff and descriptive markup which takes a relational shape where you have three features that all point to the same string: the first five characters of this sentence.  And then comes the big leap to discursive markup where suddenly you have things such as overlap.  If you look at the standoff of notation at the beginning of these brackets – it goes from 1 to 5, 1-5, it goes from 1-8, and 1-38 – so different markups applied to various string sections but the first five characters are within all of them.  This becomes discursive because it is sort of modeling a discourse on what we have just read in this sentence.  We have different focus underlying each and every one of these annotations.

[12:15]

How do we implement all of this in CATMA? Well, we use a text range based model where a tag references a text range with the offsets and we use an external stamp of markup where markup is stored and that way were it to allow overlap and to facilitate tagging by multiple users and markup, in itself, meaning the aggregated exchange reapplied, exploited, and so on.  I could call up texts which you have marked up but I don’t necessarily have to use all of your markup; I can just have a subset of tags, or what have you.  Our main device, which is actually within our feature structures, which are TEI based in a way – of course, to really re-import this into a stringently defined TEI environment is something that we haven’t tackled yet. [It is] something that we have left out for the time being.

[13:32]

Here again is an example of overlapping markup in practice.  You see we have marked up Wendell’s [Piez] name – his first name and surname – and [his] affiliation.  On the right hand side [of the slide] you see the keynote speaker affiliation tagged; that’s the one that has the longest extension here.  But there are two other ones: both name tagged [with a] keynote speaker tag.  This is what it actually looks like in an XML file and, as you can see, the tag itself has an ID and if you then look at the markup that has been generated there an ID pops up but  there are two other ones and they refer to the same text range.  Don’t ask me to translate these back into something that makes sense, all I know is that it does point back to these tags. Or at least it did yesterday.

[14:37]

This is the architecture that we need to make something like this happen collaboratively and that looks very ambitious but we are almost there in doing this.  But the more interesting question here is how can we model what we do when we do this sort of thing collaboratively.  And here you have four people who have been marking up the same text and let us just assume that these four colleagues of our’s here were in collaboration, probably at the same time on the same document. So this is going to be a very interesting process.  And the model behind this is what I am supposed to call a “n-meta data set to-1 object data instance” model. You have a text at the bottom that is the object data.  I am deviating here from how data has been used in metadata turning into data in this context because I believe the source object is the data and object data.  And we have on the left hand side in that bracket multiple tagging documents, generated in standard standoff fashion.  And the bracket on the right  symbolizes that within markup there can be contradiction.  So it’s not necessarily well-structured, completely defined, and so forth.  The metadata on the object data that we generated through various functions, they assume again a procedural hermeneutic development.  And we have these tag sets in between, and the questions becomes: Does that model suffice to explain collaborative markup? I don’t think so.  Because the interesting thing in our hermeneutic activity is that it’s recursive, it runs on itself.  So we need a model that takes into account the fact that we markup the text, and then we markup the markup of the text, and then we analyze the markup, and so forth.  The model for this, I think, is one that we should look at.

[17:15]

Let me give another example from where this actually occurs because I believe it happens much earlier than we tend to [think].  Here we have the CATMA query builder which allows you to define a query on a text or on the markup that you have generated from the text.  I start off by defining an object-data query.  I want to find a word that ends with “ez.”  And then the second step [is that] I refine this query by adding another constraint.  This constraint now is one that actually calls up a tag – not the object of text but a tag – in this case, the keynote speaker affiliation tag.  And we all know what is going to be produced now as a result – you already see it – and you hit the box at the bottom here and of course, again, we get what we have already seen, namely “Wendell Piez Mulberry Technologies, Inc.”  What we have just done is that we have crossed the divide between an object-data query and a metadata query.  So what I have actually done is turned this whole model around a bit and run them in operation at the same time of object data and metadata.  This is where I believe recursion already starts.  So not necessarily on a higher level – a completed markup, which is then discussed and analyzed and so forth – but it actually happens during textual analysis, over time, at a very base level.  That is why we need a dynamic model, to take into account of this and hereby what I am proposing we call the n-meta data set to-1 object instance by the power, and so forth.  Don’t take this too seriously.  It’s a metaphor for the type of model that we need.

[19:34]

This, in essence, is what I get out of a very practical project and I’ll start to reconsider that project in terms of its conceptual architecture.  Thanks a lot.

[19:48]

[Applause]

[19:58]
{Fotis Jannidis] May I ask how the system reacts to changes to the text?  

[20:06]

[Jan Christoph Meister] OK, yes: a big problem, of course, in standoff markup; the solution that we are now implementing is [clear?] because it [is a user request that we’ve?] received quite often.  We are still not touching the source text as such. We are documenting the changes and versioning in the header. It’s going to be written into the header of the source document but the main text will remain untouched.

[20:40]
[Fotis Jannidis] And you’re building from it from the [screen?] manuscript and then […] you will be able to build a new […] representation […]

[20:45]
[Jan Christoph Meister] Exactly. Because it is, of course, crucial that we can always point back to the initial source document.

[20:54]
[Paul Caton] Your main text stream…what if you have, say, variants?  Are those built into the text stream or are those also something in Standoff?  Sorry, I should step back with my question.  How do you decide what this base text is that you can start from?

[21:16]
[Meister] Well, that’s decided by the user.  The user culls out the document and says, “This is the document I want to tag.”  I guess the question that would be of real interest to me, is what happens when you and I tag different editions of the same text?  How do we cross reference those? Would that be – ?

[21:39]
[Paul Caton] No, no, sorry, I realize now that your users are, for example, getting them from Google and so on.  How do you control, for example, where the start and end is?  A document can be many things, like those introductory texts that have absolutely nothing to do with what one is going to tag.  I am just very interested: what counts as the beginning of a text?

[22:05]
[Meister] We don’t control that; the user controls that.  They decide, “I want to work with this text and, for me, I am going to skip the introduction.”  It’s not for us to control.

[22:22]
[Desmond Schmidt] If you could explain the data model a bit clearer: I always thought that CATMA was simply the old standoff model where you could take the tags out of the text [and] you represent them externally in a single set with the direct offsets into the text and that means they generally don’t attempt to edit – you could edit the text but you’d have to adjust the offsets – so you can have have only one set of markup at the same time and the markup has a syntax.  That is to say, if there is nesting it obeys the rules of the syntax.  Now is that true or false? Just, clarify.

[22:57]
[Meister] Collaborative Literature Éxploration and Annotation (CLÉA) is a truly collaborative system because it’s a stand alone desktop application.  But you can already import somebody else’s markup and amalgamate it with your own markup.  Plus there are tagsets at the same time.  So it is sort of a workaround solution for what the next version is actually doing via the repository function.

[23:26]
[Desmond Schmidt] How does that work? If I have a syntax of a play structure and someone else has a syntax of a metrical structure and they put it on top of that.  First of all, I haven’t got a grammar for the total document – I don’t have a structural and a metrical grammar.  I don’t have a grammar that can handle overlap because you haven’t guaranteed that the elements are well formed in the merged document.  I might have a metrical structure which overlaps the end of the line, for example.  How do you handle that, when you merge two tagsets?

[24:00]
[Meister] The overlapping is not an issue.  I don’t really see–

[24:09]
{Desmond Schmidt] Is there a syntax governing the tags?  That is the first question I asked, whether the standard standoff markup was a standard copy by commonality and range —

[24:20]
[Meister] Yes there is a syntax to the tags.  I didn’t include that here.  This is basically what the tag looks likes; this is the tag definition.  And this is what the tag referencing is going to look like.  So that is basically the syntax.  But the nice thing about this is that you can actually nest those tags if you want something more complicated.  We have introduced something which we call the tag property which you can then add.  That is actually the device that allows us to nest things.  Also, there is a distinction between structural markup and content markup.  In the original preprocessing, you read a new document for the first time; some of these structural markup is that you generate it automatically.  It’s not a very powerful function at the moment but in the next version we’ll be able to improve on that.  So for example: line breaks, chapters, etc.

[25:33]
[Alexander Czmiel] So, what you are doing, basically, is LMNL.

[25:43]
[Thomas Stäcker] Just another question [on] this issue with Google.  In the terms of use of Google, as far as remember, is says that access to the full text of Google is only through services of Google.  So do you download this text? Do you have a special agreement with Google?

[26:08]
[Meister] We are still negotiating.  The original promise by Google was that they would actually provide an API to tap into their data directly.  Then they retracted on that and said “No, we can’t do it.  But what we will give you is [that] you can define a subset of texts and you will get full access to this.  Plus to the bitmaps.  To everything.  We will have it on our servers.”

[26:39]
[Thomas Stäcker] I have another question.  If the text is still with Google, how can you secure that the text will remain stable in relation to a markup?

[26:52]
[Meister] By stealing it and copying it into our repository.

[Laughter]

[26:59]
[Meister] No, your point: that is a practical problem that will have to be tackled but we actually haven’t account for it, just yet.  Thanks for pointing this out.  Of course, this is an issue which we will have to deal with, but so far we are only encountering it at an experimental phase.  As we are not going public with this or making it publicly available, we’re fine.  But of course, we will soon hit this wall and we will probably have to look to other document resources then that provide a sustainable solution.

[27:45]
[Maximilian Schich] Isn’t that [the same?] with German copyright as for images you can have, for example, use by […] . You would use mainly within German copyright because it’s research.

[27:58]
[Meister] Yes, and now the test sample that we will use is comprised of texts that are outside of copyright law.  So we really make sure that we are not in the grey zone.  But of course you don’t know what your users are going to do.

[28:30]
[Wendell Piez] Of course you know that for some time that I have tracked and really admire your work, not just on a technical level but also the rationales that you offer and the theoretical framework which leads me to say, no, this is not on the wrong track because I think the theoretical implications are really very clear in terms of where we take markup technologies.  The kinds of things that we intend to do with markup technologies – that we have actually intended to do for a very long time but have not really had the ability to do so because of the ways the technology evolved.  And of course, as Alex [Czmiel] pointed out, that there is very close – in fact, once you start with the recursive annotation model, yes, you’re talking about LMNL now.  I am looking for opportunities there because that synergy may be very productive as it unfolds.  So my question to you, in terms of what LMNL offers, just in a very simple model of arbitrary annotations and ranges, labeling ranges but you can also have anonymous ranges and annotations, you can attach arbitrary properties to ranges and annotations, you can annotate annotations, you can get your recursion.  Is there anything there that you know that you know that you need that LMNL doesn’t really give you?

[29:50]
[Meister] No, I think that most of it is actually there.  We haven’t investigated that systematically, I must say.  We have just approached this from a very pragmatic point of view.  I must also confess that a lot of the developments and extensions that we have noted in the system are actually the result from user feedback.  I’ll give you one example: the one that we are working on now is [that] our users want the capability to comment on their annotation.  And they want to do that freestyle.  So they want to be to write into their tag, ‘I developed this tag after–”

[30:29]
[Wendell Piez] Julia [Flanders] wouldn’t let me do that, though.

[Laughter]

[30:41]
[Syd Bauman] Part of the question is to test whether I understand what you’re doing or not.  If I understand correctly there is a really nice instantiation of the aphorism, “One man’s data is another man’s metadata.” There is nothing new and original about being able to search metadata separately from data.  Any e-mail interface lets you search on both metadata fields and in the text of your email at the same time.  But if I understand correctly you are using the same data model – in the abstract, the big data model – to mark them both up simultaneously.  And the committing the user to cross over to treat them the same, or they could treat them differently, too.  Can I search for things only in the metadata or only in the data and then do extracts and different studies on just this part?

[31:44]
[Meister] Oh sure, sure.  You might want to run a query just on the metadata.  No problem.  But conceptually, I think the interesting thing, at least to me, was to realize that the crossing of this divide happens at a very fundamental level, whereas I always thought that this switch from the basic operation to the meta operation, and this going recursive, actually only happens much later, you know, when the finite product is there and I analyze it and a criticize it and then I go into the meta level, but we are doing that constantly.  With such a simple query we are actually doing that already.  That’s why I think we need a data model that takes account of that.

[32:38]
[Fotis Jannidis] Could you take about – or maybe you have talked about this – but time protection.  Could you use this in your model, too, and in, say, syntax – obviously people try to annotate the same thing but obviously they have differences in opinion.  How do you approach and visualize this difference?

[33:01]
[Meister] Ok, that is a very good point and I should have mentioned that.  We are thinking of actually extending the tag syntax by putting a timestamp or by using the identifier to point to some file that preserves the timestamp.  So I could call up, let’s say, I could amalgamate your markup and my markup, and your markup might have been produced during three different sessions on three consecutive days. And every tag produced on day one will have a specific timestamp, day two, and day three.  But it will [also] identify you as the IP owner, so to speak.  That is one aspect we’re taking care of in the repository, by the way, because it’s an intellectual property issue when you start exchanging markup; who owns this markup? And my markup will have the same identifiers in terms of ownership, timestamps, and so forth.  Then if you want to run a study on this, for some reason, of course it is there; you have it in the markup.

[34:09]
[Fotis Jannidis] I would have to infer a kind of schema from this and then compare schemas, which we talked about– (Gestures to audience member)

[34:17]
[Michael Sperberg-McQueen] Just a side question: you are using feature structure markup much more extensively than any other project I know.  I wonder if you have found any use for or need for the feature system declaration.

[34:40]
[Meister] Yes, it actually is there. (changes to “tag definition” slide)

[34:50]
[Michael Sperberg-McQueen] Ok, thank you.

[35:00]
[Fotis Jannidis] Okay, thank you very much.

Advertisements

One thought on “Jan Christoph Meister, “Tagging in the cloud. A data model for collaborative markup”

  1. Pingback: Knowledge Organization and Data Modeling in the Humanities: An ongoing conversation | datasymposium

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s