Elke Teich, “Analyzing linguistic variation: From corpus query towards feature discovery”

Theoretical Perspectives III (March 16):

Elke Teich, “Analyzing linguistic variation: From corpus query towards feature discovery” (slidesvideo)


[Elke Teich] Like very many people, I’m not exactly going to be doing what I said I was going to do. Not very surprising, but anyhow, I’ll start.

I stand here as a linguist. I don’t stand here as a computational linguist or a computer scientist or what have you. I stand here as a linguist. So my main interest is, let’s say, linguistic description of objects I’m interested in. Being here at Brown University is a great honor for various reasons. I mean it’s a great honor to be with you here together and to have these discussions. But for me as a linguist and also as a computational linguist, it’s a particular honor because some great people work here or did work here who have been very influential in the field of linguistics in general, or in computational linguistics and corpus linguistics. One of these people was mentioned by Syd [Bauman] yesterday. That was Eugene Charniak. I would say Eugene Charniak is probably the initiator of a statistical natural language processing. Now, we’ve had this approach for natural language processing maybe in the last fifteen, twenty years or so. Before that, everything was rule-based and symbolically oriented and stuff like that. When it happened it came as a great shock to everybody. Maybe a similar shock to Fotis’s [Jannidis] shock: “Ok, how can it be when I count the words, I come up with the same classifications of the text I’ve been thinking about for years?” Anyhow, that’s computational linguistics. These models, they’re not data models. I don’t know what to call things anymore after three days. But they’ve been very influential. And for me as a linguist, as shocked as I was when I first learned that you can do everything with statistics, it’s become a very helpful tool. I’m going to show you some of this in the [following] talk. So that’s that — that’s computational linguistics.


But then there’s another person who worked at Brown University who actually owns a number of patents I found yesterday on the net. This is Henry Kucera. Kucera has also been very influential in the field of computational linguistics but also corpus linguistics, because Kucera and his colleague Nelson Francis, they compiled the so-called “Brown Corpus.” This was the corpus that was compiled here at Brown University in the late 1960s, and a small corpus by today’s standards, just one million tokens. At the time, it was supposed to give a snippet of then-contemporary American English, collecting texts or text extracts from a thousand samples of texts and covering fifteen different registers or text types, if you want. Now, this resource was also quite, not influential, but perhaps very useful. It was very useful to have this resource for computational linguistics. A lot of computational linguistic tools and programs have just grabbed this corpus and trained their programs on this corpus. But it was also important at the same time in corpus linguistics because it was the first resource of its kind [that was] representing something like contemporary American English. Corpus linguists used this corpus and extended it. There’s also a British-English version of it, then there are later, diachronically later, versions of texts from the 1990s. So this became the base for a diachronic corpus of British and American English.


Having said that, I think that there are many threads in my talk. There will be two of them at least, but maybe a third one — I don’t know if I can take it to the end. [The first is] the role of computational linguistic products in our work, such as parsers and taggers and things like that. Yeah, the practice of doing something, approaching a linguistic question with corpus linguistic methods. But before I start showing this work I’m doing currently in my group, let me try to pull things together again a little bit. So, I’m a linguist, I said, so as a linguist I’m a bit different from other philologists. I guess I’m different from Fotis [Jannidis] as a literary studies person. I’m not a historian. And so on and so on. But what is it that we’re all interested in? So at first sight it seems as though there are many differences, and one difference is because, as a literary studies person like Fotis [Jannidis], I’m interested in textual artifacts that have some particular social-cultural value that are valued very much in a culture. Like, for example, this poem. When I’m in the arts or artistry then it’s the same thing; I’m interested in this unique object and I want to say something about it.


As a linguist, it’s different. I don’t care, you know? I take any object that is text. I don’t talk about pictures. But for a linguist, it looks a bit more mundane, and we come back to the food metaphor. As a linguist, I’m interested in things like this. Maybe you can’t see properly, but there are letters in this soup. It is a letter-and-carrot soup. When I first look at the text, I don’t care. Well, that’s not exactly true — I don’t care so much. It’s a text, and maybe I have some hypothesis about the text, but maybe I don’t. So first of all I just consider it as an unordered unit of things. My first urge, the first thing I want to do, is that I want to put some order [into it]. I want to create some order. If I apply it to the soup, I want to do something like this. Actually, this also happens to me with the carpet and I keep staring at the carpet, trying to find some pattern. So this is the first urge. It’s not the ultimate thing I want to do, but it’s something I feel that I need to do.


So, let me try to pull this together again. I think what still unites us no matter [our backgrounds] as literary studies historians is that we’re interested in this semiotic artifact. It’s not a physics phenomenon or anything like that. It’s a semiotic artifact; it’s created by humans. All of us are interested somehow in these three aspects of the semiotic artifact. We’re interested in the relations between these three aspects of the semiotic artifact. The difference is, I think, that when you’re a literary studies person or arts person or arts historian — maybe even a library science person —  you’re more interested in this part of the triangle [points to the link between “thought” and “referent” — both are vertices on a triangle that also bears the word “symbol” at its other corner]. You’re interested in thought, in the concept, maybe in the social. The thought could also be the social. It’s not necessarily the cognitive. It could also be the social, and something in the world, which is perhaps the referent of that. When you’re a linguist, I think you are more on this side. You’re very much interested in the symbol in the sense of “significant.” So you like playing around with that. When you see these letters on the floor, you want to do something to them. You’re very cautious in actually making this link to the thought. But you would never start with a thought [as a linguist]. Rarely. At least, not if you’re the type of linguist that I am, though there are and there will be [other types]. So, we can think about this a little bit. Maybe one more sense about this. When I come back to Fotis’s [Jannidis] dilemma, when you find something at the level of the symbol — you have counted the words — some unit that may seem quite arbitrary to you, then how do you relate this to the level of thought in this model? I see the dilemma. I also have it.


The question then is, for linguists: How can I get from the unordered to the ordered soup and beyond it? This is what I’m going to be talking about. So I would start from something like this [a slide called “processing pipeline”] or a collection of this. It’s just a common practice now in corpus linguistics; I would apply some tools to automatically or also to manually annotate these texts. Maybe I would even already call that a corpus. So I annotate them and then I get another version of the corpus which has been annotated, which has been enriched with information, and perhaps it also has various formats. We’re going to speak about formats as well in a second. Okay, well — so far, so good. So this part, it’s quite well understood and no one questions that this is how you do it. But then how do you proceed? Well, you’ve got the corpus there. It has been enriched with other types of information. But what do you want to do? What you want to do is an analysis. The annotation isn’t the analysis. You just take it for granted. You don’t care about it. You rely on the part of speech tag as syntactic parsers, lemmatizers, tokenizers, and so on, which have their own models. But then, what do you do? You have to find in this annotated resource some interesting or relevant features for what you’re actually looking for, your analysis goal. You have to find these features, you have to extract instances of these features. Then you have a set of instances, and once you have them, you want to inspect them further. You also want to evaluate them with respect to an initial question you have had. Okay, so this is a basic process, but this part of the process is much harder. It’s much harder and it’s much harder to instantiate. Okay, I’ll make this more complete with some examples now. What I’m interested in or have been interested in for the last couple of years is linguistic variation, not in these sense of dialect variation or social variation, strictly speaking, but what we call register variation. That’s variation in the language according to the use [to which] the language is put. Now this could be, let’s say, a scientific language, [which] is a certain set of registers in this terminology. Or conversation is some kind of a register, and so on. Now, register theory or what we observe in language, what makes a register really at the symbol level, at the level of the code, is that a register is characterized by a cluster of associated features which are typically lexical, grammatical features, which have a greater than random tendency to co-occur. So this is how we observe a register. Now, registers are typically, relatively stable in time. You can observe that registers diachronically are relatively stable, but when you look at the repertoire of registers and language, they will change. The situations in which we use the language will change as well. So that’s just as a background. Now the concrete project we’re conducting with my group is that we’re interested in how these registers emerge. How do they come about? What’s happening when new registers, new way of saying, come about? We investigate the context in which new scientific fields or academic fields/disciplines emerge through the contact of some existing disciplines. We think of this as a contact situation where, I don’t know, for example computer science and biology come into contact and form a new discipline, which we call bioinformatics. Or you have a computer science and linguistics, thus you have computational linguistics. We call these contact disciplines.


So, we’re interested [in the question]: what are the linguistic properties of these contact disciplines? When you take bioinfomatics, purely linguistically speaking, is it more like computer science, is it more like biology, or is it something completely different? What we need to do is that we have to compare the discourses that are produced by these disciplines. What we’re after are the similarities and differences between them, and also asking whether these new emerging disciplines create their own language. We have to be able to say whether they create something that you can call distinctive. Distinctive of the others. Okay, so that’s a concrete context. Now, if you have this question — and here I come again to corpus linguistics — what is the kind of resource you need to be able to investigate this question? Of course, you have you build a corpus that consists of scientific texts. What we’ve done [is that] we built a corpus of just English journal articles, so we don’t consider any other text types, as it were. Just journal articles — no project reports, no patents, or whatever. Just journal articles. We cover nine disciplines, which you can see here. We have two time slices in the corpus. We have the early 2000s and we have the 1970s and 80s, because this is about the time when some of these disciplines have emerged. Eventually we want to investigate how linguistically these registers have evolved. Here’s how we do that: what we typically do, then, is that we’re interested in the relationship within triples in this corpus. For instance, the relationship between computational linguistics, linguistics, and computer science, or the relationship between bioinformatics, computer science, and biology, linguistically speaking. I’m not going to talk about the diachronic aspect of this. I’m just talking about the discipline aspect of this.


Now, in terms of encoding, the sources we got the corpus from were .pdf files, and you then can imagine what we had to do to the .pdf files in order to get something out of them that was processable in some sense. We have different versions of the corpus in terms of different formats: plain texts, HTML, and XML versions. One version is encoded in the format of the Corpus Query Processor (CQP), and I’ll come back to that. Now the types of information we encode for the corpus is bibliographical data. So you can see here the discipline. Also the logical structure of the documents. Then, in terms of the linguistic units, sentences, tokens, and so on, and some linguistic categories that can be annotated automatically. One of the formats that’s very useful to us is this — it’s very simple format, as you can see here — that’s the format used by this Corpus Query Processor, which was developed at the University of Stuttgart. We basically just a have a tab separated format where we encode the values for the token, then lemma, and the part of speech (PoS). We could add others. It’s a very simple format. Now, which analysis methods are needed to work on this resource? Well, like I said, what we have to do is . . . the whole thing is an exercise of comparison and detecting between the different sub-corpora, detecting their similarities and differences. So in that sense it’s very similar to what Fotis [Jannidis] showed us before on his data. So what we want to compare things according to is register on one hand, and eventually also time, which is the diachronic aspect. What we compare things in terms of are lexical-grammatical features. I’ll show you examples of this in a minute. We want to compare lexical-grammatical features, or distributions of lexical-grammatical features, in some context, the register or the time. Now here I’m only showing the register. So what we work with in doing these comparisons, we calculate the relative similarity or difference in terms of comparing probability, distributions of feature probabilities, across these corpora. That’s always the very same thing you do here. Let me go to some examples. One of the big questions is how do you get interesting and relevant features that will bring out differences between these sub corpora, if they exist. Because this is what we’re interested in. Now the selection of features is essentially inspired by linguistic theory or linguistic description. Things other people have said before that could be interesting, could be a distinctive feature that’s worth looking at. One area in which you suspect there could be interesting differences — when you look at these disciplines, they’re very different, Bioinformatics and mechanical engineering, or linguistics and electrical engineering — is: What are the stances and evaluations that are expressed in these texts? Because it’s not like scientific text doesn’t express stanza evaluation — on the contrary.


Okay, now this is a big question. It’s a very abstract question. How can we tease it out from the material? So eventually, I need to find some kind of lexical or grammatical feature that expresses the particular values for stance and evaluation. One pattern that colleagues of ours have investigated, that’s basically Susan Hunston in [the University of] Birmingham, they found that there’s a particular pattern that’s quite interesting. It is adjective to or it is adjective that. Now, when you look at this pattern, it’s very interesting. When you just look at the grammatical pattern, you’ll hardly find any expression that is not evaluative. It’s like this grammatical pattern attracts very much evaluative adjectives — it’s quite interesting, yeah? So you’ve seen these examples, you’ve had time to look about them, and you can think about them. There are two ways, lexically-grammatically speaking, that you can express this sort of evaluation: an epistemic stance would [say] something like “obviously” or “it’s obvious that,” whereas an attitudinal stance [would say something] like “it’s interesting that” or “interestingly.” And that is, to use a model, an adverb, or to use this pattern. The question is: are there differences across these registers we have here or not? Hopefully there are some interesting differences. What we need to do [is] extract instances of these patterns. We calculate the distribution of these patterns, and then what we could do [is] we could do a statistical test on whether this is actually not a random kind of distribution that we see, but that it actually means something. Just a quick technical thing: extraction of these patterns is done by regular expression, so the CQP, the Corpus Query Processor, is just regular expression kind of query language. I can tell you that linguists need, want, love regular expressions. Don’t give them a query tool that doesn’t do regular expressions. We need them. We won’t be happy without them. This is what you get back. Of course, you count these things. You don’t have to read this. This is a table which gives you the numbers for the occurrences across corpora and across types of expressions. You can then calculate a statistical test using the table. [And] to make it more human readable, the relative frequencies of these. What you have here in the orange and yellow shades [are] these modeled adverbs, and in the blue shades is the pattern. I’ll just pick out one thing for you. What you see here, for example, that’s biology — it uses a lot of the adverbial, the adverb, and among the different types of meaning that are expressed, it uses very much “interestingly.” This is relative to the others. It doesn’t say it actually occurs absolutely speaking very often, but relatively speaking it uses those quite a lot. Now, when you compare this [ratio] to this newer discipline that has kind of emerged from biology incorporation with computer science, if you want, then you can see a different pattern here. When you test these things statistically, you can see it’s significant. So you can explore this further. It’s actually quite interesting. Now let me give one more example and then I probably need to rush.


It’s the same idea again. It’s using slightly different methods in the end, but it’s the same idea. Now, also, what we could be interested in [while] looking at these discourses which these disciplines create, we could also be interested in how the actors in these communities and the authors of these papers, but more generally the actors, how do they construe themselves? What do they think they do? One indicator you could use here, it’s just one, [is that] you could check for the pattern, “we do.” “What do we do?” Plus a verb, any verb. So it’s the same thing: you extract instances of a pattern, you look at the distribution or you make other calculations on frequency distributions. In this case, what we used was automatic text classification using a support vector machine. What you get from this are two things: when you apply this classifier, you can see the instances of verbs that I use and used with a particular frequency, of course; then you can also rank these. When we compare computer science, computational linguistics, and linguistics, you can see here, yeah? Computer science tends to act very formally. We prove, we show, we obtain, and so on. Computational linguistics these days, that’s the 2000s, it acts very experimentally, if you want: examine, implement, use. Linguistics is a bit of a mixture of things. Linguists tend to propose, suggest, argue a lot, I guess. They can also feel and see. This is not, you know — this is really distinctive. Another thing that’s interesting. . . It’s another product of the classification you can look at, and the product you can look at, which comes about by the automatic classification, is what we call the Confusion Matrix. When you apply a classifier, you never really get 100% accuracy. So the automatic classifier takes these feature distributions and the individual texts and it makes a prediction about, given these features, which class does a particular text belong to. This is never 100% correct. You can look at the confusion matrix and see what the error is that happens and where the error happens. When you look at this — and again you’re comparing computer science, computational linguistics, and linguistics — where do most of the misclassifications occur? Now most of the misclassifications occur actually through computational linguistics. So that thing, that discipline that is caught, somehow, perhaps, in between computer science and linguistics. So most of these misclassifications occur for a computational linguistics and only very few, these are the green numbers, occur between computer science and linguistics. Does that then mean that computational linguistics is actually in between the two? Also, the other thing  that you can see is how computational linguistics is very often misclassified as linguistics. More often than it’s misclassified as computer science. Linguistics is more often misclassified as computational linguistics than computer science is misclassified as computational linguistics. Does that mean that computational linguistics is in between but still closer to linguistics? I’m not sure. We have to explore this further. But this is a product that is created from the classification that you could then still explore to find an answer to your question.


I think I’ll stop here with the examples. I have more that are interesting. I said that we’ve also marked up the document structure so that you can bring in the document structure as a factor if you want, for these things that are differently distributed according to the sections of the document and so on. But let me try to wrap this up. So this is the picture, yeah? As a linguist, you have a soup and you want to bring some order into it. Maybe more. You have particular processing pipelines to do this, but you have to build them. That’s part of the process. Now what’s the modeling that is involved here. There is a lot of computational modeling, of course, but that’s not a modeling we do. We, as linguists, we use things were modeled by computer scientists or by computational linguists. These models are influenced by a lot of things: by probability theory, information theory,linguistic theory,  formal grammar and so on. Maybe we don’t have to know so much about them, but we should know a little bit about them. We should know at least that we can rely on them. We should be able to trust them in the sense that we should know how to make sure that we find the best part of speech tagger for our purposes or the best syntactic parser for our purposes. This is also some knowledge that I think a digital humanist who’s doing work like we’re doing should know. Okay, a formal grammar comes into play, you know, like regular expressions, context-free grammars, they come in, for example, at the end of the query language, so we should also know a little bit about this. What can they do for us? At the end of the analysis, in the very end, there’s statistics, there’s probability theory. I’m sorry, we also have to know something about that, I assume. We have to learn. Now, in terms of tools, the pipeline is of course plugging tubes together, so various tubes have been plugged together here. I guess the creative process in this is not so much, well, it’s also a technical challenge — you have to plot these tubes together. But it’s also [for you] to think about these workflows you create and which tools can create the right workflows for you.


Okay, now requirements on data modeling for linguistics. Maybe it’s not data modeling, but now I don’t know. So I just left it there. Okay, data modeling comes into play in a narrower sense, of course, when you make the corpus. You have to encode the corpus. You have to compile it. You have to think about the formats and you have to think about the types of information you include in the corpus. One thing that’s important here is that, I think to make the corpus addressable, you do have to identify the relevant objects of your study. We talked about what are the objects of our study, how do we perhaps also decompose a particular object of study, what are the units we’re looking at. I think we do need some agreement on what these units are that we’re looking at. Then there’s computational processing. Well, there’s a lot of task-specific models in this workflow. Rather than one overarching model  … I wouldn’t know at this moment how I would construe an overarching model for all of this. I don’t know. But there’s different, various models in it. Each of these models, at least for computational processing, they can be tested for adequacy. That’s another aspect of a model, I think. A model has to be testable. Corpus analysis tools: back in the 1990s corpus linguists still thought that they could build the ideal corpus tools. There were a number of projects in Europe that tried to do this. Like the Mate or the Nite project. To those of you who have heard about them, at the beginning of the 2000s, this awareness developed that it’s probably not possible to build this ideal tool. Instead we have task-specific tools, then we build pipelines like this. For that, of course, we have to make sure that the tools are compatible in some way. That’s the trade-off, if you want. Nowadays people in corpus linguistics who are concerned with building tools, this is what they do. They think about the processing pipelines and how they make tools compatible in the sense of some interchange language or pivot language that we can use between the tools and so on. I actually like this because this also gets recognition to the diversity of approaches and tools you can apply to one object. You may have a favorite part of speech tagger, and if they’re all pluggable, you can do it. If you have just one monolithic tool, you have no choice. I actually favor this.


Now, let’s go back here. Now anyhow, so whatever the object is [that] you have, I think it might be a soup or it might be a palm or a picture, I do think that one has to think seriously about the smaller bits and pieces that it consists of, and that we can agree to some extent what these are. Because only if we can agree can we also share the properties we want to assign to them. I finish with a quote from the web page describing Eugene Charniak, who says, and you can try to translate this to humanities, “Eugene Charniak is interested in programming computers to understand language so that they’ll be programmed to perform such tasks as answering questions and holding a conversation.” Well, how big is that? It’s quite big, and we have similar big things, I think. And then this goes on, this goes far beyond our current capabilities. Research proceeds by dividing the problems up into manageable sub plans. Ok, that’s it.


[Syd Bauman] Just a quick thought about the last slide, this quote by Professor Charniak, who’s wonderful. You said “you program computers so that they’d be able to perform such tasks as answering questions — how big is that?” I thought, I don’t know, let’s ask Watson. Although Watson doesn’t answer questions, he provides the question for the answer, so I don’t know.

[Fotis Jannidis] You said before that when I expressed that I’d been astonished by this relationship between gender and specifics…I was astonished years ago, but now I’m not. Let me talk about what you’re looking at. Actually, the things you’re looking at are . . . the register, the definition of register, implies statistical features. So you’re correlating informational statistics. I just wanted to point out…So, that may seem —

[Teich] Thank you. You’re absolutely right, thank you.

[Maximilian Schich] I have a question about this, you could consider a corpus being one concept in the data model, in a sense. And you measure the heterogeneity between the corpora. But there is some people say that the heterogeneity in subgroups is always larger than the heterogeneity between subgroups. So that means that within computational linguistics, the heterogeneity of the researchers who are in that pool would be larger than the heterogeneity between the corpora. So the thing is, and I think it’s a general problem we have with data models, is that we define certain things which are supposedly discrete or independant, but then we find out properties between them we haven’t measured. But we’re not really sure if that in there the same things have happened. So a classic example is Iraq versus Turkey: you can measure the difference or the similarity, but then you don’t talk about Kurds, which are obviously a big part of the similarity, for example. Is there a way to…you could do the same analysis on […], for example.

[Teich] Yes, you could do this […]

[Michael Sperberg-McQueen] The slide with the bar graphs showing biology and bioinformatics made me wonder– can you go back to it? Bioinformatics had such an extreme difference from the others that I immediately began wondering, how much is the population of speakers of these registers or writers of these registers? Is it possible that just as linguistics writing in Massachusetts tend to sound alike, because they all sound like one guy. I’m wondering if it’s possible for influential writers to affect the sound of an interdisciplinary field more easily than a large field. And now I see that I remembered it exactly wrong — it’s biology that’s the extreme outlier. But I’m still interested in the relative size of the populations. Do you have any sense of them?

[Teich] No, I don’t. I’m sorry. It’s a good question.

[Julia Flanders] I see this and I think, can I give you DHQ? Because I think it’s an interesting corpus exactly on a boundary layer, but I don’t know about the two other corpora after that.

[Jean Bauer] Yeah I was wondering about doing this for the digital humanities but then I was wondering how many disciplines actually go into digital humanities. From what I understand, from what you’re studying, it’s more or less two fields overlapping or coming together to create a new field. Whereas with the digital humanities, we have…it depends on how you count. I’m wondering how that would play into your model. If you have historians, literary scholars, computer scientists, computational linguists, and a couple of other groups and fields, whether you see as much differentiation.  Can you see a basic divide into two nodes or do you think it would actually mess with the model?

[Teich] You could have more. I don’t see a problem with having more than three, or one or two. I’d have to think about it in terms of how you design —

[Julia Flanders] We’d need to give the journal a few more years to accumulate. We have time for one more question.

[Syd Bauman] It’s very interesting — it’s worth looking at the “model” of selection here. Both Max’s [Schich] point and Michael’s [Sperberg-McQueen] point drive at that: it may well be that if you had chosen three other biology journals instead of the three you could get your hands on because they have been digitized or they are freely available or whatever, you’d find different results. It also might be the case that you find exactly the same results, which also seems interesting. How much is the selection model actually influencing our thoughts and how interpret the outcome? I think DHQ and other journals or sources of data that are less uniform in their approach, if they gave us the same results, would give us a kind of taste of how much that affects [the results].

[Julia Flanders] Well thank you very, very much. That was fascinating.


1 thought on “Elke Teich, “Analyzing linguistic variation: From corpus query towards feature discovery”

  1. Pingback: Knowledge Organization and Data Modeling in the Humanities: An ongoing conversation | datasymposium

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s