Each Monday and Thursday, an abstract from one of the symposium participants will be posted to facilitate discussion. We welcome your comments!
Featured Abstract: “Comparing representations of and operations on overlap”
Claus Huitfeldt, University of Bergen
Overlapping document structures have been studied by markup theorists for more than twenty years. A large number of solutions has been proposed. Some of the proposals are based on XML, others not. Some are proposals for use of alternate serial forms or data models, and some for stand-off markup. Algorithms for transformations between the different forms have also been proposed. Even so, there are few systematic comparative studies of the various proposals, and there seems to be little consensus on what is the best approach.
The aim of the MLCD Overlap Corpus (MOC) is to make it easier to compare the different
proposals by providing concrete examples of documents marked up according to a variety of proposed solutions. The examples are intended to range from small, constructed documents to full-length, real texts. We believe that the provision of such different parallel representations of the same texts in various formats may serve a number of purposes.
Many of the proposals for markup of overlapping structures are not fully worked out, or not well documented, or known only from scattered examples. Encoding a larger body of different texts according to each of the proposed solutions may help resolving unclarities or shed new light on difficulties about the proposals themselves.
Running or developing software to perform various operations on the same data represented in different forms may also help in finding out which forms are optimal for which operations. Some operations, even though well understood for non-overlapping data, may turn out not to be clearly defined for overlapping data.
Finally, a parallel corpus may serve as reference data for work on translations between the various formats, for testing conversion algorithms, and for developing performance tests for software.