(Re-) presenting texts
Discussion paper for the ALS workshop session on fieldtexts
October 1999
Nick Thieberger
Representation of oral texts in writing has been one of the goals of the descriptive linguist, in the texts accompanying grammars, and especially in the return of recorded material to the community. Texts are the grounding of our linguistic analysis and a point to which we can refer others, and which others can use to verify our results. In this brief presentation I want to assess the potential for computers to do the work of representing our texts along with a number of ancillary tasks. We need to think of our work as data management and look at tools that deal with data in other disciplines.
What do I really want from a text in a computer?
I want to be able to:
a) link a text file to an audio file in chunks (intonation units for example) so that I can access the audio via a textual index.
b) not have to segment the audiofile into smaller units to match the text chunks.
c) present the data in the text file in various ways to suit the audience. The possibilities for presentation include straight text in the source language only; text marked up to show prosodic features; interlinear glossing; text and free gloss; and any of these options in combination with any other.
d) retrieve information via the text that includes the audio link (a concordance of the text for example so that each word of the text can be heard in context).
e) retrieve information about prosodic features of the text that have been encoded within it.
f) retrieve information about parts of speech in context.
g) link from particular parts of my texts to external objects (pictures or vidoes).
h) share the data and all of its encoded attributes with colleagues
i) store the data safely beyond my lifetime.
To achieve these outcomes I need to have a way of addressing my texts and linking them to other objects in a standardised way that will endure and that others will also be using. I also need to be able to address levels of interpretation of the texts, at the most basic level I want a morphological analysis and gloss. I also want to be able to capture the hierarchical nature of a text (Simon 1998:19) to reflect which components of the text are subordinate or superordinate to others (including the inheritance of attributes through a text; for example, the speaker's identity needs to be a feature of all of their speech).
Computers are notorious for saving time once you have invested huge amounts of time and money into learning how to use them. But, if you are going to transcribe tapes, it makes sense that this one task of transcription should result in a tool that allows you to find all examples of a particular word or morpheme and to hear each in context, rather than just a cassette on your shelf and a time coded transcript. Clearly there must be room for redefining your analysis of the text, so the system must also allow changes to the text.
Earlier uses of computers for textual representation were concerned solely with presentation of the text on paper. The ability to manipulate material on the screen was an advance on reworking gestetner sheets, but it was not using the power of the computer for generalising on our linguistic knowledge. Shortly later we had some tools for interlinearising (Transcript, IT) , and for concordancing (Conc), and then for multimedia (HyperCard on Macintosh computers was an early example).
Peter and Jay Read's 'Long time olden time' (Read & Read 1993) included texts and digitised audio on CD. Thieberger's Australia's languages (1994) includes a story in Warnman, in which each sentence is presented as the corresponding sound is played (based on Randy Vallentine's template Rook). Both of these early examples of multimedia were built with specific software (HyperCard) that only ran on Macintosh computers, and required that the data was incorporated into the end product in way that does not easily allow its reuse in other media.
To get the Warnman story into HyperCard format required the usual first steps of recording, transcribing and editing the result. The audio cassette was digitised and the text was glossed (using the SIL software IT) This work was then cut and pasted into HyperCard, and the audio resource inserted. Once completed the product was only available on Macintosh computers. It permitted linking between parts of the story and from items within the story to a grammatical sketch of the language. A mouseclick on a noun presented a page discussing nouns, a click on a verb illustrated verb paradigms, and so on. The links were provided by a lookup table in which each word of the text was located, together with its word-class (this information came from the IT output). When a user clicked on a word, the program consulted the lookup table and then followed the link to the appropriate page. While preferable to constructing each link manually, this method is still lacking in that crucial computer element, extensibility.
One of the many buzzwords of the new technologies, extensibility refers to the ability to extend the use the data and to ensure its use by others in the future regardless of the kind of computer or software they may use.
These HyperCard examples effectively lock the story into a particular format and provide no way to export the linked text and sound. Ideally we need a way of producing stories that will not lock the texts into particular software, and will not require an additional layer of effort on our part to get the texts into publishable form both as a book and as linked audio. We need a way to link our transcript to the digitised sound. These transcripts can then be the basis for our publication of the story.
There is software that emulates a transcribing machine on the computer (eg MicroNotePad or Soundwalker). It allows you to play the digitised sound slowly, and to 'rewind' and repeat the last heard sound. However, the key element of linking sound and text is absent from this software (except for a minor ability to set markers in the file).
Some sound-manipulation software includes a labeling function, linking labeled sections of the sound file to a textual marker. Usually this function is not exportable, so that the labels only work inside the given software package, lacking extensibility.
What we need is an international standard for document encoding that will include reference to other media. Linguists clearly are not the only ones facing this problem which is really an issue of data management. The World Wide Web is based on encoded text in a fairly standard format, HTML. Despite some years of development, HTML and the software used to interpret it (browsers) have yet to come to a standard way of dealing with digitised sound. The true international standard for textual mark-up is Standard Generalized Markup Language (SGML), and the implementation of it for various text types has been undertaken by the Text Encoding Initiative (TEI). These standards are complex and difficult to work with. For some years now large corporations and government departments have been engaging with SGML and producing and archiving documents. For the under-resourced linguist who wants to work on a language rather than a mark-up language, it is overkill.
The compromise looks like being Extensible Markup Language (XML), a standard that will be (we are told) accepted by all browsers (as it is by Internet Explorer 5), and also word-processing software like Microsoft Word. The advantage of XML over HTML is that it can describe elements within a document, and that it has been designed specifically to deal with multi-media. HTML just tells the browser how to format the document (it is procedural), but XML (like SGML) declares the structure and elements of a document, leaving formatting to a style sheet. The same document can appear in a browser in a number of different ways, depending on the style-sheet. But the structure of the document will not change.
XML has another advantage in allowing the user to define tags rather than being restricted to a given set. But isn't this a breach of the international standard I hear you say in your relentless quest for consistency in a conference paper? Indeed it would be, but the function of each user-defined tag is set out in separate accompanying document, called a document type definition (DTD) which is effectively a phrase structure grammar of the structure of the XML document.
How can we use XML in our everyday work as linguists? The ideal is software which writes the XML for us. We should continue to do what we do, transcribing audio material for example, and the result should be a text file linked to an audio file. I have been working with SoundIndex, software produced by Michel Jacobson as part of a project with Boyd Michailovsky, LACITO/CNRS, John B. Lowe, LACITO/CNRS; University of California, Berkeley and used to demonstrated at the following URL: http://lacito.vjf.cnrs.fr/Archivag/EXEMPLE/STARTe.htm
The software was available from: http://www.multimania.com/jacobson/SI/Index.htm
With Soundindex you listen to the digitised audio and watch a soundwave display. By selecting part of the display and then typing and selecting the transcript, you are able to establish an index of links between the transcript and the sound file. This index is then combined with the transcript using a Perl script and the output is an XML encoded file in which each example sentence (or whatever length of text you have selected) appears together with the start and end point (in milliseconds) of the relevant part of the digitised sound file:
....<S id="s89"><TRANSCR> Me komam ukano preg sa kir nlaken ipi <EMPH>nafsan</EMPH> iskei nen kin tiawi ke fo tafnau tesa.</TRANSCR><AUDIO start="454.9000" end="465.0200"></AUDIO></S>
<S id="s90"><TRANSCR>Tesa<LIT=tsa> ipreg tenamrun nen kin ikerkerai itakel me tenen kin ipi rait ko tiawi ukano preg kerkerai<LIT=kerkrai> kir.</TRANSCR><AUDIO start="465.0800" end="473.9799"></AUDIO></S>....
This output provides explicit links between the written text and the digitisied sound file, while maintaining each in its own format. Formatting is not an issue at this stage of the analysis, and is dealt with by reference to the explicitly stated structure of the document rather than by individual formatting commands. As Simon says; "The lure of the WYSIWYG ("what you see is what you get") word processors for building a linguistic database (like a dictionary) must be avoided at all costs when "what you see is all you get." " (1998:23)
Do we really need to know all of this? Can't we just keep typing our work in a word processor and listening to our tapes on a cassette player? Clearly not. If we want to archive our tapes securely, or if want to be able to make the data more readily accessible, both for ourselves as we analyse it, and for others later on, then we need these new tools. We have a responsibility to maintain this data in the best possible form, especially in instances where we are the only custodians of recorded material for a particular language. It should be part of our professional role that we keep up with the most appropriate way to deal with the data we collect.
In fact, the tools discussed here will make our work more efficient. Rather than having separate types of data files, (transcript of the tape, interlinearised version of the transcript, concordance of the transcript...) we should look forward to incrementally encoding a text file in which all the conditions listed in (a-i) above will be met.
I strongly believe that we need to work with existing software and send suggestions to its creators to extend its uses. The temptation to create new software applications that will do exactly what we want is to be resisted most strenuously in the Australian scene where there is simply not enough money. There is a great deal involved in making software that works properly under all conditions and we usually do not have the resources to carry it through to completion.
If we use the international standard (XML) then we will be able to piggy-back on the capital investment of large corporations, and our software development can be at the level of interfaces to XML documents.
ReferencesRead, P. & Read, J. 1993. 'Long time olden time Sydney:Firmware publishing.
Simon, G.F. 1998. The nature of linguistic data and the requirements of a computing environment for linguistic research. In Lawler, J. & Dry, H.A. (eds), Using computers in linguistics : a practical guide. London ; New York: Routledge. 10-25.
Thieberger, N. 1994. Australia's languages: Australian indigenous languages information stacks. Canberra:AIATSIS.