12 DEVELOPING TEXT STANDARDS

12 DEVELOPING TEXT STANDARDS

Susan Hockey

Centre for Electronic Texts in the Humanities

Rutgers and Princeton Universities

New Brunswick, NJ

Text handling has so far played a relatively minor role in museum documentation and

information handling systems, but as we move towards the age of multimedia and

network-based information systems, the need for standard formats for text is increasingly

apparent. Multimedia includes the linking of objects and descriptions of objects to

primary and secondary textual sources and the same textual source may be linked to

many different objects. Taking the multimedia environment to its logical extension we can

envisage the distributed multimedia system where information is held in different

locations and is accessed over the network and used for different purposes, whilst still

being linked together. Some agreed commonality of format for that information, whether

it is text, image or sound, is therefore essential for the usefulness, development and

longevity of the data. This paper discusses the particular nature of text and methods of

handling it, and examines current thinking in text standards. The term 'electronic text' will

be used to mean anything textual in electronic form.

Why text in electronic form?

It is useful to begin with a survey of the uses of electronic texts, or rather the reasons why

textual material is put into electronic fonn. The most obvious of these is of course the

preparation of printed copy by wordprocessing, desktop publishing, text formatting or

typesetting. Even today, in most cases, the electronic text is considered as a means to an

end, the end being an elegantly printed catalogue or an annotated bibliography and the

like. The copy is usually prepared by a wordprocessing program followed by desktop

publishing, and often much effort is expended in creating a printed page which is

aesthetically pleasing, contains all the right characters and has footnotes and other

annotations exactly where they are wanted. Each wordprocessing program has its own

format of text and its own method of handling footnotes, extra characters and the like.

Text formatters and typesetting programs provide similar facilities by commands

embedded in the text. Each of these again has its own text format. Anyone who has edited

any collection of material is used to the problems of bringing together incompatible

wordprocessor formats, but even if solutions are available for this problem, we need also

to consider how these documents may be kept for future updating or how they may be

merged with material from other sources at some later date.

Electronic text is also created for structured databases and retrieval systems, where the

primary function is to search the texts for instances of specific words or terms. Software

for interactive full-text retrieval is widely used in many applications. Typical programs

include BASIS, BRSSearch, and STATUS. Each of these includes an indexing module

which must be used before any text can be searched. Searches are performed on the index

and so the response to any search is only as good as the index. These programs provide

some choices in building an index, e.g. in stop words etc, but hyphens, apostrophes,

foreign words, Roman numerals and the like often cause problems since decisions on how

to handle these must be made at the indexing stage. Variant spellings of the same word

are also usually indexed separately, which means that all forms have to be given as a

search term. Further questions are raised by ambiguous punctuation, for example where a

full stop can indicate the end of a sentence, a decimal point or an abbreviation.

These programs tend to respond first by indicating how many documents are hits, but in

some types of data it is not clear what constitutes a document. It could be a complete text,

or one page or a section of a document. There is also often an assumption that every text is

either a bibliography with a recognised structure (author, title etc) or consists of

paragraphs each of which consists of sentences. This is not the case for inscriptions,

papyri, poetry and many other types of information. A more flexible and extensible data

structure is needed.

Text is often also stored in a relational database. Even with variable length fields, a

predefined structure such as this often becomes limiting as more data is acquired. The

sorting and indexing facilities in relational databases are simple and often not adequate to

handle historical or multi-lingual material. Very few of these packages acknowledge a

date before 1900 AD or a non-decimal currency or a name which does not follow Western

European conventions. Designing a relational database requires a good deal of prior

knowledge of the relationships between the items of information. This is often not the case

with text (and with other historical material) when a database is being created for the

purposes of establishing those relationships. It can therefore be argued that both

conventional interactive retrieval systems and structured databases are constricting. The

data has to be made to fit the software when what is needed is software to fit the data.

Hypertext and multimedia do provide a better means of modelling data in that, with good

hypertext software, there are no restrictions in defining the relationships between the

elements in the data. Yet many hypertext systems do not, in my view, pay enough

attention to the textual part. Effort is concentrated on images, sound and video and the

text is seen as subsidiary material. Yet, more often than not, it is the text which provides

the link to other related scholarship which is not included in the particular multimedia

system. Tools for manipulating the text are not well provided and where they are

provided, they tend to concentrate on changing the display of the text (font change, sizes

etc) rather than helping the user understand it.

Markup and the multi-purpose electronic text

To be of any use at all, an electronic text needs markup or encoding embedded within it. A

text without any markup cannot be manipulated effectively by computer software.

Markup is used to specify areas of text which are to searched by a retrieval system (e.g. all

titles, or all the text written by author Smith). It also provides further information about

material which is retrieved, for example the identifiers of all the documents which are hits

or the whether the item retrieved is in a title or part of a subject index. Wordprocessing

and typesetting programs use typographic markup, for example italic to identify titles, or

bold face or a larger size of type to identify headings. However italic may also be used for

other features, for example foreign words in a text or for emphasised words. If the titles

within that same text are to be searched, it becomes impossible to distinguish them from

foreign or emphasised words. The history of encoding and markup shows the

development of many different schemes. For the kinds of text analysis which scholars in

the humanities have been performing for years, e.g. concordances and text retrieval,

schemes were developed to denote the canonical reference structure of scholarly texts. The

MARC record is an encoding scheme for the description of bibliographic and related

information. For formatting and printing, a parallel group of markup schemes was

developed, most notably that used by the typesetting program TEX, as well as Scribe,

TROFF and some typesetter-specific schemes. The internal format used by some

wordprocessing programs is not unlike these markup schemes. The inevitable result of

this plethora of encoding schemes has been time wasted on conversion, lack of adequate

documentation, texts unusable for any purpose other than that for which they were

originally created, and the inability to extend the coding in a text. Creating an electronic

text is not a trivial task. It therefore makes sense to create text which can be interchanged

between different applications and be used for new, and as yet undiscovered purposes,

without the need for constant reformatting and reorganisation of the data.

The incompatibility of markup systems has led to the development of generalised markup

in the form of the Standard Generalised Markup Language (SCML) which now provides a

means of creating such a text. SGML became an international standard in 1986. It is not

itself an encoding scheme. It is a meta-language in which markup codes or tags can be

defined. The principle of SCML is 'descriptive' not 'prescriptive', that is, it provides a

means of describing or marking the components of a text. The processes which are to be

performed on the text are functions of whatever computer program operates on the text.

Typical components of a text may be title, author, paragraph, sentence, document number,

or they may be features such as quotations, lists, names, addresses, dates etc. The

components are marked by encoding tags within the texts and what is called an 'SGML

application' provides the set of tags for one application area.

At one level, SGML considers a text to be composed simply of streams of symbols, which

are known as 'entities'. An entity is any named bit of text and an entity definition

associates a name with a bit of text. One use for entities is to encode characters which are

not on the keyboard. For example the entity reference β could be used for the Greek

letter beta, or &alef; for the Hebrew letter alef. Although, this may seem a clumsy way of

encoding non-standard characters, it is needed for transmission across all networks and a

computer program can convert from other machine-specific character formats to entity

references. A second use is for expanding abbreviations, for example &TEI; for Text

Encoding Initiative.

At a higher level a text is composed of objects of various kinds, which are known as

'elements'. These identify the various components of a text, which are whatever the

compiler of the text chooses to encode. Each element is marked by a start and end tag. For

example:

... the novel <title>Pride and Prejudice</title> is associated with ...

Here the title of the novel is tagged as a title. Angle brackets delimit the tags with the end

tag beginning with </.

Attributes may be associated with elements to give further information about the element.

For example, for the tag <chapten,

<chapter n=3> ... text of chapter ... </chapter>

to give the number of the chapter, or for the tag <name>

type - type of name

normal - normalised form

This would enable an index of personal names to be made in which Jack Smyth would be

listed under SmithJ. Attributes can also be used extensively for cross-references, which are