12 DEVELOPING TEXT STANDARDS
Susan Hockey
Centre for Electronic Texts in the Humanities
Rutgers and Princeton Universities
New Brunswick, NJ
Text handling has so far played a relatively minor role in museum documentation and
information handling systems, but as we move towards the age of multimedia and
network-based information systems, the need for standard formats for text is increasingly
apparent. Multimedia includes the linking of objects and descriptions of objects to
primary and secondary textual sources and the same textual source may be linked to
many different objects. Taking the multimedia environment to its logical extension we can
envisage the distributed multimedia system where information is held in different
locations and is accessed over the network and used for different purposes, whilst still
being linked together. Some agreed commonality of format for that information, whether
it is text, image or sound,
is therefore essential for the usefulness, development andlongevity of the data. This paper discusses the particular nature of text and methods of
handling it, and examines current thinking in text standards. The term 'electronic text' will
be used to mean anything textual in electronic form.
Why text in electronic form?
It is useful to begin with a survey of the uses of electronic texts, or rather the reasons why
textual material is put into electronic fonn. The most obvious of these is of course the
preparation of printed copy by wordprocessing, desktop publishing, text formatting or
typesetting. Even today, in most cases, the electronic text
is considered as a means to anend, the end being an elegantly printed catalogue or an annotated bibliography and the
like. The copy
is usually prepared by a wordprocessing program followed by desktoppublishing, and often much effort
is expended in creating a printed page which isaesthetically pleasing, contains all the right characters and has footnotes and other
annotations exactly where they are wanted. Each wordprocessing program has its own
format of text and
its own method of handling footnotes, extra characters and the like.Text formatters and typesetting programs provide similar facilities by commands
embedded
in the text. Each of these again has its own text format. Anyone who has editedany collection of material
is used to the problems of bringing together incompatiblewordprocessor formats, but even
if solutions are available for this problem, we need alsoto consider how these documents may be kept for future updating or how they may be
merged with material from other sources at some later date.
Electronic text
is also created for structured databases and retrieval systems, where theprimary function is to search the texts for instances of specific words or terms. Software
for interactive full-text retrieval is widely used in many applications. Typical programs
include BASIS, BRSSearch, and STATUS. Each of these includes an indexing module
which must be used before any text can be searched. Searches
are performed on the indexand
so the response to any search is only as good as the index. These programs providesome choices in building an index, e.g. in stop words etc, but hyphens, apostrophes,
foreign words, Roman numerals and the like often cause problems since decisions on how
to handle these must
be made at the indexing stage. Variant spellings of the same wordare also usually indexed separately, which means that all forms have to be given as a
search term. Further questions are raised by ambiguous punctuation, for example where a
full stop can indicate the end of a sentence, a decimal point or an abbreviation.
These programs tend to respond first by indicating how many documents are hits, but in
some types of data it is not clear what constitutes a document. It could be a complete text,
or one page or a section of a document. There is also often an assumption that every text is
either a bibliography with a recognised structure (author, title etc) or consists of
paragraphs each of which consists of sentences. This is not the case for inscriptions,
papyri, poetry and many other types of information. A more flexible and extensible data
structure is needed.
Text is often also stored in a relational database. Even with variable length fields, a
predefined structure such as this often becomes limiting
as more data is acquired. Thesorting and indexing facilities in relational databases are simple and often not adequate to
handle historical or multi-lingual material. Very few of these packages acknowledge a
date before
1900 AD or a non-decimal currency or a name which does not follow WesternEuropean conventions. Designing a relational database requires a good deal of prior
knowledge of the relationships between the items of information. This is often not the case
with text (and with other historical material) when a database
is being created for thepurposes of establishing those relationships. It can therefore be argued that both
conventional interactive retrieval systems and structured databases are constricting. The
data has to be made to fit the software when what is needed is software to fit the data.
Hypertext and multimedia do provide a better means of modelling data in that, with good
hypertext software, there are no restrictions in defining the relationships between the
elements in the data. Yet many hypertext systems do not, in my view, pay enough
attention to the textual part. Effort is concentrated on images, sound and video and the
text is seen as subsidiary material. Yet, more often than not, it
is the text which providesthe link to other related scholarship which
is not included in the particular multimediasystem. Tools for manipulating the text are not well provided and where they are
provided, they tend to concentrate on changing the display of the text (font change, sizes
etc) rather than helping the user understand it.
Markup and the multi-purpose electronic text
To be of any use at all, an electronic text needs markup or encoding embedded within it. A
text without any markup cannot be manipulated effectively by computer software.
Markup
is used to specify areas of text which are to searched by a retrieval system (e.g. alltitles, or all the text written by author Smith). It also provides further information about
material which is retrieved, for example the identifiers of all the documents which are hits
or the whether the item retrieved is in a title or part of a subject index. Wordprocessing
and typesetting programs use typographic markup, for example italic to identify titles, or
bold face or a larger size of type to identify headings. However italic may also be used for
other features, for example foreign words in a text or for emphasised words. If the titles
within that same text are to be searched, it becomes impossible to distinguish them from
foreign or emphasised words. The history of encoding and markup shows the
development of many different schemes. For the kinds of text analysis which scholars in
the humanities have been performing for years, e.g. concordances and text retrieval,
schemes were developed to denote the canonical reference structure of scholarly texts. The
MARC record is an encoding
scheme for the description of bibliographic and relatedinformation. For formatting and printing, a parallel group of markup schemes was
developed, most notably that used by the typesetting program TEX, as well as
Scribe,TROFF and some typesetter-specific schemes. The internal format used by some
wordprocessing programs is not unlike these markup schemes. The inevitable result of
this plethora of encoding schemes
has been time wasted on conversion, lack of adequatedocumentation, texts unusable for any purpose other than that for which they were
originally created, and the inability to extend the coding in a text. Creating an electronic
text is not a trivial task. It therefore makes sense to create text which can be interchanged
between different applications and
be used for new, and as yet undiscovered purposes,without the need for constant reformatting and reorganisation of the data.
The incompatibility of markup systems has led to the development of generalised markup
in the form of the Standard Generalised Markup Language (SCML) which now provides a
means of creating such a text. SGML became an international standard in
1986. It is notitself an encoding scheme. It is a meta-language in which markup codes or tags can be
defined. The principle of SCML is 'descriptive' not 'prescriptive', that is, it provides a
means of describing or marking the components of a text. The processes which are to be
performed on the text are functions of whatever computer program operates on the text.
Typical components of a text may be title, author, paragraph, sentence, document number,
or they may be features such as quotations, lists, names, addresses, dates etc. The
components are marked by encoding tags within the texts and what is called an 'SGML
application' provides the set of tags for one application area.
At one level, SGML considers a text to be composed simply of streams of symbols, which
are known as 'entities'.
An entity is any named bit of text and an entity definitionassociates a name with a bit of text. One use for entities is to encode characters which are
not on the keyboard. For example the entity reference β could be used for the Greek
letter beta, or &alef; for the Hebrew letter alef. Although, this may seem a clumsy way of
encoding non-standard characters, it is needed for transmission across all networks and a
computer program can convert from other machine-specific character formats to entity
references. A second use is for expanding abbreviations, for example &TEI; for Text
Encoding Initiative.
At a higher level a text is composed of objects of various kinds, which are known as
'elements'. These identify the various components of a text, which are whatever the
compiler of the text chooses to encode. Each element is marked by a start and end tag. For
example:
...
the novel <title>Pride and Prejudice</title> is associated with ...Here the title of the novel is tagged as a title. Angle brackets delimit the tags with the end
tag beginning with
</.Attributes may be associated with elements to give further information about the element.
For example, for the tag
<chapten,<chapter n=3>
... text of chapter ... </chapter>to give the number of the chapter, or for the tag
<name>type
- type of namenormal
- normalised formThis would enable an index of personal names to be made in which Jack Smyth would be
listed
under SmithJ. Attributes can also be used extensively for cross-references, which aresome choices in building an index, e.g. in stop words etc, but hyphens, apostrophes,
foreign words, Roman numerals and the like often cause problems since decisions on how
to handle these must be made at the indexing stage. Variant spellings of the same word
are also usually indexed separately, which means that all forms have to be given as a
search term. Further questions are raised by ambiguous punctuation, for example where a
full stop can indicate the end of a sentence, a decimal point or an abbreviation.
These programs tend to respond first by indicating how many documents are hits, but in
some types of data it is not clear what constitutes a document. It could be a complete text,
or one page or a section of a document. There is also often an assumption that every text is
either a bibliography with a recognised structure (author, title etc) or consists of
paragraphs each of which consists of sentences. This is not the case for inscriptions,
papyri, poetry and many other types of information. A more flexible and extensible data
structure
is needed.Text is often also stored in a relational database. Even with variable length fields, a
pre-defined structure such as this often becomes limiting as more data is acquired. The
sorting and indexing facilities in relational databases are simple and often not adequate to
handle historical or multi-lingual material. Very few of these packages acknowledge a
date before 1900
AD or a non-decimal currency or a name which does not follow WesternEuropean conventions. Designing a relational database requires a good deal of prior
knowledge of the relationships between the items of information. This is often not the case
with text (and with other historical material) when a database is being created for the
purposes of establishing those relationships. It can therefore be argued that both
conventional interactive retrieval systems and structured databases are constricting. The
data has to be made to fit the software when what is needed is software to fit the data.
Hypertext and multimedia do provide a better means of modelling data in that, with good
hypertext software, there are no restrictions in defining the relationships between the
elements in the data. Yet many hypertext systems do not, in my view, pay enough
attention to the textual part. Effort is concentrated on images, sound and video and the
text is seen as subsidiary material. Yet, more often than not, it is the text which provides
the link to other related scholarship which is not included in the particular multimedia
system. Tools for manipulating the text are not well provided and where they are
provided, they tend to concentrate on changing the display of the text (font change, sizes
etc) rather than helping the user understand it.
Markup and the multi-purpose electronic text
To be of any use at all, an electronic text needs markup or encoding embedded within it. A
text without any markup cannot be manipulated effectively by computer software.
Markup is used to specify areas of text which are to searched by a retrieval system (e.g. all
titles, or all the text written by author Smith). It also provides further information about
material which is retrieved, for example the identifiers of all the documents which are hits
or the whether the item retrieved is in a title or part of a subject index. Wordprocessing
and typesetting programs use typographic markup, for example italic to identify titles, or
bold face or a larger size of type to identify headings. However italic may also be used for
other features, for example foreign words in a text or for emphasised words. If the titles
within that same text are to be searched, it becomes impossible to distinguish them from
foreign or emphasised words. The history of encoding and markup shows the
development of many different schemes. For the kinds of text analysis which scholars in
the humanities have been performing for years, e.g. concordances and text retrieval,
schemes were developed to denote the canonical reference structure of scholarly texts. The
MARC record is an encoding scheme for the description of
bibliographic and relatedresolved into concrete references only when the text
is processed. In simple terms this ishow SGML handles hypertextual information.
The SGML Document Type Definition
(DTD)d efines the elements which make up adocument, giving relationships between them. This is called the content model. A very
simple example could be a play which has a content model of acts which are composed of
scenes which are in turn composed of speeches. The DTD is used by an SGML 'parser' to
validate a document, by checking that it conforms to the model which has been defined.
One important characteristic of SGML is that tags which are SGMLconformant can be
used to describe non-SGML material. This means that images or sound or anything else
which is not text can be embedded in an SGML-encoded text or that an SGML-encoded
text can contain pointers to non-textual material stored elsewhere. SGML therefore
provides an extremely flexible shell for multimedia information as well as text alone. The
Perseus project (Mylonas 1992) is one excellent example of how SGML-encoded text can
be built into a multimedia system.
The Text Encoding Initiative
The requirements of the museum community and the work developed by the Text
Encoding Initiative (TEI) project overlap in several areas. The TEI is a major international
project to develop an SGML tag set for use by the humanities and language industries. It is
sponsored by the Association for Computers and the Humanities, the Association for
Computational Linguistics and the Association for Literary and Linguistic Computing.
Following a planning meeting in November 1987, funding was provided by the National
Endowment for the Humanities, the Commission of the European Communities and the
Andrew W. Mellon Foundation. The TEI published the first draft of its guidelines in July
1990 and this has been substantially revised and expanded. The second draft is initially
being published in fascicles on the TEI's electronic discussion list TEI-L@UICVM.
DTDsare also available from the TEI-L fileserver.
For the first draft of the guidelines the TEI set up four Working Committees. The
Committee on Text Documentation, with expertise in librarianship and archive
management, was charged with the problems of labelling a text with in-file encoding of
cataloguing information about the electronic text itself, its source and the relationship
between the two. The Committee on Text Representation addressed issues concerning
character sets, the logical structure of a text and the physical features represented in the
source material. The Committee on Text Analysis and Interpretation devised ways of
incorporating analytic and interpretive information in a text which can include more than
one interpretation on the same component of a text. The Committee on Syntax and
Metalanguage Issues worked on producing recommendations on how SGML should best
be used.
For the second draft of the guidelines, a number of small work groups addressed specific
areas in more detail. These included character sets, text criticism, hypermedia, formulae
and tables, language corpora, physical description, verse, performance texts, literary
prose, linguistic analysis, spoken texts, historical studies, dictionaries, computational
lexica and terminological data.
The TEI proposals give guidance both on what features to encode and how to encode
them. Although the TEI is proposing between
250 and 300 different tags, very few indeedare absolutely required. The basic philosophy is
'if you want to encode this feature, do itthis way'. Sufficient information is provided for the DTDs to be extended by users. TEI
conformant texts consist of a TEI header, which provides the documentation or labelling,
followed by the text.
The TEI header is believed to be the first systematic attempt to provide in-file
documentation of an electronic text. An outline of the header, showing the four major
sections, follows:
<revisionDesc . .. <revision Deso
A key objective in the design of the TEI header was to ensure that some elements could
map directly on to fields in a MARC record so that catalogue records can be created
automatically from the header. A further advantage of using SGML to document an
electronic text
is that the same syntax is used for the documentation and for the text itself.It can be processed by the same software and maintained in the same way.
The future of SGML
The future for SGML looks now very promising. It is gaining acceptance in many circles. It
offers a sound intellectual basis for encoding text which can be used for many purposes
and thus provides what Rubinsky (1993) so aptly describes as the 'underground tunnels',
which are needed to build the electronic information age. The development of SGML
software now must catch up with the applications. SGML-aware software
is needed atseveral stages. There are various software tools to aid conversion from non-SGML
encoded text. An SGML parser validates the tagging to ensure that it
is processable. Thereis still a need for good SGML-based browsing software. Dynatext from Electronic Book
Technologies of Rhode Island is the only true SGML browser/searcher that I am aware of.
PAT from Open Text Corporation of Waterloo searches text which contains SGML-like
tags but does not require them to conform to a DTD. Other efforts to create SGML
browsers are based on SQL and are developed from the relational database model used by
SQL. The choice of software tools will soon grow. It makes sense now for anyone creating
electronic text to use SGML if they want that text to last.
Extracted on 25/11/2008.From: http://www.archimuse.com/publishing/ichim93/hockey.pdf