Title: | Electronic texts in the Humanities: A coming of age |
Authors: | Hockey, Susan |
Keywords: | Electronic texts Humanities research |
Issue Date: | 1994 |
Publisher: | Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign |
Citation Information: | Hockey, S. (1994) Electronic texts in the humanities: a coming of age. In B. Sutton (ed) Literary texts in an electronic age: Scholarly implications and library services [papers presented at the 1994 Clinic on Library applications of Data Processing, April 10-12, 1994]: 21-34. |
Series Name / Report no.: | Literary texts in an electronic age: Scholarly implications and library services [papers presented at the 1994 Clinic on Library applications of Data Processing, April 10-12, 1994] |
Abstract / Summary: | Electronic texts have been used for research and teaching in the humanities ever since the end of the 1940s. This paper charts the development of various applications in literary computing including concordances, text retrieval, stylistic studies, scholarly editing, and metrical analyses. Many electronic texts now exist as a by-product of these activities. Efforts to use these texts for new applications led to the need for a common encoding scheme, which has now been developed in the form of the Text Encoding Initiative's implementation of the Standard Generalized Markup Language (SGML), and to the need for commonly used procedures for documenting electronic texts, which are just beginning to emerge. The need to separate data from software is now better understood, and the variety of CD-ROM-based text and software packages currently available is posing significant problems of support for libraries as well as delivering only partial solutions to many scholarly requirements. Attention is now turning to research towards more advanced network-based delivery mechanisms. |
URI: | http://hdl.handle.net/2142/392 |
ISBN: | 0878450963 |
ISSN: | 0069-4789 |
Type of Resource: | text |
Genre of Resource: | conference paper |
Publication Status: | published or submitted for publication |
Appears in Collections: |
1994: Literary
texts in an electronic age: scholarly implications and library services |
Extracted on 25/11/2008 from: https://www.ideals.uiuc.edu/handle/2142/392
THE ARTICLE:
SUSAN HOCKEY
Center for Electronic Texts in the Humanities
Rutgers and Princeton Universities
New Brunswick, New Jersey
Electronic Texts in the Humanities:
A
Coining of AgeABSTRACT
Electronic texts have been used for research and teaching in the
humanities ever since the end of the 1940s. This paper charts the
development of various applications in literary computing including
concordances, text retrieval, stylistic studies, scholarly editing, and
metrical analyses.
Many electronic texts now exist as a by-product ofthese activities. Efforts to use these texts for
new applications led tothe need for a
common encoding scheme, which has now been developedin the form of the Text Encoding Initiative's implementation of the
Standard Generalized
Markup Language (SGML), and to the need forcommonly
used procedures for documenting electronic texts, which arejust beginning to emerge.
The need to separate data from software isnow
better understood, and the variety of CD-ROM-based text andsoftware packages currently available is posing significant problems
of support for libraries as well as delivering only partial solutions to
many
scholarly requirements. Attention is now turning to researchtowards more advanced network-based delivery mechanisms.
INTRODUCTION
It is
now forty-five years since Father Roberto Busa started workon
the first-ever humanities electronic text project to compile aconcordance to the works of St.
Thomas Aquinas and related authors(Busa 1974-). Since that time,
many other electronic text projects havebegun, and a body of knowledge and expertise has gradually evolved.
Many
lessons have been learned from these activities, and it is nowpossible to
make some realistic projections for the future developmentof electronic text usage in the humanities. Until recently, almost all
work
has been done on electronic transcriptions of text rather thanon digitized images.
The discussion in this paper will concentrate ontranscriptions, which are referred to as text, but the implications for
images will be noted briefly.
The
focus of the paper is on primary source material in thehumanities. This can be literary text, which is prose, verse, or drama,
or a combination of these. It
may also be documentary and take theform of letters, memoranda, charters, transcripts of speeches, papyri,
inscriptions, newspapers, and the like. Other texts are studied for
linguistic purposes, notably collections of text forming language corpora
and early dictionaries.
Many humanities texts are complex in nature,and the interpretation of the complex features within
them is oftenthe subject of scholarly debate.
Some texts contain several naturallanguages and/or writing systems. Others have variant spellings, critical
apparatus with variant readings, marginalia, editorial emendations, and
annotations, as well as complex and sometimes parallel canonical
referencing schemes.
An adequate representation of these features isneeded for scholarly analysis.
APPLICATIONS
IN LITERARY COMPUTINGThe
earliest and most obvious application was the production ofprinted
word indexes and concordances, often with associated frequencylists.
A word index is a list of words in a text where each word (keyword)is accompanied by a reference indicating the location of the occurrences
of that
word in the text. In a concordance, each occurrence of eachword
is also accompanied by some surrounding context, which may bea few words or
up to several lines. A word frequency list shows thenumber
of times that each word occurs. Words would normally appearin alphabetical order, but they could also be alphabetized or sorted
by their endings, which is useful for the study of morphology or
rhymeschemes, or in frequency order where the most
common words or thehapax legomena (once-occurring words) can easily be seen. Specialized
concordances
show words listed by their references (for example, byspeaker within a play) or sorted according to the words before or after
the keyword, or by the
number of letters they contain. It can be seenthat the production of concordances was typically a mechanical batch
process that could generate vast amounts of printout.
Early on, attention was also paid to defining the alphabetical order
for sorting words in a variety of languages, for example, transcriptions
of Greek and Russian as well as Spanish where ch, II, and rr are separate
letters of the alphabet.
Ways of dealing with hyphens, apostrophes,accented characters, editorial emendations, and the like were soon devised,
and in most cases, the choice was left to the user.
A major strengthof two of the most widely used concordance and retrieval programs
today,
Micro-OCP and TACT, is their flexibility in alphabet definitions.More
detail on alphabetization and different types of concordances maybe found in Howard-Hill (1979), Hockey (1980), and Sinclair (1991).
By the mid-1950s, a
number of other concordance-based projectshad begun. Brandwood's (1956)
work on Plato formed the basis of astylistic study. In France, plans for the
Trsor de la Langue Francaise,a vast collection of literary works since the time of the revolution, began
in 1959 to aid the production of the
new French dictionary (Qumada1959). These texts form the basis of the
ARTFL (American Researchon
the Treasury of the French Language) database at the Universityof Chicago. Other groups or projects of note in the 1960s include
Howard-Hill's (1969) Oxford Shakespeare Concordances,
word frequency countsof Swedish (Gothenburg) (Alln 1970), Classical Latin texts at Liege
(Delatte
and Evrard 1961), Medieval Latin in Louvain-la-Neuve(Tombeur 1973), and
work on various Italian texts at Pisa under thedirection of Antonio Zampolli (1973). At that time, the only means
of input was uppercase-only
punched cards or, sometimes, paper tape.Burton (1981a, 1981b, 1981c, 1982) describes these projects and others
in her history of concordance
making from Father Busa until the 1970s,which makes interesting reading.
The
interactive text retrieval programs that we use today are aderivative of concordances, since
what they actually search is aprecompiled index or concordance of the text. Besides their obvious
application as a reference tool, concordance and text retrieval programs
can be used for a variety of scholarly applications, one of the earliest
of which was the study of style and the investigation of disputed
authorship.
The mechanical study of style pre-dates computers by along time. Articles by T. C. Mendenhall at the end of the last century
describe his investigations into the style of Shakespeare, Bacon, Marlowe,
and
many other authors, using what seems to have been the first-everword-counting machine. Mendenhall (1901, 101-2) notes
the excellent and entirely satisfactory manner in which the heavy
task of counting was performed by the [two] ladies
who undertookit. ... The operation of counting was greatly facilitated by the
construction of a simple counting machine by which the registration
of a word of any given number of letters was made by touching
a button marked with that number.
Mendenhall's findings were not without interest, since he discovered
that Shakespeare has more words of four letters than any other length,
whereas almost all other authors peak at three.
Many other stylisticstudies have based their investigations on the usage of
common words,or function words. These are independent of content, and authors often
use them unconsciously.
Synonyms have also been studied as havecollocations or pairs of words occurring close together.
The work ofMosteller and Wallace (1964)
on the Federalist Papers is generallyconsidered to be a classic authorship study, since the twelve disputed
papers were
known by external evidence to be either by Hamilton orby Madison and there was also a lot of other material of
knownauthorship (Hamilton or Madison)
on the same subject matter. A studyof
common words showed that Hamilton prefers "while," whereasMadison almost always uses "whilst." Other words favored by one or
the other of them included "enough" and "upon."
Anthony
Kenny's (1978) investigation of the Aristotelian Ethics wasbased
on function words, which he divided into categories such asparticles and prepositions, that were derived from his reading of printed
concordances.
He was able to show that the usage of common wordsin three books that appear in both the
Nicomachean and the EudemianEthics is closer to
Eudemian Ethics. More recently, John Burrows's (1987)examination of Jane Austen's novels has
become something of a landmarkstudy in literary computing.
By analyzing their usage of commonwords, he was able to
show gender differences in the characters in thenovels and to characterize their idiolects. These and similar studies
employ some simple statistical methodologies for which
Kenny (1982)is a useful introduction.
They also show the need to index every wordin the text and to distinguish between homographic forms.
Concordances can also be a valuable tool for the historical
lexicographer, and several large textbases were originally compiled for
this purpose.
The Dictionary of Old English (DOE) in Toronto createdthe complete Corpus of
Old English, which totals some three millionwords. Lexicographers at the
DOE have created complete concordancesof all this corpus and select citations from the concordances for the
dictionary entries (Healey 1989).
The most frequent word in Old Englishoccurs about 15,000 times, and it was just possible for a lexicographer
to read all the concordance entries for it. This is obviously not feasible
for
much larger corpora such as the Trsor de la Langue Francaise.A
notable modern example of what has become known as corpuslexicography is Collins's
COBUILD English Dictionary, which wascompiled using a twenty-million-word corpus of English (Sinclair 1987).
Other electronic texts have been created for the analysis of meter
and rhyme schemes. In the 1960s, scansion programs existed for Greek
and Latin hexameter verse (Ott 1973). Metrical dictionaries were compiled
for authors as diverse as Hopkins (Dilligan and Bender 1973) and
Euripides (Philippides 1981).
Sound patterns have been studied in Homer(Packard 1974), some
German poets (Chisholm 1981), and Dante (Robey1987).
The
traditional scholarly editing process has also led to the creationof some electronic texts. In simple terms, this process has consisted of
collating the manuscripts, establishing the textual tradition, compiling
an authoritative text, compiling the critical apparatus, and then printing
text. In the 1960s, computer programs to collate manuscripts began
to appear, and it was soon realized that collation could not be treated
as a completely automatic process and that, because of the lineation,
verse was easier to deal with than prose. Robinson's (forthcoming)
COLLATE
program was developed after a study of earlier systems. Ithas a graphical user interface and is by far the most flexible collation
program.
Many
early humanities projects were hampered by design forcedupon
them by the limitations of hardware and software. Until diskstorage became
more widely available in the 1970s, texts and associatedmaterial were stored
on magnetic tape, which could only be accessedsequentially. Disk storage allowed
random access, but data were stillconstrained within the structures of database programs, particularly
relational databases where the information is stored as a set of rectangular
tables and is viewed as such by the user. Very little humanities-oriented
information fits this format without some restructuring, which, more
often than not, results in
some loss of information.Hypertext has provided a solution to data modeling for the
humanities. It offers flexible data structures and provides a
web ofinterrelated information, which can be annotated by the user if desired.
An
obvious application in the humanities is the presentation of primaryand secondary material together. Images, sound, and video can be incorporated
to aid the interpretation of the text.
The traditional scholarlyedition can be represented very effectively as a hypertext, but hypertext
is a more obvious
medium for presenting multiple versions of a textwithout privileging any particular one of them (Bornstein 1993). Other
experiments have used hypertext to model the narrative structure of
literature with a view to helping students understand it better (Sutherland
forthcoming).
ELECTRONIC TEXTS
TODAYMany
of the electronic texts that are in existence today were createdas a by-product of research projects such as those described above. Large
collections of text have been assembled by a few research institutes,
mostly in Europe where public
money has been provided for the studyof language and its relation to the cultural heritage. Most other texts
have been compiled by individuals for their
own projects. These textsreflect the interests of those research groups or individuals, and it is
perhaps questionable as to
how many of them can be used for otherscholarly purposes. These texts are ASCII files, not files that have been
indexed for use by specific programs. Initial estimates
show that 90to 95 percent of texts fall into this category. For a variety of reasons,
few of them have been
made available for other scholars to use, andthese scholars
may find that they are not well suited to their purposes.However, it was soon realized that considerable time and effort is
required to create a good electronic text.
Many existing texts have beenkeyboarded, and this is still the normal means of input. Optical character
recognition
(OCR) of some material became feasible in the early 1980s,but in general, it is only suitable for
modern printed material. OCRsystems tend to have difficulty with material printed before the end
of the last century, newspapers, or anything else where the paper causes
the ink to bleed, as well as material containing footnotes and marginalia,
nonstandard characters and words in italic, or small capitals. Those
systems that are trainable can be
more suitable for humanities material,but these require
some skill on the part of the operator. Hockey (1986)and the collection of papers assembled by the Netherlands Historical
Data Archive (1993) give further information.
More importantly, OCRalso generates only a typographic representation or
markup of the text,whereas experience with using texts has
shown that this is inadequatefor most kinds of processing and analysis. Most large data entry projects
are choosing to have their data keyed, which allows some
markup tobe inserted at that time.
Recognizing the need to preserve electronic texts, the Oxford Text
Archive (OTA) was established in 1976 to "offer scholars long term
storage and maintenance of their electronic archives free of charge."
It has amassed a large collection of electronic texts in
many differentformats and is committed to maintaining them
on behalf of their depositors.Depending on the conditions determined by their depositors,
OTAtexts are
made available to other individuals for research and teachingpurposes at little cost. However, there is
no guarantee of accuracy, andusers of
OTA texts are encouraged to send any updated versions that theymay
have created back to Oxford. Proud ( 1989) reports on the findingsof a British Library sponsored project to review the Oxford Text Archive.
^ There have been a few systematic attempts to create or collect and
archive texts for general-purpose scholarly use.
The most notable onefor a specific language is the Thesaurus Linguae Graecae (TLG), which
began at Irvine, California, in 1972. It is
now nearing completion ofa databank of almost seventy million words of Classical Greek (Brunner
1991).
The texts are distributed on a CD-ROM that contains plain ASCIIfiles.
They are not indexed in any way. In the late 1980s, the PackardHumanities Institute (PHI) compiled a complementary
CD-ROM ofall Classical Latin, which is about eight million words.
The WomenWriters' Project at
Brown University is building a textbase of women'swriting in English from 1330 to 1830 and contains
many texts thatare not readily accessible elsewhere.
Begun in the 1980s, the DartmouthDante Project
(DDP) is aiming to make available the text of the DivineComedy
and all major commentaries. The texts are stored and indexedusing BRS-Search and can be accessed via Telnet to lib.dartmouth.edu
then, at the prompt, type "connect dante."
A
few other collections of text should be noted here. The Istitutodi Linguistica Computazionale in Pisa has a large collection of literary
and nonliterary works in Italian. Institutes funded by the
Germangovernment at
Bonn and Mannheim have been building text collectionsfor
many years. Bar-Han University in Israel is the home of the ResponsaProject, and the
Hebrew Academy in Jerusalem also has a substantialcollection. Material in Welsh and other Celtic languages has been built
up
at Aberystwyth and elsewhere. The International Computer Archiveof
Modern English at Oslo concentrates on English-language corpora,and groups in various English-speaking countries are compiling corpora
of their
own usage of English. The British National Corpus is nearingcompletion of a hundred-million-word corpus of written and spoken
English.
Many other similar activities exist. The Georgetown UniversityCenter for Text and Technology maintains a catalog of projects and
institutes that hold electronic texts but not the texts themselves. This
catalog can be accessed most easily by
Gopher to guvax.georgetown.edu.Lancashire (1991) is the most comprehensive source of information in
print about humanities computing projects in general.
The
Rutgers Inventory of Machine-Readable Texts in the Humanitiesis the only attempt to catalog existing electronic texts using standard
bibliographic procedures (Hoogcarspel 1994).
The Inventory is held onthe Research Libraries Information
Network (RLIN) and is maintainedby the Center for Electronic Texts in the Humanities (CETH). It contains
entries for
many of the texts in the Oxford Text Archive, plus materialfrom a
number of other sources. The Inventory is now being developedby
CETH staff who have prepared extensive guidelines for catalogingmonographic electronic text files using Anglo-American Cataloguing
Rules, 2d ed.,
(AACR2) and RLIN.In the last few years,
more electronic texts have begun to be madeavailable by publishers or software vendors. These are the texts that
are more likely
now to be found in libraries. They are mostly CD-ROMsand are usually packaged with specific retrieval software. Examples
include the Global Jewish Database on
CD-ROM, the New OxfordEnglish Dictionary on CD-ROM, the CETEDOC CD-ROM of the Early
Christian Fathers, and the WordCruncher disk of American literature.
The
CD-ROM versions of the English Poetry Full-Text Database andPatrologia Latina published by Chadwyck-Healey also fall into this
category, although these texts are also available
on magnetic tape foruse with other software. Oxford University Press also publishes
electronic texts, which are ASCII files. Their texts are particularly well
documented, and most can be used with the
Micro-OCP concordanceprogram, which they also publish.
Some
of these packaged products are relatively easy to use, butprospective purchasers might want to be aware of a
number of issuesbefore they launch into acquiring
many of them. Almost every oneof these products has its
own user interface and query language. Theyare mostly designed for scholarly applications on what are complex
texts. Therefore, it can take
some time to understand their capabilitiesand to learn
how to use them. If this proliferation of products continues,the cost of supporting them will not be insignificant. Librarians are
not normally expected to
show patrons how to read books, but theycan expect to spend
some considerable time in learning how to usethese resources and showing
them to users. Those that are easy to usemay
not satisfy many scholarly requirements. For example, on theWordCruncher
CD-ROM, which is one of the easiest to use, the textshave been indexed in such a simple
way that there is no way to distinguishbetween I in act and scene numbers (e.g., Act I) and the
pronoun I.Several of these products are designed for the individual scholar to use
on
his or her own machine rather than for access by many people.They
provide good facilities for storing search requests for future use,but this is not
much help if twenty other people have stored new requestsor modified existing ones in between. Another issue is just what words
have been indexed and how.
A response to any search request is onlyas good as the words that have been indexed. In some cases, this seems
to have been determined by software developers
who have littleunderstanding of the nature of the material and the purposes for which
it might be used. Other institutions have chosen to acquire texts in
ASCII format and provide network access to them, usually with
OpenText's
PAT system. In this case, the burden of deciding what to indexfalls on the librarian,
who is thus assuming some responsibility forthe intellectual content of the material.
CREATING ELECTRONIC TEXTS
FOR THE FUTURECreating an electronic text is a time-consuming and expensive
process, and it therefore makes sense to invest for the future
when doingit. Texts that are created specifically for one software program often
cannot easily be used with others.
The need to separate data fromsoftware is
now well recognized. Data that are kept in an archival formindependent of any hardware and software stand a
much better chanceof lasting for a long time because they can be
moved from one systemto another and because they can be used for different purposes and
applications.
Experience has
shown that an archival text needs markup anddocumentation for it to be of any use in the future.
Markup makesexplicit for computer processing things that are implicit to the
humanreader of a text.
Markup is needed to identify the structural componentsof a text (chapter, stanza, act, scene, title) and enables specific areas
or subsets of text to be searched and text that has been retrieved to
be identified by references or other locators. It
may also be used toencode analytic and interpretive features.
Many humanities texts arecomplex in nature, and
many different markup schemes have beencreated to encode their properties. Ones that have been in
commonuse are
COCOA, which is used by Micro-OCP and TACT, the beta codeused by the Thesaurus Linguae Graecae, and the three-level referencing
system used by WordCruncher. These
markup schemes concentrate onthe structure of a text, as opposed to schemes such as
TeX and troff,which contain formatting instructions.
Following a planning meeting in 1987, a major international effort
to create guidelines for encoding electronic texts was launched by the
Association for Computers and the Humanities, the Association for
Computational Linguistics, and the Association for Literary and Linguistic
Computing. This project,
known as the Text Encoding Initiative(TEI), brought together groups of scholars, librarians, and computer
professionals to examine
many different types of texts and to compilea comprehensive list of the features within those texts.
The TEI
soon determined that the Standard Generalized MarkupLanguage (SGML)
was a sound basis for the development of the newencoding scheme.
SGML became an international standard in 1986. Itis a metalanguage within which encoding schemes can be defined. It
is descriptive rather than prescriptive and thus can form the basis of
the reusable text. It permits multiple and possibly conflicting views
to be encoded within the same text. It is incremental so that
newencodings can be added to a text without detriment to what is already
there.
SGML-encoded texts are also ASCII files, and so their longevitycan be assured.
The TEI's application of SGML is very wide ranging.It provides base tag sets for prose, verse, drama, dictionaries, transcripts
of speech, and terminological data.
To these can be added tag sets fortextual criticism, transcription of primary sources, language corpora,
formulae and tables, graphics, hypermedia, analytical tools, and names
and dates.
The application has been designed so that other tag setscan be added later.
The first definitive version of the TEI guidelineshas very recently been published (Sperberg-McQueen and Burnard 1994).
Many
existing electronic texts have little or no documentationassociated with them. Often, it is difficult to establish what the text
is, where it
came from, and, in a few cases, even what language it isin. There seem to be two
main reasons for this. In some cases, thetext was created by an individual
who was so familiar with that textthat he or she did not find it necessary to record any documentation
about it. In other cases, the person
who created the text did not haveany model to follow for documenting the text and thus recorded only
minimal information about it.
Where documentation does exist, it isin
many different formats, making the task of compiling informationabout electronic texts extremely difficult.
As part of its recommendations, the
TEI has proposed an electronictext file header to meet the needs of librarians
who will manage thetexts, scholars
who will use them, and computer software developerswho
will write programs to operate on them. The TEI header consistsof a set of
SGML elements that include bibliographic details of theelectronic text and the source from which it was taken, information
about the principles that governed the encoding of the text, any
classificatory material, and a revision history that records the changes
made
to the text.DIGITAL IMAGING
Many
of the lessons learned from the creation and use of electronictexts can also be applied to digital imaging of manuscripts and textual
material.
The potential of digital imaging for preservation and accessis
now being exploited in numerous projects. From this point of view,the archival role is obviously very important. Most of the cost in digital
imaging is in taking the object to and from the camera, and so it makes
sense to digitize at the highest resolution possible. Storing the image
in a proprietary format linked to
some specific software will lead toall the same problems that have been experienced with text stored in
a proprietary indexing program. It will not be possible to guarantee
that the image will be accessible in the future or that it can be used
for other purposes. Documentation and provenance information are just
as important for images.
SGML can be used to describe material thatis not itself textual.
The TEI header would need only a slight modificationto be used for images and offers a route to using both text
and image together.
The TEI's hypertext mechanisms allow pointersfrom the text to the image and can form the basis of a system that
operates on the transcription of the text but displays the image to the user.
ANALYSIS
TOOLSExperience of working with electronic literary texts has highlighted
a
number of analysis tools and features that have been found to beuseful.
The most obvious is the need to index every word and not tohave a stop list. This is important for
many stylistic and linguisticstudies that have concentrated
on the usage of common words. It alsoavoids the omission of some homographic forms; for example, the
English auxiliary verbs "will" and "might" are also nouns.
The punctuationis often important in early printed texts, and some scholars
maywant to search on that. In other languages, it provides a simple key
to the examination of the ends of sentences, for example, clausulae in
Classical Latin.
Words that are not in the main language of the textsneed to be indexed separately to avoid
homographs such as "font" inEnglish and French, or "canes" in English and Latin.
The ability tosearch on the ends of words is also useful, particularly for verse and
in languages that inflect heavily.
A very small number of resourcesprovide an index by endings. For others, this kind of search can take
some
time as it can only be handled by a sequential search on theword
index. A good text will also have structural encoding, and theuser
may want to have the option of restricting proximity searches towithin certain structural boundaries or allowing them to extend beyond
a boundary. For example, finding "tree" within ten words of "flower"
may
not be useful if "tree" is the last word of a chapter and "flower"occurs at the beginning of the next chapter.
There has not been as
much progress in the development of toolsto analyze text. Essentially,
we are still able to search text only byspecifying strings of characters, possibly linked by Boolean operators,
whereas most users are interested in concepts, themes, and the like.
String searches cannot effectively disambiguate homographic forms, for
example, "bank" as in
money bank as opposed to "bank" of the riveror the verb "bank" (used of an airplane), or Latin "canes" as "dogs" or
"you will sing."
Computer
programs to perform morphological analysis, lemmatization,syntactic analysis, and parsing have been used experimentally
for some time, but our understanding of these is still only
partial.
The most successful parsing programs claim accuracy of about95 percent. Morphological analysis has been done reasonably well for
some languages, for example, Ancient Greek, but there are no widely
available general-purpose programs that are suitable for literature.
Father Busa recognized the need to lemmatize his concordance to St.
Thomas
Aquinas in order to make it more useful to scholars, but thiswas done manually, which is still the only
way to ensure accurate data.In Busa 1992, he reflects on the lack of intellectual progress and on
how
little the computer can still do.Because of its nature, literature is harder to deal with than
manyother types of text, and there have been relatively few attempts to apply
more
sophisticated language analysis algorithms to it. After years ofworking
with rule-based systems, researchers in computationallinguistics are turning to the compilation of large-scale lexical resources
and knowledge bases for use by natural language understanding systems.
The
usual method has been to create an electronic version of a printeddictionary and restructure that within the computer as a lexical database
that contains morphological analyses, lemmas, frequent collocations,
and other information that would help to disambiguate homographic
words. However, printed dictionaries are designed for
humans not forcomputers to use.
They exist to document the language and thus containmany
citations for uncommon usages of words but very few (inproportion to their occurrences) of usual usages.
A computer programmust look every
word up in the dictionary and thus needs moreinformation about
common words. This has led to the current interestin language corpora, which are large bodies of text from which
information can be derived to
augment electronic dictionaries. In manyways, this development represents another
coming of age, since theinitial methodologies used by computational linguists to analyze large
corpora are concordance based and are very similar to those that have
been used in literary computing for
many years. Once information aboutword
usage has been derived, it can be encoded within the text (usingSGML
markup) and used to train and refine future programs, whichwill eventually perform
more accurate analyses. We can only hope thatthis
coming of age will lead to better access technologies and to thecomputer doing
more for us.REFERENCES
Allen, Sture. 1970. Vocabulary Data Processing. In Proceedings of the International
Conference on Nordic and General Linguistics, ed. Hreinn Benediktsson, 235-61.
Reykjavik: Visindafelag Islendiga.
Bornstein, George. 1993. What is the Text of a Poem by Yeats? In Palimpsest: Editorial
Theory in the Humanities, ed. George Bornstein and Ralph G. Williams, 167-93.
Ann Arbor: University of Michigan Press.
Brandwood, Leonard. 1956. Analysing Plato's Style with an Electronic Computer. Bulletin
of the Institute of Classical Studies 3: 45-54.
Brunner, Theodore F. 1991. The Thesaurus Linguae Graecae: Classics and the Computer.
Library Hi Tech 9(1): 61-67.
Burrows, J. F. 1987. Computation into Criticism: A Study of Jane Austen's Novels and
an Experiment in Method. Oxford: Clarendon Press.
Burton, Dolores M. 1981 a. Automated Concordances and Word Indexes: The Fifties.
Computers and the Humanities 75(1): 1-14.
Burton, Dolores M. 1981b. Automated Concordances and Word Indexes: The Early Sixties
and the Early Centers. Computers and the Humanities /5(2): 83-100.
Burton, Dolores M. 1981c. Automated Concordances and Word Indexes: The Process,
the Programs, and the Products. Computers and the Humanities /5(3): 139-54.
Burton, Dolores M. 1982. Automated Concordances and Word-Indexes: Machine Decisions
and Editorial Revisions. Computers and the Humanities /6(4): 195-218.
Busa, Roberto, S. J. 1974-. Index Thomisticus. Stuttgart-Bad, Connstatt: Fromann-
Holzboog.
Busa, Roberto, S. J. 1992. Half a Century of Literary Computing: Towards a 'New'
Philology. (Reports from Colloquia at Tubingen). Literary b Linguistic Computing
7(1): 69-73.
Chisholm, David. 1981. Prosodic Approaches to Twentieth-Century Verse.
ALLC Journal2(1): 34-40.
Delatte, L., and E. Evrard. 1961.
Un Laboratoire d'Analyse Statistique des LanguesAnciennes a 1'Universite de Liege. L'Antiquite Classique 30: 429-44.
Dilligan, R. J., and T. K. Bender. 1973. The Lapses of Time:
A Computer-AssistedInvestigation of English Prosody. In The Computer and Literary Studies, ed. A. J.
Aitken, Richard W. Bailey, and N. Hamilton-Smith, 239-52. Edinburgh: Edinburgh
University Press.
Healey, Antoinette diPaolo. 1989. The Corpus of the Dictionary of Old English: Its
Delimination, Compilation and Application. Paper presented at the Fifth Annual
Conference of the
UW Centre for the New Oxford English Dictionary, St. Catherine'sCollege, Oxford, England.
Hockey, Susan. 1980. A Guide to Computer Applications in the Humanities. London:
Duckworth; Baltimore: Johns Hopkins University Press.
Hockey, Susan. 1986. OCR: The Kurzweil Data Entry Machine. Literary b Linguistic
Computing 1(2): 63-67.
Hoogcarspel, Annelies. 1994. The Rutgers Inventory of Machine-Readable Texts in the
Humanities. Information Technology and Libraries 13( 1 ): 27-34.
Howard-Hill, T. H. 1969. The Oxford Old-Spelling Shakespeare Concordances. Studies
in Bibliography 22: 143-64.
Howard-Hill, T. H. 1979. Literary Concordances: A Guide to the Preparation of Manual
and Computer Concordances. Oxford: Pergamon Press.
Kenny, Anthony. 1978. The Aristotelian Ethics: A Study of the Relationship between
the Eudemian and Nicomachean Ethics of Aristotle. Oxford: Oxford University Press.
Kenny, Anthony. 1982. The Computation of Style. Oxford: Pergamon Press.
Lancashire, Ian., ed. 1991. The Humanities Computing Yearbook 1989-90: A
Comprehensive Guide to Software and Other Resources. Oxford: Clarendon Press.
Mendenhall, T. C. 1901.
A Mechanical Solution of a Literary Problem. Popular ScienceMonthly 60 (December): 97-105.
Mosteller, Frederick, and David L. Wallace. 1964. Inference and Disputed Authorship:
The Federalist. Reading, Mass.: Addison-Wesley.
Netherlands Historical Data Archive. 1993. Optical Character Recognition in the
Historical Discipline: Proceedings of an International Workgroup. Netherlands
Historical Data Archive, Nijmegen Institute for Cognition and Information.
Ott, Wilhelm. 1973. Metrical Analysis of Latin Hexameter: The Automation of a
Philological Research Project. In Linguistica Matematica e Calcolatori (Atti del
Convegno e della Prima Scuola Internazionale, Pisa 1970), ed. Antonio Zampolli,
379-90. Florence: Leo S. Olschki.
Packard, David W. 1974. Sound-Patterns in Homer. Transactions of the American
Philological Association 104: 239-60.
Philippides, Dia Mary L. 1981. The Iambic Trimeter of Euripides. New York: Arno Press.
Proud, Judith K. 1989. The Oxford Text Archive. British Library Research and
Development Report. London: British Library.
Quemada, Bernard. 1959. La Mecanisation dans les Recherches Lexicologiques. Cahiers
de Lexicologie 1: 7-46.
Robey, David. 1987. Sound and Sense in the Divine Comedy. Literary b Linguistic
Computing 2(2): 108-15.
Robinson, Peter M. W. Forthcoming. COLLATE: A Program for Interactive Collation
of Large Manuscript Traditions. In Research in Humanities Computing, vol. 3, ed.
Susan Hockey and Nancy Ide, 32-45. Oxford: Oxford University Press.
Sinclair, John M., ed. 1987. Looking Up: An Account of the
COBUILD Project in LexicalComputing. London: Collins.
Sinclair, John M. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
Sperberg-McQueen, C. M., and Lou Burnard, eds. 1994. Guidelines for the Encoding
and Interchange of Electronic Texts. Chicago and Oxford: Association for Computers
and the Humanities, Association for Computational Linguistics, Association for
Literary and Linguistic Computing.
Sutherland, Kathryn. Forthcoming. Waiting for Connections: Hypertexts, Multiplots,
and the Engaged Reader. In Research in Humanities Computing, vol. 3, ed. Susan
Hockey and Nancy Ide, 46-58. Oxford: Oxford University Press.
Tombeur, P. 1973. Research Carried Out at the Centre de Traitement Electronique des
Documents of the Catholic University of Louvain. In The Computer and Literary
Studies, ed. A. J. Aitken, Richard W. Bailey, and N. Hamilton-Smith, 335-40.
Edinburgh: Edinburgh University Press.
Zampolli, Antonio. 1973. La Section Linguistique du CNUCE. In Linguistica Matematica
e Calcolatori (Atti del Convegno e della Prima Scuola Internazionale, Pisa 1970),
ed. Antonio Zampolli, 133-99. Florence: Leo S. Olschki.
Extracted
on 25/1/2008 from: http://www.ideals.uiuc.edu/bitstream/2142/392/2/Hockey.pdf