Obtaining
Electronic Texts
The availability of
electronic texts may open new expanses for literary research, and text
processing with computers can answer a series of questions that otherwise could
be answered only with extremely painstaking research.
About two decades ago,
scholars started making significant use of computers to analyze documents.
Texts, then, were often not available in electronic form; researchers had to
key in the texts themselves (or hire someone to enter them).
Entering documents by
typing them is a slow and tedious process, and it is difficult to ensure
accuracy. Frequently, a text would be keyed twice (ideally, each version would
be done by a different typist). Then the two files would be compared, and any
differences reconciled with the original.
Using this process of
keying multiple versions and then collating and correcting them, I had a file
containing the sonnets of Shakespeare prepared. The project ended up consuming
more than four times the labor I had projected. I
intended to have the complete works of Jane Austen entered by such a process,
but it turned out to be too expensive to complete -- and it was difficult to
employ accurate typists who were willing to enter multiple copies of novels --
they got bored and quit.
It is possible to create
text files with a scanner, but only very recently has the process become both
relatively accurate and relatively inexpensive. Scanners still cannot be
dependably used for texts that are badly printed (or, of course, handwritten).
There have been a number of
remarkable developments in recent years that make texts far more commonly
accessible. In fact, almost any document a researcher might reasonably want may
soon be accessible in electronic form. Over the years, the Oxford Text Archive
has made hundreds and hundreds of text files of literary works available.
Although its current
holdings are modest, Project Gutenberg has as its goal to make 10,000 of the
most used books available in electronic form by the year 2000. There will be
little or no charge for using the Project's files. Some organizations have
compiled their own specialized collections, such as those of the American
Philosophical Association's Subcommittee on Philosophy and Electronic Texts.
Some kinds of
representative texts have been collected and distributed as a corpus of written
(or spoken) language. The London-Lund Corpus contains about half a million
words of spoken English. The Brown Corpus contains over one million running
words of newspaper articles and other published texts. The Helsinki Corpus
contains one and one-half million words. The Kolhapur Corpus has one million
running words of Indian English. The LOB Corpus has over one million words, and
it is available in a version with descriptive tags. All of these corpora are
available on CD-ROM.
There are organizations and
centers that help create and standardize texts and
coordinate their international dissemination. The Center
for Electronic Texts in the Humanities established by Rutgers and
The Text Encoding
Initiative (TEI) is devoted to developing guidelines for the encoding and interchange of electronic texts. The TEI has devoted an
enormous amount of effort to delineate exactly how the Standard Generalized Markup Language (SGML) can be used to describe the nature
and features of an electronic text.
It appears that in the
future, SGML tags will commonly be inserted into texts in such a way that
computer programs can not only catalog the general
characteristics of a document, but also make very specific determinations of
how the language functions. For example, an electronic version of the works of
Shakespeare that contains SGML tags could be used to determine not only which
words are used by Shakespeare, but also (for example) which words are contained
in speeches by both Iago and Lear in act two.
The electronic distribution
of many texts is limited by copyright laws, but increasingly the copyright
owners are making their holdings available themselves. There are commercial
packages filled with the full texts of reference works, novels, and newspapers.
Oxford University Press is marketing electronic versions of its editions of
works by Austen, Chaucer, Coleridge, Dickens, Shakespeare, Wordsworth, and
others. Publishing companies will soon realize that the availability of
electronic editions will increase sales of standard printed editions:
researchers (especially university faculty and students) will use the
electronic versions for research, but they will need the corresponding standard
printed edition for reference and class- room use.
Presumably, all commercial
publishers now use computerized typesetting equipment. Therefore, it should be
very easy for them to produce electronic versions of texts that they issue in
printed form. (Although, I was told by an employee of one publishing company
that it was common practice to erase the typesetting tapes or disks as soon as
the book was printed -- in order to reuse the tapes or disks and thus save
money!)
Chadwyck-Healey, Inc.,
has unbelievably ambitious plans for electronic publishing. According to their
announcement letter, The English Poetry Full-Text Database will include
"every letter of every word of every poem" of "English poetry
spanning the Anglo-Saxon period (
As if the English Poetry
Database were not enough, Chadwyck- Healey will also
publish the Patrologia Latina Database: a compilation
of 221 volumes of late ancient and early Latin literature. Chadwyck-Healey's
prices for these text databases are substantial, but they are only fractions of
the costs of scholars producing the text files for themselves.
Electronic texts are
currently available on several media: magnetic disks, floppy disks, are most
commonly used for distribution of shorter works; larger files are usually
stored on CD-ROM discs or, less frequently, on nine-track magnetic tapes. The
future will probably offer far larger-capacity media, but direct computer
processing of enormous files using high-speed communication networks similar to
Internet may become the standard.
Using Electronic Texts
for Research
Computer analysis of
electronic texts can make it easy to answer a series of questions that
otherwise can be answered only by intuition, guess, or uncommonly mind-numbing
research.
For example, Stuart Tave contended that in the novels of Jane Austen, she
carefully discriminated between the use of two words
that are often considered synonyms: "amiable" and
"agreeable." (See note 1.) Following a distinction made by one
of the characters in Emma, (See note 2.) Tave
maintained that "agreeable" is used to describe a mere show of
surface manners, while "amiable" is a quality based on excellent
internal character.
Tave described the context of several
appearances of each word and mentioned a number of others. He then assumed that
every use of the words in Austen's novels (especially in Emma and Pride
and Prejudice) maintained the same distinction.
Using the electronic texts
of the novels, a computer program that searches for specific words and displays
their context indicates that in Emma, "agreeable" appears 49
times, and "amiable" appears 35 times; in Pride and Prejudice,
"agreeable" appears 41 times, and "amiable" appears 39
times. Upon examination of the 164 occurrences of these words in context, it
appears that Tave is absolutely correct; the words
are indeed carefully distinguished.
Upon freshly rereading
Austen's Pride and Prejudice, I got the idea that the words
"love" and "affection" were similarly used in a very
precise way. Based on noting a handful of occurrences, I thought that young men
and women used the word "love" only of parents or brothers and
sisters; if they wanted to talk about the emotion connected with courtship,
they spoke of "affection" (at least until they were engaged). This
was my idea based on one rereading of the novel. I had noticed perhaps a dozen
occurrences of the words, and they seemed to support my thesis.
A computer program searched
for each word and displayed its context; there are 92 occurrences of
"love" and 58 occurrences of "affection." Careful
examination of the 150 passages indicates absolutely no support for my
position: the two words were used almost interchangeably.
Obviously, it is extremely
difficult, if not absolutely impossible, to notice the exact usage of a hundred
or more occurrences of one or more words during the reading of a novel of
several hundred pages. A computer using the electronic texts of novels brings
all such occurrences together in their context for a researcher to examine.
Almost always, computer
analysis of lengthy texts produces surprises by assembling data that run
counter to intuitions of the works. It is quite easy to count the number of
words of direct dialogue and compare that with the total number of words. It
turns out that some novels that are commonly thought of as filled with talking
(such as those of Jane Austen), are not so much so as those that are often
thought of as nearly all narrative.
It can be interesting to
observe how writers use color. John Keats sometimes
associates rich colors with emotion, while Joseph
Conrad rarely does so. Again, there can be surprises. Since she often deals
with intellectual rather than physical landscapes, George Eliot's novels might
be supposed to have less color than, say, those of
Nathaniel Hawthorne. In fact, Eliot uses about twice as many words for color as
Roger Murray has shown that
if poems written in
Nancy Ide
has observed that William Blake's massive poem "The Four Zoas" has an intricate narrative that uses Blake's
obscure personal mythology, and thus it would be expected that the meaning of
the poem is incomprehensible for most readers. However, she documents that the
poem has patterns of images that make a powerful impact even on a naive reader.
(See note 4.)
An analysis of the
vocabularies of writers and of their word and syntax preferences may indicate
answers to questions of authorship attribution of texts. Perhaps due to their
use of inductive logic, such studies have not always been very persuasive. John
Burrows, who has published several fascinating studies testing characteristics
of texts and authorship with computers, said that although "they will
never be entitled to claim certainty," literary researchers "can undoubtedly
help to identify the authors of doubtful texts." (See note 5.)
Computer generation of
word-frequency lists, concordances, indexes, and collocation data for texts can
make textual research much easier. It may seem to be trivial to count the
numbers of personal pronouns in texts. Yet, a comparison of such counts could
be interesting for the novels of Jane Austen and Joseph Conrad if, as has been
said, there is no scene in Austen's novels in which a woman is not present and
there is never a scene in a novel by Conrad in which a man is not present.
Having
texts in electronic form can greatly simply finding specific words and
passages. Where
does Shakespeare say, "The first thing we do let's kill all the
lawyers"? And is that an accurate quotation? (According to the Wells and
Taylor edition of the Works that is available in electronic edition, this is an
accurate quotation from act 4, scene 2, line 78 of 2
Henry VI.) How many times does Shakespeare mention lawyers? (The words
"lawyer" or "lawyers" are used eleven times.)
Not only can passages be
found easily in electronic texts, but when they are found, the relevant lines
can be blocked and moved directly into another document with many word
processors. Avoiding rekeying passages makes research faster and less tedious,
and it helps assure accuracy.
The ease with which
electronic texts can be searched and data collected prompts questions for
research that would not otherwise be considered. In a play by Shakespeare, do
the heros or the villains
get to talk more? What is the minimum number of actors (assuming unlimited
doubling) needed to perform a given play? (Such a question can be answered by
determining which characters are on stage at the same time based on stage
directions and on characters' lines.)
There are many interesting
things for scholars to analyze in literature when they have texts in electronic
form. Books and journals will increasingly cover new territory in which study
is made possible by the availability of electronic texts.
Click here to go to Eric Johnson's computer
programs.
Click here to go to Eric Johnson's publications.
Click here to go to Eric Johnson's home page.
1 Stuart M. Tave, Some Words of Jane Austen (Chicago: University
of Chicago Press, 1973), pp. 116-131.
2 Jane Austen, Emma,
Ed. R. W. Chapman (London: Oxford University Press, 1933), p. 149.
3 Roger
4 Nancy M. Ide, "A Statistical Measure of Theme and
Structure," Computers and the Humanities, 23:4-5 (August-October,
1989), 277-283.
5 J. F. Burrows, "Not
Unless You Ask Nicely: The Interpretative Nexus Between Analysis and
Information," Literary and Linguistic Computing, 7:2 (1992), 103.
Eric Johnson is Professor
of English and Dean of the
“© a.r.e.a./Dr.Vicente Forés López http://www.uv.es/%7Efores/mainframeuvp.html”