Electronic Texts and their Use for Literary Research

Article 5:

Electronic Texts and their Use for Literary Research

by Eric Johnson

Version 2.1: first published as "Electronic Shakespeare: Making Texts Compute," in Computer-Assisted Research Forum, 1.3 (Spring/Summer, 1993), 1-3.

Obtaining Electronic Texts

The availability of electronic texts may open new expanses for literary research, and text processing with computers can answer a series of questions that otherwise could be answered only with extremely painstaking research.

About two decades ago, scholars started making significant use of computers to analyze documents. Texts, then, were often not available in electronic form; researchers had to key in the texts themselves (or hire someone to enter them).

Entering documents by typing them is a slow and tedious process, and it is difficult to ensure accuracy. Frequently, a text would be keyed twice (ideally, each version would be done by a different typist). Then the two files would be compared, and any differences reconciled with the original.

Using this process of keying multiple versions and then collating and correcting them, I had a file containing the sonnets of Shakespeare prepared. The project ended up consuming more than four times the labor I had projected. I intended to have the complete works of Jane Austen entered by such a process, but it turned out to be too expensive to complete -- and it was difficult to employ accurate typists who were willing to enter multiple copies of novels -- they got bored and quit.

It is possible to create text files with a scanner, but only very recently has the process become both relatively accurate and relatively inexpensive. Scanners still cannot be dependably used for texts that are badly printed (or, of course, handwritten).

There have been a number of remarkable developments in recent years that make texts far more commonly accessible. In fact, almost any document a researcher might reasonably want may soon be accessible in electronic form. Over the years, the Oxford Text Archive has made hundreds and hundreds of text files of literary works available.

Although its current holdings are modest, Project Gutenberg has as its goal to make 10,000 of the most used books available in electronic form by the year 2000. There will be little or no charge for using the Project's files. Some organizations have compiled their own specialized collections, such as those of the American Philosophical Association's Subcommittee on Philosophy and Electronic Texts.

Some kinds of representative texts have been collected and distributed as a corpus of written (or spoken) language. The London-Lund Corpus contains about half a million words of spoken English. The Brown Corpus contains over one million running words of newspaper articles and other published texts. The Helsinki Corpus contains one and one-half million words. The Kolhapur Corpus has one million running words of Indian English. The LOB Corpus has over one million words, and it is available in a version with descriptive tags. All of these corpora are available on CD-ROM.

There are organizations and centers that help create and standardize texts and coordinate their international dissemination. The Center for Electronic Texts in the Humanities established by Rutgers and Princeton Universities and the Georgetown Center for Text and Technology will advise and catalog the production and distribution of electronic texts.

The Text Encoding Initiative (TEI) is devoted to developing guidelines for the encoding and interchange of electronic texts. The TEI has devoted an enormous amount of effort to delineate exactly how the Standard Generalized Markup Language (SGML) can be used to describe the nature and features of an electronic text.

It appears that in the future, SGML tags will commonly be inserted into texts in such a way that computer programs can not only catalog the general characteristics of a document, but also make very specific determinations of how the language functions. For example, an electronic version of the works of Shakespeare that contains SGML tags could be used to determine not only which words are used by Shakespeare, but also (for example) which words are contained in speeches by both Iago and Lear in act two.

The electronic distribution of many texts is limited by copyright laws, but increasingly the copyright owners are making their holdings available themselves. There are commercial packages filled with the full texts of reference works, novels, and newspapers. Oxford University Press is marketing electronic versions of its editions of works by Austen, Chaucer, Coleridge, Dickens, Shakespeare, Wordsworth, and others. Publishing companies will soon realize that the availability of electronic editions will increase sales of standard printed editions: researchers (especially university faculty and students) will use the electronic versions for research, but they will need the corresponding standard printed edition for reference and class- room use.

Presumably, all commercial publishers now use computerized typesetting equipment. Therefore, it should be very easy for them to produce electronic versions of texts that they issue in printed form. (Although, I was told by an employee of one publishing company that it was common practice to erase the typesetting tapes or disks as soon as the book was printed -- in order to reuse the tapes or disks and thus save money!)

Chadwyck-Healey, Inc., has unbelievably ambitious plans for electronic publishing. According to their announcement letter, The English Poetry Full-Text Database will include "every letter of every word of every poem" of "English poetry spanning the Anglo-Saxon period (600 A.D.) through the nineteenth century." This corpus of English poetry will be encoded, presumably with SGML tags.

As if the English Poetry Database were not enough, Chadwyck- Healey will also publish the Patrologia Latina Database: a compilation of 221 volumes of late ancient and early Latin literature. Chadwyck-Healey's prices for these text databases are substantial, but they are only fractions of the costs of scholars producing the text files for themselves.

Electronic texts are currently available on several media: magnetic disks, floppy disks, are most commonly used for distribution of shorter works; larger files are usually stored on CD-ROM discs or, less frequently, on nine-track magnetic tapes. The future will probably offer far larger-capacity media, but direct computer processing of enormous files using high-speed communication networks similar to Internet may become the standard.

Using Electronic Texts for Research

Computer analysis of electronic texts can make it easy to answer a series of questions that otherwise can be answered only by intuition, guess, or uncommonly mind-numbing research.

For example, Stuart Tave contended that in the novels of Jane Austen, she carefully discriminated between the use of two words that are often considered synonyms: "amiable" and "agreeable." (See note 1.) Following a distinction made by one of the characters in Emma, (See note 2.) Tave maintained that "agreeable" is used to describe a mere show of surface manners, while "amiable" is a quality based on excellent internal character.

Tave described the context of several appearances of each word and mentioned a number of others. He then assumed that every use of the words in Austen's novels (especially in Emma and Pride and Prejudice) maintained the same distinction.

Using the electronic texts of the novels, a computer program that searches for specific words and displays their context indicates that in Emma, "agreeable" appears 49 times, and "amiable" appears 35 times; in Pride and Prejudice, "agreeable" appears 41 times, and "amiable" appears 39 times. Upon examination of the 164 occurrences of these words in context, it appears that Tave is absolutely correct; the words are indeed carefully distinguished.

Upon freshly rereading Austen's Pride and Prejudice, I got the idea that the words "love" and "affection" were similarly used in a very precise way. Based on noting a handful of occurrences, I thought that young men and women used the word "love" only of parents or brothers and sisters; if they wanted to talk about the emotion connected with courtship, they spoke of "affection" (at least until they were engaged). This was my idea based on one rereading of the novel. I had noticed perhaps a dozen occurrences of the words, and they seemed to support my thesis.

A computer program searched for each word and displayed its context; there are 92 occurrences of "love" and 58 occurrences of "affection." Careful examination of the 150 passages indicates absolutely no support for my position: the two words were used almost interchangeably.

Obviously, it is extremely difficult, if not absolutely impossible, to notice the exact usage of a hundred or more occurrences of one or more words during the reading of a novel of several hundred pages. A computer using the electronic texts of novels brings all such occurrences together in their context for a researcher to examine.

Almost always, computer analysis of lengthy texts produces surprises by assembling data that run counter to intuitions of the works. It is quite easy to count the number of words of direct dialogue and compare that with the total number of words. It turns out that some novels that are commonly thought of as filled with talking (such as those of Jane Austen), are not so much so as those that are often thought of as nearly all narrative.

It can be interesting to observe how writers use color. John Keats sometimes associates rich colors with emotion, while Joseph Conrad rarely does so. Again, there can be surprises. Since she often deals with intellectual rather than physical landscapes, George Eliot's novels might be supposed to have less color than, say, those of Nathaniel Hawthorne. In fact, Eliot uses about twice as many words for color as Hawthorne. Shakespeare's plays contain fewer words for color than Hawthorne's novels, but perhaps plays always have fewer words for color than novels.

Roger Murray has shown that if poems written in England between 1780 and 1820 are classified by topics such as pathos, love, and childhood, it appears that some types of words are used with far greater frequency in poems of one topic than in those of another topic. As might be expected, poems about childhood have a great number of words that concern education: "book," "learn," "read," "school," and so on. It may or may not be expected that poems about pathos have a high frequency of travel words: "journey," "return," "roam," and "traveller." It does not seem obvious that poems about love should have lots of words that suggest time: "today," "tomorrow," "daily," "twilight," and "dusk." (See note 3.)

Nancy Ide has observed that William Blake's massive poem "The Four Zoas" has an intricate narrative that uses Blake's obscure personal mythology, and thus it would be expected that the meaning of the poem is incomprehensible for most readers. However, she documents that the poem has patterns of images that make a powerful impact even on a naive reader. (See note 4.)

An analysis of the vocabularies of writers and of their word and syntax preferences may indicate answers to questions of authorship attribution of texts. Perhaps due to their use of inductive logic, such studies have not always been very persuasive. John Burrows, who has published several fascinating studies testing characteristics of texts and authorship with computers, said that although "they will never be entitled to claim certainty," literary researchers "can undoubtedly help to identify the authors of doubtful texts." (See note 5.)

Computer generation of word-frequency lists, concordances, indexes, and collocation data for texts can make textual research much easier. It may seem to be trivial to count the numbers of personal pronouns in texts. Yet, a comparison of such counts could be interesting for the novels of Jane Austen and Joseph Conrad if, as has been said, there is no scene in Austen's novels in which a woman is not present and there is never a scene in a novel by Conrad in which a man is not present.

Having texts in electronic form can greatly simply finding specific words and passages. Where does Shakespeare say, "The first thing we do let's kill all the lawyers"? And is that an accurate quotation? (According to the Wells and Taylor edition of the Works that is available in electronic edition, this is an accurate quotation from act 4, scene 2, line 78 of 2 Henry VI.) How many times does Shakespeare mention lawyers? (The words "lawyer" or "lawyers" are used eleven times.)

Not only can passages be found easily in electronic texts, but when they are found, the relevant lines can be blocked and moved directly into another document with many word processors. Avoiding rekeying passages makes research faster and less tedious, and it helps assure accuracy.

The ease with which electronic texts can be searched and data collected prompts questions for research that would not otherwise be considered. In a play by Shakespeare, do the heros or the villains get to talk more? What is the minimum number of actors (assuming unlimited doubling) needed to perform a given play? (Such a question can be answered by determining which characters are on stage at the same time based on stage directions and on characters' lines.)

There are many interesting things for scholars to analyze in literature when they have texts in electronic form. Books and journals will increasingly cover new territory in which study is made possible by the availability of electronic texts.

Click here to go to Eric Johnson's computer programs.

Click here to go to Eric Johnson's publications.

Click here to go to Eric Johnson's home page.

NOTES

1 Stuart M. Tave, Some Words of Jane Austen (Chicago: University of Chicago Press, 1973), pp. 116-131.

2 Jane Austen, Emma, Ed. R. W. Chapman (London: Oxford University Press, 1933), p. 149.

3 Roger Murray, "Poetry and Collocation," Style, 14:3 (Summer 1980), 216-234.

4 Nancy M. Ide, "A Statistical Measure of Theme and Structure," Computers and the Humanities, 23:4-5 (August-October, 1989), 277-283.

5 J. F. Burrows, "Not Unless You Ask Nicely: The Interpretative Nexus Between Analysis and Information," Literary and Linguistic Computing, 7:2 (1992), 103.

Eric Johnson is Professor of English and Dean of the College of Liberal Arts at Dakota State University, Madison, SD 57042 U.S.A. He has published more than one hundred articles and reviews about computers, writing, and literary study. He can be reached by electronic mail as JohnsonE@jupiter.dsu.edu