The Editing and Structure
of Open Source Shakespeare
Moby Shakespeare’s texts collectively can be called a diplomatic edition of a critical edition: They are an edition produced by faithfully reproducing another edition, which was formed by conflating the folios and quartos. However, the texts could not be used “as is” if they were going to be fed into a database on their way to becoming Open Source Shakespeare.
The first challenge was to get the texts into a uniform order. The human eye can easily ignore small differences in formatting; a computer is far less forgiving. Sometimes the ends of lines were terminated with a paragraph break, sometimes two. Act and scene changes were indicated differently in different texts, and so on.
There was also the question of what to do with material that lies outside the characters’ spoken lines. I removed the dramatis personae at the beginning of each play and entered the character descriptions into a separate database table, so they can be seen in the play’s home page, but remain distinct from the text.
In editing the texts themselves, I made some minor changes for the sake of consistency. For instance, the Moby texts indent certain stage directions if they fall at the end of a line, and sometimes, a stage direction is indented by many spaces. This seems arbitrary, and although it may be following a convention in the printed texts, it adds nothing to either comprehension or aesthetics. For the most part, those spaces have been removed.
In the course of preparing the texts for the parser (about which more in a moment), many miscellaneous formatting errors came to light. Some of them were found by visitors after the site’s release. They also caught less visually obvious flaws, such as the assignment of a particular line to the wrong character (an error that was sometimes my fault, but usually the fault of the original Moby text.) There are, in all likelihood, many other errors remaining in the 28,000 lines, which will be corrected as users report them. Because there are over 860,000 words in the texts, I judged that my time would be more profitably spent on the site’s tools, and so the errors are fixed as they are reported.
When I prepared the texts, I made them readable by humans, but in a consistent format meant to be read by a machine. Specifically, they were intended for a parser, a program that reads a text and does something useful with it. In this case, the parser splits the texts into individual lines, determines their attributes, and feeds them into a database. (See Appendix B for a sample of the texts’ final format.)
I developed the parser at the same time I was feeding it the texts. Initially, I started with one play (King Lear) and wrote the first-generation version of the parser. As I formatted the texts, I improved the parser’s performance and power. For example, at first the parser did nothing other that read each line and figure out which character it belonged to, adding act and scene information as well. It was easy enough to determine how many words and characters were in each line, so I programmed the parser to capture that information and store those values in the database.
There are four search options in OSS: partial-word, exact-word, stemmed, and phonetic. Every online text search function will search for all or part of a word. That is, when a user searches for the word play, the function will find play, but also playing and replay. Finding an exact match, which would exclude playing and replay, is not ubiquitous in online text searches, but it is common and useful, so OSS can do it. There were two additional inexact, or “fuzzy,” search methods that intrigued me, stemmed searches and phonetic (sound-alike) searches, which are rarely used. I started experimenting with these searches to see if I could incorporate them.
The Porter stemming algorithm is a venerable method of determining the stems of words using standard grammatical procedures. It removes inflections from words, so playing, played, and plays are converted to the synthetic stem plai. But it has no idea that is and was are conjugated forms of be (though it will identify being as derived from the same stem.)
Another standard linguistic programming method is the Metaphone algorithm. This method forms a sound value from a word by stripping the vowels out of it, and then converts similar-sounding consonants into a common consonant. Porter and Metaphone are widely documented on the Internet, and you can find ready-made code for them written in many programming languages. That is important, because in OSS, the texts are sent through a parser written in one language (Perl), extracted through another language (SQL), and displayed through a third (PHP).
Once I gathered the code necessary to build stemming and phonetic searches, some choices presented themselves. In order to find a phonetic value, for example, you have to perform the following steps:
- Convert the user-supplied keywords into phonetic values
- Build a database query based on those values; and
- Execute the query in a reasonable amount of time.
I could think of two ways to perform step 3. First, the query could retrieve all of the lines in the scope that the user specifies – which could include all the works, and all 28,000 lines – and march through the results one-by-one, converting every word into phonetic values and comparing them with the user’s requested words. This is horrendously inefficient: Every stemmed or phonetic query would consume about 8-10 megabytes of memory, making it impossible to run more than a few queries simultaneously from different users. The execution time could balloon to as much as 5 minutes.
The second option was to calculate separate stemmed and phonetic lines for each natural language line, and store all three lines in the same database record. This makes the execution time identical to the exact-word search, i.e., less than 10 seconds. Figure 16 below illustrates how this looks inside the database. Note the words played and government, which are correctly stemmed to plai and govern, respectively; however, the words his and prologue are incorrectly assumed to be the inflected forms of the nonexistent stems hi and prologu.
Of the two fuzzy search options, the stemming algorithm appears to be more useful. Metaphone identifies their, there, and they’re as homophones, but for finding certain words, it is useless. To cite one egregious example, searching for guild returns called, could, cold, glad, killed, and quality. Porter stemming has its limitations, particularly with irregular verbs, but it will generally perform as expected. The best way to link an inflected word with its root would be through a brute-force approach: Take at least 100,000 English words, annotated with pronunciations, stems, and any other value worth attaching, and put them in a database table. Then, when the parser is processing the texts, it can look up each word and it will not have to make an educated guess for the stem and the pronunciation – the parser can find that information in the table. Doing that would be simple, but the problem is obtaining the word list, and verifying its quality. Ian Lancashire suggested this approach in 1992:
…with some information not commonly found in traditional paper editions, software can transform texts automatically into normalized or lemmatized forms. One such kind of apparatus suitable for an electronic edition is an alphabetical table of word-forms in a text, listed with possible parts-of-speech and inflectional or morphological information, normalized forms, and dictionary lemmas. With such an additional file, software might then ‘tag’ the text with these features and then transform it automatically into a normalized text or a text where grammatical roles replace the words they describe. Such transformations have useful roles to play in authorship studies and stylistic analysis ( Lancashire, “Public-Domain”).
After ten or twelve plays, the text formatting was more or less standardized and complete, and it was just a question of re-formatting the remaining works. Act and scene changes had their own separate lines, so the parser would know where they were. At first, stage directions were a separate category of lines. I found that this was unnecessary, as they could be assigned to a “character” with the identifier of xxx in the database.
Two issues, one minor and one fairly significant, remain with the texts and the database that stores them. There are a small but not inconsiderable number of lines that are attributed to more than one character. Some are marked “Both,” and the speakers are easy to identify from the context. But what to do about lines marked “All”? Should they be attributed to every single character on the stage? Presumably – but how do you determine who is on stage, given the paucity of stage directions in the original texts? That requires editorial discernment that I do not have. Further, since one of my goals was to finish this project before my natural death, I did not want to painstakingly go through hundreds of lines with multiple speakers and figure out who was saying what. Also, this would require increasing the complexity of the database, because each line is assigned to one speaker, and one speaker only (indicated by the field “CharID” in Figure 16). Changing that would mean re-engineering several database tables, as well as all of the pages which use those tables’ data. In the end, every time a line was marked as “Both” or “All,” I created a new character in that play called “Both” or “All.” Not the most satisfactory arrangement, but good enough.
The other issue is fairly significant and noticeable. Between Acts IV and V of Henry IV, Part 2, King Henry IV dies. Until that point, the Moby text refers to “Prince Hal,” and then after his coronation, he is “King Henry V.” Making a computer understand that transition is tricky, for reasons similar to the multi-character lines described above. There is only one name for each character, just as there is only one character for each line. You could have two different characters for Henry, one for Prince Hal and one for the king. If a user wanted to search all of Henry’s lines for the word happy, he would have to know that the same person’s lines were split into two different characters, and perform the search accordingly. That seems too much to expect of the casual user.
So there is still one name for each character, which makes for several goofy-looking passages of dialogue. Take a look at this passage in Henry V, Act 4, Scene 5:
Henry IV . But wherefore did he take away the crown?
[Re-enter PRINCE HENRY]
Lo where he comes. Come hither to me, Harry.
Depart the chamber, leave us here alone.
Exeunt all but the KING and the PRINCE
Henry V . I never thought to hear you speak again.
The choice came down to three possibilities: 1) keeping the character names consistent, no matter whether their name or rank changed, which might cause a small amount of confusion for some readers; 2) crippling the utility of the search function and frustrating users; or 3) re-engineering major portions of the database and re-writing the pages which use them. As with multi-character lines, the amount of time and effort necessary to do proper name changes was not proportional to the results, and I took option number one.
Once the text formatting and parser functions were in a workable status, it was just a question of repeating the same procedure for each play. This is the final procedure for adding a work:
- Manually enter the character information into the database, including character descriptions. Also, the database indicates character abbreviations, so the parser will know that Ham. corresponds to the character of Hamlet.
- Remove all extraneous information at the beginning of the play (frontispiece, character information, notes, etc.)
- Perform several search-and-replace operations to properly mark the stage directions, act and scene indicators, and character lines.
- Eyeball the text, searching for obvious errors.
- Run the parser on the text. Each time the parser comes across an error, it halts the program and reports the line number where it choked. The line is then amended.
- Repeat step 5 until there are no more errors.
- Display the play on the testbed Web site, again looking for errors that a computer might not catch but a human would see.
This procedure might seem very complex, and indeed it took many hours to perfect. However, the last fifteen or sixteen plays went very quickly, as it was just a question of repeating the same process over and over. I got to the point where I could finish one or two plays an hour, depending on how many discrepancies there were in the texts.
Next, I moved on to the poems and sonnets. Since I had been working on plays thus far, my database’s schema reflected the structure of a play: Each had an entry in the Plays table, and each play had Acts, Scenes, and Lines. I could have kept using this format behind the scenes, as this schema is largely hidden from the user. But I “universalized” the database schema instead. Plays became Works, Acts became Sections, Scenes became Chapters, and Lines became Paragraphs. Any literary work could be broken into smaller elements by a parser and stored in this schema, if it were used in another project.
The poems are heterogeneous in format, but they were easy to convert, as their structure was fairly simple compared to a play (no stage directions, and all of the lines were assigned to a “character” called “Shakespeare.”) I decided to treat the sonnets as a single work with one section and 154 chapters.
The final texts of Open Source Shakespeare do differ somewhat from the Moby edition, though the differences are not substantive. OSS adds a through line-numbering (TLN) system, which means that within each play, the line numbering starts at the beginning and continues through to the end, without restarting the numbering at act and scene divisions. The Norton edition uses TLN, as do other electronic editions such as the Internet Shakespeare Editions; the Variorum Handbook mandates TLN (Variorum 22). The advantage of TLN is that from the line number, you get a rough idea of where the line falls in the play. Scene-by-scene numbering shows where a line falls within a particular scene. In my opinion, TLN is the better system overall, because the length of the plays differs much less than that of individual scenes, and thus what it conveys is more useful. The Variorum Handbook and others number the titles of the play as “0,” or “0.1, 0.2” etc. for multi-line titles. In OSS, the play titles are considered attributes of the play, not a part of it. Act and scene indicators are also removed from the text itself, although the scene’s setting (e.g., “Another part of the forest”) is captured and stored as an attribute of the scene.