New TEI XML digital editions by the Perseus Project

Plutarch, Athenaeus, Elegy and Iambus, the Greek Anthology, Lucian and the Scaife Digital Lbrary – 1.6 million words of Open Content Greek
(on Stoa.org by Gregory Crane)

The Perseus Digital Library is pleased to publish TEI XML digital editions for Plutarch, Athenaeus, the Greek Anthology, and for most of Lucian.

This increases the available Plutarch from roughly 100,000 to the surviving 1,150,000 words. Athenaeus and the Greek Anthology are new within the Perseus Digital Library, with roughly 270,000 and 160,000 words of Greek. The 13,000 words for J.M. Edmonds Elegy and Iambus include both the surviving poetic quotations and major contexts in which these poems are quoted. The 200,000 words of Lucian represent roughly 70% of the surviving works attributed to that author. In all, this places more than 1.6 million words of Greek in circulation.

The Need for Open Content Source Texts

It has been a decade since we published new Greek sources. There is nothing glamorous about digitizing source texts and many other more exciting research projects to explore as Classics in particular and the Humanities in general reinvent themselves within the digital world. Nevertheless, in working with our colleagues, we have come to the conclusion that the most important desideratum for the study of Greek is a library of Greek source texts that can be used and repurposed freely. Machine-readable texts are our Genome. We have therefore undertaken to help fill this vacuum. Support from various sources – including the National Endowment for the Humanities (NEH), the Mellon Foundation, the Institute for Museum and Library Services, the UK’s Joint Information Services Council (JISC), the Deutsche Forschungsgemeinschaft (DFG), and the Cantus Foundation – put us in a position where we could begin to contribute new Greek sources. A Digital Humanities Grant from Google helped complete the work published here and will allow us to release more Greek (http://googleblog.blogspot.com/2010/07/our-commitment-to-digital-humanities.html).

Our goal is not simply to provide services such as morphologically aware searching but to provide the field with Greek texts that they can reedit, annotate, and modify as they wish. We offer these texts both because they are useful as they stand but also as raw material on which students of Greek can build. We look forward to seeing versions of these texts in Chicago’s Philologic, the Center for Hellenic Studies’ First Thousand Years of Greek, and many other environments.

Creative Commons License

This is not the first time that these authors have been placed in digital form but this is the first time that they have been published under an open content license. All of the print sources are in the public domain. In creating these new digital editions we have chosen to apply a Creative Commons (CC) (http://creativecommons.org/) license with minimal restrictions. We have removed the non-commercial restriction that we adopted in March 2006 when we first began making our XML source texts available under a CC license. We expect those who use these digital texts to attribute their source to Perseus and to make any changes that they make to these texts available under the same conditions. Perseus will provide credit for any changes that it integrates into its versions of these texts. Projects such as Chicago’s Perseus under Philologic, the Open University’s Hestia Project, and the Center for Hellenic Studies First Thousand Years of Greek, as well as (outside of Classics) the Nameless Shakespeare and the University of Richmond’s Digital Scholarship Lab have already used sources contributed by Perseus as the starting point for additional work. We hope to see such efforts expand even more greatly in the future.

We have not created these digital editions to generate revenue or to underlie a proprietary service. In the now classic 1922 student textbook Argonauts of the Western Pacific, the Polish anthropologist Bronislaw Malinowski described with wonder and admiration the complex system of gift exchange that obtained among the Trobriand people of the Kiriwina islands. Students of Homer and of Archaic Greek culture will find in this society echoes of the generosity and gift-giving in which Greeks took pride. We take these texts out of the sphere of market exchange. We offer them both as a gift and as a challenge for students of Greek to improve what we have done. You may use texts to make money but you must share your versions of these texts as gifts to others.

Generations of scholars worked on these texts and it is our privilege to make these sources available to students of Greek at all levels and throughout the world. Copyright and licensing restrictions have prevented us from drawing upon the most recent editions for these works – we particularly mention Konrat Ziegler, the editor of Plutarch. Ziegler was in 1938 sentenced to two years imprisonment for helping a Jewish friend escape Nazi Germany and, after his release, hid the daughter of a Jewish friend.

We are confident that the contributions of recent scholars such as Ziegler will find their just place in subsequent versions of these digital sources. Our goal was to create digital sources with which those who love Greek could work and on which they could build without fear. You should not have to worry that a project director will cut off your access to the sources on which your research depends. Scholars should not have to work in fear of lawsuits from commercial publishers or their agents, for working with public domain data, which was digitized by federal money, is protected by a proprietary license and used to generate commercial revenue. You are free to change them and to create new works. You are free to act as students and as scholars, enabling the words of these ancient texts to take life again within the minds of our contemporaries and of future generations. The Trobriand islanders whom Malinowski knew would have understood the spirit of scholarship immediately. This is not mere gimwali, the game of commercial exchange, but kula, an exchange of gifts and a challenge to the generosity of those who make use of this.

Preservation and Curation

We consider preservation to be a problem that is, for practical purposes, solved. The texts that we are publishing now – as well as all the texts and other objects of persistent value in Perseus – are parts of the permanent digital collections at Tufts University and will be preserved, along with other university collections, by the Digital Collections and Archives and whatever organizations may succeed it. The best thing that scholars can do is to create objects that librarians can preserve, for it is libraries that have preserved our collections for generations in the past and are designed to do so in the future. Our digital repositories cannot yet work very well with the content of these XML files but they are quite capable of preserving the files as sequences of bits. At the same time, the open license means that anyone can replicate these sources and that there can be many copies of these sources outside of the preservation systems that our librarians develop.

Curation involves modification and improvement of the content. This can involve formal transformations (e.g., the conversion, hopefully automatic, to a future version of the Text Encoding Initiative (TEI) Guidelines). A great deal of work needs to be done. Some of this work is fairly basic in nature and is easily defined. But most of the work to be done is open-ended and involves the evolution of truly digital annotations that subsume and supersede the outmoded instruments of print editions.

All of the print editions on which we draw are in the public domain and most are available for free download as PDF files either from Archive.org or from Google Books. We have encoded the page numbers of the print sources in the digital files so that readers can compare digital editions with their print sources and can consult the textual notes for any given passage (as well as the introduction and other information).

Textual notes

The decision not to enter textual notes warrants additional explanation. Classicists have bemoaned the absence of variant readings in their digital source texts since the Thesaurus Linguae Graecae (TLG) began development almost forty years ago, but no project has emerged to create a comprehensive database of variants.

We at Perseus have worked on textual variants over the years. In the 1990s, we used the then new TEI Guidelines to create dynamic editions in which readers could compare different versions of the same text. We chose to begin work with English sources because these had a broader immediate audience and thus seemed better suited as a demonstration project.

Hilary Binda created a digital edition for the surviving plays of Christopher Marlowe. For works with minor textual variants (such as Dido or Tamburlaine the Great) she encoded the variants in a machine actionable form.

<sp who=”myce”>
<speaker>Mycetes</speaker>
<l>Brother, I see your meaning well enough.</l>

<l>
And thorough your

<app>
<lem>Planets</lem>
<rdg wit=”Coll”>plainness</rdg>
</app>
I perceive you thinke,</l>

Marlowe was appealing because his work not only allowed us to examine the problems of representing variants on a single more-or-less uniform source but also challenged us to think about cases where more than one very different version exists. Two versions of Doctor Faustus survive and neither can be easily reduced to the other nor to a single text. In this case Binda encoded links between the versions so that we could compare the texts to each other. David A. Smith, now a member of the Computer Science Faculty at UMass Amherst, then developed visualization tools within the Perseus Website so that readers could dynamically explore both minor variants and the two versions of Doctor Faustus.

Since we were not able to convert variants into a machine actionable form, we included textual notes as footnotes, with the idea that others could then systematize this data (e.g., http://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.02.0010:text=Catil.). But even when we have a clean transcription of the textual notes, we often cannot match notes to the relevant sections of the text, much less parse the notes themselves (Boschetti 2007). Human editors are much less consistent in their practice than they realize – a phenomenon that surprised and slowed those who, a generation ago, created a digital version of the Oxford English Dictionary (Raymond and Tompa 1987). The TEI, recognizing this problem, created for print dictionaries a much looser document type definition than that recommended for born-digital dictionaries. In a significant number of cases, we were not able to understand the notes ourselves – a phenomenon reported also by members of the Homer Multitext Project.

With the rise of vast collections with high-resolution images of each page in a printed book, the situation changed. Scholars can simply consult page images of print resources. We need searchable text so that we can find the variants on a page. During 2006/2007, support from the NSF allowed Gordon Stewart, then a recent BA in Classics and now a PhD candidate at Princeton, to publish an evaluation of optical character recognition (OCR) software for Classical Greek. Stewart used systems optimized for modern Greek and trained them for Classical Greek – in effect, Stewart trained the systems to ignore accents and found that, for a modern typeface such as that used by the Loeb Classical Library, OCR could generate text that captured the alphabetic data (i.e., unaccented Greek) with an accuracy of 99.7% and with accuracy of 98.5% for a range of Greek fonts (Stewart et al. 2007).

Stewart also measured the number of Greek words that only occurred in the textual notes. An exploratory survey of 10 Oxford Classical Text and Teubner Editions showed that 14% of the Greek words on a given page only occurred in textual notes. For Loeb editions, which report only what the editor considers to be the most important variants, he found that 4% of the Greek words only appeared in the notes.

The results of these measures are significant. Even when we have perfect transcriptions of a reconstructed text, we only have at most 96% of the relevant text. Put another way, scholars will find more of the relevant data searching text that contains a few errors but also includes output for the source text and variants.

Methods to perform OCR on Greek have also improved. During the 2008/2009 academic year, Federico Boschetti, working as a member of the Mellon-funded Cybereditions Project was able to generate accurate results for accented Greek and he developed methods to compare the output of multiple OCR engines to detect and correct errors (Boschetti et al. 2009). Accents remain challenging, with accuracy for accented Greek that have not exceeded 97%. The vast majority of these errors, however, involve accentuation and many of these can be corrected automatically.

In light of the above, we have decided on a two-fold approach. First, we continue to create accurate transcriptions of the reconstructed texts but we include as well page numbers that become dynamic links to the digital images of those pages in collections such as Google Books or the Internet Archive. Second, we will rely upon OCR-generated text for searching the textual notes themselves.

But if automated methods do not lend themselves to analyzing textual notes they are much more successful at comparing different versions of the same text. The venerable diff utility in Unix has provided serviceable text comparison software for a generation. Programs such as Collate and the Versioning Machine have been developed to support humanists working with different versions of the same text.

For twenty-one of Plutarch’s Moralia we have provided, in addition to Bernadakis’ Teubner Edition, the Greek text that Babbitt prints in the Loeb Classical Library so that readers can experiment with text comparison systems such as Collate and the Versioning Machine.

Even when we have only error-filled OCR text, we may have enough intact strings so that we can align texts together so that scholars can compare page images of the same chunk of text in many editions. Ultimately, our editions can and should include the full history of the text, including not only manuscripts and other witnesses but also printed editions and published conjectures. Scholars can then browse, mine, and visualize this data according to their needs. These needs include not only reconstruction of the source text but also seeing how printed texts of important works changed over time and analyzing the relationship between editions. The Homer Multitext Project and Aeschylus Project, led by Vittorio Citti, Federico Boschetti, Francesco Mambrini, Matteo Romanello, and others, are among those efforts that are exploring such data driven editions.

English Translations

We have included English translations for some but not all of the Greek texts that we are publishing. In part, this reflects the fact that it is much easier to digitize English translations than Greek texts. While we hope to be able to add the translations, members of the community could download the source texts for Athenaeus, the Greek Anthology, and Lucian, and create clean XML from the OCR.

How these digital editions were produced

OCR software, applied to scanned images of the print sources, produced the raw material for these digital editions. In correcting these texts we worked closely with our colleagues at Digital Divide Data (DDD), a non-profit company in Cambodia developed to engage workers in the global economy. Members of the DDD team quickly refined the raw OCR output, marking the boundaries between headers, text, and notes, while correcting and marking the citation data in the source texts. We then analyzed the document structures and encoded them in TEI XML. We also drew upon the Morpheus Greek morphological analysis system to identify words with possible errors in the Greek OCR – from 5 to 15% of the words in each text. Members of the DDD team then went through and added corrections. We are grateful to Linda Thomas, the US representative for DDD, and to Sambo Sdok, our colleague in Cambodia who bore with our requests and whose team toiled to make freely available to the world a part of our shared cultural heritage. Rashmi Singhal, the lead programmer at Perseus from 2007 through late 2010, developed the workflow and did a great deal of work improving the TEI XML, while Bridget Almas, who succeeded Rashmi in November 2010, has improved the markup and loaded the new texts into the Perseus Digital Library system. Lisa Cerrato has for years managed the operation as a whole, scanning books where sources from Google or the Internet Archive were unavailable or illegible, correcting and adding XML markup to the English translations of Plutarch, and performing innumerable tasks that make this work possible.

Readers of these electronic texts will, for now, find evidence of their origin from OCR software. The words of the text still contain some errors but most remaining problems are with punctuation and encoding. Error detection software does a much better job of identifying spelling errors than missing colons and commas, while random apostrophes, commas and periods are still to be found – usually a testimony to specs on the scanned page image.

Other errors involve markup. We have not always included the paragraph breaks of original editions (nor were these necessarily a high priority for us). We often include more than one citation scheme for each text (e.g., both book/chapter/section numbers and Stephanus pages). It is easy to detect when we jump from section 3 to section 5 but harder to detect when we have missed the citation marker for section 5 where 5 is the last section marker in a chapter.

We wanted to provide texts that would distinguish between quotations of external sources and the core text – students of Plutarch or Athenaeus need to be able to filter out quotations of Homer or Greek Comedy when they are analyzing the prose of these authors. We have therefore labeled quotations of poetry wherever possible. Our colleagues at DDD are not experts in Classical Greek and there were times where the distinction between poetry and prose is not entirely clear. We have corrected many instances where quoted poetry was not marked and where prose was marked as poetry but more surely remain. We have begun to use the TEI QUOTE tag to indicate a quotation that comes from some other source.

Authors such as Athenaeus often quote passages from drama and a number of poems in the Greek Anthology contain speakers. We have only begun to provide the TEI markup for speeches and speakers.

Likewise, we tried to use the TEI Q tag to mark quotations within the text such as:

“O Solon,” he said, “I do not think this is wise.”

The goal is to facilitate the analysis of different linguistic registers such as narrative and conversational prose – the Powell lexicon of Herodotus, for example, distinguishes between words that appear in quoted text from those that appear in the narrative. This distinction often breaks down, however, especially when we find, particularly in dialogues, long speeches that quickly shift in style to expository prose.

Readers will quickly see that the distinction between Q and QUOTE tags is far from consistent. Those analyzing these texts will, however, get good results if they assume that any tag that contains within it lines of poetry (marked as or tags) is not part of the main narrative and comes from an external source. Quotations and paraphrases from prose sources in authors such as Plutarch and Athenaeus are a much more complex topic.

Beyond interpretation and representation of previous sources

Much of what has been described so far involves interpreting and then representing in machine actionable form. As we realize more of the possibilities of digital publication, we soon find ourselves adding information that is not implicit on the print page (such as the syntactic analyses found in the Greek and Latin Treebanks and available at http://nlp.perseus.tufts.edu/syntax/treebank/). At some point, we may have a perfect transcription of the print source but the digital edition has its own logic.

If we take seriously the issue of identifying quotations and paraphrases, then we rapidly move beyond encoding features implicit in the print source and into the study of historical sources and of fragmentary historians. The PhiloGrid Project (funded by NEH and JISC) and the Cybereditions Project (funded by Mellon) allowed Monica Berti, a Classicist and editor of Greek fragmentary historians, and Matteo Romanello, a Digital Classicist, to spend six months studying the opportunities and challenges involved in working with Greek and Latin works which survive only insofar as surviving works quote them. Most Greek and Latin sources, in fact, only survive in fragmentary forms, which can include verbatim quotations, paraphrases or allusions. The German eAqua Project helped Computer Science PhD-candidate Marco Büchler spend six months at Perseus as well, where he began a collaboration with Monica Berti, applying methods to detect text reuse, developed to find quotations of Plato, to the analysis of fragmentary authors (Berti and Büchler 2010). None of this work is feasible, however, unless the researchers are able to analyze and republish in annotated form the source texts that quote, paraphrase and cite lost works. Digital editions of fragmentary authors must be hypertextual databases that link reconstructed fragments to the various sources in which they occur (Berti et al. 2009). Print editions of fragmentary authors are static collections of excerpts. Editors of fragmentary works must have access to digital versions of the sources that preserve those fragments.

We therefore chose to enter all of Plutarch and Athenaeus precisely because these authors quote, paraphrase, and cite thousands of passages from works that no longer survive. We have published these works so that not only Marco, Matteo and Monica but all students of fragmentary authors and of Plutarch and Athenaeus can use them (for more on the importance of quotations within Athenaeus see Braund and Wilkins 2000, Lenfant 2007, and Jacob 2001).

While fragmentary collections should, in our view, consist of annotated links on top of authors such as Plutarch and Athenaeus, we cannot work only with Plutarch and Athenaeus. In order to experiment with a collection that references a more comprehensive set of sources for particular authors (as opposed to mining exhaustive references to authors from selected sources), we have included a digital version of J. M. Edmonds edition of Elegy and Iambus. In this case, we have used embedded TEI CIT tags to represent two basic structures. A tag marked with an identifier marks what Edmonds has designated as the fragment. This CIT is placed within a larger CIT structure that represents the content from the surviving author who quotes this fragment.

For example:

<div2 type=”elegiac” n=”6,7″>
<cit id=”tlg-0266.cit.23″><quote>e)s timwri/as de\ a(\s u(/brizon e)s tou\s *messhni/ous *turtai/w| pepoihme/na e)sti/n:
<cit id=”tlg-0266.cit.24″><quote>
<l>w(/sper o)/noi mega/lois a)/xqesi teiro/menoi,</l>
<l>desposu/noisi fe/rontes a)nagkai/hs u(/po lugrh=s</l>
<l>h(/misu panto\s o(/son<note n=”p.66.n.3″/> karpo\n a)/roura fe/rei.</l></quote><bibl>CURFRAG.tlg-0266.4</bibl></cit>
<p>o(/ti de\ kai\ sumpenqei=n e)/keito au)toi=s a)na/gkh, dedh/lwken e)n tw=|de:</p>
<cit id=”tlg-0266.cit.25″><quote>
<l>despo/ta=s oi)mw/zontes o(mw=s a)/loxoi/ te kai\ au)toi/,</l>
<l>eu)=te tin’ ou)lome/nh moi=ra ki/xoi qana/tou.</l></quote> <bibl>CURFRAG.tlg-0266.5</bibl></cit>
</quote> <bibl>Paus. 4. 15.5</bibl></cit>
</div2>

Thus in the passage above, a larger CIT represents an excerpt from Pausanias which contains two quotations of Tyrtaeus.

We chose to include the Greek Anthology both because of its inherent interest and because it constitutes a technical challenge analogous to that of fragmentary authors. Support from the Cybereditions Project has allowed Alison Babeu, the Digital Librarian at Perseus, to develop an extensible and growing catalogue of Greek and Latin sources. Reference works such as the TLG Canon are not so much bibliographies as they are checklists of editions currently included in the TLG. They are thus closer in spirit and substance to the lists of cited editions at the start of the Liddell Scott Jones Lexica and Lewis and Short for Greek and Latin. Library catalogues, by contrast, provide catalogue records and unique identifiers for authors, and these work in large and complex systems, but library catalogues focus on authors and subjects of whole books. The Greek Anthology provides an example of the kind of work that emerging library systems need to support. Readers are often less interested in books and pages and want instead to find all works attributed to a particular author, whether these are prose speeches or epigrams scattered throughout a larger collection. The Digital Greek Anthology illustrates how such a work can be structured.

Where Plutarch and Athenaeus quote and paraphrase many sources, the Greek anthology is not a work by a single author but a collection of works by many different poets. While we have followed the Loeb as a source text we have generally followed the attributions of poems to individual authors as they appear in the Beckby edition (Beckby 1965-1968). Others may wish to add competing attributions, while access to an unencumbered text will help researchers apply computational methods to the author attribution problem.

The work on Plutarch, Athenaeus, Elegy and Iambus, and the Greek Anthology builds on, and provides the raw material to continue, work with fragmentary authors in the NEH/JISC PhiloGrid Project. A summer 2010 Google Digital Humanities Award contributed to the digitization of Athenaeus and the Greek Anthology, and has allowed us to begin adding new authors to the Perseus collections. The works of Lucian published here represent the first offerings in a series of new authors made possible by Google.

The Scaife Digital Library

These authors do not just expand what Perseus can offer but are also preliminary offerings for the Scaife Digital Library, a distributed collection of open content named after the late Ross Scaife. Hellespont, a new project funded by NEH and DFG, will allow us to upgrade our texts to TEI P5 and, even more importantly, to revise and document the markup. The texts published here will be the first such texts, with the rest of the Perseus collections following. Support from the Bamboo Project is also allowing us to work on the infrastructure needed to dynamically integrate sources from Perseus with those from other projects.

References

Hermann Beckby. (1965-1968). Anthologia Graeca. München : Heimeran.

Monica Berti, Matteo Romanello, Alison Babeu, Gregory Crane. (2009). “Collecting Fragmentary Authors in a Digital Library (Greek Fragmentary Historians). In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries (JCDL 2009), pages 259-262. http://www.perseus.tufts.edu/publications/JCDL09_sp.pdf

Monica Berti and Marco Büchler. (2010). “Fragmentary Texts and Digital Collections of Fragmentary Authors.” Digital Classicist 2010 Works in Progress Seminar, http://www.digitalclassicist.org/wip/wip2010-08mb.pdf

Federico Boschetti. (2007) “Methods to Extend Greek and Latin Corpora with Variants and Conjectures: Mapping Critical Apparatuses onto Reference Text.” In CL 2007: Proceedings of the Corpus Linguistics Conference (27-30 July 2007) http://corpus.bham.ac.uk/corplingproceedings07/paper/150_Paper.pdf

Federico Boschetti, Matteo Romanello, Alison Babeu, David Bamman, Gregory Crane. (2009). “Improving OCR Accuracy for Classical Critical Editions.” In Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2009), pages 156-167, Corfu Greece: Springer Verlag, 2009-09. http://www.perseus.tufts.edu/publications/ecdl2009-preprint.pdf

David Braund and John Wilkins (eds.). (2000). Athenaeus and His World. Reading Greek Culture in the Roman Empire. Exeter: University of Exeter Press.

Christian Jacob. (2001). “Ateneo o il Dedalo Delle Parole.” in L. Canfora (ed.), Ateneo. I Deipnosofisti. Roma: Salerno Editrice.

Dominique Lenfant (ed.) (2007). Athénée et les Fragments d’historiens. Paris: De Boccard.

Darrell R. Raymond and Frank Wm. Tompa. (1987). “Hypertext and the New Oxford English Dictionary.” In HYPERTEXT ’87: Proceedings of the ACM conference on Hypertext, http://portal.acm.org/citation.cfm?id=317438.

Gordon Stewart, Gregory Crane, and Alison Babeu. (2007). “A New Generation of Textual Corpora: Mining Corpora from Very Large Collections. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (JCDL 2007), pages 356-365, Vancouver, British Columbia: ACM Digital Library, 2007. http://hdl.handle.net/10427/14853

No Comments

Social Widgets powered by AB-WebLog.com.