TEI Encoding Guidelines for Keyboarding Vendors

University of Virginia Library

About This Document

This document describes the markup practices for keyboarding/encoding vendors to follow when producing TEI documents for the University of Virginia Library.

Title: TEI Encoding Guidelines for Keyboarding Vendors
Author: Greg Murray, University of Virginia Library
URL: http://pogo.lib.virginia.edu/dlps/public/text/vendor/vendor.html
Last modified: March 2, 2009

Contents


I. Introduction

The DTD

The University of Virginia Library (U.Va. Library) uses a customization of TEI (version P4).

Note: This document is not a tutorial on TEI; it assumes prior experience with TEI encoding. Instead, this document describes encoding practices specific to U.Va. Library

Obtaining the DTD:

Invoking the DTD: The main DTD driver file is tei2.dtd. The required U.Va.-specific modification files are uva-dl-tei.ent and uva-dl-tei.dtd. Together they constitute a customization of TEI P4. To invoke the DTD, refer to the main DTD driver file in the DOCTYPE declaration, and include ENTITY declarations for the modification files in the internal subset:

<!DOCTYPE TEI.2 SYSTEM "tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "uva-dl-tei/uva-dl-tei.dtd">
]>

As this example declaration shows, the TEI P4 files (which are unmodified — they remain exactly as originally downloaded from www.tei-c.org) and the U.Va. Library modification files can reside in different directories.

uva-dl-tei vs. TEI Lite: Encoders familiar with TEI Lite (or with unmodified TEI P4) will notice that our DTD (uva-dl-tei) is generally more rigorous or strict than TEI Lite. For example:

uva-dl-tei also adds several convenience elements not available in TEI Lite:


General Guidelines

XML: All documents should be encoded in the XML expression of TEI (not SGML).

The transcription: The electronic text should contain an exact character-for-character transcription of the text of the print source.

Including all content: With very few exceptions (see Exceptions immediately below), all content from the print source must be included in the electronic text. All textual data must be included in the transcription, and all non-textual data must be included in the markup as figures.

Exceptions: The only exceptions to the "include all content" rule are:

Providing notes: While transcribing and encoding even the most typical, straightforward materials, some problems or uncertainties are likely to arise. Any unusual circumstances encountered in the transcription or encoding process should be noted in a document that gets delivered to U.Va. Library along with the encoded XML files themselves. We prefer to receive one such "Notes" document for each electronic text (rather than one document for an entire encoding project). The document should include information on these kinds of situations:


II. Major Structure

Essential Structure

Follow the essential structural markup common to most TEI documents:

<TEI.2>
    <teiHeader>
        . . . [metadata section supplied by U.Va. Library to the keyboarding vendor]
    </teiHeader>
    <text>
        <front>
            . . . [front matter: title page, acknowledgments, dedication, table of contents, preface, etc.]
        </front>
        <body>
            . . . [main body of text]
        </body>
        <back>
            . . . [back matter: endnotes, index, appendices, etc.]
        </back>
    </text>
</TEI.2>

TEI header: The <teiHeader> element will be supplied by U.Va. Library to the keyboarding vendor.

Composite texts: In rare cases, U.Va. Library will request that a particular document be marked as a composite text, in which the usual <body> element is replaced with the <group> element, which then contains multiple <text> elements, each with its own <front>, <body>, and <back>. (This can occur with anthologies or collected works, where each work has its own front and/or back matter.)

<TEI.2>
    <teiHeader>
        . . . [metadata section supplied by U.Va. Library to the keyboarding vendor]
    </teiHeader>
    <text>
        <front> . . . [front matter for the collection] </front>
        <group>
            <text>
                <front> . . . [front matter of first text] </front>
                <body> . . . [main body of first text] </body>
                <back> . . . [back matter of first text] </back>
            </text>
            <text>
                <front> . . . [front matter of second text] </front>
                <body> . . . [main body of second text] </body>
                <back> . . . [back matter of second text] </back>
            </text>
        </group>
        <back> . . . [back matter for the collection] </back>
    </text>
</TEI.2>

This kind of structure should be used only when specifically requested by U.Va. Library for a particular document.


Structural Divisions

Starting with <div1>: Although TEI Lite allows them, we do not use the <div> or <div0> elements. Instead, top-level structural divisions must be encoded as <div1>.

All content within a div: The <front>, <body>, and <back> elements must contain only <div1> elements. That is, all content must be enclosed by a <div1> or lower-level division, not placed directly within the <front>, <body>, or <back> elements. This holds true even for page breaks.

Determining division boundaries: When determining the start- and end-points of divisions and the nesting of lower-level divisions, refer to the printed table of contents for the work. Typically the table of contents is an accurate guide to the hierarchical structure of the work.

type attribute: Each <div#> tag must have a type attribute indicating the kind of division. If the division has no obvious type, the generic term "section" may be used; if "section" has already been used for a higher-level division, use "subsection". Typical type values for divisions within <body> are:

Some type values are appropriate within <body> or <front>, depending on context:

Others are appropriate within <body> or <back>, depending on context:

n attribute: If the division is numbered or otherwise labeled in the print source, record the number or label in the n attribute. Typically the number associated with a division is obvious from the division's header. If the division does not have a number associated with it, do not include the n attribute.

[example of a straightforward div structure]

Division headings: A division will usually (though not always) have a heading that announces it. The heading should, of course, be marked with <head>.

It is fairly common for a division to have more than one heading. In such cases, use multiple <head> elements, and include the type attribute to distinguish the headings and identify their roles. Use the same type values used for the <titlePart> element, namely "main", "sub", "desc" (for descriptive), and "alt" (for alternative).

When a division has only one <head>, it is necessarily the main one, so the type attribute is unnecessary and should not be used.

[example]

Half-titles, fly-titles, and divisional titles: A common feature in many books is a separate page containing the title of the work (or the title of a section of the work). There are three main types of such features:

[example showing a fly-title and a divisional title]

Informal divisions: It is fairly common (especially in poems, but also in prose works) to see informal divisions, indicated by a string of asterisks or periods, or by a horizontal line. Normally, such features do not indicate the beginning of a new <div#>. Instead, mark them as <ornament/> elements.

For strings of asterisks, periods, etc., set the type attribute to "characters", and include the characters as the content of the <ornament> element. For example:

<ornament type="characters">*&emsp;*&emsp;*&emsp;*&emsp;*&emsp;*&emsp;*&emsp;*</ornament>

For horizontal lines, set type to "line" and leave the content of the element empty:

<ornament type="line"/>

The same approach should be used for printer's ornaments.

Quotations of verse within prose: In cases where a brief section of verse is quoted within a predominantly prose text, the verse should be marked as a block quotation (<q>), not as a distinct <div#>.


Front and Back Matter

Use <div1> to mark the main sections of the front matter and back matter. The exception to this rule is the main title page of the work, for which the <titlePage> element is used. Use the type attribute on the <div1> element to indicate the type of division. Typical values for front matter include:

Typical type values for back matter include:

Title Page

Within <front>, the title page is, of course, marked up using the <titlePage> element rather than <div1>.

Title types: When using the <titlePart> element to mark the parts of the title, include the type attribute, assigning one of these values:

Include verso: The content on the verso (reverse side) of the title page should be included within the <titlePage> element.

Volume information: The volume information should go in a <titlePart type="volume">. This is true even if the volume information is separated from the title by the byline or other elements (<titlePart> is allowed outside <docTitle>).

. . . </docTitle>
<byline>By <docAuthor>BOOKER T. WASHINGTON</docAuthor></byline>
<titlePart type="volume">VOLUME I</titlePart>

[example: complete title page with verso]
[another example]


III. Genres

Letters

When encoding letters, prefaces, and other such personal writings, use the appropriate elements for the opening and closing sections:

Openers and closers, in turn, contain one or more of these elements:

[example: letter with opener and closer, including a postscript]


Verse

Line groups: All verse — including poems without distinct stanzas, as well as verse quoted within a block of prose — should be encoded with the <lg> element. The type attribute is required; if the lines of verse have no obvious grouping (such as "stanza"), use type="group".

Indentation: If a line of verse is indented more than the surrounding lines, use <l rend="indent">...</l>. [example]

Line breaks: When encoding verse it is important to distinguish between logical lines of verse and the physical presentation of those lines on the printed page. In cases where a line of verse is too long to fit on the printed page, and for that reason is continued on a second line, use <l> to mark the logical line of verse and <lb/> to mark the physical line break. [example]


Drama

Use the standard TEI elements for encoding dramatic works:


Newspapers

[Newspaper markup example]

Files and file naming: Each issue (each day) of the newspaper should be encoded as a single <TEI.2> document contained in a single XML file. For the Cavalier Daily project, each file should be named "CavDaily_yyyymmdd.xml". For example, the issue for March 11, 1969 would be named:

CavDaily_19690311.xml

In cases where the vendor creates page images from microfilm, the page images should be named with the same base filename as the issue to which the page images pertain, followed by a two-digit page sequence number. For example:

CavDaily_19690311_01.tif
CavDaily_19690311_02.tif
CavDaily_19690311_03.tif
CavDaily_19690311_04.tif

ENTITY declaration: When declaring the DTD, include an ENTITY declaration named NEWSPAPER with value "INCLUDE". This enables the newspaper-specific features of the DTD, which are disabled by default.

<!DOCTYPE TEI.2 SYSTEM "tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "uva-dl-tei/uva-dl-tei.dtd">
<!ENTITY % NEWSPAPER "INCLUDE">

TEI header: For the Cavalier Daily project, use the following TEI header template. This template can be inserted into the TEI file without any modifications or enhancements:

<teiHeader>
<fileDesc>
<titleStmt>
<title>The Cavalier Daily</title>
</titleStmt>
<publicationStmt>
<publisher>University of Virginia Library</publisher>
</publicationStmt>
<sourceDesc><bibl/></sourceDesc>
</fileDesc>
<profileDesc>
<langUsage>
<language id="eng">English</language>
</langUsage>
</profileDesc>
</teiHeader>

Page layout: The columnar layout of a newspaper page is typically very complex, with multiple levels of column breaks. The column breaks should not be recorded. Instead, all page breaks must be recorded, and the articles and other content on that page should be transcribed in the order in which they appear on the page, moving primarily from top to bottom and secondarily from left to right.

In other words, transcribe all the pieces along the top of the page first, moving left to right; then the pieces in the various mid-sections (there will usually be more than one level) of the page, moving left to right; and finally the pieces along the bottom of the page, moving left to right.

[Example of page layout sequence]

Associating related sections: Because newspaper articles are often broken up and printed in two (or more) sections on different pages, it is necessary to associate the first section of the article with the subsequent section(s). Use the id, next and prev attributes on <div#> to achieve this.

< -- start tag for first part of article -->
<div1 type="article" id="a1.2" next="a3.1">

< -- start tag for second part of article -->
<div1 type="article" id="a3.1" prev="a1.2">

When assigning identifiers for the id attribute, use this scheme:

For example, the first article (or partial article) on the first page should be assigned id="a1.1". On the third page, the fourth article (or partical article) to be encoded for that page (moving from left to right and from top to bottom, as instructed above) should be assigned id="a3.4". Please assign an ID to all <div1 type="article"> elements, even if the article is not split into two non-contiguous sections. (Other division types do not require an ID.)

When an article is broken into two or more sections on different pages, each section of the article must be enclosed in its own <div1> element. This is necessary because other articles or partial articles will occur between the two sections.

The "jump" line -- the phrase indicating where to look for the continuation of the article -- should be encoded in a <ref> element, with a target attribute containing the ID of the div1 containing the article continuation.

<ab type="ref"><ref target="a3.1">(see Players, p. 3)</ref></ab>
</div1>
<!-- Intervening content... -->
<div1 type="article" id="a3.1" prev="a1.2">
<pb n="3"/>
<head type="main">Players Present Grievances Tonight</head>
<ab type="ref" rend="center"><ref target="a1.2">(continued from p.1)</ref></ab>

Photographs: When encoding photographs, illustrations, or other graphic elements using <figure>, the <figure> should be placed in its own <div1> if the photograph is not contained within a particular article. (See example markup, photo on page 1 with caption "Bill Gibson Encourages His Players...".) If the photograph appears within the text of a particular article (usually contained within a single column of text), then the <figure> should be encoded directly within the <div#> in which it occurs. (See markup example, photo on page 3 with caption "Coach Bill Gibson".)

The credit or byline accompanying a photograph or other graphic should be encoded with a <byline> element within <figure>, for example:

<figure>
<byline>Joe Smith</byline>
<head>THE ROTUNDA AT DAWN</head>
<p>Students converse on steps as sun rises.</p>
</figure>

Division types: When encoding newspapers, the most common kind of major structural division will be "article" (not "story", which is intended for encoding prose fiction short stories).

<div1 type="article">

A brief section (typically only one or two paragraphs) enclosed in a box and placed at the end of a column of text is called a "filler" and should be marked as <div1 type="filler">. (See newspaper markup example for two examples of this.)

The main heading on the top of the first page, showing the name of the newspaper, is called a "nameplate" and should be marked as <div1 type="nameplate">. The details of publication should be marked as follows:

volume number (sometimes expressed as a year, as in "79th Year") <num type="volume" value="...">...</num>
place of publication <name type="place">...</name>
date of publication <date value="yyyy-mm-dd">...</date>
issue number <num type="number" value="...">...</num>
<text>
<body>
<div1 type="nameplate">
<pb/>
<head>THE CAVALIER DAILY</head>
<ab><num type="volume" value="79">79th YEAR</num> 
<name type="place">UNIVERSITY OF VIRGINIA, CHARLOTTESVILLE</name>, 
<date value="1969-03-11">TUESDAY, MARCH 11, 1969</date> 
<num type="number" value="92">NUMBER 92</num></ab>
</div1>

The blocks of content appearing on the editorial page of the newspaper — which is almost always the second page of the newspaper — require special division types:

For the Cavalier Daily project, most issues will have a "University Notices" section and a classified advertisements section. These sections are handled similarly. The "University Notices" typically appear first and should be marked as <div1 type="univ-notices">. Each category, such as "TODAY" or "MISCELLANEOUS" should be marked as <div2 type="section">, and then each individual notice should be marked with <p>. For example:

<div1 type="univ-notices">
<head><i>University Notices</i></head>
<div2 type="section">
<head>TODAY</head>
<p>FELLOWSHIP of Christian Ath- <lb/>
letes meeting at 7:30 p.m., Wesley <lb/>
Foundation.</p>

Similary, the classifieds section should be marked as <div1 type="classifieds">. Each category, such as "FOR SALE" or "WANTED", should be marked as <div2 type="section">, and then each individual classified ad should be marked as <p>. For example:

<div1 type="classifieds">
<head>CLASSIFIEDS</head>
<div2 type="section">
<head>FOR RENT</head>
<p>Apartment for rent...</p>

If, but only if, none of the division types described above is appropriate to a particular block of content, use the generic type "section" (or, if "section" has already been used for a higher-level division, use "subsection").

Special kinds of gaps: When encoding newspapers, some kinds of content should be excluded from the electronic transcription (due to copyright restrictions or other editorial reasons not pertaining to physical damage to the print source). Rather than using <gap desc="..." reason="editorial"/> for these gaps, use the special convenience elements provided by the DTD:

<ad/> for advertisements
<cartoon/> for cartoons appearing in a single frame or box
<comicStrip/> for cartoons appearing in multiple frames or boxes
<puzzle/> for crossword or other puzzles
<wireArticle/> for articles with a wire-service credit
<wirePhoto/> for photographs with a wire-service credit

Wire-service articles and photographs are identified by one of the following phrases in the byline or dateline:

Such phrases are usually enclosed in parentheses. The following article dateline is typical:

WASHINGTON (UPI)

Wire-service articles are a special case. Instead of replacing the entire article with a <wireArticle/> element (in the spot where a <div1 type="article"> element would go if it were not a wire-service article), we want to capture the headline, but not the article content. For example:

<div1 type="article" id="a1.4">
<head>Finch Warns Cut-Off <lb/>
Of Grants to Rioters</head>
<wireArticle/>
<p/>
</div1>

[Newspaper markup example]


Journal Articles

[No special encoding requirements for journal articles at this time.]


Encyclopedias and Dictionaries

Encyclopedias

Encyclopedia entries typically consist mainly of prose paragraphs and do not normally pose any special markup issues. Each encyclopedia entry is a <div#> containing one or more headings followed by paragraphs.

Dictionaries

Dictionary entries should be encoded using the TEI additional tagset for print dictionaries. For detailed information on the use of these elements, see Chapter 12, Print Dictionaries in the TEI Guidelines.

To enable the dictionary-specific features of the DTD, declare an entity named DICTIONARY with value "INCLUDE", like this:

<!DOCTYPE TEI.2 SYSTEM "tei-p4/tei2.dtd" [
<!ENTITY % TEI.extensions.ent SYSTEM "uva-dl-tei/uva-dl-tei.ent">
<!ENTITY % TEI.extensions.dtd SYSTEM "uva-dl-tei/uva-dl-tei.dtd">
<!ENTITY % DICTIONARY "INCLUDE">

In the simplest case, a dictionary entry has minimal grammatical information and only one definition:

After-night, n.   The time after it becomes night.

<entry>
<form><orth><b>After-night,</b></orth></form>
<gramGrp><pos><i>n.</i></pos></gramGrp>
<def>The time after it becomes night.</def>
</entry>

An entry may include alternative spellings for the main word, in which case the alternative spellings should be marked with <orth type="alt">:

Again, conj.   Agen; agin: By the time that, untill: "I'll have
        it ready agin you come."

<entry rend="hang">
<form><orth><b>Again,</b></orth></form>
<gramGrp><pos><i>conj.</i></pos></gramGrp>
<form><orth type="alt"><i>Agen; agin:</i></orth></form>
<def>By the time that, untill:</def>
<eg><q rend="inline">"I'll have <lb/>
it ready <i>agin</i> you come."</q></eg>
</entry>

The preceding example also includes a usage example, marked with <eg>. The <eg> element does not allow character data; instead, <eg> must contain <q> (for examples with no attributed source) or <cit> (for examples that include an attribution of the author or source text).

Note: Because U.Va. Library normally uses <q> only for block quotations, when using <q> in a dictionary entry please indicate <q rend="inline">, as shown above.

More complex dictionary entries may include more than one form of the same word — that is, multiple homographs (words identical in spelling but different in meaning or pronunciation), each marked with <hom>. Entries may also include more than one meaning for the same word, in which case the information (definitions, examples, etc.) for each meaning should be grouped as a <sense>. If the senses are labeled with numbers or letters in the print source, include the label in the n attribute:

Against, prep.   In resistance to; or defense from "They
        marched against the Spaniards." (2.) Opposite. "Over
        against a point called Sandy Point." Against, conj. "Keep
        'em against I come."

<entry rend="hang">
<hom>
<form><orth><b>Against,</b></orth></form>
<gramGrp><pos><i>prep.</i></pos></gramGrp>
<sense>
<def>In resistance to; or defense from</def>
<eg><q rend="inline">"They <lb/>
marched <i>against</i> the Spaniards."</q></eg>
</sense>
<sense n="2">
(2.) <def>Opposite.</def>
<eg><q rend="inline">"Over <lb/>
<i>against</i> a point called Sandy Point."</q></eg>
</sense>
</hom>
<hom>
<form><orth>Against,</orth></form>
<gramGrp><pos><i>conj.</i></pos></gramGrp>
<eg><q rend="inline">"Keep <lb/>
'em <i>against</i> I come."</q></eg>
</hom>
</entry>

[page image for the preceding examples]

In cases where words with identical spellings (homographs) receive separate entries in the dictionary (rather than being included within a single entry), each entry should be marked as an <entry> as usual, but then the group of entries should be wrapped in a <superEntry> element. [example]


IV. Block-level Features

Block Quotations

The <q> element should be used to mark quotations that are set off typographically from the surrounding text (not quotations that are printed in the same typographic style and are indicated only by double quotation marks), as indicated by one or more of these typographic changes:

The <q> element should never be used to replace quotation marks. If the quotation is both set off from the surrounding text and enclosed in quotation marks, use the <q> element and also include the quotation marks.

[example]

Block quotations with openers and closers: If a block quotation contains an opener and/or closer, as in the case of quoted letters, newspaper articles, etc., use the <quotedLetter> element. (The <q> element does not allow <opener> or <closer>.) [example]

See also Letters


Figures

Captions and associated text: When using the <figure> element to indicate non-textual content (illustrations, photographs, maps, etc.), use the <head> element to record the caption of the figure (if any). Use the <p> element to record text (if any) that is associated with the figure but is not part of the caption. [example]

Printer's ornaments: Printer's ornaments do not qualify as figures. Instead, ornaments should be marked with:

<ornament type="ornament"/>

See also Informal divisions.


Tables

Header cells: For cells that contain a label or heading, rather than data, use <cell role="label">. (For cells containing data, there is no need to include the role attribute; "data" is the default.)

Spanning rows or columns: If a cell occupies more than one row or column, use the rows or cols attribute, respectively, on the <cell> start-tag. (This is equivalent to the use of the rowspan and colspan attributes on <td> in HTML.)

Tables vs. lists: In some cases, the choice between <table> and <list> may not be obvious, but typically any items of text that are intended to line up vertically should be encoded as a <table>. A table of contents, list of illustrations, etc. should almost always be marked up as a table. Ask U.Va. Library for further guidance if needed.

[example]


Lists

Note that lists can be nested (a list <item> can contain a <list>). A common use of nested lists is for indexes where each entry contains indented sub-entries. [example]


Notes

Note reference vs. note body: By note reference we mean the anchor point for the annotation within the flow of the main text, typically indicated with a superscript number or symbol. By note body we mean the content of the annotation. The most common locations for the note body are:

Marking the note reference

In cases where the note reference is indicated by a number or symbol, as is almost always true of footnotes and endnotes, use <ref> to encode the note reference. [example]

In cases where no number or other referencing symbol is present — as is common for marginal notes, where the physical placement of the note on the page indicates which line or paragraph the annotation refers to — use <ptr/> to supply an anchor point for the annotation. [example]

Whether using <ref> or <ptr/>, always use the target attribute, the value of which must match the id attribute of the corresponding <note>.

Marking the note body

Use <note> to encode the note body.

Location within the XML document: With the exception of endnotes, which are already located in a separate section and should not be moved, the <note> element should occur at the point of the note's attachment in the main text — that is, immediately after the <ref> or <ptr/> element.

Note symbols: When the note body includes the referencing symbol (a number, *, †, etc.), record this symbol using the <ns> (note symbol) element as the first element within <note>.

[example]

Required attributes: Always include the id attribute, which must contain an ID that is unique within the XML document, and the place attribute, indicating the placement of the note on the printed page:

Note: When creating IDs for notes, use a simple, human-readable numbering scheme. For notes that are already numbered in the print source, include the number in the ID. For example:

Unanchored notes: If the note body has no corresponding referencing symbol (notes for which <ptr/> is used for the note reference, rather than <ref>; typically marginal notes), include the anchored attribute with a value of "no". [example]

Multiple references to a single note: In cases where a printed note (typically a footnote) is pointed to by more than one note reference, the printed note should be transcribed once (immediately following the first <ref> element), with the remaining <ref> elements pointing to that single <note> element. The <note> element should not be repeated for each <ref> element. [example]


Other Features

Arguments, bibliographic citations, epigraphs, and trailers should be encoded as such using the appropriate TEI elements.

For example, an epigraph containing a quotation and a citation of its source should be marked up in this manner:

<epigraph>
<cit>
<q>"I have sworn upon the altar of God <lb/>
eternal hostility against every form of tyranny <lb/>
over the mind of man."</q>
<bibl><author> &mdash; <i>Thomas Jefferson.</i></author></bibl>
</cit>
</epigraph>

An argument (a leading section containing a summary of the content that follows it) is often presented as a series of topics separated by long dashes. An argument should be marked as an <argument>, not as a second <head>. [example]


Phrase-level Features

Changes in Typeface

With one exception, changes in typeface should be marked with the appropriate physical element, not with a logical element such as <emph>, <title>, <term>, <mentioned>, etc. The exception is foreign phrases, which should be marked with <foreign> (see Foreign Phrases). When marking changes in typeface, use the following elements:

Note on <smcap>: Text that is printed in small caps should be transcribed using both upper-case and lower-case letters, not all upper-case letters. [example]

When using <hi> for changes in typeface, use one of the following values for the rend attribute:

[example of gothic typeface]


Alignment and Indentation

When indicating alignment or indentation, use the globally available rend attribute, either on structural elements (<p>, <l>, <cell>, <item>, etc.) or on <hi>, as appropriate to the situation.

For indicating alignment, the rend value may be:

For indicating indentation, rend may be:

Default alignment: Some elements have a presumed or default alignment and do not normally require explicit alignment markup:

These elements should contain alignment markup only when the layout of the element on the printed page differs from the defaults listed above.


Foreign Phrases

Using <foreign>: Words or phrases that are both (a) typographically distinct (usually in italics), and (b) not in the main language of the text (almost always English), should be marked with the <foreign> element. Whenever possible, include the lang attribute, using one of the standard ISO 639-2 three-character language codes. Occasionally, the language will not be obvious, in which case encode the phrase with <foreign> but without the lang attribute. Commonly used ISO 639-2 codes include:

freFrench
gerGerman
grcGreek, ancient (to 1453)
greGreek, modern (1453- )
hebHebrew
itaItalian
latLatin
rusRussian
spaSpanish

Each language identified by a lang attribute (on <foreign>, or on any other element) must be declared in a <language> element within <teiHeader><profileDesc><langUsage> in order for the XML document to validate.

[example]

Retaining typographic distinction: Using the <foreign> element does not obviate the need to encode the change in typeface. Since foreign phrases are usually italicized, typical markup for a foreign phrase will be: <foreign lang="..."><i>...</i></foreign>

Not roman but not Asian: Languages such as Greek, Hebrew, and Russian fall into a special category. They require non-roman characters, but they are alphabetic, not ideographic. If the language is within the vendor's capabilities, the foreign phrase should be transcribed using either the appropriate character entities, or XML character references with Unicode hexadecimal values:

Greek Use the iso-grk1.ent character entities, supplemented as needed by the accented characters in iso-grk2.ent
Hebrew Use the Hebrew block of Unicode (0590 - 05FF): &#x05D0; for aleph, &#x05D1; for beth, etc.
RussianUse the Cyrillic block of Unicode (0400 - 04FF)

If the language is not within the vendor's capabilities, omit the characters from the transcription and use the <gap/> element.

Ask U.Va. Library for further guidance if needed.

See also Gaps and Uncertainties: <gap/> and Special Characters.


Punctuation

Standard keyboard punctuation: Most common punctuation characters can and should be represented using their normal keyboard characters:

Use of entities: Other marks of punctuation must be represented using their standard character entities:

An ellipse — a series of dots or asterisks indicating deliberately omitted text — should be indicated by a series of keyboard-character periods or asterisks. Simply use the same number of periods or asterisks used in the print source.

If the print source contains an exceptionally long space that needs to be preserved (for example, to indicate a word deliberately omitted by the author), use a series of &emsp; (em space) entities.

Other marks of punctuation are available in the iso-num.ent, iso-pub.ent, and iso-tech.ent character entity sets. See Special Characters below.

Spacing between sentences: Use one space character between sentences, not two, regardless of the apparent spacing in the print source.


VI. Reference Systems

Page Breaks

Use <pb/> to mark page breaks — that is, to mark the point at which a page begins.

Always at top of page: The <pb/> element should always be placed at the top or beginning of the page, regardless of the position of the printed page number in the print source.

Page numbers: If the page contains a printed page number, record it in the n attribute; if not, do not include the n attribute.

Always within a div: Page breaks must be placed within a <div#> element, never between divisions. Therefore, when a division starts on a new page, the <pb/> is the first element in the division, immediately following the opening <div#> tag (preceding even the division <head>, if there is one). For example:

</div2>
<div2 type="chapter" n="II">
<pb/>
<head>II&mdash;APPELLATIONS.</head>

Blank pages: There must be one <pb/> element for every page in the set of page images for the work. This is true even for pages that have no text or other printed content on them.

If the blank page occurs between divs, place the blank page's <pb/> element as the last page of the preceding div, not as the first page of the new div.

<!-- end of last chapter --></p>
<pb/> <!-- blank page between last chapter and bibliography -->
</div1>
</body>
<back>
<div1 type="bibliography">
<pb/>
<head>BIBLIOGRAPHY</head>

Running page headers: Normally running page headers should be excluded from the electronic text (see Exceptions). In some cases, however, U.Va. Library will specifically request that the running headers be preserved for a particular book. To encode the running headers, use <fw type="header"> within <pb>:

<pb n="99"><fw type="header">APPEAL TO CHURCHES OF MASSACHUSETTS</fw></pb>

Column Breaks

Use the <cb/> element to mark column breaks — that is, to mark the point at which a column of text begins. Of course, many books have a single-column layout, in which case it is not necessary to mark the column at all. Other materials, however, such as dictionaries, encyclopedias, newspapers, and journals, are commonly printed in multiple columns and require the use of <cb/> to indicate the layout.

Always at top: Like page breaks (<pb/>), which should always be placed at the top of the page, <cb/> should always mark the top or beginning of the column of text.

n attribute: For <cb/>, use the n attribute to record the number of the column on the page. If each page contains two columns, the first (leftmost) column on each page is <cb n="1"/>, and the second column is <cb n="2"/>.

Mixed column layouts: In cases where the number of columns changes mid-page, use the <cols/> element to indicate the point at which the number of columns changes. Use the n attribute to indicate the number of columns in the section that follows the <cols/> tag. For example, if the page layout shifts from single-column to double-column in the midst of the page, use <cols n="2"/> to indicate the point at which double-column layout begins (and then use <cb n="1"/> and <cb n="2"/> to mark the columns, as usual). At the point where the layout shifts back to single-column text, use <cols n="1"/> (after which no <cb/> elements are necessary, since the layout is single-column). [example]

Note: A division <head> followed by a multi-column layout does not indicate a mixed-column layout and does not require <cols n="..."/>. [example]


Line Breaks

Line breaks in running prose should be preserved in the electronic transcription by marking the end of each printed line with the standard <lb/> element.


Special Considerations

Gaps and Uncertainties

The <unclear> and <gap/> elements should be used as follows:

<unclear>: Use <unclear> to mark passages that cannot be transcribed with certainty, as happens when a letter/word/phrase is physically present on the page but is unreadable (due to a printing error, physical damage to the page such as readers' marks, or a bad scan).

When marking an illegible letter/word/phrase as unclear, transcribe the readable characters, omit the unreadable characters, and mark the entire word or phrase with <unclear>...</unclear>. For example:

Correct:

lost, wounded or captured in a fruitless and <lb/>
<unclear>opeless</unclear> assault. <lb/> 

Wrong:

lost, wounded or captured in a fruitless and <lb/>
<unclear/>opeless assault. <lb/> 

Note: If one or more illegible characters come immediately before or after an end-of-line hyphen, include the entire hyphenated word within the <unclear> element. For example:

Correct:

cleared his throat, and quietly withdrew, <unclear>maintain- <lb/>
ing</unclear> to the last his unprejudiced demeanour.</p>

Wrong:

cleared his throat, and quietly withdrew, maintain- <lb/>
<unclear>ing</unclear> to the last his unprejudiced demeanour.</p>

If a word or phrase is so illegible that none of its characters can be read with certainty, use an empty <unclear/> element to mark the point at which the word or phrase occurs:

instead of his sargeant; and therefore no regiment <lb/>
<unclear/> to be seen in which there are not soldiers in <lb/>

<gap/>: Use <gap/> to mark any section (character, word, passage, page, etc.) that is being omitted from the transcription. There are two reasons for such omissions: the section is missing (as happens with torn or missing pages), or it has been excluded deliberately for editorial reasons. In particular, a block of non-Western characters in a language outside the vendor's capabilities should be marked as a <gap/>. Use one <gap/> element for each unbroken section of content that is being excluded from the electronic transcription. (See also Special Characters.)

When using the <gap/> element, include the desc (short for "description") and reason attributes. The reason attribute accepts these values:

Examples:

<gap desc="Chinese characters" reason="editorial"/>

<gap desc="page 43, line 17 to end of page" reason="damage"/>

Arbitrary Sections

In texts with complex structure or layout, the encoder is likely to encounter block-level sections or phrase-level passages that are difficult to fit into any of the standard TEI elements. In such cases, it may be best to take advantage of TEI's elements for arbitrary sections:

Both of these elements accept the type attribute with any value (no predefined vocabulary).

Although these elements should be used sparingly, they are very useful when genuinely needed.

IMPORTANT: It is better to use <ab> or <seg>, when appropriate, than to inject inappropriate markup — such as <div#> elements that do not truly reflect the major structural divisions of the work, or <p> elements that are not really paragraphs — for the sake of "making it parse."

If a work contains a particularly problematic feature for which the preferred encoding is not clear, ask U.Va. Library for further guidance.


Special Characters

Character-entity sets are not declared in the external DTD. Instead, you will need to declare and invoke any entity sets required by a given document in the document's internal subset:

<?xml version='1.0'?>
<!DOCTYPE TEI.2 SYSTEM '[path]/tei2.dtd' [
<!ENTITY % TEI.extensions.ent SYSTEM '[path]/uva-dl-tei.ent'>
<!ENTITY % TEI.extensions.dtd SYSTEM '[path]/uva-dl-tei.dtd'>

<!ENTITY % ISOlat1 SYSTEM '[path]/iso-lat1.ent'>%ISOlat1;
<!ENTITY % ISOlat2 SYSTEM '[path]/iso-lat2.ent'>%ISOlat2;
<!ENTITY % ISOnum  SYSTEM '[path]/iso-num.ent'>%ISOnum;
<!ENTITY % ISOpub  SYSTEM '[path]/iso-pub.ent'>%ISOpub;
<!ENTITY % ISOtech SYSTEM '[path]/iso-tech.ent'>%ISOtech;
]>

The usual ISO 8879 (SGML) entity sets are included with the uva-dl-tei DTD files (see The DTD above). Please do not use your own, local versions of the ISO 8879 entity sets, as our versions may include corrections, as well as a supplementary set containing characters not available in the standard sets (uva-supp.ent).

Use the named (mnemonic) entities from the U.Va.-supplied entity sets whenever possible. In working with XML character-entity sets, you may want to refer to the following resources:

Characters not in a standard entity set: If a character is not in one of the U.Va.-supplied entity sets (don't forget to check uva-supp.ent as well as the ISO 8879 sets), it may nevertheless be available as a Unicode character. In such cases, identify the correct Unicode character and declare it as an entity with an appropriate human-readable name.

Characters not in Unicode: In some cases, a particular character may not be available in Unicode. It is usually possible, however, to create the needed character by using the Unicode combining diacritics.

[example]

Any such entity declarations outside the U.Va.-supplied entity sets should be noted in the encoder's notes to U.Va. Library (see Providing notes).


Final Steps

Line endings: If the XML files contain two-byte line endings (carriage return + linefeed), please convert them to Unix-style, one-byte line endings (linefeed character) before delivering the finished files to U.Va. Library.