scottkleinman / aeme Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 2.0 41.19 MB

AEME Development Repo

XSLT 2.33% HTML 76.19% Lasso 1.15% CSS 3.56% Python 0.43% JavaScript 16.34%

aeme's People

Contributors

Stargazers

Watchers

Forkers

fowziasharmeen kreddy95

aeme's Issues

Encoding punctus elevatus and the caesura

Compare these transcriptions of the punctus elevatus (other entities have been removed for readability):

(1)

<l>Sire sire þis womman seide&punctelev; housebonde i nabbe nanne</l>

(2)

<l>Sire sire þis womman seide &punctelev;<shadowGap type="caesura"/> housebonde i nabbe nanne</l>

Transcription 1 does not designate the punctus elevatus as marking the caesura. It also silently collapses the space prior to the punctuation mark so that it will render according to modern conventions for punctuation. (TEI Guidelines state that this is sometimes done in diplomatic transcriptions). This allows for easy rendering, as described in our current Guidelines, thought the punctuation mark should be enclosed in <pc> tags.

Transcription 2 uses <shadowGap/> merely as a caesura marker, not for rendering punctuation. Sharon suggests that <seg type="caesura">&punctelev;</seg> would be better, and I agree. However, I would suggest <pc type="caesura">&punctelev;</pc> for compatibility with the rendering instructions in our Guidelines. (One issue is that we have to include the TEI analysis module or at least clone <pc> from there.

So this is a good opportunity (A) to reconsider whether they system in our Guidelines works and (B) whether we want to encode the caesura at this stage in the game?

How to indicate the extent of unclear text

Should we be using dots in <unclear> to indicate the number of unclear characters? Or is there an attribute like @extent that can be used (in which the element would presumably be <unclear/>)?

@skgoetz : yes, @extent exists for <unclear>, and one could use an empty element

@scottkleinman: Just to be clear (ha! ha!), that's yes to using @extent and no dots?

Alternative to <shadowGap>

An alternative to<shadowGap/> and <shadowGap> </shadowGap>:

MUFI has code points for SOFT HYPHEN and DELETE characters, which may be used by the stylesheet to add or delete spaces. Here are some possible uses:

atones (uses SOFT HYPHEN)

This is more concise than at<shadowGap> </shadowGap>ones or

<choice>
    <orig>atones</orig>
    <reg>at ones</reg>
</choice>

Likewise:

bi&del;fore (uses DELETE)

This is more concise than bi<shadowGap/>fore or

<choice>
    <orig>bi fore</orig>
    <reg>bifore</reg>
</choice>

Is this a misuse of these characters? Is it better to use an element for this purpose?

Unicode encoding fails when texts are cut

Sample error message: UnicodeEncodeError: 'ascii' codec can't encode character u'\u8b00' in position 0: ordinal not in range(128)

Named entity markup, 1 name multiple entities

Consider the following (slightly edited) lines:

    <l>[Seinte Anne] hadde euerech [housebonde] aftur oþur ; for heo was iwedded þrie</l>
    <l>Bi euerech of heom ane douȝter heo hadde ; and euerech hieȝte <name type="person" 
         ref="???">Marie</name></l>

How would we handle the @ref, given that the name here refers to three separate entities?

Side question: What was Anne thinking?

<choice> for words or letters

Compare:

<choice>
    <orig>Man</orig>
    <reg>man</reg>
</choice>

<choice>
   <orig>M</orig>
    <reg>m</reg>
</choice>an

@skgoetz prefers the former "because I think in terms of words, not letters. Less flippantly, because I’d argue that this scribe thought in terms of words as well—semantic weight, not only whether a capital letter would look nice here. It’s an example of how encoding is an editorial act."

If I understand this point, both transcriptions should be acceptable, but as editors we should adopt the method that best reflects our interpretation of what caused the scribe to capitalise "man" (in the case of this example).

Accepting Horstmann's corrections

If we accept Horstmann's corrections, should we tag them with <reg resp="#bioHO001">, <reg resp="#bioHO001">, etc.? Or is it enough to indicate in notes where a Horstmann reading is of particular interest?

Markup for collapsing word divisions in the critical layer

Currently, our Guidelines describe the use of <shadowGap/> to bring together divided morphemes into more modern word forms. For instance:

i<shadowGap/>leue a<shadowGap/>ne

would read "ileue ane".

An alternative using TEI-native elements is

i<sic> </sic>leue a<sic> </sic>ne

Now that we have some experience coding, we should begin re-considering our practices.

Re-numbering lines to account for lost text

Our current automatic line numbering basically corresponds to Horstmann's numbering, but we may want to re-number lines based on the assumed number of lost lines in missing folios.

@skgoetz replies: interesting, and possible to sort out near the end, not while post-processing.

IDs for places

Places should probably be given IDs. Sharon suggests borrowing an existing schema (MTP uses METS and a database ID), and I've assigned myself to investigate possibilities. One complication is the changing boundaries of counties: "Oxfordshire" will have different meanings based on the context in which it is used.

Add "enc" to the @type values for <note>

To be used for encoding issues.

Dealing with tricky word transpositions

At the top of f. 52r (near the end of the second line) there is a really tricky word transposition. Any thoughts on how to deal with it? Here is my best effort so far:

<l xml:id="bodllaudmisc108.3l.131"><hi rend="touched-red">A</hi><sg type="ln"/>ke <choice>
    <orig>seint</orig>
    <reg>Seint</reg>
  </choice>
  <name type="person" ref="#bioMatthew0001">Matheu</name> ne fo&rrot;<sg/>ȝat nouȝt <hi
    rend="touched-red">þ</hi>at <choice>
    <orig>Maide</orig>
    <reg>maide</reg>
  </choice><pc> &punctelev; </pc>þei he <seg xml:id="transpose1a"><add place="sublinear"
  >&caret;</add><metamark function="transposition" rend="line" place="supralinear"/>
    were <add place="sublinear">&caret;</add></seg>
  <seg xml:id="transpose1b">d<metamark function="transposition" rend="line"
  place="supralinear"/>ed<add place="sublinear">&caret;</add></seg></l>
  <listTranspose>
    <transpose>
      <ptr target="#transpose1b"/>
      <ptr target="#transpose1b"/>
    </transpose>
  </listTranspose>

Namespace for <shadowGap>

Syd Bauman had a look at our code to see why it was not validating for upload to TAPAS. He was duly impressed:

"I like that your schema was properly generated from an ODD! That should probably be in a non-TEI namespace, though."

The non-TEI namespace issue was not the reason for the validation error, but it does raise the question of our use of namespaces again. Something to think about for the future, especially in conjunction with issue #34.

addSpan and delSpan

Doc needs updating to specify what the final <anchor/> looks like. It will need an @xml:id; I suggest also using @type ("addSpan" and "delSpan" would work nicely), or my XSL won't be able to distinguish that anchor easily from others.

House citation style

Bibliography should be cited using a house style. I suggest Chicago 16th, and Sharon seconds this. I would further add that bibliographical references should simply be placed in (rather than using more structured markup) as follows:

<listBibl>
    <bibl xml:id="Horstmann1873">
        Horstmann, Carl. <hi rend="it">Leben Jesu: ein Fragment, und Kindheit Jesu,
        &amp;c.</hi>. Münster: Regensberg, 1873.
    </bibl>
</listBibl>

Note that titles have to be tagged to render as italic using this method.

For application see issue #5.

<respons> in <titleStmt>

Our current <teiHeader> contains a <respons> element inside the <titleStmt>. Is this necessary, or is all the relevant information in the <respStmt>?

Correspondence Table for Ker Dates

Sharon will create table showing date ranges that correspond to Ker's dating system. This will be helpful for encoding dates more quickly.

Markup of Ornamental Capitals

How do we mark up the extent of ornamental capitals where part of the letter extends over more lines than the rest. For instance, if the main part of an "A" extends over two lines but the left "leg" extends downward for another four, is the letter 2 lines or 6 lines high?

@skgoetz adds: To what extent do we encode details of individual non-inhabited capitals when they follow a pattern established in the MS, versus describing them in decoNote and being done with it? My draft assignment indicates two-line ones as blue but doesn’t note the red tracery for that reason.

Encoding named entities with "of"

How do we handle names like "Herebard of Boseham"? Is it

<name type="person" ref="#bioHerebardOfBoseham">Herebard</name> of 
<name type="place" ref="#locBoseham">Boseham</name>

<name type="person" ref="#bioHerebardOfBoseham">Herebard of Boseham</name>

<name type="person" ref="#bioHerebardOfBoseham">Herebard of 
    <name type="place" ref="#locBoseham">Boseham</name>
</name>

Encoding multiple letters underdotted for deletion

How to encode this phenomenon?

@skgoetz suggests encoding all letters as a single deletion unless the individual dots are the result of separate acts of marking for deletion.

I suggest that this practice be adopted but that this issue remain open until it can be documented in the Guidelines.

Entity label bugs

Hidot and middot have the same code point in the xml entity declarations. Also, change the label for open and close parentheses to correct MUFI labels lpar and rpar.

DIMEV URL for the Whole Manuscript of Laud Misc. 108

Our current template has the URL for The Life of Christ. A URL for the entire MS in DIMEV should be located.

Accents over "i"

I have distinguished the scribe's regular acute accent over "i" from instances where he uses a simple minim by using "í" for the former and "i" for the latter. I have not used to give a separate reading in the critical layer, trusting the stylesheet to convert all instances of "í" to "i". Is this the best approach?

Need to develop guidelines for <facsimile>

And do the transcription, of course.

header trouble

oXygen isn't doing any validation for me at all, due (I think) to the ENTITYs defined before the <TEIheader>. Basically, it thinks that is such a big problem that it focuses just on that; no other errors get flagged.

I gather that this decision (ie, pre-<TEIheader> entities) is a P4 thing, and that P5 wants those things defined inside the header. And I did try creating a P4 document and pasting the text in, to no avail.

Can someone explain the thinking to me, for my peace of mind, and also suggest an option that will allow me to make use of the validation?

Roman Numerals

Our guidelines seemed to me to be poised between advice to regularise Roman numerals only with Arabic equivalents and advice to regularise Roman numerals with both Arabic equivalents and words. I got rid of the latter in the draft of 4.0 with some reluctance. I now see that PPEA regularise only on words, e.g.

<choice>
    <orig>.xij.</orig>
    <reg>twelf</reg>
</choice>

I don't like this because the word is not registered as a number. Our Guidelines say to use the <num> element with an Arabic value. The equivalent would thus be:

<num value="12">
    <choice>
        <orig>.xij.</orig>
        <reg>twelf</reg>
    </choice>
</num>

Any issues with making this our standard? I checked, and <num> can contain <choice>.

Line numbers in the SEL

Our current Guidelines state:

The format of @xml:id [for <l>] will be the manuscript abbreviation plus the <msItem> number plus the line number.

I want to clarify that the <msItem> is the value for @n, not @xml:id. This is potentially problematic because we have the following possibilities for, say, line 1 of St Dunstan:

<!-- Based on @n of <msItem> -->
<l xml:id="laudmisc108.0003.615"> <!-- Line 615 of the SEL -->
<l xml:id="laudmisc108.0003b.1"> <!-- Line 1 of St Dunstan -->

<!-- Based on @xml:id of <msItem> -->
<l xml:id="laudmisc108.mc0003.615"> <!-- Line 615 of the SEL -->
<l xml:id="laudmisc108.mc00005.1"> <!-- Line 1 of St Dunstan -->

We should clarify how we should construct xml:id values. I'm inclined to go for the @n values and to re-start lineation with each vita, but I thought I'd open this to discussion before updating the Guidelines.

Add custom `@ht` attribute to `<hi>`

For discussion, see issue #21.

External references with ampersand in the url

I'm trying to code <ref target="http://www.hrionline.ac.uk/mwm/browse?type=ms&id=118" type="MWM">http://www.hrionline.ac.uk/mwm/browse?type=ms&id=118</ref> (where "MWM" is Manuscripts of the West Midlands). However the ampersand in @target does not validate. If I change it to &, it validates, but, when I paste that into my browser, the page is not found.

How exactly should we handle this situation?

Tagging God, Christ, and the Devil as named entities

Should we be using <name type="person"> for God, the various forms of Christ, and the Devil? I've done it for one short text, but it's potentially a lot of tagging for longer texts.

Requirements for upload to TAPAS

At present, TAPAS will not validate DTD subsets of entities or non-TEI namespaces. Since we are not using the latter at the moment, the only issue for us is our list of entities. Until they get this problem fixed (it is a PHP problem), Syd Bauman advises use to pre-process our texts prior to upload using the Unix/Linux xmllint command:

$ xmllint --noent INPUT.xml > OUTPUT.xml

This will generate a version of INPUT.xml that does not have a DTD subset and where each entity reference has been replaced by the actual Unicode character. (I.e., "þ" is replaced not by "þ", but rather by a U+00FE character that looks like a thorn in some fonts and a "I can't display this character" box in others.)

Not ideal, but hopefully they'll come up with a solution soon.

doc: @glyphHeight

This is tiny, but in the Ornamental Capitals and Pilcrows section, it ought to be noted that @glyphHeight is optional (not only @glyphWidth).

Representation of final stressed "e"

Should we represent French stressed final "e" with an acute in the critical transcription?

@skgoetz replies: to me this is, “Are we editing in the French tradition?” (No.)

Let me clarify the question. This refers to words of French origin like "prive" and "beaute", which have the French stressed tense "e" rather than the unstressed English schwa in the second syllable. It is common practice to mark these with an acute accent in student editions as a reading aid. We have the option to do so on our critical layer. So this is really a question about whether we want to encode that student aid.

Encoding abbreviations

Consider the following transcriptions of the word "prophetes":

(1)

<ex>pro</ex>phetes

(2)

<choice>
    <abbr>&pflour;</abbr>
    <expan>pro</expan>
</choice>
phetes

(3)

<choice>
    <orig>&pflour;</orig>
    <reg>pro</reg>
</choice>
phetes

(4)

<choice>
    <orig>&pflour;phetes</orig>
    <reg>prophetes</reg>
</choice>

Our grant narrative states that users will be able to toggle abbreviated and expanded forms in the diplomatic layer, which I think implies that only the expanded forms will be visible in the critical layer. By this logic, the encoding requires <choice>; Transcription 1 does not have a representation of the abbreviation. Transcriptions 2 and 3 both allow the stylesheet to choose the appropriate form but have different semantic implications. Transcription 2 marks the p-flourish as an abbreviation, which can be replaced by an expansion in the diplomatic layer. The critical layer would implicitly show the expanded form, but it is not marked specifically for this layer. Transcription 3 does that, but the diplomatic layer would then have to borrow (semantically) inappropriately from the "regularised" form. Reluctantly, I would choose Transcription 2 over Transcription 3. Transcription 4--the whole word approach--would require some more sophisticated stylesheet manipulation to allow toggling of the just the &pflour; portion, and I think this would be unwieldy at best. A possible resolution is:

(5)

<choice>
    <orig><am>&pflour;</am>phetes</orig>
    <reg><ex>pro</ex>phetes</reg>
</choice>

A variant of Transcription 2 might be

(6)

<choice>
    <am>&pflour;</am> or <abbr>&pflour;</abbr>
    <ex>pro</ex>
</choice>
phetes

Since <ex> is explicitly an editorial expansion, it semantically embodies the critical representation. All of the above transcriptions validate, but am I missing some other ways this issue could be addressed? Should we have some best practices in the Guidelines for handling this situation?

Abbreviation of Jesus Christ

In line 75 of the Life of Christ, there is an instance of "Ihu" with a bar between "h" and "u". The online Book of Margery Kempe uses "Ih̅u" to represent something similar to this: i.e. "h" with a macron. However, in the Laud instance, the macron seems to me to be exactly in between the "h" and the "u" (it is not a crossed "h"). Perhaps someone else should take a look and comment on my reading.

In the same line, the scribe's abbreviation of "Christ" has what appears to be a minim over the "c", for which I've used "". Again, someone should confirm this reading, and, if it is confirmed, we need to think about the use of this entity to represent the abbreviation mark since it is a Private Use character.

Index of Middle English Verse refs in msContents and elsewhere

Although our Guidelines specify that we will provide IMEV references, it seems reasonable to use the digital version DIMEV and just provide URLs to that. NB. It may still be necessary to use IMEV is we are unable to find a reference in DIMEV (I have not found one for The Life of Christ). But DIMEV should be our default.

Locus values and DIMEV refs for individual SEL vitae

We need to fill this information in. Sharon, this is assigned to you but we could re-assign it to students if you don't want to do this.

Add edition bibliography for each <msItem>

Our template file currently only lists external bibliography for the Life of Christ. We should add print editions for others (at least the most recent scholarly edition).

For citation style, see issue #6.

Folio number discrepancies

Folio numbers are taken from Laing and DMIEV. They do not correspond to the Bodleian-designated foliation. We need to resolve this.

Add possibility of multiple values to @type in <note>

It appears that this is not valid in the current schema.

Do we need to use <lb/> at the beginning of an <l> element?

That is: Beginning of the line...

@skgoetz:
I wouldn’t, as you know…. See first example at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#COVE

I'm not convinced by this example, the logic of which is described thus: "By convention, the start of a metrical line implies the start of a typographic line". Of course, we are not dealing with typographic lines, but, as an Anglo-Saxonist, I am used to the start of a metrical line not implying the start of a line in the manuscript.

This is not to say that we have to include at the beginning of each metrical line where it does correspond to a new line in the manuscript. It just means that we have to be able to render both types of text in the diplomatic layer. Is a stylesheet up to the task?

Spelling of saints' names

I suggest using David Hugh Farmer, The Oxford Dictionary of Saints, 5th ed. (Oxford: Oxford University Press, 2003) for the spellings of saints' names. This is available online from Oxford Reference, and most university libraries will likely have a subscription. So owners of earlier editions shouldn't need to purchase a new one on the off chance that the spellings have changed between editions.

Duplicate xml:id values validate

I just noticed that multiple tags like <pb n="15r" xml:id="pb0009"/> with the same xml:id validate in oXygen. Perhaps the schema is not set up to enforce unique IDs?

<sic> and <corr/> or <del> for underdotted deletions?

Consider this transcription of "sunfolei" (with i underdotted for deletion):

sunfole
<choice>
    <sic>i</sic>
    <corr/>
</choice>

Here the "i" is deleted in the critical layer by the empty <corr/> element, but there is no indication that the "i" is underdotted. An alternative approach is the following:

sunfole<del rend="subpointed" hand="#bioUN001">i</del>

I think these are very close to being formally the same since <del> is implicitly a correction. However, to provide rendering information for the dot, you would need an entity or you would need to use <sic rend="subpointed">. However, @hand is not possible on <sic> without customisation. <sic> seems to me to imply the main hand of the passage, anyway, but <del hand=""> provides some more flexibility for deletion by later hands. Overall, I think <del> is a better bet.

Superscripted abbreviation markers

I recently came across "with" abbreviated "w^t". Is the most efficient way to encode this as follows?

<choice>
    <orig>w<am rend="superscript">t</am></orig>
    <reg>w<ex>ith</ex></reg>
</choice>

I don't want to use a combining "t" character since the "t" is not directly above the preceding character, and embedding <hi> inside <am> is not allowed in our current schema.

Use of o2 for pilcrow and "o" for ornamental capitals

Is there any issue with using o2 for a pilcrow?

@skgoetz : why does PPEA use o in “o2”, anyway? If a letter at all, ought it not to be a2 for a two-line A and so on? (For the prose Brut I used “cap2” because the CDATA tells you which letter it is; “o2” has been bugging me a bit for weeks.)

@scottkleinman: I silently asked the same question in drafting the guidelines. I assumed that "o" was short for ornamental and (unhappily) adopted it from PPEA because I thought it was something that could ne changed easily. I'd be happy with "a2", "b2", etc. or "cap2", if people want to change.

Capitalisation at the beginnings of verse lines

For the diplomatic layer: We need to come up with a consistent system for handling the first letters of lines, for which the capitalisation is ambiguous.

For the critical layer: Do we follow the manuscript or do we follow the convention of capitalising all verse lines? We should be consistent.

<publicationStmt> date value prior to publication

Until we actually go public, what should be the value of the <date> element in <publicationStmt>? Sharon adds: Do we have an estimated date of publication?

Markup of "i" and "j"

This is a suggested enhancement to the Guidelines.

In most cases words with Modern English "j" like "joy" and "Jericho" will have an "i" in the manuscript. These words should be coded as follows:

<choice>
    <orig>i</orig>
    <reg>j</reg> 
</choice>

<choice>
    <orig>ioye</orig>
    <reg>joye</reg> 
</choice>

See issues #23 for choosing between these possibilities.

í varia

The Blaise scribe has two different ways of putting an acute accent mark over an /i/: a straight line, which is fine, and a rounded accent (about 45º of an arc), which I can't find any code for in the MUFI guidelines.

You can see examples of these in fol. 228v on lines 3 (not counting header) and 7, respectively. I cannot detect a case or meaning difference, but if you look through the instances, the distinction looks very purposeful. For that reason, and because I'm punctuation-obsessed, I'm inclined to encode the latter type separately.

What do you think?
Can I specify my own ENTITY for this?

[ALSO: How do I label this issue as a question? I can't figure it out.]