Giter Site home page Giter Site logo

digitallinguistics / scription Goto Github PK

View Code? Open in Web Editor NEW
6.0 2.0 0.0 85 KB

A specification for formatting interlinear glossed texts in a way that is computationally parseable

Home Page: https://scription.digitallinguistics.io

License: MIT License

scription dlx digital-linguistics linguistics documentary-linguistics language-documentation language digital-humanities scription-files glosses

scription's Introduction

Scription

GitHub release GitHub issues DOI license GitHub stars

This document specifies a simple text format for representing linguistic texts as interlinear glossed examples. This format, known as scription (a term coined by Patrick J. Hall (University of California, Santa Barbara)), makes it easy to quickly enter data. It is easily read by humans, and easily converted to other formats used in documentary linguistics.

At its simplest, a scription file is just a basic interlinear gloss. Below is a valid scription file containing a single utterance in Chitimacha:

waxdungu qasi
waxt-qungu qasi
day-one    man
one day a man

However, the scription format supports much more complicated interlinear glosses, as well as the ability to specify metadata about the text. Click the example link below to view a slightly more complex scription file.

View the example scription file.

The complete specification for formatting valid scription files is given below.

Note: You may also be interested in the scription2dlx JavaScript library, which converts scription files to the Data Format for Digital Linguistics (DaFoDiL).

Cite this format using the following model:

Hieber, Daniel W. 2012. digitallinguistics/scription. DOI:10.5281/zenodo.2595548

Contents

File Extension / Media Type {#extension}

Scription files should be treated as plain text files (text/plain) and given the .txt extension. Using other extensions such as .scription or .text is not recommended.

Header

Each scription file may begin with a header containing metadata about the text, between two triple dashes (---). For example:

---
title: How the world began
---

The header content should consist of metadata about the text, in YAML format. The properties included in the header must use the field names recommended for linguistic texts specified by the Data Format for Digital Linguistics, with the exception that the utterances property must NOT be included. Some examples of attributes that users might include are the title, abbreviation, and dateRecorded properties.

If present, the header may not be empty. At a minimum, a title property is required.

Interlinear Gloss Schema {#schema}

Each text has an interlinear gloss schema that tells readers or parsers what each line in an utterance represents. The interlinear gloss schema is always inferred from the first utterance in the text. Subsequent utterances are then assumed to follow the same schema unless otherwise specified.

Users can specify an interlinear gloss schema using backslash codes at the beginning of each line in an utterance, followed by one or more spaces or tabs, and then the data for that line. Consider the following example text:

\txn   ninakupenda
\m     ni-na-ku-pend-a
\gl    1SG.SUBJ-PRES-2SG.OBJ-love-IND
\tln   I love you

ninaenda
ni-na-end-a
1SG-PRES-go-IND
I am going

This text has 2 utterances, separated by a blank line. The lines in the first utterance are preceded by backslash codes indicating the function of each line. This schema tells readers and parsers that the lines in this utterance are a phonemic transcription of the utterance (\txn), followed by a morphemic analysis (\m) and glosses (\gl), and finally a free translation (\tln). The second utterance is then assumed to follow the same schema, so it does not need backslash codes.

By default, an utterance with only 2 lines is assumed to follow this schema:

\txn
\tln

An utterance with 3 lines is assumed to follow this schema:

\m
\gl
\tln

An utterance with 4 lines is assumed to follow this schema:

\txn
\m
\gl
\tln

The complete list of supported backslash codes is listed in the Lines section. If a backslash code appears more than once in a schema, each instance must have a language or orthography specified. (For example, an utterance with both \tln-en and \tln-es would be valid, but an utterance with \tln and \tln-es would not be valid.) Editors and parsers may support additional backslash codes, but other editors and parsers are not required to support them. Parsers which encounter invalid backslash codes should throw an error. When parsers encounter an undefined backslash code, however, they should not throw an error; parsers should pass through the data unchanged if possible, or ignore it otherwise.

Each backslash code must consist of a backslash \, followed immediately by the code indicating the type of line (ex: gl, txn), and optionally a hyphen followed by an abbreviation or ISO language tag, depending on the line. Backslash codes may only contain basic alphanumeric characters (A-Z, a-z; no diacritics) and numbers (0-9). Some examples of backslash codes are below:

  • \gl - The glosses line
  • \txn-practical - The phonemic transcription line, in the practical orthography for the language
  • \tln-es - The translation line, in Spanish

If one line in an utterance includes a backslash code, all the other lines in that utterance must have one as well, with the exceptions that the metadata line never starts with a backslash code (it must always start with #), and that the note line may (optionally) always have a backslash code (\n) even if other lines do not. Barring these exceptions, parsers should throw an error if they encounter an utterance where only some of the lines begin with backslash codes.

If an individual utterance in a text follows a different schema than the one specified in the first utterance, the user must indicate the function of each line by including the backslash code at the beginning of the line. This is most useful when a specific utterance requires an extra line in the interlinear gloss, for whatever reason.

As an example, consider a scription file using the default interlinear gloss schema. Most utterances will look something like this:

kˀiht-ik
want-1SG
I want

If, however, one utterance contains a morphophonological change, the user may choose to add a fourth line to the interlinear gloss for that specific utterance, like so:

\txn ʔučaːši
\m   ʔuči-ʔiš-i
\gl  do-IPFV-3SG
\tln he did it

Note that the following format is also valid:

\txn ʔučaːši
\m ʔuči-ʔiš-i
\gl do-IPFV-3SG
\tln he did it

This will only affect the interlinear gloss schema for this specific utterance. All other utterances in the text will be assumed to continue following the same schema as the interlinear gloss schema in the first utterance.

If the first utterance in a text happens to follow a different interlinear gloss schema than the rest of the utterances in the text, users can simply provide a schema with no data, like so:

\txn
\m
\gl
\tln

In this case, parsers should use this utterance only for the purpose of inferring the interlinear gloss schema; they should not treat it as data.

Utterances

Following the header and one or more line breaks is the collection of utterances in the text, each represented as an interlinear glossed utterance. The collection of utterances may be empty.

Each interlinear glossed utterance is a set of lines of text, and each utterance is separated from other utterances by one or more blank lines. The lines within an interlinear glossed utterance must not be separated by blank lines. To indicate that there is no data for a line, include that line's backslash code, and leave the rest of the line blank, like so:

\txn hujambo
\gl
\tln hello

Each utterance may only contain one line of each type and orthography/language, with the exception of the note line (\n). Users may include multiple note lines, but each must be preceded by the \n backslash code.

The first utterance in a text is always used to infer the interlinear gloss schema for the text. Parsers should assume that each line in an interlinear glossed utterance corresponds to the same number line in the interlinear gloss schema. For example, a scription file using the default schema (see above) should treat the first line in an interlinear gloss as the morphemic analysis, the second as the glosses, and the third as the translation.

If an utterance contains an extra line (that is, one more line than specified in the interlinear gloss schema), that line should be treated as a note line (\n). The behavior of parsers for any additional lines is undefined; parsers may choose to attempt to process that data or not.

Lines

This section provides guidelines on formatting each line of an interlinear glossed utterance. Lines will have different formatting requirements depending on their type.

Words on a line may be grouped together using square brackets ([ ]). Multiple words that are grouped using square brackets must be treated as one word for the purpose of aligning items in an interlinear gloss. In the following example, the first and last name of a person are grouped together into a single unit, and given the gloss NAME.

\trs Qix kapx [John Smith].
\txn qix kapx [John Smith]
\gl  1SG name NAME
\tln My name is John Smith.

Utterance Metadata: # {#metadata}

Each utterance may be preceded by a metadata line, beginning with a hash (#). This can be used to indicate the name of the language of that utterance, the language family, or other notes or metadata about the utterance. This is most useful when the scription text contains a collection of unrelated examples in different languages. An example of an utterance with a metadata line is shown below.

# Chitimacha (isolate; Louisiana)
waxdungu qasi
one day a man

The format of the data contained within this line is unspecified. The behavior of parsers with respect to this line is undefined, except that parsers should not throw an error if this line is encountered. It is recommended that parsers ignore this line by default.

Speaker: \sp {#speaker}

This line consists of an abbreviation for the person who spoke the utterance, usually their initials. The speaker line may not be in multiple languages or writing systems. It may contain only the letters a-z, A-Z, and numbers 0-9. No spaces are allowed. If you need to provide more details about a speaker, you can do so in the metadata header at the beginning of the text.

Transcript: \trs {#transcript}

A transcript of this utterance, including things like prosodic markup, overlap, pauses, and various other discourse features. The transcript may be in multiple orthographies or representational systems. For example, you might have a transcript in both Discourse Functional Transcription (DFT) and Conversation Analysis (CA) formats, which might be represented on two different lines as \trs-dft and \trs-ca respectively.

Phonemic Transcription: \txn {#transcription}

A phonemic transcription of the utterance. Punctuation and capitalization should be avoided in this line. This line should not be broken into morphemes, and should not contain extra white space to align words (use the \w line for that instead). (Morpho)phonological sound changes should be represented in this line. In other words, this line serves as a phonemic transcription of the utterance as the speaker actually pronounced it. Do not include phonemic slashes (/ /) in this line.

This line may be used with multiple orthographies. For example, a language which has a practical writing system may have both \txn-practical and \txn-ipa, to represent each utterance in both the practical orthography and in IPA. It is recommended but not required that orthography abbreviations be valid ISO language tags (for example, \txn-x-practical). However, sometimes this is impractical or unreadable.

Phonetic Transcription: \phon {#phonetic}

A phonetic transcription of the utterance. This transcription must be in IPA; it may not be used with multiple orthographies. Do not include phonetic brackets ([ ]) in this line.

Word Transcription: \w {#word-transcription}

A phonemic transcription of each word in the utterance. This line may be in multiple orthographies. Words in this line are often separated by additional white space, to vertically align words. Otherwise, this line typically contains the same data as the utterance's phonemic transcription line (\txn). This line should not contain morpheme breakdowns (use the \m line instead). Do not include phonemic slashes (/ /) in this line.

Morphemic Analysis: \m {#morphemes}

This line shows the individual morphemes in an utterance, separated by hyphens, equal signs, or other symbols recognized as valid glossing symbols by the Leipzig Glossing Rules. Words may be separated by one or more white spaces or tabs (useful for aligning words vertically for readability). If this line is present, the glosses line (gl) must also be present. This line must contain the same number of words as the glosses line and the literal word translation line (if present). Each word within the utterance must also contain the same number of morphemes as the corresponding word in the glosses line.

The morphemes line may be represented in more than one orthography. For example, in a language that has a practical writing system, a user might include both a \m-practical and \m-ipa line, for the practical orthography and IPA respectively. It is recommended but not required that orthography abbreviations be valid ISO language tags (for example, \m-x-practical). However, sometimes this is impractical or unreadable.

Data should be entered in this line using regular hyphens (U+2010) rather than non-breaking hyphens (U+2011), for ease of entry. Tools may replace regular hyphens with non-breaking hyphens for display purposes, but must not alter the original data by replacing the original, regular hyphens. If non-breaking hyphens are included in the data for this line, they must be treated as word characters rather than as morpheme separators or punctuation.

Glosses: \gl {#glosses}

This line shows the glosses for each morpheme in the morphemic analysis (\m) line, separated by hyphens, equal signs, or other symbols recognized as valid glossing symbols by the Leipzig Glossing Rules. Words may be separated by one or more white spaces or tabs (useful for aligning words vertically for readability). If this line is present, the morphemic analysis line must also be present. This line must contain the same number of words as the morphemic analysis line and the literal word translation line (if present). Each word within the utterance must also contain the same number of glosses as the corresponding word in the morphemic analysis line.

Grammatical glosses should be written in CAPS. Lexical glosses should avoid capitalization. Personal names should be glossed NAME or with their literal meaning. Affixes whose meaning is unknown or uncertain may be glossed ??, although other glosses are acceptable (for example, aff1, aff2, etc.).

Data should be entered in this line using regular hyphens (U+2010) rather than non-breaking hyphens (U+2011), for ease of entry. Tools may replace regular hyphens with non-breaking hyphens for display purposes, but must not alter the original data by replacing the original, regular hyphens. Non-breaking hyphens are not permitted on this line.

The glosses line may be represented in multiple languages. For example, an utterance with glosses in both English and Spanish might have the lines \gl-en and \gl-es. Language abbreviations must be valid ISO language tags.

If the same gloss appears twice within a word, it should be treated as a discontinuous morpheme (ex: a circumfix or transfix). The following examples in illustrate this use:

# Lakota
na-wíčha-wa-xʔu̧
hear-3PL.UND-1SG.ACT-hear
I hear them
# Darfur Arabic
t-u-r-u-g
way-PL-way-PL-way
ways

To avoid this behavior, you can change the gloss of one of the morphemes (ex: PL^1 and PL^2).

Infixes are also supported, using the angle brackets convention specified in the Leipzig Glossing Rules:

# Tagalog
b<um>ili
<FOC>buy
buy

Literal Word Translation: \wlt {#word-literal}

A word-by-word literal translation of the utterance. Literal translations of each word must not contain white space; words within each translation may be separated by periods, hyphens, underscores, or other characters. This line must have the same number of words as the morphemic analysis and glosses lines. An example utterance with literal word translations is shown below.

\m   naakxte-m-puy-na
\gl  write-PLACT-PAST.IPFV-3PL
\wlt they.usually.write.with.it
\tln a pen/pencil

Literal Translation: \lit {#literal}

The literal translation for this utterance. Do not include brackets ([ ]) or quotes (‘ ’) around the data for this line, unless using quotes for reported speech. This line may be represented in multiple languages. For example, an utterance with a translation in both Spanish and English might have the lines \tln-spa and \tln-eng. Language abbreviations must be valid ISO language tags.

Free Translation: \tln {#translation}

The free translation for this utterance. Do not include brackets ([ ]) or quotes (‘ ’) around the data for this line, unless using quotes for reported speech. This line may be represented in multiple languages. For example, an utterance with a translation in both Spanish and English might have the lines \lit-spa and \lit-eng. Language abbreviations must be valid ISO language tags.

Note: \n {#note}

A note about this utterance. Note lines may be in multiple languages (ex: \n-en and \n-es). If the language is absent, parsers should assume that the language is English (en). The language tag, if present, must be a valid ISO language tag.

The source of the note may also be indicated at the beginning of the note text, followed by a colon (:). This should be the initials of the person who was the source of the note, and may contain only basic Latin characters (A-Z, a-z).

Some examples of note lines are below.

What would this utterance mean if the verb were perfect?
DWH: Is this utterance past tense or present tense?
\n MM: I think this is plural.
\n-swa Sentensi hii ni kuhusu bwana yule.

Source: \s {#source}

The source line is used to indicate the bibliographic source of the utterance. This is most useful when the scription file consists of a collection of utterances from different texts or publications, as often happens when preparing a set of examples for typological publications. This line would typically be included immediately after an interlinear glossed example in a publication. It may only be in a single language.

Time Duration: \t {#duration}

The time duration line is used to indicate the start and end times of the utterance in an associated media recording. It must follow the format SS.MMM-SS.MMM, where SS = the start/end time in seconds, and MMM = the start/end time in milliseconds. The number before the hyphen indicates the start time, and the number after the hyphen indicates the end time. The hyphen may optionally be surrounded by spaces (e.g. 10.123 - 20.456). The start and end times must be specified in seconds and milliseconds, not any other units or precision.

Custom Lines {#custom}

Utterances may include line types that are not defined in this specification. When parsers encounter an undefined backslash code, they should not throw an error by default; parsers should ignore this line and if possible pass through the data unchanged.

Emphasis

Emphasis may be added on any lines containing linguistic data by adding asterisks (*) around the emphasized item or portion of the data:

*wax*dungu qasi
*waxt*-qungu qasi
*day*-one man
one *day* a man

Information about emphasis is most useful when preparing specific pieces of data for publication. Because the location of emphasis needs to vary from example to example and publication to publication, authors should not mark up their original data with emphasis. Instead they should create a new scription file containing the examples to be used in the publication, and indicate emphasis there.

For the following lines (utterance-level data), asterisks may occur anywhere in the data:

  • transcript (\trs)
  • phonemic transcription (\txn)
  • phonetic transcription (\phon)
  • literal translation (\lit)
  • free translation (\tln)

For the following lines (word-level data), pairs of asterisks may only appear at word and morpheme boundaries. Asterisks placed elsewhere should be stripped from the data and ignored by parsers.

  • word transcription (\w)
  • morphemic analysis (\m)
  • glosses (\gl)
  • literal word translation (\wlt)

If an odd number of asterisks is found, they should be stripped from the data and ignored.

Asterisks are for presentational purposes only, and parsers must not save asterisks as part of the linguistic data for the utterance. However, parsers may choose to utilize information about emphasis in other ways, or save that information in separate fields.

scription's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scription's Issues

move clarification of colon glyph

The colon may be followed by one or more tabs or spaces. The colon (:) may be omitted if both the source and language are absent.

Change to:

The colon (:) may be followed by one or more tabs or spaces. The colon may be omitted if both the source and language are absent.

clarify utterance-level and word-level lines, and the relationships between them

Certain lines represent properties of an utterance (e.g. \txn, \sp), while other lines are properties of words (e.g. \w, \wlt) or morphemes (e.g. \m, \gl). Clarify which lines are of which type.

In addition, certain lines can be converted programmatically between different levels; or, it might even be the case that the exact same data can function as either. For example, the only difference between the utterance transcription (\txn) and the word transcription (\w) is that the word transcription line allows extra white spacing for vertical alignment. Clarify the relationship and conversion algorithm between these types of lines.

add section to docs for custom lines

Handling of custom lines can technically be inferred from the spec, but the information's a little scattered. Add a section specifically for custom lines, how to format them, and how parsers should handle them.

support utterance metadata

Example Header: # Language (Family > Genus; Location)

A comment line at the beginning of the utterance will be treated as a header

# Swahili (Niger-Congo > Bantu; East Africa)
ni-na-ku-pend-a
1SG.SUBJ-PRES-2SG.OBJ.LOVE-IND
I love you

Also support \lang (language of the utterance) and \s (source) tags.

\sp must be a valid DLx abbreviation

This line consists of an abbreviation for the person who spoke the utterance, usually their initials. The speaker line may not be in multiple languages or writing systems.

Change to:

This line consists of an abbreviation for the person who spoke the utterance, usually their initials. The speaker line may not be in multiple languages or writing systems. It may contain only the letters a-z, A-Z, and numbers 0-9. No spaces are allowed. If you need to provide more details about a speaker, you can do so in the metadata header at the beginning of the text.

clarify status of metadata line

The metadata line (#) is not a backslash code. It should be ignored for the purpose of determining the schema of an utterance. The metadata hash (#) may be present even when other backslash codes are not.

Also remove recommendation that parsers ignore the line. Just say that they may ignore the line.

reword introduction to Lines section

Lines within an interlinear gloss must be formatted in different ways, depending on the line type. This section provides guidelines on formatting each line of an interlinear glossed utterance. The sections are in their recommended order for an interlinear gloss.

Change to:

This section provides guidelines on formatting each line of an interlinear glossed utterance. Lines will have different formatting requirements depending on their type.

support emphasis with asterisks

Emphasis can be added on any line except the metadata, speaker, and notes lines by adding asterisks around the emphasized item or portion of the data:

*wax*dungu qasi
*waxt*-qungu qasi
*day*-one man
one *day* a man

For the following lines (utterance-level data), asterisks may occur anywhere in the data:

  • transcript (\trs)
  • phonemic transcription (\txn)
  • phonetic transcription (\phon)
  • literal translation (\lit)
  • free translation (\tln)

For the following lines (word-level data), pairs of asterisks may only appear at word and morpheme boundaries. Asterisks placed elsewhere should be stripped from the data and ignored.

  • word transcription (\w)
  • morphemic analysis (\w)
  • glosses (\gl)
  • literal word translation (wlt)

Asterisks should be stripped from the data itself, and should not be stored in data fields by parsers. However, parsers may choose to utilize information about emphasis in other ways.

If an odd number of asterisks is found, emphasis on that line should be ignored.

all lines of the same type must have language / orthographies

If a line type appears multiple times in an interlinear gloss schema, each line must have its language / orthography specified.

bad

\txn    hujambo dunia
\tln    Hello world.
\tln-es Hola mundo.

good

\txn    hujambo dunia
\tln-en Hello world.
\tln-es Hola mundo.

support literal glosses for words

\ltg: literal gloss
\ltw: literal word
\w: word - This is being used for the word transcription
\wg: word gloss
\wl: word literal
\wlt: word literal
\wt: word translation | word transcription

These should either not contain white space, or should be surrounded by brackets. Otherwise parsers cannot align the literal glosses to their words accurately. (This is an instance where the scription format is more rigid than the DLx JSON format. The JSON format would allow for spaces in a literal gloss.)

May be in multiple languages. Must have valid ISO language tags.

Multi-language word glossing

Greetings,

In looking at the scription standard It seems to be very practical. As am reading I am wondering how the standard annotates the data for language, orthography version or script used. My suggestion is to internationalize the strings after the manner of JSON string internationalization (JSON Spec). An alternative would be to use the RDFa/RDF (see discussion throughout this doc) approach and use @ plus an bcp47 tag.

I this way I could have in scription syntax to indicate that the literal word translation from bcp47 language code for ut-Ma'in [gel] was translated into English (USA spelling), French as used in France, and Hasua as written with arabic script in Nigeria, and then finally the tentative orthography of the ut-Ma'in language.

\trs@gel
\txn@gel-x-ipa
\wlt@en-US
\wlt@fr-fr
\wlt@ha-arab-NG
\wlt@gel-x-orth2020

support word transcriptions

Use the \w code for this.

Rarely needed, since in principle this line would be exactly the same data as the \txn line, with only differences in spacing (though this is still useful).

This line should not contain morpheme breakdowns.

This line may be in multiple orthographies.

Clarify that this line often has extra white space for the purpose of vertical alignment, while the utterance transcription line does not.

support discontinuous affixes

If the same gloss appears twice in a word, it will be treated as a single discontinuous morpheme. To avoid this behavior, change the gloss (e.g. PL_1 and PL_2, or PL^1 and PL^2, etc.).

Examples:

# Lakota
na-wíčha-wa-xʔu̧
hear-3PL.UND-1SG.ACT-hear
I hear them
# Darfur Arabic
t-u-r-u-g
way-PL-way-PL-way

allow users to provide the schema along with the first utterance

Indicating the line types on the first utterance will indicate to parsers that any subsequent utterances should be assumed to follow the same line schema, unless otherwise indicated. This allows users to specify the line types once, obviating the need for them on subsequent utterances.

support a canonical reference line (\ref)

Data imported or typed from other sources often have a canonical reference number that needs to stay associated with them. Two examples:

  • In the Nuuchahnulth texts, each utterance is given an identifier. However, sometimes numbers are skipped or duplicated. Examples of identifiers are FoodThief 27a or Qawiqaalth 13.

  • In the Chitimacha texts, each text, paragraph, and utterance is given a unique identifier, so that the combined identifier for an utterance might look like A23f.2.

Add a \ref line to the specification which may contain any type of identifier. The recommendation should be to limit this to ASCII characters and basic punctuation, but in principle this field can be unconstrained. A more complex example of data that might be entered in this line is a set of coordinates for where an item appears in a PDF, or a UUID, etc.

This also helps avoid overloading the utterance metadata field (#).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.