wibarab / featuredb Goto Github PK

WIBARAB is a project in the field of Arabic dialectology. It consists of various regional sub-projects (four PhD projects) and a large database about bedouin-type dialects of Arabic. The Feature Database will be the main point of integrating the results of the sub-projects. In this repository we collect the primary data of the database in TEI/XML.

License: Other

XSLT 1.45% HTML 93.43% CSS 0.02% Jupyter Notebook 5.11%

acdh-ch arabic-dialects linguistics

featuredb's People

Contributors

Stargazers

Watchers

featuredb's Issues

socioLinguisticConstraints are not rendered in HTML preview

introduce divGen to indicate location of featureValues list

We want to allow editors to decide where the list of possible feature values should be placed in the description part of a feature value document. We could use divGen for that purpose.

<divGen type="featureValues"/>

add to ODD / Schema
implement in html preview transformation

define curation workflow

Define curation workflow

In order to be implemented into values of @status attributes, we need to define a curation workflow.
Here's what has been proposed so far in our meeting on 2023-01-12:

Draft (default status) - data gathering is still ongoing
Done (WIBARAB marks it) major bulk of data gathering is already done (minus some fieldwork and doubts). The document is ready to be validated
Validated (ACDH CH marks it)- 1st round of validation has been done and finished and no changes are required from the ACDH-CH Team.
Needs revision (ACDH CH team)- ACDH-CH Team needs some changes from WIBARAB team for a second final round of validation.
Revised (WIBARAB team marks it)- Some changes have been done after 1st validation and the document needs to be validated again
Completed (ACDH CH and WIBARAB need to agree) - Final version of the document. Ready to publish.

move common lists into dedicated git repository

Several "TEI-encoded lists" are used (and potentially edited) in parallel by different projects: these files should be kept in a central place and thus be moved into a dedicated git repository (e.g. coined vicav-commons ?) which can be included as a submodule in the project-specific git repositories.

Candidates are:

fLib.xml
vicav_geodata.xml
vicav_biblio_tei_zotero.xml (i.e. VICAV Zotero dump)

note type should be rendered

It would be helpful to indicate the @type on <note> also in the html rendering

automatically import tribes from dialect list in excel to the tei-tribes-file as discussed in our meetting

personGrp for religions

Changed personGrp/@type religiousGroup to religiousAffiliation.
Not sure what the right term would be. Discuss.

simplify titles of feature documents

Commit number ce0cd788aaddfe7a15c8dc091da96f4e42eb261f - New locations in the Galilee by Ana (August 8th)

Commit ce0cd78 (New locations in the Galilee by Ana - August 8th) was pushed but does not appear on TEI enricher. The modified files though are present in the backup folder. Since many other modifications have been done after that, how shall I proceed?

wrong relative paths pointing from feature document to profiles

E.g. in 010_manannot/features/features_djim.xml:

    <wib:featureValueObservation cert="unknown" status="draft" xml:id="fr_0000_new_moon_please" resp="dmp:???">
         ...
               <ptr target="profiles\vicav_profile_LBN-ABI.xml"/>
          ...
    </wib>

Since the profiles are located under 010_manannot/profiles the @target attribute on <ptr> should read ..\profiles\vicav_profile_LBN-ABI.xml Changing this in the data isn't a problem, but would this break something in the tei_enricher, @charlymo ?

introduce a controlled vocabulary for tribe names

Controlled vocabulary for tribe names

We want to make sure that the tribe names are consistent across our data so we should both add the list to the ODD / Schema and to the tei_enricher

There are several ways of implementing that:

Option 1: source from language profiles

Each tribe is represented in a language profile; so we could extract the list out of those profiles describing a tribe (leaving out others).

Pro:

tribes and langProfiles will be consistent.
no duplication of information

Con: Technically probably a bit more complicated:

tei_enricher will need one file with a list, so this will need to be generated programmatically every time a new tribe is added
also, the ODD and the schema will have to be re-generated
it is questionable whether / when we will have language profiles for each tribe

Option 2: dedicated list of tribes

Actually, there is already a stub of a list of tribes at 010_manannot/wibarab_tribes.xml

Pro:

easy to edit / consume
could use schematron rule to

Con:

duplication of sources (some tribes will also have a language profile containing overlapping information)

Introduce new publication subtypes

Currently, the ODD allows several values for @subtype on <bibl> (based on what's in the VICAV Zotero Library

   <attDef ident="subtype" mode="add">
      <!-- This is extraced by running distinct-values(//biblStruct/@type) on the TEI export of the VICAV bibliography. -->
      <valList type="closed">
         <valItem ident="conferencePaper"/>
         <valItem ident="bookSection"/>
         <valItem ident="journalArticle"/>
         <valItem ident="book"/>
         <valItem ident="encyclopediaArticle"/>
         <valItem ident="thesis"/>
         <valItem ident="magazineArticle"/>
         <valItem ident="manuscript"/>
      </valList>

It would be great if those would show up in the tei_enricher

develop expansion XSLT script

Develop expansion XSLT script

The feature documents are made up of references to various external documents. For full validation and for querying the data, these references need to be resolved and the data being included in a "full" feature documents.

ident on lang; corresp on personGrp

How to group features?

convert featurestructure-elements to FeatureValueObservation-Elements in q-type-file

Regarding the Sociolinguistic constraint again

We discussed again briefly the difference between the sociolinguistic constraints and the PersonGroup, and we came to the conclusion that a simple note element within the sociolinguistic constraints section would suit our purposes just fine, basically as it is now but in the transformation it would show as 'Sociolinguistic constraint'. The PersonGroup would include what we discussed.

replace xml:base="{docPath}" with some other encoding

two problems:

@xml:base contains an URI, { is an invalid character there
resolving uris won't work any more as expected out of the box

since the purpose of this construct was specific to the Enricher, probably a processing instruction would be the most suitable solution

validation: make sure that fvo ids are globally unique

Theoretically, all fvo elements should have a globally unique xml:id by prefixing them with the document ids. Since this is beyond the current document-internal validation we've implemented so far, we need to add this.

Currently there are some fvo ids where the "document id" part of the fvo id reads "tf_template". https://github.com/search?q=repo%3Awibarab%2Ffeaturedb+xml%3Aid%3D%22ft_template&type=code

Authorship attribution for feature descriptions.

As discussed in our meeting on 2023-12-21, we want to attribute authorship to the descriptive part of a feature document, potentially also for external contributors. For this, we should …

add <byline> to the ODD and make it mandatory within <div type="description">
add a to 010_manannot/wibarab_dmp.xml where the <person> elements for external contributors can be listed
make @resp mandatory on <div type="description">

Further, we should decide whether the author of the feature description should also be mentioned in the <titleStmt> (IMHO s*he should), and how (<author> ? <respStmt> with a dedicated <resp> ?)

implement script to reorder FVO content to conform to the schema

Currently the ODD requires the order to be:

name
bibl
placeName
lang
date

afterwards optional elements in any number or order:

personGrp
cit
note

Validation error - Fieldwork

'fieldwork' violates enumeration constraint of 'publication personalCommunication campaign'.
The attribute 'type' with value 'fieldwork' failed to parse.

Open/view @target in editor

To access the profiles directly from the editor, the editor must be able to open/view files from the values of target attributes.

fix xml:ids in Zotero export

Description

Currently, the xmls:ids in 010_manannot/vicav_biblio_tei_zotero.xml are generated by the Zotero client and referenced from the single feature documents. However, these IDs are not reliably stable and can change as entries are added (e.g. adding another publication from an author from the same year will result in both records' xml:ids be updated to "lastName2023a" and "lastName2023b".

Solution

To avoid this, we have introduced the "biblid" values in Zotero's extra field which we have full control over.
We now just need to add a post-processing step to the 080_scripts_generic/vicav_zotero/fetch_generated_tei_and_process.ipynb

encode author of feature description section

Zotero export: entries without biblid

2024-01-08T10:40:38.4676328Z 2024-01-08 10:40:38,467 - 5U3YWIMG no biblid
2024-01-08T10:40:38.4679487Z 2024-01-08 10:40:38,467 - TYKGGJEB no biblid
2024-01-08T10:40:38.4684605Z 2024-01-08 10:40:38,468 - LG2SHTMB no biblid
2024-01-08T10:40:38.4686009Z 2024-01-08 10:40:38,468 - 6TNYZUA8 no biblid
2024-01-08T10:40:38.4687022Z 2024-01-08 10:40:38,468 - QCPMWAYN no biblid
2024-01-08T10:40:38.4687968Z 2024-01-08 10:40:38,468 - P4WYQADG no biblid
2024-01-08T10:40:38.4689154Z 2024-01-08 10:40:38,468 - TZUT6CRI no biblid
2024-01-08T10:40:38.4690108Z 2024-01-08 10:40:38,468 - VBMVMQE8 no biblid
2024-01-08T10:40:38.4691026Z 2024-01-08 10:40:38,468 - HHA62AUL no biblid
2024-01-08T10:40:38.4692023Z 2024-01-08 10:40:38,468 - DYHVZN2P no biblid
2024-01-08T10:40:38.4692912Z 2024-01-08 10:40:38,468 - JULCPNGK no biblid
2024-01-08T10:40:38.4693925Z 2024-01-08 10:40:38,468 - 8F46VZCI no biblid
2024-01-08T10:40:38.4694859Z 2024-01-08 10:40:38,468 - XP62YEX8 no biblid
2024-01-08T10:40:38.4695691Z 2024-01-08 10:40:38,468 - EEKF92L3 no biblid
2024-01-08T10:40:38.4696728Z 2024-01-08 10:40:38,468 - VZWM5K3W no biblid
2024-01-08T10:40:38.4697584Z 2024-01-08 10:40:38,468 - 2SVX5GW7 no biblid
2024-01-08T10:40:38.4698484Z 2024-01-08 10:40:38,468 - EQQCQX4I no biblid
2024-01-08T10:40:38.4699860Z 2024-01-08 10:40:38,469 - Y4VTSSEN malformed biblid: (biblid:āl_1968_2357)
2024-01-08T10:40:38.4701257Z 2024-01-08 10:40:38,469 - RDCRA9ZI malformed biblid: biblid:ouldbaba_2023_9273)
2024-01-08T10:40:38.4703118Z 2024-01-08 10:40:38,469 - APTEQYR4 malformed biblid: biblid:danna_2023_9272)

reference language profile using `<lang>`

Currently, we point to the language profile related to a feature value observation a simple <ptr> element.
Probably it would be more expressive to use <lang corresp="../profiles/profile.xml"/>

validation errors q-file "chapter"

In the "chapter" of the q-file are two recurring errors that come up in validation:
Das Attribut 'type' des Elements '{http://www.tei-c.org/ns/1.0}graphic' ist im DTD/Schema nicht definiert .
Das Attribut 'type' des Elements '{http://www.tei-c.org/ns/1.0}num' ist im DTD/Schema nicht definiert .

zotero: id unrelated to bibliographic data

In 010_manannot/vicav_biblio_tei_zotero.xml the entry http://zotero.org/groups/2165756/items/F4V22ECD has xml:id "prochazka_2016_3167" - which is strange because neither editor nor year are related to the actual bibliographic data. Moreover, in Zotero, the entry has a different, more plausible bibl:id in the extra field (biblid:HerinZammit_2016_3447). We should investigate if there exist other cases like that and provide a fix so that the information in the extra field always matches the xml:id in the TEI export.

Validation: relate validation errors to editors

h1. Description

As usual, the various levels of validation only report errors for file names + line numbers.
To ease managing the resolution of errors in the feature documents, each error should be assigned to the editor of the respective feature value observation element the error was caused.

translate dialect list into VICAV language profiles

we want to retire the dialect list and transform the information to stubs of language profiles

remove comment "potentially ambiguous references"

Description / Background

In the past, I've added XML comments to bibliographic references which were potentially ambiguous so curators could systematically check them and set @status on the <bibl> element to OK (cf. ODD). The issues in the data should have been resolved by now, however in many cases, curators only changed the value of @status but did not remov the XML comment.

What's to be done

Remove XML comments reading "potentially ambiguous references" inside of <bibl> elements with @status="OK"

multiple values in one fvo or in seperate fvos?

I think we discussed this before but unfortunately weʔre not sure anymore what we landed on: if for one feature and one dialect we have several realisations, is it better to create seperate fvos or put both/all realisations in one fvo?

Zotero to TEI: represent date of data collection

Cf. #42: Each bibliographic entry used for feature value observations will have a tag for the decade of data collection and the level of certainty in Zotero. This will come out as any other "normal" Zotero tag as <note type="tag"> in the TEI export, however we probably want to make it more expressive.

<biblStruct> itself does not offer many choices. Either

<biblStruct>
     …
      <note type="dataCollection">The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date> and <date type="dataCollection" cert="low" notBefore-iso="1960-01-01" notAfter="1969-12-31">1960s</date>.
        </note>
</biblStruct>

OR: we can attach this to an @ana attribute on <biblStruct>:

<biblStruct type="conferencePaper" xml:id="Agius_26291991_9868" corresp="http://zotero.org/groups/2165756/items/EJHJT3CB" n="Agius1991a" ana="dataCollected:d1950s dataCollected:d1960s">
      …
</biblStruct>

and at the end of the document:

<interp xml:id="d1950s"><desc>The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date></desc></interp>

Neither of which I find very convincing, honestly. ... Other ideas, @charlymo @kisram @VeronikaEngler ?

Zotero: add decade of data collection

To each publication in the VICAV bibliography, we want to add the information when the data was collected

adapt XSLT for on-the-fly display of profiles

The existing XSLT is VICAV-specific and only creates a div-snippet.

Zotero-TEI export is broken

many invalid xml:ids (spaces, unescaped single quotation marks, brackets etc., e.g. belnap r. kirk_2009_3013)
same xml:id is used for different entries (e.g. behnstedt_1994_0001 is used for @corresp='http://zotero.org/groups/2165756/items/EEI7S3N8' and `@corresp='http://zotero.org/groups/2165756/items/7TMLL9NZ')
<extent> must be at the end of the entry (currently it's directly following the <title>
<title> missing in <monogr> of an analytic publication (e.g. http://zotero.org/groups/2165756/items/5SWT5LJ7)
HTML-Elements inside of TEI (<h2>, <i>)

New personGroup Role

Two particular tribes are not tribes in the traditional sense of the word, they are groups which have come together for multiple reasons, such as work, have mingled with each other and created their own tribal group and linguistic variety. We would like to call them something along the lines of TribalGroup e.g. . The only problem is that do not have a define relation with the others, and as such it would be good if they existed outside of the predefined hierarchy which applies for the clan - tribe - confederation.

Introduce controlled vocabulary for names of religions

feature value observations which are attested to a specific religious group contain a <personGrp> element:

<personGrp type="religousGroup">
      <name>Christians</name>
</personGrp>

We want to limit the possible values of <name> to one of the following:

Christians
Jews
Muslim
Ibadi
Malikite
Sunni
Shiite
Druze

TODO

add to ODD (@dasch124 )
introduce to tei_enricher (@charlymo) - related to https://gitlab.oeaw.ac.at/acdh-ch/object-pascal/tei-enricher/-/issues/4

label (@n) for taxonomy

Added n="sedentismType" to taxonomy. Look for better term under which bedouin (=nomadic?), sedentary, mixed can be grouped.

Add document status "in progress"

meeting 2024-01-11:

Currently, validation is done only on documents indicated as "done". For feature documents which are based on fieldwork, it will take some time until they reach this status, yet we might want at least parts of the to be validated.
We could think to introduce a third document status "in progress" where validation errors of fvos with status != "done" are dropped, so they don't bloat the status list.

investigate ways of converting LAMETA to TEI

https://github.com/onset/lameta/blob/master/sample%20data/Edolo%20sample/Sessions/ETR009/ETR009_Careful.mp3.meta