Giter Site home page Giter Site logo

wibarab / featuredb Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 61.83 MB

WIBARAB is a project in the field of Arabic dialectology. It consists of various regional sub-projects (four PhD projects) and a large database about bedouin-type dialects of Arabic. The Feature Database will be the main point of integrating the results of the sub-projects. In this repository we collect the primary data of the database in TEI/XML.

License: Other

XSLT 1.45% HTML 93.43% CSS 0.02% Jupyter Notebook 5.11%
acdh-ch arabic-dialects linguistics

featuredb's People

Contributors

antonellat73 avatar charlymo avatar claudialaaber avatar dasch124 avatar github-actions[bot] avatar gundak95 avatar hessabi3108 avatar iriartedia89 avatar johdop avatar kisram avatar likeanga avatar mariarebecca avatar prochas8 avatar simar0at avatar terlan712 avatar veronikaengler avatar

Stargazers

 avatar

Watchers

 avatar  avatar

featuredb's Issues

introduce divGen to indicate location of featureValues list

We want to allow editors to decide where the list of possible feature values should be placed in the description part of a feature value document. We could use divGen for that purpose.

<divGen type="featureValues"/>
  • add to ODD / Schema
  • implement in html preview transformation

define curation workflow

Define curation workflow

In order to be implemented into values of @status attributes, we need to define a curation workflow.
Here's what has been proposed so far in our meeting on 2023-01-12:

  • Draft (default status) - data gathering is still ongoing
  • Done (WIBARAB marks it) major bulk of data gathering is already done (minus some fieldwork and doubts). The document is ready to be validated
  • Validated (ACDH CH marks it)- 1st round of validation has been done and finished and no changes are required from the ACDH-CH Team.
  • Needs revision (ACDH CH team)- ACDH-CH Team needs some changes from WIBARAB team for a second final round of validation.
  • Revised (WIBARAB team marks it)- Some changes have been done after 1st validation and the document needs to be validated again
  • Completed (ACDH CH and WIBARAB need to agree) - Final version of the document. Ready to publish.

wrong relative paths pointing from feature document to profiles

E.g. in 010_manannot/features/features_djim.xml:

    <wib:featureValueObservation cert="unknown" status="draft" xml:id="fr_0000_new_moon_please" resp="dmp:???">
         ...
               <ptr target="profiles\vicav_profile_LBN-ABI.xml"/>
          ...
    </wib>

Since the profiles are located under 010_manannot/profiles the @target attribute on <ptr> should read ..\profiles\vicav_profile_LBN-ABI.xml Changing this in the data isn't a problem, but would this break something in the tei_enricher, @charlymo ?

introduce a controlled vocabulary for tribe names

Controlled vocabulary for tribe names

We want to make sure that the tribe names are consistent across our data so we should both add the list to the ODD / Schema and to the tei_enricher

There are several ways of implementing that:

Option 1: source from language profiles

Each tribe is represented in a language profile; so we could extract the list out of those profiles describing a tribe (leaving out others).

Pro:

  • tribes and langProfiles will be consistent.
  • no duplication of information

Con: Technically probably a bit more complicated:

  • tei_enricher will need one file with a list, so this will need to be generated programmatically every time a new tribe is added
  • also, the ODD and the schema will have to be re-generated
  • it is questionable whether / when we will have language profiles for each tribe

Option 2: dedicated list of tribes

Actually, there is already a stub of a list of tribes at 010_manannot/wibarab_tribes.xml

Pro:

  • easy to edit / consume
  • could use schematron rule to

Con:

  • duplication of sources (some tribes will also have a language profile containing overlapping information)

Introduce new publication subtypes

Introduce new publication subtypes

Currently, the ODD allows several values for @subtype on <bibl> (based on what's in the VICAV Zotero Library

   <attDef ident="subtype" mode="add">
      <!-- This is extraced by running distinct-values(//biblStruct/@type) on the TEI export of the VICAV bibliography. -->
      <valList type="closed">
         <valItem ident="conferencePaper"/>
         <valItem ident="bookSection"/>
         <valItem ident="journalArticle"/>
         <valItem ident="book"/>
         <valItem ident="encyclopediaArticle"/>
         <valItem ident="thesis"/>
         <valItem ident="magazineArticle"/>
         <valItem ident="manuscript"/>
      </valList>

It would be great if those would show up in the tei_enricher

develop expansion XSLT script

Develop expansion XSLT script

The feature documents are made up of references to various external documents. For full validation and for querying the data, these references need to be resolved and the data being included in a "full" feature documents.

Regarding the Sociolinguistic constraint again

We discussed again briefly the difference between the sociolinguistic constraints and the PersonGroup, and we came to the conclusion that a simple note element within the sociolinguistic constraints section would suit our purposes just fine, basically as it is now but in the transformation it would show as 'Sociolinguistic constraint'. The PersonGroup would include what we discussed.

replace xml:base="{docPath}" with some other encoding

two problems:

  • @xml:base contains an URI, { is an invalid character there
  • resolving uris won't work any more as expected out of the box

since the purpose of this construct was specific to the Enricher, probably a processing instruction would be the most suitable solution

Authorship attribution for feature descriptions.

As discussed in our meeting on 2023-12-21, we want to attribute authorship to the descriptive part of a feature document, potentially also for external contributors. For this, we should …

  • add <byline> to the ODD and make it mandatory within <div type="description">
  • add a to 010_manannot/wibarab_dmp.xml where the <person> elements for external contributors can be listed
  • make @resp mandatory on <div type="description">

Further, we should decide whether the author of the feature description should also be mentioned in the <titleStmt> (IMHO s*he should), and how (<author> ? <respStmt> with a dedicated <resp> ?)

Validation error - Fieldwork

'fieldwork' violates enumeration constraint of 'publication personalCommunication campaign'.
The attribute 'type' with value 'fieldwork' failed to parse.

Open/view @target in editor

To access the profiles directly from the editor, the editor must be able to open/view files from the values of target attributes.

fix xml:ids in Zotero export

fix xml:ids in Zotero export

Description

Currently, the xmls:ids in 010_manannot/vicav_biblio_tei_zotero.xml are generated by the Zotero client and referenced from the single feature documents. However, these IDs are not reliably stable and can change as entries are added (e.g. adding another publication from an author from the same year will result in both records' xml:ids be updated to "lastName2023a" and "lastName2023b".

Solution

To avoid this, we have introduced the "biblid" values in Zotero's extra field which we have full control over.
We now just need to add a post-processing step to the 080_scripts_generic/vicav_zotero/fetch_generated_tei_and_process.ipynb

Zotero export: entries without biblid

2024-01-08T10:40:38.4676328Z 2024-01-08 10:40:38,467 - 5U3YWIMG no biblid
2024-01-08T10:40:38.4679487Z 2024-01-08 10:40:38,467 - TYKGGJEB no biblid
2024-01-08T10:40:38.4684605Z 2024-01-08 10:40:38,468 - LG2SHTMB no biblid
2024-01-08T10:40:38.4686009Z 2024-01-08 10:40:38,468 - 6TNYZUA8 no biblid
2024-01-08T10:40:38.4687022Z 2024-01-08 10:40:38,468 - QCPMWAYN no biblid
2024-01-08T10:40:38.4687968Z 2024-01-08 10:40:38,468 - P4WYQADG no biblid
2024-01-08T10:40:38.4689154Z 2024-01-08 10:40:38,468 - TZUT6CRI no biblid
2024-01-08T10:40:38.4690108Z 2024-01-08 10:40:38,468 - VBMVMQE8 no biblid
2024-01-08T10:40:38.4691026Z 2024-01-08 10:40:38,468 - HHA62AUL no biblid
2024-01-08T10:40:38.4692023Z 2024-01-08 10:40:38,468 - DYHVZN2P no biblid
2024-01-08T10:40:38.4692912Z 2024-01-08 10:40:38,468 - JULCPNGK no biblid
2024-01-08T10:40:38.4693925Z 2024-01-08 10:40:38,468 - 8F46VZCI no biblid
2024-01-08T10:40:38.4694859Z 2024-01-08 10:40:38,468 - XP62YEX8 no biblid
2024-01-08T10:40:38.4695691Z 2024-01-08 10:40:38,468 - EEKF92L3 no biblid
2024-01-08T10:40:38.4696728Z 2024-01-08 10:40:38,468 - VZWM5K3W no biblid
2024-01-08T10:40:38.4697584Z 2024-01-08 10:40:38,468 - 2SVX5GW7 no biblid
2024-01-08T10:40:38.4698484Z 2024-01-08 10:40:38,468 - EQQCQX4I no biblid
2024-01-08T10:40:38.4699860Z 2024-01-08 10:40:38,469 - Y4VTSSEN malformed biblid: (biblid:āl_1968_2357)
2024-01-08T10:40:38.4701257Z 2024-01-08 10:40:38,469 - RDCRA9ZI malformed biblid: biblid:ouldbaba_2023_9273)
2024-01-08T10:40:38.4703118Z 2024-01-08 10:40:38,469 - APTEQYR4 malformed biblid: biblid:danna_2023_9272)

reference language profile using `<lang>`

Currently, we point to the language profile related to a feature value observation a simple <ptr> element.
Probably it would be more expressive to use <lang corresp="../profiles/profile.xml"/>

zotero: id unrelated to bibliographic data

In 010_manannot/vicav_biblio_tei_zotero.xml the entry http://zotero.org/groups/2165756/items/F4V22ECD has xml:id "prochazka_2016_3167" - which is strange because neither editor nor year are related to the actual bibliographic data. Moreover, in Zotero, the entry has a different, more plausible bibl:id in the extra field (biblid:HerinZammit_2016_3447). We should investigate if there exist other cases like that and provide a fix so that the information in the extra field always matches the xml:id in the TEI export.

Validation: relate validation errors to editors

h1. Description

As usual, the various levels of validation only report errors for file names + line numbers.
To ease managing the resolution of errors in the feature documents, each error should be assigned to the editor of the respective feature value observation element the error was caused.

remove comment "potentially ambiguous references"

Description / Background

In the past, I've added XML comments to bibliographic references which were potentially ambiguous so curators could systematically check them and set @status on the <bibl> element to OK (cf. ODD). The issues in the data should have been resolved by now, however in many cases, curators only changed the value of @status but did not remov the XML comment.

What's to be done

Remove XML comments reading "potentially ambiguous references" inside of <bibl> elements with @status="OK"

multiple values in one fvo or in seperate fvos?

I think we discussed this before but unfortunately weʔre not sure anymore what we landed on: if for one feature and one dialect we have several realisations, is it better to create seperate fvos or put both/all realisations in one fvo?

Zotero to TEI: represent date of data collection

Cf. #42: Each bibliographic entry used for feature value observations will have a tag for the decade of data collection and the level of certainty in Zotero. This will come out as any other "normal" Zotero tag as <note type="tag"> in the TEI export, however we probably want to make it more expressive.

<biblStruct> itself does not offer many choices. Either

<biblStruct>
     …
      <note type="dataCollection">The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date> and <date type="dataCollection" cert="low" notBefore-iso="1960-01-01" notAfter="1969-12-31">1960s</date>.
        </note>
</biblStruct>

OR: we can attach this to an @ana attribute on <biblStruct>:

<biblStruct type="conferencePaper" xml:id="Agius_26291991_9868" corresp="http://zotero.org/groups/2165756/items/EJHJT3CB" n="Agius1991a" ana="dataCollected:d1950s dataCollected:d1960s">
      …
</biblStruct>

and at the end of the document:

<interp xml:id="d1950s"><desc>The data of this publication was collected in the <date type="dataCollection" cert="low" notBefore-iso="1950-01-01" notAfter="1959-12-31">1950s</date></desc></interp>

Neither of which I find very convincing, honestly. ... Other ideas, @charlymo @kisram @VeronikaEngler ?

Zotero-TEI export is broken

  • many invalid xml:ids (spaces, unescaped single quotation marks, brackets etc., e.g. belnap r. kirk_2009_3013)
  • same xml:id is used for different entries (e.g. behnstedt_1994_0001 is used for @corresp='http://zotero.org/groups/2165756/items/EEI7S3N8' and `@corresp='http://zotero.org/groups/2165756/items/7TMLL9NZ')
  • <extent> must be at the end of the entry (currently it's directly following the <title>
  • <title> missing in <monogr> of an analytic publication (e.g. http://zotero.org/groups/2165756/items/5SWT5LJ7)
  • HTML-Elements inside of TEI (<h2>, <i>)

New personGroup Role

Two particular tribes are not tribes in the traditional sense of the word, they are groups which have come together for multiple reasons, such as work, have mingled with each other and created their own tribal group and linguistic variety. We would like to call them something along the lines of TribalGroup e.g. . The only problem is that do not have a define relation with the others, and as such it would be good if they existed outside of the predefined hierarchy which applies for the clan - tribe - confederation.

Introduce controlled vocabulary for names of religions

Introduce controlled vocabulary for names of religions

feature value observations which are attested to a specific religious group contain a <personGrp> element:

<personGrp type="religousGroup">
      <name>Christians</name>
</personGrp>

We want to limit the possible values of <name> to one of the following:

  • Christians
  • Jews
  • Muslim
  • Ibadi
  • Malikite
  • Sunni
  • Shiite
  • Druze

TODO

label (@n) for taxonomy

Added n="sedentismType" to taxonomy. Look for better term under which bedouin (=nomadic?), sedentary, mixed can be grouped.

Add document status "in progress"

meeting 2024-01-11:

Currently, validation is done only on documents indicated as "done". For feature documents which are based on fieldwork, it will take some time until they reach this status, yet we might want at least parts of the to be validated.
We could think to introduce a third document status "in progress" where validation errors of fvos with status != "done" are dropped, so they don't bloat the status list.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.