Giter Site home page Giter Site logo

Comments (16)

rkjtan avatar rkjtan commented on August 14, 2024

Be sure to reference Issue #3, where I noted (1) 2 known typos for Ref & SubjRef values; (2) The issue that the use of morphId actually obscures the fact that for participant reference and semantic roles I was referring to the head of the phrase & not just that one word, when there is more than one word in the phrase. So, the actual referent is the highest applicable noun-type phrase node in the tree where the word with the identified morphId is the head. I'm not sure at this point where it is best to pull these attributes at the word level or the lowest Node level.

from macula-hebrew.

klosoter avatar klosoter commented on August 14, 2024

So, if I understand correctly:

  1. the ids in the attributes Frame, SubjRef, and Ref need to be replaced with our own bcvw system.
  2. there still needs to be decided about what we do with referenced phrases consisting of more than one word

I'll start working on the first

from macula-hebrew.

rkjtan avatar rkjtan commented on August 14, 2024

To clarify the ids in the attributes Frame, SubjRef, and Ref are already in our bcvw system. In the majority of cases, the ids that applied to the skeleton trees apply equally to the OSHB trees. However, there are some places where the ids have shifted due to differences in the text segmentation. So, you would want to use the mapping to figure out what kind of shifts have happened & make sure the ids for Frame, SubjRef, & Ref still refer to the words they are meant to point to.

from macula-hebrew.

klosoter avatar klosoter commented on August 14, 2024

Thank you for the clarification. I had assumed this already. Indeed, the numbering often already corresponds to the OSHB trees. I've written an XQuery that replaces all Frame, SubjRef, and Ref ids with the corresponding OSHB ones. It works when I apply it to a small book, but it takes a really long time.

Also, not all references are in the mapping XML. Overall, there are 54 ids (whether, Frame, SubjRef, or Ref) that do not occur as MorphIds in the mapping XML. Do we need to do something about this?

from macula-hebrew.

jonathanrobie avatar jonathanrobie commented on August 14, 2024

Could you please attach a list of those 54 ids to this issue so we can look at them and decide what to do?

How long does the query take? We only have to do this once, if it works properly.

from macula-hebrew.

klosoter avatar klosoter commented on August 14, 2024

here is the list:


['010280020111',
 '010280050021',
 '010280050041',
 '010280060021',
 '010280060041',
 '010280060051',
 '010280060071',
 '010280060082',
 '010280060092',
 '010280060141',
 '010280060153',
 '010280060162',
 '010280060211',
 '010280060221',
 '010280070021',
 '010460010021',
 '010460010032',
 '010460010091',
 '020070110042',
 '030270160022',
 '030270160051',
 '030270160081',
 '060190130081',
 '060190130102',
 '100200150061',
 '100200150122',
 '100200150151',
 '100200150182',
 '100240060071',
 '110070080082',
 '160050070021',
 '160050070062',
 '160050070082',
 '160050070111',
 '160050070121',
 '160050070141',
 '160050070191',
 '170010130032',
 '170040040092',
 '180130080042',
 '180140070032',
 '180210140022',
 '180220170022',
 '180330060042',
 '190210020011',
 '190210020041',
 '190570030042',
 '210090110142',
 '220060120031',
 '220060120042',
 '260370090212',
 '270020390031',
 '270020390072',
 '350020190032']

from macula-hebrew.

klosoter avatar klosoter commented on August 14, 2024

It is so slow because it tries to find the @Morphid in the mappings XML for every reference. In Python it is much faster (less than a minute), once you create a dictionary with morphIds as keys.

Just a general question: is there any preference of XQuery over Python? For me personally, the more complex stuff is much easier to do in Python, and possibly much faster too.

from macula-hebrew.

rkjtan avatar rkjtan commented on August 14, 2024

All the 54 cases above (except for 170040040092, which is in a verse where we have a known typo mentioned in Issue #3) occur in verses where we've had to make adjustments to the number of words due to one of 3 reasons: (1) 3-part compound analysis in OSHB with usually directional h (pronominal suffix in at least 1 case) in between separately analyzed; (2) implied article adjustments (adding or removing implied article); or (3) additional Qere that exists in OSHB. They basically correspond to the sentences mentioned in Issue #1 (plus a few others where we previously adjusted the presence or absence of the implied article). I don't have time this week to look at these (occupied the whole week with a conference), but I suspect the problem arises due to adjustments with numbering we've made to these specific verses.

from macula-hebrew.

klosoter avatar klosoter commented on August 14, 2024

Using the new mapping, only the following morphsIds that occur in Frame, Ref, or SubjRef attributes, do not appear as morphId in the Full Trees.

180130080043
180140070033
180210140023
180220170023
180330060043
190570030043
210090110143
260370090213
350020190033

Also, there are two other less consistent cases:

  1. Ref and SubjRef usually separate multiple ids by a whitespace " ", except for a few cases where ";" is used, often alongside " ". Example: {020070110042;020070110043 020070110054}
  2. Sometimes, the attributes are not empty, but do not contain morphIds, such as {A0:}

Do the differences in 1 have a meaning? And what do we do with 2?

from macula-hebrew.

rkjtan avatar rkjtan commented on August 14, 2024

All the consistent cases named above are correct & are explained as follows:
180130080043 involves reading an implied article, which shifted the noun "God" from 180130080042 to 180130080043.
180140070033 involves reading an implied article, which shifted the noun "tree" from 180140070032 to 180140070033.
180210140023 involves reading an implied article, which shifted the noun "God" from 180210140022 to 180210140023.
180220170023 involves reading an implied article, which shifted the noun "God" from 180220170022 to 180220170023.
180330060043 involves reading an implied article, which shifted the noun "God" from 180330060042 to 180330060043.
190570030043 involves reading an implied article, which shifted the noun "God" from 190570030042 to 190570030043.
210090110143 involves reading an implied article, which shifted the substantival adjective "wise" from 210090110142 to 210090110143.
260370090213 involves reading an implied article, which shifted the substantival participle "ones slain" from 260370090212 to 260370090213.
350020190033 involves reading an implied article, which shifted the noun "wood" from 350020190032 to 350020190033.

from macula-hebrew.

rkjtan avatar rkjtan commented on August 14, 2024

I remember that white space and ; were both considered for separating Ids at one point and that we ended up going one way rather than the other. So, I believe the inconsistency with using ; sometimes is just an inconsistency that slipped through and that there is no difference in meaning with just using whitespace. On 2., can you provide me a list of the places involved where the attributes are not empty, but do not contain morphIds? I'll need to double-check these--my initial thought is that maybe some way that we had used to delete morphIds accidentally didn't delete the non-Id parts of the entries. However, I need to check to be sure.

from macula-hebrew.

klosoter avatar klosoter commented on August 14, 2024

Just to clarify, for Frames, the AA, A0, and A1 id slots are separated by white space, while the ids are separated by a semicolon. For SubjRef and Ref, the ids are separated by a white space, except for a few cases where it is separated by a semicolon.

Interestingly, in those cases, the ids that are separated by a semicolon, are always consecutive numbers (so, perhaps they are meant to form one unit). I've attached a list of these cases.
semicolon_ref.txt.

from macula-hebrew.

klosoter avatar klosoter commented on August 14, 2024

Here is a list of ids (this occurs only in Frame), where the attributes are not empty but do not contain ids.

from macula-hebrew.

rkjtan avatar rkjtan commented on August 14, 2024

Thank you for the useful clarification about the distinction between Frames and SubjRef & Ref. I just checked all cases in semicolon_ref.text. I found a good reason why every case where the ids are separated by a semicolon always has consecutive numbers. It appears that the consecutive number after the ; is the correct id post-mapping, whereas the id before the ; is the old id. In every case, an added implied article changed the id to the next consecutive number from the old id. So, the course of action to take is probably to delete the number before the ; along with the ; and keep only the consecutive number after the ;.

from macula-hebrew.

rkjtan avatar rkjtan commented on August 14, 2024

On the issue of attributes are not empty but do not contain ids for Frames, there are too many cases to check every single case right now. However, my spot checking indicates that these are all cases where there is no agent that could be referenced in context. My current suspicion is that every active voice verb automatically got "A0:" added to it at some point. However, since not every verb has an agent that can be referenced in context (e.g., generic reference), the ones with no morphIds are the ones with no agent to refer to. You can double-check if it is true that every active voice verb automatically has at least "A0:" when there is no other value. (For passive voice verbs, "A0:" would not be expected and so I don't think passive voice verbs would have "A0:" automatically added. If my hypothesis is true, for active voice verbs, I would expect "A0:" to still be added, even if there is an A1 with a morphId.)

from macula-hebrew.

jonathanrobie avatar jonathanrobie commented on August 14, 2024

from macula-hebrew.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.