Comments (16)
Be sure to reference Issue #3, where I noted (1) 2 known typos for Ref & SubjRef values; (2) The issue that the use of morphId actually obscures the fact that for participant reference and semantic roles I was referring to the head of the phrase & not just that one word, when there is more than one word in the phrase. So, the actual referent is the highest applicable noun-type phrase node in the tree where the word with the identified morphId is the head. I'm not sure at this point where it is best to pull these attributes at the word level or the lowest Node level.
from macula-hebrew.
So, if I understand correctly:
- the ids in the attributes Frame, SubjRef, and Ref need to be replaced with our own bcvw system.
- there still needs to be decided about what we do with referenced phrases consisting of more than one word
I'll start working on the first
from macula-hebrew.
To clarify the ids in the attributes Frame, SubjRef, and Ref are already in our bcvw system. In the majority of cases, the ids that applied to the skeleton trees apply equally to the OSHB trees. However, there are some places where the ids have shifted due to differences in the text segmentation. So, you would want to use the mapping to figure out what kind of shifts have happened & make sure the ids for Frame, SubjRef, & Ref still refer to the words they are meant to point to.
from macula-hebrew.
Thank you for the clarification. I had assumed this already. Indeed, the numbering often already corresponds to the OSHB trees. I've written an XQuery that replaces all Frame, SubjRef, and Ref ids with the corresponding OSHB ones. It works when I apply it to a small book, but it takes a really long time.
Also, not all references are in the mapping XML. Overall, there are 54 ids (whether, Frame, SubjRef, or Ref) that do not occur as MorphIds in the mapping XML. Do we need to do something about this?
from macula-hebrew.
Could you please attach a list of those 54 ids to this issue so we can look at them and decide what to do?
How long does the query take? We only have to do this once, if it works properly.
from macula-hebrew.
here is the list:
['010280020111',
'010280050021',
'010280050041',
'010280060021',
'010280060041',
'010280060051',
'010280060071',
'010280060082',
'010280060092',
'010280060141',
'010280060153',
'010280060162',
'010280060211',
'010280060221',
'010280070021',
'010460010021',
'010460010032',
'010460010091',
'020070110042',
'030270160022',
'030270160051',
'030270160081',
'060190130081',
'060190130102',
'100200150061',
'100200150122',
'100200150151',
'100200150182',
'100240060071',
'110070080082',
'160050070021',
'160050070062',
'160050070082',
'160050070111',
'160050070121',
'160050070141',
'160050070191',
'170010130032',
'170040040092',
'180130080042',
'180140070032',
'180210140022',
'180220170022',
'180330060042',
'190210020011',
'190210020041',
'190570030042',
'210090110142',
'220060120031',
'220060120042',
'260370090212',
'270020390031',
'270020390072',
'350020190032']
from macula-hebrew.
It is so slow because it tries to find the @Morphid in the mappings XML for every reference. In Python it is much faster (less than a minute), once you create a dictionary with morphIds as keys.
Just a general question: is there any preference of XQuery over Python? For me personally, the more complex stuff is much easier to do in Python, and possibly much faster too.
from macula-hebrew.
All the 54 cases above (except for 170040040092, which is in a verse where we have a known typo mentioned in Issue #3) occur in verses where we've had to make adjustments to the number of words due to one of 3 reasons: (1) 3-part compound analysis in OSHB with usually directional h (pronominal suffix in at least 1 case) in between separately analyzed; (2) implied article adjustments (adding or removing implied article); or (3) additional Qere that exists in OSHB. They basically correspond to the sentences mentioned in Issue #1 (plus a few others where we previously adjusted the presence or absence of the implied article). I don't have time this week to look at these (occupied the whole week with a conference), but I suspect the problem arises due to adjustments with numbering we've made to these specific verses.
from macula-hebrew.
Using the new mapping, only the following morphsId
s that occur in Frame, Ref, or SubjRef attributes, do not appear as morphId
in the Full Trees.
180130080043
180140070033
180210140023
180220170023
180330060043
190570030043
210090110143
260370090213
350020190033
Also, there are two other less consistent cases:
- Ref and SubjRef usually separate multiple ids by a whitespace " ", except for a few cases where ";" is used, often alongside " ". Example:
{020070110042;020070110043 020070110054}
- Sometimes, the attributes are not empty, but do not contain
morphId
s, such as{A0:}
Do the differences in 1
have a meaning? And what do we do with 2
?
from macula-hebrew.
All the consistent cases named above are correct & are explained as follows:
180130080043 involves reading an implied article, which shifted the noun "God" from 180130080042 to 180130080043.
180140070033 involves reading an implied article, which shifted the noun "tree" from 180140070032 to 180140070033.
180210140023 involves reading an implied article, which shifted the noun "God" from 180210140022 to 180210140023.
180220170023 involves reading an implied article, which shifted the noun "God" from 180220170022 to 180220170023.
180330060043 involves reading an implied article, which shifted the noun "God" from 180330060042 to 180330060043.
190570030043 involves reading an implied article, which shifted the noun "God" from 190570030042 to 190570030043.
210090110143 involves reading an implied article, which shifted the substantival adjective "wise" from 210090110142 to 210090110143.
260370090213 involves reading an implied article, which shifted the substantival participle "ones slain" from 260370090212 to 260370090213.
350020190033 involves reading an implied article, which shifted the noun "wood" from 350020190032 to 350020190033.
from macula-hebrew.
I remember that white space and ; were both considered for separating Ids at one point and that we ended up going one way rather than the other. So, I believe the inconsistency with using ; sometimes is just an inconsistency that slipped through and that there is no difference in meaning with just using whitespace. On 2., can you provide me a list of the places involved where the attributes are not empty, but do not contain morphIds? I'll need to double-check these--my initial thought is that maybe some way that we had used to delete morphIds accidentally didn't delete the non-Id parts of the entries. However, I need to check to be sure.
from macula-hebrew.
Just to clarify, for Frames, the AA, A0, and A1 id slots are separated by white space, while the ids are separated by a semicolon. For SubjRef and Ref, the ids are separated by a white space, except for a few cases where it is separated by a semicolon.
Interestingly, in those cases, the ids that are separated by a semicolon, are always consecutive numbers (so, perhaps they are meant to form one unit). I've attached a list of these cases.
semicolon_ref.txt.
from macula-hebrew.
Here is a list of ids (this occurs only in Frame), where the attributes are not empty but do not contain ids.
from macula-hebrew.
Thank you for the useful clarification about the distinction between Frames and SubjRef & Ref. I just checked all cases in semicolon_ref.text. I found a good reason why every case where the ids are separated by a semicolon always has consecutive numbers. It appears that the consecutive number after the ; is the correct id post-mapping, whereas the id before the ; is the old id. In every case, an added implied article changed the id to the next consecutive number from the old id. So, the course of action to take is probably to delete the number before the ; along with the ; and keep only the consecutive number after the ;.
from macula-hebrew.
On the issue of attributes are not empty but do not contain ids for Frames, there are too many cases to check every single case right now. However, my spot checking indicates that these are all cases where there is no agent that could be referenced in context. My current suspicion is that every active voice verb automatically got "A0:" added to it at some point. However, since not every verb has an agent that can be referenced in context (e.g., generic reference), the ones with no morphIds are the ones with no agent to refer to. You can double-check if it is true that every active voice verb automatically has at least "A0:" when there is no other value. (For passive voice verbs, "A0:" would not be expected and so I don't think passive voice verbs would have "A0:" automatically added. If my hypothesis is true, for active voice verbs, I would expect "A0:" to still be added, even if there is an A1 with a morphId.)
from macula-hebrew.
from macula-hebrew.
Related Issues (20)
- Add lemmas to Hebrew nodes trees HOT 4
- There are missing `m/@xml:id`s in our current lowfat trees HOT 1
- Marble Domains (`Domain`, `ContextualDomain`, `CoreDomain`) HOT 6
- 5. Repopulate Hebrew lowfat with the latest updates:
- transcription and gloss attributes from SIL are still missing, at least from Genesis 1.
- Problems in `morpheme-mappings.xml` HOT 1
- Word Sense (from macula-greek) HOT 1
- Greek beta-to-unicode in Genesis 1:1 HOT 1
- Incorrect closing </w> tag
- Implicit article stealing attributes from following sibling
- Split node at GEN 50:10!4
- Replace `c` node with merged `m` in PSA 102:4
- After in Gen 1:12 HOT 2
- Incorrect mapping to lowfat HOT 1
- _ki_ missing in Lev 5:21. HOT 2
- Low-fat word parts missing HOT 5
- Lowfat 'c' fields have no glosses HOT 1
- include Ketiv into Macula-Hebrew ? HOT 2
- Misnumbered nodes in 1 Chronicles 20 HOT 1
- Macula Contextual Domains
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from macula-hebrew.