Comments (22)
Hi Sina,
Can you copy-paste or attach your Bazel log here that shows the error?
from nisaba.
The reason I am asking you for the log snippet with the error is because we'd prefer to fix the problem if it's a genuine bug or perhaps a stale documentation.
from nisaba.
Thanks for your prompt reply.
Here is my stdout:
(base) sina@MacBook-Pro nisaba-main % bazel build -c opt ...
INFO: Analyzed 1807 targets (24 packages loaded, 3129 targets configured).
INFO: Found 1807 targets...
[1 / 99] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: /Users/sina/Desktop/normal/nisaba-main/nisaba/scripts/brahmic/data/Knda/BUILD.bazel:49:21: Executing genrule //nisaba/scripts/brahmic/data/Knda:generate_nfc failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)
Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
Traceback (most recent call last):
File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/unicode_strings_to_tsv.py", line 42, in <module>
from nisaba.scripts.utils import unicode_strings_util as lib
File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/unicode_strings_util.py", line 22, in <module>
from nisaba.scripts.utils import proto
File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/proto.py", line 20, in <module>
import nisaba.scripts.utils.file as uf
File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/file.py", line 21, in <module>
import pynini
File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/org_opengrm_pynini/pynini/__init__.py", line 1, in <module>
from _pynini import *
ImportError: dlopen(/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/org_opengrm_pynini/extensions/_pynini.so, 0x0002): tried: '/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/org_opengrm_pynini/extensions/_pynini.so' (mach-o file, but is an incompatible architecture (have (arm64), need (x86_64))), '/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/external/org_opengrm_pynini/extensions/_pynini.so' (mach-o file, but is an incompatible architecture (have (arm64), need (x86_64)))
INFO: Elapsed time: 8.551s, Critical Path: 5.12s
INFO: 263 processes: 20 internal, 243 darwin-sandbox.
FAILED: Build did NOT complete successfully
from nisaba.
Thanks for the log! I'll investigate - it definitely looks like a Pynini on newer macOS ARM architectures. Possibly related to
this and this.
Unfortunately it will be difficult to reproduce on my side - I have an older macOS hardware.
from nisaba.
Thanks.
Do you think you can share the .far
files then? That'll solve the problem.
from nisaba.
I think this is possible, but these are binary files produced on Intel chips, they may not be compatible with your architecture. I'll look into this.
In the meantime, may I ask you to try this simple experiment in your spare time - perhaps Bazelisk will work instead of Bazel:
# I think the following will download the universal binary compatible with ARM as well.
BAZEL=bazelisk-darwin
curl -LO "https://github.com/bazelbuild/bazelisk/releases/latest/download/$BAZEL"
chmod +x ${BAZEL}
# Clean the cache and retry the build with Bazelisk.
./${BAZEL} clean --expunge
./${BAZEL} build -c opt nisaba/scripts/abjad_alphabet/...
from nisaba.
Unfortunately, it leads to the same error! Just to confirm, I have Bazelisk installed too.
I think if the .far
files were there, I could simply import them as you show in your example, using pynini.Far()
.
I appreciate it, @agutkin!
from nisaba.
Please try the abjad_alphabet_fars_x86_64.tar.gz tarball that contains the FARs for the Perso-Arabic writing systems that I pre-compiled on x86_64 platform and released as v0.3.0-beta pre-release.
from nisaba.
Thanks a million. The tarball file solves the problem.
I am having another issue with ApplyOnText
in the following script. I'll let you know if I can solve you and then will ask to close this issue :-)
import pynini
from nisaba.scripts.utils import far
with pynini.Far('/Users/sina/Desktop/normal/x86_64/abjad_alphabet/reversible_roman.far') as far:
latin_from_arab = far['FROM_ARAB']
print(ApplyOnText('گورنمنٹ'))
from nisaba.
Excellent. Glad it helped. Please let me know how your investigation went.
from nisaba.
Sorry to bring up this issue again, @agutkin, but I am still not able to run the previous piece of code getting this error: ModuleNotFoundError: No module named 'nisaba.scripts'
. I try to run the script in different directories, but I still cannot solve it.
I appreciate your help.
from nisaba.
Since you're trying to run outside of Bazel I presume you've successfully installed Pynini. Because you have access to precompiled FARs you can simply copy-paste the implementation of ApplyOnText
from far.py in Nisaba into your example. That should be using Pynini only with no extra Nisaba code.
from nisaba.
Wonderful. Thanks a million for all your help. I could finally get what I needed using this:
import pynini
from nisaba.scripts.utils import far
far_obj = far.Far('/Users/sina/Desktop/normal/x86_64/abjad_alphabet/reversible_roman.far')
with pynini.Far('/Users/sina/Desktop/normal/x86_64/abjad_alphabet/reversible_roman.far') as far:
latin_from_arab = far['FROM_ARAB']
far_obj_wreapper = far_obj.FstWrapper(latin_from_arab)
print(far_obj_wreapper.ApplyOnText("چونی"))
I commented import nisaba.scripts.utils.file as uf
and from rules_python.python.runfiles import runfiles
in far.py
and file.py
.
Now, one last question remains and that is if I should import the different fst
files, such as visual_norm
separately to get different normalization. What is normalizer
used for?
I am running some experiments on the normalization of the Perso-Arabic script and am trying to make sure that I evaluate Nisaba on my dataset well (as you mention here).
Sorry for making this issue so lengthy. I really like this project! 🙂
from nisaba.
Cool! Glad to hear we got a working setup outside of Bazel. The Normalizer (the implementation is at the bottom of the file) is just a short hand for running visual normalization. See the following method:
def ApplyOnWord(self, word: str) -> str:
"""Normalize a entire word."""
return self._visual_norm.ApplyOnText(self._nfc.ApplyOnText(word))
In general, if you look at the actual implementations of the grammars, the input string x
should be sequentially composed with a sequence of transducers:
- For full visual (i.e., visually-invariant) normalization pipeline: The composition is
P_v
= ((x
∘NFC
) ∘V[L]
), whereV[L]
is the visual normalization FST for languageL
andNFC
is the language-agnostic transducer. TheV[L]
transducer includes some common non-NFC transformations and presentation form handling (NFKC) as well as position-dependent and position-independent language-specific normalizations. - For reading normalization pipeline (that visually modifies the output): The composition is
P_r
= (x
∘P_v
) ∘R[L]
, whereR[L]
is the reading normalization FST. Please note, the reading normalization transducer is separate from the visual normalization one, to keep the size of the resulting FSTs smaller.
Which one of the two pipelines to use is up to you. For some applications, like ASR, visual invariance may be required, for others the reading normalization pipeline may be more suitable.
Hope this helps. Attaching our software demo short paper to be presented WANLP 2022 which also has a small Python example -- wanlp2022.pdf
from nisaba.
Also, please take a look at the normalization code we released with the long paper. There's some irrelevant detail, but the following abstract snippet from the _process_corpus
function should be important for you when you apply the normalization:
for tok in filter(None, input_line.split()):
try:
# `tok` is string, pynini.accep(tok) turns it into a string FST.
output_tok_fst = (pynini.accep(tok) @ nfc_fst) @ visual_fst. # Or use ApplyOnText that you ported.
if output_tok_fst.start() != pynini.NO_STATE_ID:
# Success! The result is non-empty! Convert the FST back to string.
output_tok = output_tok_fst.string()
if not output_tok:
# This most likely should not happen, but guard against unexpected empty composition here as well.
# Keep the token as is.
except (pynini.FstStringCompilationError, pynini.FstOpError):
# Error! Keep the incoming token unchanged.
from nisaba.
Presumably this has been fixed.
from nisaba.
@agutkin I have a similar issue. I am trying to use the wellformed.far
from the far's that you had initially linked. My code:
wf_far_obj = Far("nisaba/x86_64/brahmic/wellformed.far")
with pynini.Far(
"nisaba/x86_64/brahmic/wellformed.far"
) as far:
wellformed = far["GUJR"]
wf_wrapper = wf_far_obj.FstWrapper(wellformed)
print(wf_wrapper.ApplyOnText("[ગુરી પડવાતી વચણા અમૃતનું વાંચન જે"))
This is the error I get:
ERROR: StringFstToOutputLabels: Invalid start state
Traceback (most recent call last):
File "example.py", line 54, in ApplyOnText
return pynini.shortestpath(pynini.escape(text) @ self._fst).string()
File "extensions/_pynini.pyx", line 462, in _pynini.Fst.string
File "extensions/_pynini.pyx", line 507, in _pynini.Fst.string
_pywrapfst.FstOpError: Operation failed
from nisaba.
The wellformed
FST is an acceptor, rather than transducer. Please try using AcceptText
rather than ApplyOnText
and post the results here. The AcceptText
API takes a string and return a bool
- whether the input string is accepted (well-formed) or rejected by the automaton.
from nisaba.
@agutkin I see, it would be nice to have clearer documentation on this. How then can I make a not well formed text, well formed?
I further wanted to ask you, is the following code right?
far_obj = Far("nisaba/x86_64/brahmic/visual_norm.far")
with pynini.Far(
"nisaba/x86_64/brahmic/visual_norm.far"
) as far:
visual_norm = far["GUJR"]
vis_norm_wrapper = far_obj.FstWrapper(visual_norm)
texts = [ "બદલવું પડશેસ્વયમનું પરિવર્તન અનિવાર્એ છે મોક્સની"]
for line in texts:
vn_line = vis_norm_wrapper.ApplyOnText(line)
print(vn_line)
The reason I ask is, there seems to be no change in the input and output text although there are errors in the text I entered.
from nisaba.
Interesting - visual normalization is supposed to fix aksara whose visual appearance would otherwise be the same. The reading normalization transducer fixes some additional issues but doesn't guarantee visual invariance. Also, both normalization methods are not substitutes for the spell checkers -- these transducers solve a different problem.
@cibu, do you have anything to add to this?
from nisaba.
@agutkin understandable. If you're aware, any recommendations for Gujarati grammar and spell checkers?
I tried reading norm as well but that's giving me a segmentation fault for GUJR.
from nisaba.
@RohitMidha23, which Gujarati string results in segmentation fault in reading normalization transducer?
On the issue of Gujarati spell checkers - I don't know, to be honest.
from nisaba.
Related Issues (8)
- Bazel Build Failure HOT 8
- Using Nisaba via Pynini: cannot find rules_python HOT 4
- ISSUE WITH BENGALI
- USE OF MODIFYING LETTER [STRESS MARKER] IN BORO, MAITHILI,DOGRI
- Nukta adjoined to Vowel letter /अ/ in Ol Ciki HOT 2
- use of Nukta with Vowel Sign/Vowel Letter in Ol Ciki in Devanagari script
- Bazel Build Issue GRM2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nisaba.