Giter Site home page Giter Site logo

Comments (22)

agutkin avatar agutkin commented on June 13, 2024

Hi Sina,

Can you copy-paste or attach your Bazel log here that shows the error?

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

The reason I am asking you for the log snippet with the error is because we'd prefer to fix the problem if it's a genuine bug or perhaps a stale documentation.

from nisaba.

sinaahmadi avatar sinaahmadi commented on June 13, 2024

Thanks for your prompt reply.

Here is my stdout:


(base) sina@MacBook-Pro nisaba-main % bazel build -c opt ...                
INFO: Analyzed 1807 targets (24 packages loaded, 3129 targets configured).
INFO: Found 1807 targets...
[1 / 99] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: /Users/sina/Desktop/normal/nisaba-main/nisaba/scripts/brahmic/data/Knda/BUILD.bazel:49:21: Executing genrule //nisaba/scripts/brahmic/data/Knda:generate_nfc failed: (Exit 1): bash failed: error executing command /bin/bash -c ... (remaining 1 argument skipped)

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
Traceback (most recent call last):
  File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/unicode_strings_to_tsv.py", line 42, in <module>
    from nisaba.scripts.utils import unicode_strings_util as lib
  File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/unicode_strings_util.py", line 22, in <module>
    from nisaba.scripts.utils import proto
  File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/proto.py", line 20, in <module>
    import nisaba.scripts.utils.file as uf
  File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/com_google_nisaba/nisaba/scripts/utils/file.py", line 21, in <module>
    import pynini
  File "/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/org_opengrm_pynini/pynini/__init__.py", line 1, in <module>
    from _pynini import *
ImportError: dlopen(/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/org_opengrm_pynini/extensions/_pynini.so, 0x0002): tried: '/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/sandbox/darwin-sandbox/1435/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/nisaba/scripts/utils/unicode_strings_to_tsv.runfiles/org_opengrm_pynini/extensions/_pynini.so' (mach-o file, but is an incompatible architecture (have (arm64), need (x86_64))), '/private/var/tmp/_bazel_sina/af7a15e6977cda310c80ddcdc4b66a45/execroot/com_google_nisaba/bazel-out/darwin_arm64-opt-exec-2B5CBBC6/bin/external/org_opengrm_pynini/extensions/_pynini.so' (mach-o file, but is an incompatible architecture (have (arm64), need (x86_64)))
INFO: Elapsed time: 8.551s, Critical Path: 5.12s
INFO: 263 processes: 20 internal, 243 darwin-sandbox.
FAILED: Build did NOT complete successfully

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Thanks for the log! I'll investigate - it definitely looks like a Pynini on newer macOS ARM architectures. Possibly related to
this and this.

Unfortunately it will be difficult to reproduce on my side - I have an older macOS hardware.

from nisaba.

sinaahmadi avatar sinaahmadi commented on June 13, 2024

Thanks.
Do you think you can share the .far files then? That'll solve the problem.

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

I think this is possible, but these are binary files produced on Intel chips, they may not be compatible with your architecture. I'll look into this.

In the meantime, may I ask you to try this simple experiment in your spare time - perhaps Bazelisk will work instead of Bazel:

# I think the following will download the universal binary compatible with ARM as well.
BAZEL=bazelisk-darwin
curl -LO "https://github.com/bazelbuild/bazelisk/releases/latest/download/$BAZEL"
chmod +x ${BAZEL}

# Clean the cache and retry the build with Bazelisk.
./${BAZEL} clean --expunge
./${BAZEL} build -c opt nisaba/scripts/abjad_alphabet/...

from nisaba.

sinaahmadi avatar sinaahmadi commented on June 13, 2024

Unfortunately, it leads to the same error! Just to confirm, I have Bazelisk installed too.

I think if the .far files were there, I could simply import them as you show in your example, using pynini.Far().

I appreciate it, @agutkin!

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Please try the abjad_alphabet_fars_x86_64.tar.gz tarball that contains the FARs for the Perso-Arabic writing systems that I pre-compiled on x86_64 platform and released as v0.3.0-beta pre-release.

from nisaba.

sinaahmadi avatar sinaahmadi commented on June 13, 2024

Thanks a million. The tarball file solves the problem.

I am having another issue with ApplyOnText in the following script. I'll let you know if I can solve you and then will ask to close this issue :-)

import pynini
from nisaba.scripts.utils import far
with pynini.Far('/Users/sina/Desktop/normal/x86_64/abjad_alphabet/reversible_roman.far') as far:
    latin_from_arab = far['FROM_ARAB']
    print(ApplyOnText('گورنمنٹ'))

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Excellent. Glad it helped. Please let me know how your investigation went.

from nisaba.

sinaahmadi avatar sinaahmadi commented on June 13, 2024

Sorry to bring up this issue again, @agutkin, but I am still not able to run the previous piece of code getting this error: ModuleNotFoundError: No module named 'nisaba.scripts'. I try to run the script in different directories, but I still cannot solve it.

I appreciate your help.

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Since you're trying to run outside of Bazel I presume you've successfully installed Pynini. Because you have access to precompiled FARs you can simply copy-paste the implementation of ApplyOnText from far.py in Nisaba into your example. That should be using Pynini only with no extra Nisaba code.

from nisaba.

sinaahmadi avatar sinaahmadi commented on June 13, 2024

Wonderful. Thanks a million for all your help. I could finally get what I needed using this:

import pynini
from nisaba.scripts.utils import far
far_obj = far.Far('/Users/sina/Desktop/normal/x86_64/abjad_alphabet/reversible_roman.far')

with pynini.Far('/Users/sina/Desktop/normal/x86_64/abjad_alphabet/reversible_roman.far') as far:
    latin_from_arab = far['FROM_ARAB']

far_obj_wreapper = far_obj.FstWrapper(latin_from_arab)
print(far_obj_wreapper.ApplyOnText("چونی"))

I commented import nisaba.scripts.utils.file as uf and from rules_python.python.runfiles import runfiles in far.py and file.py.

Now, one last question remains and that is if I should import the different fst files, such as visual_norm separately to get different normalization. What is normalizer used for?

I am running some experiments on the normalization of the Perso-Arabic script and am trying to make sure that I evaluate Nisaba on my dataset well (as you mention here).

Sorry for making this issue so lengthy. I really like this project! 🙂

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Cool! Glad to hear we got a working setup outside of Bazel. The Normalizer (the implementation is at the bottom of the file) is just a short hand for running visual normalization. See the following method:

def ApplyOnWord(self, word: str) -> str:
    """Normalize a entire word."""
    return self._visual_norm.ApplyOnText(self._nfc.ApplyOnText(word))

In general, if you look at the actual implementations of the grammars, the input string x should be sequentially composed with a sequence of transducers:

  • For full visual (i.e., visually-invariant) normalization pipeline: The composition is P_v = ((xNFC) ∘ V[L]), where V[L] is the visual normalization FST for language L and NFC is the language-agnostic transducer. The V[L] transducer includes some common non-NFC transformations and presentation form handling (NFKC) as well as position-dependent and position-independent language-specific normalizations.
  • For reading normalization pipeline (that visually modifies the output): The composition is P_r = (xP_v) ∘ R[L], where R[L] is the reading normalization FST. Please note, the reading normalization transducer is separate from the visual normalization one, to keep the size of the resulting FSTs smaller.

Which one of the two pipelines to use is up to you. For some applications, like ASR, visual invariance may be required, for others the reading normalization pipeline may be more suitable.

Hope this helps. Attaching our software demo short paper to be presented WANLP 2022 which also has a small Python example -- wanlp2022.pdf

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Also, please take a look at the normalization code we released with the long paper. There's some irrelevant detail, but the following abstract snippet from the _process_corpus function should be important for you when you apply the normalization:

for tok in filter(None, input_line.split()):
     try:
        # `tok` is string, pynini.accep(tok) turns it into a string FST.
        output_tok_fst = (pynini.accep(tok) @ nfc_fst) @ visual_fst. # Or use ApplyOnText that you ported.
        if output_tok_fst.start() != pynini.NO_STATE_ID:
            # Success! The result is non-empty! Convert the FST back to string.
            output_tok = output_tok_fst.string()
            if not output_tok:
              # This most likely should not happen, but guard against unexpected empty composition here as well.
              # Keep the token as is.
     except (pynini.FstStringCompilationError, pynini.FstOpError):
        # Error! Keep the incoming token unchanged.

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Presumably this has been fixed.

from nisaba.

RohitMidha23 avatar RohitMidha23 commented on June 13, 2024

@agutkin I have a similar issue. I am trying to use the wellformed.far from the far's that you had initially linked. My code:

wf_far_obj = Far("nisaba/x86_64/brahmic/wellformed.far")
with pynini.Far(
    "nisaba/x86_64/brahmic/wellformed.far"
) as far:
    wellformed = far["GUJR"]
wf_wrapper = wf_far_obj.FstWrapper(wellformed)
print(wf_wrapper.ApplyOnText("[ગુરી પડવાતી વચણા અમૃતનું વાંચન જે"))

This is the error I get:

ERROR: StringFstToOutputLabels: Invalid start state
Traceback (most recent call last):
  File "example.py", line 54, in ApplyOnText
    return pynini.shortestpath(pynini.escape(text) @ self._fst).string()
  File "extensions/_pynini.pyx", line 462, in _pynini.Fst.string
  File "extensions/_pynini.pyx", line 507, in _pynini.Fst.string
_pywrapfst.FstOpError: Operation failed

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

The wellformed FST is an acceptor, rather than transducer. Please try using AcceptText rather than ApplyOnText and post the results here. The AcceptText API takes a string and return a bool - whether the input string is accepted (well-formed) or rejected by the automaton.

from nisaba.

RohitMidha23 avatar RohitMidha23 commented on June 13, 2024

@agutkin I see, it would be nice to have clearer documentation on this. How then can I make a not well formed text, well formed?
I further wanted to ask you, is the following code right?

far_obj = Far("nisaba/x86_64/brahmic/visual_norm.far")
with pynini.Far(
    "nisaba/x86_64/brahmic/visual_norm.far"
) as far:
    visual_norm = far["GUJR"]

vis_norm_wrapper = far_obj.FstWrapper(visual_norm)

texts = [ "બદલવું પડશેસ્વયમનું પરિવર્તન અનિવાર્એ છે મોક્સની"]

for line in texts:
    vn_line = vis_norm_wrapper.ApplyOnText(line)
    print(vn_line)

The reason I ask is, there seems to be no change in the input and output text although there are errors in the text I entered.

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

Interesting - visual normalization is supposed to fix aksara whose visual appearance would otherwise be the same. The reading normalization transducer fixes some additional issues but doesn't guarantee visual invariance. Also, both normalization methods are not substitutes for the spell checkers -- these transducers solve a different problem.

@cibu, do you have anything to add to this?

from nisaba.

RohitMidha23 avatar RohitMidha23 commented on June 13, 2024

@agutkin understandable. If you're aware, any recommendations for Gujarati grammar and spell checkers?
I tried reading norm as well but that's giving me a segmentation fault for GUJR.

from nisaba.

agutkin avatar agutkin commented on June 13, 2024

@RohitMidha23, which Gujarati string results in segmentation fault in reading normalization transducer?

On the issue of Gujarati spell checkers - I don't know, to be honest.

from nisaba.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.