Giter Site home page Giter Site logo

Comments (10)

dannywillems avatar dannywillems commented on June 8, 2024 1

After investigating more the C code and because I was getting segfaults on single core in rare cases, we found it was coming from the C code, not OCaml. I close this issue as it is not related to ocaml-multicore based on our last investigations.

from ocaml-multicore.

jmid avatar jmid commented on June 8, 2024

I admit that I have not studied the code in detail, so take these comments with a well-intended grain of salt... 😄

I can see you use Domainslib which has had a few recent updates to be included in a forthcoming release.
The second of these updates addresses an issue with similar symptoms making me suspect Domainslib rather than the multicore compiler:

  • Domainslib had an update to ocaml-multicore/domainslib#51 which changes the API slightly:
    computations have to be enclosed in a call to Task.run
  • Another PR domainslib#50 which just got merged caused multiple pools to wreak havoc (non-det. behaviour, seg.faults - #43 #58 )

I would thus recommend trying things out with a fresh Domainslib from the repo.
opam source domainslib --dev --pin (or a suitably labelled pin-depends in the opam file) should do the job.
Of course Domainslib may not be to blame here - but it would be nice to rule it out 😀

from ocaml-multicore.

dannywillems avatar dannywillems commented on June 8, 2024

See https://gitlab.com/dannywillems/ocaml-bls12-381/-/commit/d55f962516e3d33b10dec84f86296e0343a2987c.
It gives better results in the sense it runs on inputs of size 2^16 and 2^17. However, with 2^18, I get a segfault.

opam pin add domainslib.dev git+https://github.com/ocaml-multicore/domainslib\#df4afa26ebbaee6f0eecb26955fde28dee53a19d
➜  ocaml-bls12-381 git:(pippenger-improvement) ✗ dune exec ./benchmark/bench_g1_pippenger.exe -f -- 16 4 4 4
Number of elements: 65536 (2^16). Num domains = 4 and num task = 4, num chunk = 4
Multi core pippenger, contiguous array: 650.777000ms
Single core pippenger: 656.653000ms
Single core pippenger, contiguous array: 487.927000ms
Single core pippenger, contiguous array splitted in chunks: 578.026000ms
➜  ocaml-bls12-381 git:(pippenger-improvement) ✗ dune exec ./benchmark/bench_g1_pippenger.exe -f -- 17 4 4 4
Number of elements: 131072 (2^17). Num domains = 4 and num task = 4, num chunk = 4
Multi core pippenger, contiguous array: 1155.445000ms
Single core pippenger: 1256.583000ms
Single core pippenger, contiguous array: 923.629000ms
Single core pippenger, contiguous array splitted in chunks: 1094.259000ms
➜  ocaml-bls12-381 git:(pippenger-improvement) ✗ dune exec ./benchmark/bench_g1_pippenger.exe -f -- 18 4 4 4
[1]    639927 segmentation fault  dune exec ./benchmark/bench_g1_pippenger.exe -f -- 18 4 4 4

I also get the same behavior than before, i.e. it works when multicore runs alone.

from ocaml-multicore.

jmid avatar jmid commented on June 8, 2024

Thanks for trying it out - at least a newer domainslib didn't make things worse... 😅

It would probably make sense to try a newer multicore compiler version (5.00).
There have been recent changes (e.g., #771 fixing #770 with similar symptoms) so it could be useful to know whether it also happens with the latest fixes (the backport in #781 to 4.12.0+domains wasn't merged IIUC).

Caveat:

  • Installing dependencies for a recent multicore can involve a bit of opam wrestling (#770 contains some advice). Keeping these to a minimum might make it easier to get it up and running (and reducing the number of moving parts can help identify the underlying cause)

from ocaml-multicore.

dannywillems avatar dannywillems commented on June 8, 2024

Getting the same result:

cd /tmp
# Commit 4f89f41b7f597ca200a91002a60cb460dde3f6af for me
git clone https://github.com/ocaml-multicore/ocaml-multicore
cd ocaml-multicore
opam switch create . --empty
opam pin add -k path --inplace-build ocaml-variants.5.0.0+domains .
eval $(opam env)
ocaml --version
# > The OCaml toplevel, version 5.00.0+dev0-2021-11-05
# I suppose it is normal the version is not the commit hash.
opam pin add domainslib.dev git+https://github.com/ocaml-multicore/domainslib.git\#df4afa26ebbaee6f0eecb26955fde28dee53a19d
# Change path
opam pin add -k path ../dannywillems/ocaml-bls12-381
cd ../dannywillems/ocaml-bls12-381
$  dune exec ./benchmark/bench_g1_pippenger.exe -f -- 18 4 4 4
[1]    123595 segmentation fault  dune exec ./benchmark/bench_g1_pippenger.exe -f -- 18 4 4 4

from ocaml-multicore.

jmid avatar jmid commented on June 8, 2024

Hm. I'm surprised it went so smoothly since I saw zarith, base, ... listed as dependencies in the opam file... (I may have misunderstood something) 🤔

Just to make sure I understand your reproduction steps:

Won't this pick up the "outer" OCaml version (perhaps 4.12.0+domain and the dependencies already install there) rather than the 5.00.0 installed in the local switch?
Comparing ocamlrun -version and opam list from the two directories should help clarify this.

Edit: OK, I may have read your reproduction steps too literally. 🤷‍♂️
Just to confirm: Building is pretty smooth. Furthermore I can reproduce the segfault on my machine with the same latest commit (4f89f41) when I clone the gitlab repo and pin and build from the pippenger-improvement branch in the local switch.

from ocaml-multicore.

jmid avatar jmid commented on June 8, 2024

OK. I've now taken a look at blst_bindings_stubs.c. Here is some feedback based on https://ocaml.org/manual/intfc.html
I'm no expert in the GC-interface - so again - take these with a well-intended grain of salt: 😅

In caml_blst_g1_pippenger_contiguous_affine_array_stubs

In allocate_p1_affine_array_stubs

A more general question: could bigarrays potentially achieve what you are trying to do here?

from ocaml-multicore.

dannywillems avatar dannywillems commented on June 8, 2024

it seems there is no check on the returned value from malloc/calloc?

It is something I still had to do. I added here.
Quite interesting: I would have expected to not change anything, but... values up to 21 worked (didn't try above yet because would take a long time).

is the custom_block scalars accessed with Field contradicting https://ocaml.org/manual/intfc.html#ss:c-custom-access ?

Scalars are accessed using Field and Blst_fr_val because the value is an OCaml array, not a custom block. There is no OCaml value stored in a custom block in the binding. The C value is stored directly without an indirection using a pointer + finalizer (it gives also better perf, see https://gitlab.com/dannywillems/ocaml-bls12-381/-/merge_requests/144).

I don't know how caml_alloc_custom reacts in case allocation fails (probably an OOM exception)

I don't know too.

I was a bit surprised that the allocated block isn't initialized.

It is the job of caml_blst_p1_affine_array_set_p1_points_stubs. The idea is to populate the contiguous C array using a Caml array.

CAMLlocal1 is not used at the beginning of the function
contrasting https://ocaml.org/manual/intfc.html#ss:c-simple-gc-harmony

It is used just after int n_c = Int_val(n), which is simply accessing a value. Not sure it is relevant.

from ocaml-multicore.

dannywillems avatar dannywillems commented on June 8, 2024

A more general question: could bigarrays potentially achieve what you are trying to do here?

bigarrays are only for integers/floats

from ocaml-multicore.

jmid avatar jmid commented on June 8, 2024

OK, thanks 👍‍

I managed to run it under gdb before I got side-tracked:

$ gdb _build/default/benchmark/bench_g1_pippenger.exe
[...]
(gdb) run 18 4 4 4
Starting program: /tmp/ocaml-multicore/ocaml-bls12-381/_build/default/benchmark/bench_g1_pippenger.exe 18 4 4 4
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff0ded700 (LWP 33202)]
[New Thread 0x7fffeb7fe700 (LWP 33204)]
[New Thread 0x7fffebfff700 (LWP 33203)]
[New Thread 0x7fffeaffd700 (LWP 33205)]
[New Thread 0x7fffea7fc700 (LWP 33206)]
[New Thread 0x7fffe9ffb700 (LWP 33207)]
[New Thread 0x7fffe97fa700 (LWP 33208)]

Thread 1 "Domain0" received signal SIGSEGV, Segmentation fault.
0x000055555567b6a8 in POINTonE1xyzz_dadd_affine ()
(gdb) bt
#0  0x000055555567b6a8 in POINTonE1xyzz_dadd_affine ()
#1  0x000055555567bbd2 in POINTonE1_bucket ()
#2  0x0000555555681936 in POINTonE1s_tile_pippenger ()
#3  0x000055555568bbf3 in blst_p1s_mult_pippenger ()
#4  0x0000555555674308 in caml_blst_g1_pippenger_contiguous_affine_array_stubs (buffer=<optimized out>, affine_list=<optimized out>, scalars=<optimized out>, 
    start=<optimized out>, len=<optimized out>) at blst_bindings_stubs.c:1095
#5  <signal handler called>
#6  0x00005555555fea77 in camlBls12_381__G1__pippenger_with_affine_array_1520 () at src/blst/g1.ml:378
#7  0x00005555555f71cb in camlDomainslib__Task__do_task_486 () at lib/task.ml:41
#8  <signal handler called>
#9  0x00005555555f7911 in camlDomainslib__Task__loop_600 () at lib/task.ml:96
#10 0x00005555555f2e2d in camlDune__exe__Bench_g1_pippenger__with_pool_946 () at benchmark/bench_g1_pippenger.ml:59
#11 0x00005555555f3cb5 in camlDune__exe__Bench_g1_pippenger__entry () at benchmark/bench_g1_pippenger.ml:117
#12 0x00005555555ef72b in caml_program ()
#13 <signal handler called>
#14 0x00005555556cbdb5 in caml_startup_common (argv=0x7fffffffdcc8, pooling=<optimized out>, pooling@entry=0) at startup_nat.c:137
#15 0x00005555556cbdef in caml_startup_exn (argv=<optimized out>) at startup_nat.c:147
#16 caml_startup (argv=<optimized out>) at startup_nat.c:147
#17 0x00005555555ef1f2 in main (argc=<optimized out>, argv=<optimized out>) at main.c:37

Recompiling the (templated C/C++) code with debugging could probably help reveal what went wrong
by printing the value of parameters and local variables in the debugger.

There's also good debugging tricks and advice (e.g., on repeating a test and on rr) here:
https://github.com/ocaml-multicore/ocaml-multicore/wiki/Debugger-hacks
https://github.com/ocaml-multicore/ocaml-multicore/wiki/Debugging-the-OCaml-Multicore-runtime

from ocaml-multicore.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.