Giter Site home page Giter Site logo

Comments (15)

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024 3

Finally, I believe I fixed this issue via #157.

Last week, I got a new x86_64 based Linux PC with 20 logical cores (Intel Core i7-12700F), and it helped me a lot to reproduce and investigate the issue. I found the cause of the issue last night and fixed it. After the fix, I have never been able to reproduce the issue again on both the PC (Linux x86_64) and Mac (macOS arm64).

The cause was race conditions when many threads are concurrently rehashing (extending or shrinking) internal hash table moka::cht. The creator of the original cht designed it to work fine in such a situation but it is not working as expected. So I added a lock to ensure only one thread can participate rehashing at a time. This actually increased performance in my load tests as it will prevent heavy retries on an atomic CAS operation compare_exhance_weak.

Also I found the memory ordering used for compare_exchange_weak will be too weak for non x86 platforms, and may cause inconsistency between threads. So I changed it to the one that I believe strong enough.

#157 also upgrades crossbeam-epoch to the latest version (v0.9.9).

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024 2

Hi @SimonSapin,

crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by GHSA-qc84-gqf4-9926

Thank you for the information.

Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch?

No. I do not think so, unfortunately.

I have another Moka repository here and it has crossbeam-epoch upgraded to v0.9.9:

and I ran the same test on both Moka with crossbeam-epoch v0.8.2 and v0.9.9. I found Moka with crossbeam-epoch v0.9.9 is still having the same issue.

Moka with crossbeam-epoch v0.9.9

Had segfault four times in about four hours.

$ rg '(Segmentation fault|Bus error)' epoch09-2022-0618.log 
271:./run-tests-insert-once.sh: line 26: 94446 Segmentation fault: 11  ./target/release/mokabench --invalidate --insert-once
283:./run-tests-insert-once.sh: line 30: 94453 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once

$ rg '(Segmentation fault|Bus error)' epoch09-2022-0619A.log
243:./run-tests-insert-once.sh: line 18: 99154 Segmentation fault: 11  ./target/release/mokabench --insert-once --size-aware
326:./run-tests-insert-once.sh: line 30: 99301 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once

$ cat epoch09-2022-0618.log
...
cargo tree --all-features  
...
│   ├── crossbeam-epoch v0.9.9
│   │   ├── cfg-if v1.0.0
│   │   ├── crossbeam-utils v0.8.9 (*)

Moka with crossbeam-epoch v0.8.2

Had segfault three times in about four hours.

$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619.log 
349:./run-tests-insert-once.sh: line 26: 95369 Segmentation fault: 11  ./target/release/mokabench --invalidate --insert-once

$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619B.log
339:./run-tests-insert-once.sh: line 30:   478 Segmentation fault: 11  ./target/release/mokabench --invalidate-entries-if --insert-once
385:./run-tests-insert-once.sh: line 38:   536 Segmentation fault: 11  ./target/release/mokabench --ttl 3 --tti 1 --invalidate --insert-once --size-aware

$ cat epoch08-2022-0619.log
...
cargo tree --all-features  
...
│   ├── crossbeam-epoch v0.8.2
│   │   ├── cfg-if v0.1.10
│   │   ├── crossbeam-utils v0.7.2

NOTE: To make segfault occurs more often, I used modified Moka to set the number of moka::cht::HashMap segments to 1. (The release versions have it set to 64)

Anyway, I will continue evaluating crossbeam-epoch v0.9.9 in parallel to v0.8.2, and will upgrade Moka's dependency with v0.9.9 once I feel v0.9.9 will not increase the chance of segfaults.

I am also watching every releases of crossbeam-* and parking_lot crates, and testing them if they have any fixes on memory safety issues. I am reviewing Moka and their source codes when I have time. I hope I can isolate the code causing the issue.

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024 1

FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9. I scheduled it for next patch release Moka v0.8.7.

As I wrote in the PR, I will run some mokabench tests before merging it. I will be able to run mokabench for 6 hours a day (during night), so if everything goes well, the test will complete in 4 days (total 24 hours).

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

Based on my test results, it might be worth to downgrade crossbeam-epoch from v0.9.5 to v0.8.2 to workaround the issue. I am preparing Moka v0.5.2 release with moka-cht v0.4.2 and crossbeam-epoch v0.8.2.

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

Released Moka v0.5.2 with moka-cht v0.4.2 and crossbeam-epoch v0.8.2.

Unfortunately, the same segmentation fault (the pattern 1) occurred when I was running mokabench on Moka v0.5.2. I released v0.5.2 anyway as earlier versions of Moka may have the same issue already, and I feel segmentation faults is less frequent with crossbeam-epoch v0.8.2.

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

Just for sure, I tried Rust 1.53.0 to compile mokabench + Moka v0.5.3. I did it because I have never tried Rust 1.53.0 since Moka v0.5.1 was released. The result was the same; it got a segmentation fault after running mokabench for ~2 hours. I used the EC2 instance type with 36 vCPUs.

Segfaults? Rust Moka crossbeam-epoch vCPUs
yes 1.53.0 v0.5.3 v0.8.2 36
yes 1.54.0 v0.5.2 v0.8.2 36
yes 1.55.0 v0.5.3 v0.8.2 36

from moka.

lpi avatar lpi commented on June 12, 2024

Any progress on this? Does it ever happen on something with 18 vCPUs?

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

Any progress on this?

No 😞. I spent a few more days for running different tests, doing code review, etc., but could not find any clue.

I am currently constraint by time (I have to run the test at least for a few hours to reproduce) and money (36 vCPU instance is expensive; $1.926/hour). I will revisit this issue when I have more time.

Does it ever happen on something with 18 vCPUs?

No. It has never happened on a 18 16 vCPUs instance in my tests. Also, no Moka users have reported this or similar problems.

Are you holding off on using Moka because of this problem? If so, perhaps I will add an optional Cargo feature to use an alternative hash table. It will spoil concurrent performance but will be safer.

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

Here are some updates on this issue.

It has been five moths since I first saw this issue, but (fortunately) no user of this crate has reported segfaults:

  • Segfaults have been occurring only in my testing environment (Amazon EC2) with 32 or more vCPUs.
  • In my testing environment, segfaults have been occurred only when the following methods are used:
    • get_or_insert_with
    • get_or_try_insert_with

On January 5th, 2022, I ran the same load tests (mokabench) against Moka v0.7.0 on the following EC2 instances and had some segfaults only on the instances with 32 vCPUs:

Moka Version Instance Type vCPUs Architecture OS Number of segfaults occurred
v0.7.0 c6i.8xlarge 32 vCPU x86_64 Amazon Linux 2 2 times in 4 hours
v0.7.0 c6g.8xlarge 32 vCPU AArch64 Amazon Linux 2 3 times in 4 hours
v0.7.0 c6i.4xlarge 16 vCPU x86_64 Amazon Linux 2 0 time in 4 hours

I ran the same but shorter load tests as a part of pre-release testing for v0.7.1 (January 12th, 2022) and v0.7.2 (February 6th, 2022). There was no segfault for v0.7.2:

Moka Version Instance Type vCPUs Architecture OS Number of segfaults occurred
v0.7.1 c6i.8xlarge 32 vCPU x86_64 Amazon Linux 2 1 time in 2.5 hours
v0.7.2 c6i.8xlarge 32 vCPU x86_64 Amazon Linux 2 0 time in 4 hours

v0.7.2 has fixes and enhancements for #72. It might have mitigated the issue but I am not 100% sure because I still have not figured out the root cause of those segfaults.

from moka.

Dessix avatar Dessix commented on June 12, 2024

MIRI or Loom may be able to spot the issue, if you use them to test the contracts of the internal HashMap implementation.

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

Here are some updates on this issue.

  • Segfaults are occurring only in my testing environments.
    • Nobody else has been reported this or similar issues.
  • With unmodified Moka's source codes, I need an Amazon EC2 instance with 32 or more vCPUs to reproduce this issue.
  • If I modify Moka's source codes to reduce the number of internal segments of our HashMap from 16 to 2 1, I can reproduce this issue with the following machines:
    • Mac mini M1 running macOS arm64. (4 × performance cores + 4 × efficiency cores)
    • QEMU on Mac mini M1 running Ubuntu Server Arm (AArch64). (4 × vCPUs)
  • I generate the workload using the mokabench program, with 36 to 48 client threads concurrently reading from and writing to one cache.
    • mokabench will repeat short (~15 seconds) but very intensive workload.
    • It usually takes 1 to 2 hours to reproduce the issue.

Our internal HashMap is lock-free container and heavily depends on atomic operations such as compare-and-swap (CAS). It seems parallelism is the key to trigger the issue; e.g. more than one processor cores to execute CAS on the same memory location at the same time. It also heavily depends on crossbeam-epoch's epoch-based memory reclamation (garbage collection, GC), which also relies on CAS.

I think the most suspicious area is rehashing, which is used to extend HashMap capacity and to run epoch-GC on deleted keys. There should be lots of CAS conflicts and retries, and epoch-GCs occurs during rehashing.

Action Plans

  1. To mitigate the issue, increase the number of the internal segments of our HashMap.
  2. Continue testing with different configurations to isolate the problem area:
    • e.g. Modify the codes to change rehashing behavior.
  3. Enable Loom testing:
    • This may require non trivial amount of work.
    • e.g. We will need to upgrade crossbeam-epoch from v0.8.3 to v0.9 to get Loom support (?)
  4. Enable Miri testing on the HashMap etc.
    • This may require non trivial amount of work too.
    • I already tried this in January 2022, but I could not get even single unit test to finish in ~10 hours. (Miri is very slow when testing multi-thread stuff)
    • We will need to reduce the number of threads and number of cache entries in each test until Miri can finish in a reasonable time frame.

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

To mitigate the issue, increase the number of the internal segments of our HashMap.

This workaround is added via #129.

from moka.

SimonSapin avatar SimonSapin commented on June 12, 2024

Cargo.toml points here:

moka/Cargo.toml

Lines 52 to 55 in 8f61b35

# Although v0.8.2 is not the current version (v0.9.x), we will keep using it until
# we perform enough tests to get conformable with memory safety.
# See: https://github.com/moka-rs/moka/issues/34
crossbeam-epoch = "0.8.2"

crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by GHSA-qc84-gqf4-9926

Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch?

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9.
...
so if everything goes well, the test will complete in 4 days (total 24 hours).

Unfortunately, I found that upgrading crossbeam-epoch to v0.9.9 would actually make this issue worse on Linux x86_64. It occurred ~15% more often with v0.9.9 than v0.8.2. So I am hesitate to merge the PR.

Just for sure, I will do the same test again during this weekend.

from moka.

tatsuya6502 avatar tatsuya6502 commented on June 12, 2024

I have published v0.9.2 with this fix to crates.io.

from moka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.