Comments (15)
Finally, I believe I fixed this issue via #157.
Last week, I got a new x86_64 based Linux PC with 20 logical cores (Intel Core i7-12700F), and it helped me a lot to reproduce and investigate the issue. I found the cause of the issue last night and fixed it. After the fix, I have never been able to reproduce the issue again on both the PC (Linux x86_64) and Mac (macOS arm64).
The cause was race conditions when many threads are concurrently rehashing (extending or shrinking) internal hash table moka::cht
. The creator of the original cht
designed it to work fine in such a situation but it is not working as expected. So I added a lock to ensure only one thread can participate rehashing at a time. This actually increased performance in my load tests as it will prevent heavy retries on an atomic CAS operation compare_exhance_weak
.
Also I found the memory ordering used for compare_exchange_weak
will be too weak for non x86 platforms, and may cause inconsistency between threads. So I changed it to the one that I believe strong enough.
#157 also upgrades crossbeam-epoch to the latest version (v0.9.9).
from moka.
Hi @SimonSapin,
crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by GHSA-qc84-gqf4-9926
Thank you for the information.
Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch?
No. I do not think so, unfortunately.
I have another Moka repository here and it has crossbeam-epoch upgraded to v0.9.9:
and I ran the same test on both Moka with crossbeam-epoch v0.8.2 and v0.9.9. I found Moka with crossbeam-epoch v0.9.9 is still having the same issue.
Moka with crossbeam-epoch v0.9.9
Had segfault four times in about four hours.
$ rg '(Segmentation fault|Bus error)' epoch09-2022-0618.log
271:./run-tests-insert-once.sh: line 26: 94446 Segmentation fault: 11 ./target/release/mokabench --invalidate --insert-once
283:./run-tests-insert-once.sh: line 30: 94453 Segmentation fault: 11 ./target/release/mokabench --invalidate-entries-if --insert-once
$ rg '(Segmentation fault|Bus error)' epoch09-2022-0619A.log
243:./run-tests-insert-once.sh: line 18: 99154 Segmentation fault: 11 ./target/release/mokabench --insert-once --size-aware
326:./run-tests-insert-once.sh: line 30: 99301 Segmentation fault: 11 ./target/release/mokabench --invalidate-entries-if --insert-once
$ cat epoch09-2022-0618.log
...
cargo tree --all-features
...
│ ├── crossbeam-epoch v0.9.9
│ │ ├── cfg-if v1.0.0
│ │ ├── crossbeam-utils v0.8.9 (*)
Moka with crossbeam-epoch v0.8.2
Had segfault three times in about four hours.
$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619.log
349:./run-tests-insert-once.sh: line 26: 95369 Segmentation fault: 11 ./target/release/mokabench --invalidate --insert-once
$ rg '(Segmentation fault|Bus error)' epoch08-2022-0619B.log
339:./run-tests-insert-once.sh: line 30: 478 Segmentation fault: 11 ./target/release/mokabench --invalidate-entries-if --insert-once
385:./run-tests-insert-once.sh: line 38: 536 Segmentation fault: 11 ./target/release/mokabench --ttl 3 --tti 1 --invalidate --insert-once --size-aware
$ cat epoch08-2022-0619.log
...
cargo tree --all-features
...
│ ├── crossbeam-epoch v0.8.2
│ │ ├── cfg-if v0.1.10
│ │ ├── crossbeam-utils v0.7.2
NOTE: To make segfault occurs more often, I used modified Moka to set the number of moka::cht::HashMap
segments to 1. (The release versions have it set to 64)
Anyway, I will continue evaluating crossbeam-epoch v0.9.9 in parallel to v0.8.2, and will upgrade Moka's dependency with v0.9.9 once I feel v0.9.9 will not increase the chance of segfaults.
I am also watching every releases of crossbeam-* and parking_lot crates, and testing them if they have any fixes on memory safety issues. I am reviewing Moka and their source codes when I have time. I hope I can isolate the code causing the issue.
from moka.
FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9. I scheduled it for next patch release Moka v0.8.7.
As I wrote in the PR, I will run some mokabench tests before merging it. I will be able to run mokabench for 6 hours a day (during night), so if everything goes well, the test will complete in 4 days (total 24 hours).
from moka.
Based on my test results, it might be worth to downgrade crossbeam-epoch from v0.9.5 to v0.8.2 to workaround the issue. I am preparing Moka v0.5.2 release with moka-cht v0.4.2 and crossbeam-epoch v0.8.2.
from moka.
Released Moka v0.5.2 with moka-cht v0.4.2 and crossbeam-epoch v0.8.2.
Unfortunately, the same segmentation fault (the pattern 1) occurred when I was running mokabench on Moka v0.5.2. I released v0.5.2 anyway as earlier versions of Moka may have the same issue already, and I feel segmentation faults is less frequent with crossbeam-epoch v0.8.2.
from moka.
Just for sure, I tried Rust 1.53.0 to compile mokabench + Moka v0.5.3. I did it because I have never tried Rust 1.53.0 since Moka v0.5.1 was released. The result was the same; it got a segmentation fault after running mokabench for ~2 hours. I used the EC2 instance type with 36 vCPUs.
Segfaults? | Rust | Moka | crossbeam-epoch | vCPUs |
---|---|---|---|---|
yes | 1.53.0 | v0.5.3 | v0.8.2 | 36 |
yes | 1.54.0 | v0.5.2 | v0.8.2 | 36 |
yes | 1.55.0 | v0.5.3 | v0.8.2 | 36 |
- Rust 1.53.0 — rust-lang/rust#86036
- Rust 1.54.0 — rust-lang/rust#82834
from moka.
Any progress on this? Does it ever happen on something with 18 vCPUs?
from moka.
Any progress on this?
No 😞. I spent a few more days for running different tests, doing code review, etc., but could not find any clue.
I am currently constraint by time (I have to run the test at least for a few hours to reproduce) and money (36 vCPU instance is expensive; $1.926/hour). I will revisit this issue when I have more time.
Does it ever happen on something with 18 vCPUs?
No. It has never happened on a 18 16 vCPUs instance in my tests. Also, no Moka users have reported this or similar problems.
Are you holding off on using Moka because of this problem? If so, perhaps I will add an optional Cargo feature to use an alternative hash table. It will spoil concurrent performance but will be safer.
from moka.
Here are some updates on this issue.
It has been five moths since I first saw this issue, but (fortunately) no user of this crate has reported segfaults:
- Segfaults have been occurring only in my testing environment (Amazon EC2) with 32 or more vCPUs.
- In my testing environment, segfaults have been occurred only when the following methods are used:
get_or_insert_with
get_or_try_insert_with
On January 5th, 2022, I ran the same load tests (mokabench) against Moka v0.7.0 on the following EC2 instances and had some segfaults only on the instances with 32 vCPUs:
Moka Version | Instance Type | vCPUs | Architecture | OS | Number of segfaults occurred |
---|---|---|---|---|---|
v0.7.0 | c6i.8xlarge |
32 vCPU | x86_64 | Amazon Linux 2 | 2 times in 4 hours |
v0.7.0 | c6g.8xlarge |
32 vCPU | AArch64 | Amazon Linux 2 | 3 times in 4 hours |
v0.7.0 | c6i.4xlarge |
16 vCPU | x86_64 | Amazon Linux 2 | 0 time in 4 hours |
I ran the same but shorter load tests as a part of pre-release testing for v0.7.1 (January 12th, 2022) and v0.7.2 (February 6th, 2022). There was no segfault for v0.7.2:
Moka Version | Instance Type | vCPUs | Architecture | OS | Number of segfaults occurred |
---|---|---|---|---|---|
v0.7.1 | c6i.8xlarge |
32 vCPU | x86_64 | Amazon Linux 2 | 1 time in 2.5 hours |
v0.7.2 | c6i.8xlarge |
32 vCPU | x86_64 | Amazon Linux 2 | 0 time in 4 hours |
v0.7.2 has fixes and enhancements for #72. It might have mitigated the issue but I am not 100% sure because I still have not figured out the root cause of those segfaults.
from moka.
MIRI or Loom may be able to spot the issue, if you use them to test the contracts of the internal HashMap implementation.
from moka.
Here are some updates on this issue.
- Segfaults are occurring only in my testing environments.
- Nobody else has been reported this or similar issues.
- With unmodified Moka's source codes, I need an Amazon EC2 instance with 32 or more vCPUs to reproduce this issue.
- If I modify Moka's source codes to reduce the number of internal segments of our HashMap from 16 to
21, I can reproduce this issue with the following machines:- Mac mini M1 running macOS arm64. (4 × performance cores + 4 × efficiency cores)
- QEMU on Mac mini M1 running Ubuntu Server Arm (AArch64). (4 × vCPUs)
- I generate the workload using the mokabench program, with 36 to 48 client threads concurrently reading from and writing to one cache.
- mokabench will repeat short (~15 seconds) but very intensive workload.
- It usually takes 1 to 2 hours to reproduce the issue.
Our internal HashMap is lock-free container and heavily depends on atomic operations such as compare-and-swap (CAS). It seems parallelism is the key to trigger the issue; e.g. more than one processor cores to execute CAS on the same memory location at the same time. It also heavily depends on crossbeam-epoch's epoch-based memory reclamation (garbage collection, GC), which also relies on CAS.
I think the most suspicious area is rehashing, which is used to extend HashMap capacity and to run epoch-GC on deleted keys. There should be lots of CAS conflicts and retries, and epoch-GCs occurs during rehashing.
Action Plans
- To mitigate the issue, increase the number of the internal segments of our HashMap.
- Continue testing with different configurations to isolate the problem area:
- e.g. Modify the codes to change rehashing behavior.
- Enable Loom testing:
- This may require non trivial amount of work.
- e.g. We will need to upgrade crossbeam-epoch from v0.8.3 to v0.9 to get Loom support (?)
- Enable Miri testing on the HashMap etc.
- This may require non trivial amount of work too.
- I already tried this in January 2022, but I could not get even single unit test to finish in ~10 hours. (Miri is very slow when testing multi-thread stuff)
- We will need to reduce the number of threads and number of cache entries in each test until Miri can finish in a reasonable time frame.
from moka.
To mitigate the issue, increase the number of the internal segments of our HashMap.
This workaround is added via #129.
from moka.
Cargo.toml
points here:
Lines 52 to 55 in 8f61b35
crossbeam-epoch 0.8.2 depends on crossbeam-utils 0.7.x, which is affected by GHSA-qc84-gqf4-9926
Is the work around in #129 to upgrade moka’s dependency of crossbeam-epoch?
from moka.
FYI, I created a draft pull request #157 to upgrade crossbeam-epoch from v0.8.2 to v0.9.9.
...
so if everything goes well, the test will complete in 4 days (total 24 hours).
Unfortunately, I found that upgrading crossbeam-epoch to v0.9.9 would actually make this issue worse on Linux x86_64. It occurred ~15% more often with v0.9.9 than v0.8.2. So I am hesitate to merge the PR.
Just for sure, I will do the same test again during this weekend.
from moka.
I have published v0.9.2 with this fix to crates.io.
from moka.
Related Issues (20)
- CI: Enable Miri tests on `moka::cht::*` modules
- Memory corruption observed when using Moka v0.9.6 HOT 15
- unbounded capacity? HOT 2
- Possibility of using async runtime tasks instead of thread pools HOT 10
- oom caused after use #234's statistics record code HOT 4
- How can i add something to the cache inside the `eviction_listener`? HOT 5
- Moka loses cache with curl HOT 9
- Provide an easy way to implement per-entry TTL and TTI HOT 6
- Enable `clippy::arc_with_non_send_sync` lint for active branches
- Provide a way to iterate entries with their metadata
- Provide a way to get a read-only snapshot of the `FrequencySketch` of `Cache`
- Provide a way to restore a `Cache` from entries with metadata and a `FrequencySketch` snapshot HOT 2
- CI: Temporary disable CirrusCI HOT 3
- wasm compatibility - change Expiry to avoid needing std::time::Instant::now() HOT 4
- An internal `do_insert_with_hash` method gets the current `Instant` too early when eviction listener is enabled HOT 1
- Tracking issue for restoring cache state from backed up entries and a snapshot of the LFU filter
- Reason for `Arc<Error>` HOT 2
- Memory leak in moka 0.12 HOT 13
- With Rust 1.73.0, some unit tests started to fail for `mips-unknown-linux-musl` target HOT 7
- not support `armv5te-unknown-linux-musleabi` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from moka.