Giter Site home page Giter Site logo

Runtime Error about QP about sherman HOT 13 CLOSED

baotonglu avatar baotonglu commented on June 19, 2024
Runtime Error about QP

from sherman.

Comments (13)

baotonglu avatar baotonglu commented on June 19, 2024

The following is the info from ibv_devinfo. mlx5_1 is based on IB while mlx5_0 is based on RoCE. I am not sure whether such physical connection will have an influence on this project.

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         16.33.1048
        node_guid:                      9803:9b03:0003:4641
        sys_image_guid:                 9803:9b03:0003:4640
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000008
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 6
                        port_lid:               67
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.33.1048
        node_guid:                      9803:9b03:0003:4640
        sys_image_guid:                 9803:9b03:0003:4640
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000008
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

from sherman.

Transpeptidase avatar Transpeptidase commented on June 19, 2024

Hi, you can modify this line according to the RDMA NIC you want to use, where ibv_get_device_name(deviceList[i]) == "mlx5_0" or "mlx5_1"

if (ibv_get_device_name(deviceList[i])[5] == '0') {

from sherman.

Transpeptidase avatar Transpeptidase commented on June 19, 2024

And if you use RoCE, you should modify gidIndex according to the shell command show_gids, which is usually 3.

bool createContext(RdmaContext *context, uint8_t port = 1, int gidIndex = 1,

Moreover, please check the RDMA connectivity using tools like ibv_read_lat

from sherman.

baotonglu avatar baotonglu commented on June 19, 2024

sorry the messy message. Let me rephrase my questions.

from sherman.

baotonglu avatar baotonglu commented on June 19, 2024

Hi,

Thanks for your help. After changing ibv_get_device_name(deviceList[i]), it connects well. But I am encountering the segmentation fault error when running. I just show the error below by using two machines as the example (running ./benchmark 2 100 10 ).

The machine 0 has the following output:

kNodeCount 2, kReadRatio 100, kThreadCount 10
shared memory size: 8GB, 0x7f2880000000
cache size: 1GB
Machine NR: 2
Device 0: mlx5_1
NIC Device Memory is 128KB
I am servers 0 [255.255.255.255]
I connect server 1
dir 0 launch!

Header size: 35
Internal Page size: 1024 [1024]
Internal per Page: 61
Leaf Page size: 1018 [1024]
Leaf per Page: 54
LeafEntry size: 18
InternalEntry size: 16
new root level 1 [0, 33556480]
new root level 2 [0, 33619968]
new root level 3 [0, 36002816]
I am 2
I am 1
I am 3
I am 4
I am 0
I am 5
I am 7
I am 6
I am 8
I am 9
new root level 4 [1, 137316352]
Deadlock [0, 31200]
0, 8 locked by 1, 1
benchmark: /home/v-baotonglu/Sherman/src/Tree.cpp:246: bool Tree::try_lock_addr(GlobalAddress, uint64_t, uint64_t*, CoroContext*, int): Assertion `false' failed

The machine 1 (with gdb) has the following output:

kNodeCount 2, kReadRatio 100, kThreadCount 10
shared memory size: 8GB, 0x7ffd80000000
cache size: 1GB
Machine NR: 2
NIC Device Memory is 128KB
I am servers 1 [255.255.255.255]
I connect server 0
[New Thread 0x7ffff4157700 (LWP 47535)]
dir 0 launch!

Tree root pointer value [1, 33554432]
[New Thread 0x7ffff3914700 (LWP 47561)]
[New Thread 0x7ffff3113700 (LWP 47562)]
I am 10
[New Thread 0x7ffff2912700 (LWP 47563)]
I am 11
[New Thread 0x7ffff2111700 (LWP 47564)]
I am 12
I am 13
[New Thread 0x7ffff1910700 (LWP 47565)]
I am 14
[New Thread 0x7ffff110f700 (LWP 47566)]
I am 15
[New Thread 0x7ffff090e700 (LWP 47567)]
I am 16
[New Thread 0x7fffdbfff700 (LWP 47568)]
I am 17
[New Thread 0x7fffdb7fe700 (LWP 47569)]
I am 18
[New Thread 0x7fffdaffd700 (LWP 47570)]
I am 19
Thread 6 "benchmark" received signal SIGSEGV, Segmentation fault.

The debug info about the segmentation fault is as follows:

[Switching to Thread 0x7ffff2111700 (LWP 47564)]
tcache_get (tc_idx=63) at malloc.c:2952
2952    malloc.c: No such file or directory.
(gdb) bt
#0  tcache_get (tc_idx=63) at malloc.c:2952
#1  __GI___libc_malloc (bytes=1024) at malloc.c:3060
#2  0x000055555557b54e in IndexCache::add_to_cache (this=0x55556281f290, page=0x7fff83000c50) at /home/v-baotonglu/Sherman/include/IndexCache.h:104
#3  0x00005555555770d9 in Tree::page_search (this=this@entry=0x555562808850, page_addr=page_addr@entry=..., k=@0x7ffff21107f0: 561285, result=..., cxt=cxt@entry=0x0, coro_id=coro_id@entry=0, from_cache=false)
    at /home/v-baotonglu/Sherman/src/Tree.cpp:674
#4  0x00005555555790b7 in Tree::insert (this=0x555562808850, k=@0x7ffff21107f0: 561285, v=@0x7ffff2110820: 9906706, cxt=0x0, coro_id=0) at /home/v-baotonglu/Sherman/src/Tree.cpp:421
#5  0x0000555555565291 in thread_run (id=3) at /home/v-baotonglu/Sherman/test/benchmark.cpp:109
#6  0x00007ffff6e846df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007ffff75a16db in start_thread (arg=0x7ffff2111700) at pthread_create.c:463
#8  0x00007ffff654171f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

It seems to be related to the memory allocation. Any suggestions? Thanks

from sherman.

baotonglu avatar baotonglu commented on June 19, 2024

Update:
The benchmark runs successfully when kThreadCount = 1. If the workload is read-only, then kThreadCount = 2 also has no error.

Therefore, it seems that the multiple client threads in one server will incur this problem. It is probably the problem of my machine setup but I cannot locate the root cause.

I hope that you could give me some suggestions. Thank you very much.

from sherman.

Transpeptidase avatar Transpeptidase commented on June 19, 2024

Hi, I was able to successfully execute A /benchmark 2 100 10 in our cluster :)

[root@node111 build]# ./benchmark 2 100 10
kNodeCount 2, kReadRatio 100, kThreadCount 10
shared memory size: 8GB, 0x7f8b30d28000
cache size: 1GB
Machine NR: 2
NIC Device Memory is 256KB
I am servers 0 [0.0.0.0]
I connect server 1
dir 0 launch!

Header size: 35
Internal Page size: 1024 [1024]
Internal per Page: 61
Leaf Page size: 1018 [1024]
Leaf per Page: 54
LeafEntry size: 18
InternalEntry size: 16
new root level 1 [0, 33556480]
new root level 2 [0, 33619968]
new root level 3 [0, 36002816]
I am 2
I am 0
I am 4
I am 3
I am 1
I am 5
I am 7
I am 6
I am 8
I am 9
new root level 4 [1, 104123392]
node 0 finish
warmup time 23s
[skiplist node: 45798]  [page cache: 22917]
340ns per loop
0, throughput 2.8571
cluster throughput 5.721
cache hit rate: 0.997283
0, throughput 2.8909
cluster throughput 5.785
cache hit rate: 0.998640
0, throughput 2.8884
cluster throughput 5.782
cache hit rate: 0.999095
0, throughput 2.8907
cluster throughput 5.783
cache hit rate: 0.999324
0, throughput 2.8906
cluster throughput 5.782
cache hit rate: 0.999459
0, throughput 2.8904
cluster throughput 5.781
cache hit rate: 0.999549
0, throughput 2.8911
``

It is weird that `malloc` induces a seg fault. Can you provide a diff indicating the codes you have modified?

from sherman.

baotonglu avatar baotonglu commented on June 19, 2024

Hi,

This link is my changes.

I am using IB (mlx5_1) but I still changed gidIndex (I think this is fine).

BTW, could you show the ibv_devinfo on your machine? I want to see the setup on your machines, which may be helpful.

Thanks!

from sherman.

Transpeptidase avatar Transpeptidase commented on June 19, 2024

Here is the result of ibv_devinfo.

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.28.1002
        node_guid:                      ec0d:9a03:00ae:148c
        sys_image_guid:                 ec0d:9a03:00ae:148c
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000010
        phys_port_cnt:                  1
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

from sherman.

Transpeptidase avatar Transpeptidase commented on June 19, 2024

You can try the following approaches:

  1. check if kLeafHdrOffset == kInternalHdrOffset
  2. modfiy the CMakeList.txt using -O0
  3. use tcmalloc or jemalloc

from sherman.

baotonglu avatar baotonglu commented on June 19, 2024

Thanks for your help.

from sherman.

baotonglu avatar baotonglu commented on June 19, 2024

I suddenly found that the huge page size in my system is 1GB instead of 2MB. Will this impact the code correctness of Sherman?

from sherman.

Transpeptidase avatar Transpeptidase commented on June 19, 2024

This does not affect the correctness of Sherman. You can even turn off the use of hugepage by removing the MAP_HUGETLB flag

MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

from sherman.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.