Giter Site home page Giter Site logo

pakmarkthub / dragon Goto Github PK

View Code? Open in Web Editor NEW
58.0 58.0 20.0 14.45 MB

A host-based framework that transparently extends the GPU addressable global memory space beyond the host memory using NVM-backed data pointers

Home Page: https://ft.ornl.gov/research/dragon

License: MIT License

C 46.79% Makefile 2.67% Shell 10.83% Python 6.89% Cuda 32.83%
driver gpu nvm

dragon's People

Contributors

msharmavikram avatar pakmarkthub avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dragon's Issues

About page-fault

I am doing things related to page-fault-handler in the GPU driver.But limited to my own environment, I cannot run the dragon completely.
So could you tell me in which file the relevant code to intercept GPU pages-fault in the GPU driver is?And where is the dragon's processing code or keywords when the page is missing?
Sorry for bothering you so many times, thanks very much.

about driver that dragon uesd

Hi Dr. Pak Markthub,

Hope this email finds you well. This is Wei Rang, a CS student from Huazhong University of Science and Technology.

Recently I am trying to reproduce your work in DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access.

But I can only find the file NVIDIA-Linux-x86_64-384.81.deb,but not the .run file on NVIDIA official website.

Could you please provide a corresponding .run file or a corresponding download link?

Thank you very much and looking forward to hearing from you.

Some issues about dragon

When I try to run the following command:
../scripts/prepare-dragon-driver nvidia-uvm-.patch NVIDIA-Linux-x86_64-

I encountered some problems,as shown in the screenshot.And I can't find the declaration of the variable in all the given files.
Could you give me some suggestions?Thank you very much and looking forward to hearing from you.
image
image
image

Doesn't support Kernel4.X?

Hi,
interesting work! well done.
A question, from current install guild, it's said "incompatible with kernel 4.x", so may I know what blocks it to support kernel 4.X ? can be resolved? or in plan ?

thanks.

Frank.

Writes using Dragon crashes the system for all provided application

@pakmarkthub We are trying to bring up the Dragon and it's examples. We observe the read-only workloads in dragon work as expected. However, when we do the write to the dragon_map'ed filed, the system crashes without any error message (basically hung).

It turns out when the program exists (when memory gets freed or inside dragon_unmap call), the system crash occurs. To figure this out, we created a toy example, where we dragon_map the file with D_F_WRITE | D_F_CREATE flags and call CUDA thread to write 1 to the mapped file using tid 0. We observed the CUDA kernel completing successfully however, the file is not written (the file is created and the contents inside the file is all zeros).

To understand what is going wrong, we added some print statements inside the dragon_unmap() call. We notice when cudaFree(addr) is called the system crashes. We disabled cudaFree() call to see if it works, but the system still crashed when the program exited (and file was not synced with the write data and the created file was all zeroes). So there is something going wrong when memory is freed and we are unable to figure it out. Any pointers on what might be going wrong would be helpful.

Machine details:

  1. Intel Xeon with V100 GPU
  2. Kernel 5.3.1
  3. NVIDIA Driver: 440.33.01 with dragon patches

Question about deploying Dragon

Hi Dr. Pak Markthub,

Hope this email finds you well. This is Wei Rang, a CS Ph.D. student from UNC Charlotte.

Recently I am trying to reproduce your work in DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access.

The followings are my hardware and software specifications:
Memory: 16GB 2x8GB DDR4 2666MHz RDIMM ECC Memory
Processor: Intel Xeon Bronze 3104 1.7GHz * 6
Graphics: Quadro RTX 4000 (11 GB)

OS: CentOS7 Kernel 3.10.0-1062.4.3.el7.x86_64
GPU Driver: NVIDIA-Linux-x86_64-440.36
CUDA: cuda_9.0.176_384.81_linux

When I was trying to replace the original Nvidia GPU driver following your tutorial in Github. A lot of errors occurred and then I tried to modify the patch but it didn't work.

Could you please provide any suggestions on how to deploy and use DRAGON framework? Or should I use the exact same hardware and software configuration as your paper mentioned?

Thank you very much and looking forward to hearing from you.

Kernel panic - after drop caches

Hi @pakmarkthub

When I run the vectorAdd program repeatedly (manually and not using a run script), I end up getting a kernel panic error. I upgraded the kernel to 5.6.3 and is using Nvidia driver 440.82 in CentOS 8 and this time I ensured it is ext4 :)

I am trying to understand what is causing this issue and unable to figure out. Any thoughts on what might be going wrong.

Let me tell you exactly what I did in a step by step process.

  1. generate data 1000K entries in ext4 disk and load the dragon driver and activate it.
  2. execute nvmgpu vectorAdd program with following field
    ./bin/vectorAdd 165536 1024 /mnt/nvme0/vectorAdd
  3. The step 2 completes and generates correct output.
  4. sync
  5. drop caches
  6. execute nvmgpu vectorAdd program with following field
    ./bin/vectorAdd 165536 1024 /mnt/nvme0/vectorAdd
  7. KERNEL PANIC with below error:
[  +0.513839] BUG: Bad page state in process vectorAdd  pfn:3f5f5d0
[  +0.000035] page:ffffede0fd7d7400 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[  +0.000043] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[  +0.000013] flags: 0x17ffffc0000000()
[  +0.000011] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[  +0.000020] raw: 0000000000000001 ffff908f69aee068 00000000ffffffff ffff909170526000
[  +0.000019] page dumped because: page still charged to cgroup
[  +0.000014] page->mem_cgroup:ffff909170526000
[  +0.000011] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[  +0.000028]  intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[  +0.000275] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P           O      5.6.3.dragon #5
[  +0.000020] Hardware name: ******
[  +0.000018] Call Trace:
[  +0.000015]  dump_stack+0x66/0x90
[  +0.000014]  bad_page.cold.125+0x7f/0xb2
[  +0.000012]  free_pcppages_bulk+0x178/0x660
[  +0.000013]  free_unref_page_list+0x101/0x180
[  +0.000015]  release_pages+0x382/0x400
[  +0.000013]  tlb_flush_mmu+0x44/0x150
[  +0.000012]  unmap_page_range+0x87f/0xde0
[  +0.000838]  unmap_vmas+0x91/0xf0
[  +0.000783]  exit_mmap+0xaa/0x180
[  +0.000779]  mmput+0x52/0x120
[  +0.000778]  do_exit+0x337/0xae0
[  +0.000769]  do_group_exit+0x3a/0xa0
[  +0.000762]  __x64_sys_exit_group+0x14/0x20
[  +0.000751]  do_syscall_64+0x5b/0x1e0
[  +0.000738]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000736] RIP: 0033:0x7f58bfbec7f6
[  +0.000741] Code: Bad RIP value.
[  +0.000733] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.000745] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[  +0.000755] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  +0.000753] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[  +0.000747] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[  +0.000743] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000
[  +0.000753] BUG: Bad page state in process vectorAdd  pfn:3f5f5d1
[  +0.000749] page:ffffede0fd7d7440 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[  +0.000778] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[  +0.000760] flags: 0x17ffffc0000000()
[  +0.000757] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[  +0.000774] raw: 0000000000000001 ffff908f69aeeea0 00000000ffffffff ffff909170526000
[  +0.000784] page dumped because: page still charged to cgroup
[  +0.000792] page->mem_cgroup:ffff909170526000
[  +0.000787] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[  +0.000023]  intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[  +0.009132] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P    B      O      5.6.3.dragon #5
[  +0.001018] Hardware name: ******
[  +0.001019] Call Trace:
[  +0.001012]  dump_stack+0x66/0x90
[  +0.001004]  bad_page.cold.125+0x7f/0xb2
[  +0.001003]  free_pcppages_bulk+0x178/0x660
[  +0.000996]  free_unref_page_list+0x101/0x180
[  +0.000994]  release_pages+0x382/0x400
[  +0.000985]  tlb_flush_mmu+0x44/0x150
[  +0.000980]  unmap_page_range+0x87f/0xde0
[  +0.000962]  unmap_vmas+0x91/0xf0
[  +0.000935]  exit_mmap+0xaa/0x180
[  +0.000913]  mmput+0x52/0x120
[  +0.000887]  do_exit+0x337/0xae0
[  +0.000864]  do_group_exit+0x3a/0xa0
[  +0.000840]  __x64_sys_exit_group+0x14/0x20
[  +0.000820]  do_syscall_64+0x5b/0x1e0
[  +0.000795]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  +0.000777] RIP: 0033:0x7f58bfbec7f6
[  +0.000754] Code: Bad RIP value.
[  +0.000745] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  +0.000753] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[  +0.000753] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  +0.000758] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[  +0.000758] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[  +0.000761] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.