When I run the vectorAdd program repeatedly (manually and not using a run script), I end up getting a kernel panic error. I upgraded the kernel to 5.6.3 and is using Nvidia driver 440.82 in CentOS 8 and this time I ensured it is ext4 :)
I am trying to understand what is causing this issue and unable to figure out. Any thoughts on what might be going wrong.
Let me tell you exactly what I did in a step by step process.
[ +0.513839] BUG: Bad page state in process vectorAdd pfn:3f5f5d0
[ +0.000035] page:ffffede0fd7d7400 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[ +0.000043] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[ +0.000013] flags: 0x17ffffc0000000()
[ +0.000011] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[ +0.000020] raw: 0000000000000001 ffff908f69aee068 00000000ffffffff ffff909170526000
[ +0.000019] page dumped because: page still charged to cgroup
[ +0.000014] page->mem_cgroup:ffff909170526000
[ +0.000011] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[ +0.000028] intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[ +0.000275] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P O 5.6.3.dragon #5
[ +0.000020] Hardware name: ******
[ +0.000018] Call Trace:
[ +0.000015] dump_stack+0x66/0x90
[ +0.000014] bad_page.cold.125+0x7f/0xb2
[ +0.000012] free_pcppages_bulk+0x178/0x660
[ +0.000013] free_unref_page_list+0x101/0x180
[ +0.000015] release_pages+0x382/0x400
[ +0.000013] tlb_flush_mmu+0x44/0x150
[ +0.000012] unmap_page_range+0x87f/0xde0
[ +0.000838] unmap_vmas+0x91/0xf0
[ +0.000783] exit_mmap+0xaa/0x180
[ +0.000779] mmput+0x52/0x120
[ +0.000778] do_exit+0x337/0xae0
[ +0.000769] do_group_exit+0x3a/0xa0
[ +0.000762] __x64_sys_exit_group+0x14/0x20
[ +0.000751] do_syscall_64+0x5b/0x1e0
[ +0.000738] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ +0.000736] RIP: 0033:0x7f58bfbec7f6
[ +0.000741] Code: Bad RIP value.
[ +0.000733] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ +0.000745] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[ +0.000755] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[ +0.000753] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[ +0.000747] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[ +0.000743] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000
[ +0.000753] BUG: Bad page state in process vectorAdd pfn:3f5f5d1
[ +0.000749] page:ffffede0fd7d7440 refcount:0 mapcount:0 mapping:ffff908f69b31b80 index:0x1
[ +0.000778] ext4_da_aops [ext4] name:"c.nvmgpu.mem"
[ +0.000760] flags: 0x17ffffc0000000()
[ +0.000757] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff908f69b31b80
[ +0.000774] raw: 0000000000000001 ffff908f69aeeea0 00000000ffffffff ffff909170526000
[ +0.000784] page dumped because: page still charged to cgroup
[ +0.000792] page->mem_cgroup:ffff909170526000
[ +0.000787] Modules linked in: nvidia_uvm(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) ipmi_devintf vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nft_objref nf_conntrack_tftp tun bridge stp llc nf_tables_set nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct rfkill nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal coretemp kvm_intel sunrpc snd_hda_codec_realtek snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi irqbypass iTCO_wdt iTCO_vendor_support snd_hda_intel snd_intel_dspcfg crct10dif_pclmul ext4 snd_hda_codec crc32_pclmul snd_hda_core mbcache snd_hwdep ghash_clmulni_intel jbd2 snd_seq intel_cstate snd_seq_device snd_pcm ipmi_ssif intel_uncore snd_timer mei_me snd ipmi_si pcspkr soundcore sg i2c_i801 mei joydev
[ +0.000023] intel_rapl_perf ioatdma lpc_ich ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod ast drm_vram_helper drm_ttm_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvme nvme_core ttm crc32c_intel t10_pi igb ahci drm atlantic dca libahci i2c_algo_bit libata wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_devintf]
[ +0.009132] CPU: 25 PID: 24386 Comm: vectorAdd Tainted: P B O 5.6.3.dragon #5
[ +0.001018] Hardware name: ******
[ +0.001019] Call Trace:
[ +0.001012] dump_stack+0x66/0x90
[ +0.001004] bad_page.cold.125+0x7f/0xb2
[ +0.001003] free_pcppages_bulk+0x178/0x660
[ +0.000996] free_unref_page_list+0x101/0x180
[ +0.000994] release_pages+0x382/0x400
[ +0.000985] tlb_flush_mmu+0x44/0x150
[ +0.000980] unmap_page_range+0x87f/0xde0
[ +0.000962] unmap_vmas+0x91/0xf0
[ +0.000935] exit_mmap+0xaa/0x180
[ +0.000913] mmput+0x52/0x120
[ +0.000887] do_exit+0x337/0xae0
[ +0.000864] do_group_exit+0x3a/0xa0
[ +0.000840] __x64_sys_exit_group+0x14/0x20
[ +0.000820] do_syscall_64+0x5b/0x1e0
[ +0.000795] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ +0.000777] RIP: 0033:0x7f58bfbec7f6
[ +0.000754] Code: Bad RIP value.
[ +0.000745] RSP: 002b:00007ffc54c70978 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ +0.000753] RAX: ffffffffffffffda RBX: 00007f58bfedd740 RCX: 00007f58bfbec7f6
[ +0.000753] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[ +0.000758] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffcc8
[ +0.000758] R10: fffffffffffff9fc R11: 0000000000000246 R12: 00007f58bfedd740
[ +0.000761] R13: 0000000000000013 R14: 00007f58bfee6448 R15: 0000000000000000