Hi there 👋

If you prioritize safeguarding DNS privacy, consider checking out my creation, smartdns-rs. It's definitely worth having.

Smartdns-rs
If you're a user of ZeroTier and looking to set up your own controller, but find the existing UI solutions too heavy with their Node.js and Docker setups, you might want to give my solution a try. It's a single-file binary, under 5MB in size.

ZeroTier Edge
If you also use Obsidian and want to share your notes or write academic papers, you can give my plugin a try.

Obsidian Enhancing Export
If you're also an Obsidian user and wish to create programming notes that can be executed for viewing results, my plugin might be just what you need. It works on mobile devices as well.

Obsidian Code Emiiter
Are you currently diving into machine learning studies and find yourself juggling numerous mathematical formula notes? If you're struggling with the notation, give my tool a shot.

Latex Equation Editor
If you also use Python but find it challenging to know which functions are available in the built-in global functions like max, min, filter, without consulting the documentation, and sometimes lose typing information, resulting in the loss of code completion suggestions in your IDE, you can give my fully generic dataset manipulation library a try. It draws inspiration from languages like C#, Kotlin, Rust, and more. You will find it easy to adapt to and use.

Pyiter

Languages and Tools:

tensor scatter seems much slower than pytorch (cuda)

reproduce by code below

fn scatter_add() -> candle_core::Result<()> {
    // let device = Device::new_cuda(0)?;
    let device = Device::new_cuda(0)?;
    let logits_idx_end = 32000_usize;
    let logits_idx = Tensor::arange(0_u32, logits_idx_end as u32, &device)?.reshape((1, 32000))?;
    let logits_idx_inv = Tensor::zeros_like(&logits_idx)?;
    let src = Tensor::arange(0_u32, logits_idx_end as u32, logits_idx.device())?
        .expand(logits_idx.shape())?
        .contiguous()?;
    let start = std::time::Instant::now();
    let logits_idx_inv = candle_ext::F::scatter(&logits_idx_inv, &logits_idx, &src, D::Minus1)?;
    match device {
        Device::Cuda(cuda_dev) => {
            cuda_dev.synchronize();
        }
        _ => {}
    }
    println!("scatter cost {:?}/{}", start.elapsed(), logits_idx_end);
    Ok(())
}

rust result(run 2times in the same process)

scatter cost 3.288861ms/32000
scatter cost 3.271358ms/32000

logits_idx = torch.arange(0,32000, dtype=torch.int64, device = 'cuda').reshape(1,32000)
logits_idx_inv = torch.zeros_like(logits_idx)
src = torch.arange(0,32000, device = 'cuda').expand(logits_idx.shape)
torch.cuda.synchronize()
start_time = time.time_ns()
logits_idx_inv = torch.empty_like(logits_idx).scatter_(dim=-1,index=logits_idx,src=src)
torch.cuda.synchronize()
print("first cuda scatter cost ", time.time_ns() - start_time, "ns", logits_idx.shape,logits_idx_inv.shape)


logits_idx = torch.arange(0,32000, dtype=torch.int64, device = 'cuda').reshape(1,32000)
logits_idx_inv = torch.zeros_like(logits_idx)
src = torch.arange(0,32000, device = 'cuda').expand(logits_idx.shape)
torch.cuda.synchronize()
start_time = time.time_ns()
logits_idx_inv = torch.empty_like(logits_idx).scatter_(dim=-1,index=logits_idx,src=src)
torch.cuda.synchronize()
print("cuda scatter cost ", time.time_ns() - start_time, "ns", logits_idx.shape,logits_idx_inv.shape)

python result(run 2times in the same process)

first cuda scatter cost  3191597 ns torch.Size([1, 32000]) torch.Size([1, 32000])
cuda scatter cost  38734 ns torch.Size([1, 32000]) torch.Size([1, 32000])

it seems pytorch run much faster after warmup.

mokeyish / candle-ext Goto Github PK

candle-ext's Introduction

Hi there 👋

Languages and Tools:

candle-ext's People

Contributors

Stargazers

Watchers

Forkers

candle-ext's Issues

tensor scatter seems much slower than pytorch (cuda)

Add logical_or?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent