Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Inconsistencies in unsqueeze operation description in the book and in notebook and its necessity (3.6.2 Implementing multi-head attention with weight splits) about llms-from-scratch HOT 4 CLOSED

rasbt commented on July 20, 2024

Inconsistencies in unsqueeze operation description in the book and in notebook and its necessity (3.6.2 Implementing multi-head attention with weight splits)

from llms-from-scratch.

Comments (4)

rasbt commented on July 20, 2024 1

Nice, it turns out you were right, the .unsqueeze(0) was indeed redundant. Love it, it makes the code even simpler and more readable!

from llms-from-scratch.

labdmitriy commented on July 20, 2024

Also I have a question - could you please explain why do we need to call contiguous() in the following line in MultiHeadAttention class:

context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)

from llms-from-scratch.

rasbt commented on July 20, 2024

mask_unsqueezed = mask_bool.unsqueeze(0).unsqueeze(0)

Ah yes, this was unnecessary so I updated it to just mask_bool.unsqueeze(0) a while back. I will look into whether I can remove it altogether like you suggest. Thanks!

Also I have a question - could you please explain why do we need to call contiguous() in the following line in MultiHeadAttention class:

context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)

Good question. This is because the way the memory is organized in this tensor; the .view() would raise an error. What you could do is

context_vec = context_vec.reshape(b, num_tokens, self.d_out)

This this is because (quoting from the documentation):

When possible, the returned tensor will be a view of input. Otherwise, it will be a copy. Contiguous inputs and inputs with compatible strides can be reshaped without copying, but you should not depend on the copying vs. viewing behavior.

However, I haven't used .reshape elsewhere in this book so I wanted to stick with .view for consistency.

from llms-from-scratch.

labdmitriy commented on July 20, 2024

Sebastian, thanks a lot for your response,

Good question. This is because the way the memory is organized in this tensor; the .view() would raise an error

Yes, this question was asked because when I deleted .contiguous():

context_vec = context_vec.view(b, num_tokens, self.d_out)

I didn't have any errors and get the same results.

Only one another reason to convert to contiguous tensor that I found here was the following:

This create issues with parallel computations.

But I didn't find more detailed explanation.
Could you please share your thoughts about it?

Thank you.

from llms-from-scratch.

Recommend Projects

Inconsistencies in unsqueeze operation description in the book and in notebook and its necessity (3.6.2 Implementing multi-head attention with weight splits) about llms-from-scratch HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent