I am trying this library, i can see data are loaded via dataloaders. In my case i am u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Feel free to close this if your issue has been resolved <a class="user-mention notrans

custom dataloader for NLP dataset about kd_lib HOT 4 CLOSED

OriAlpha commented on June 3, 2024

custom dataloader for NLP dataset

from kd_lib.

Comments (4)

NeelayS commented on June 3, 2024 1

The distillation methods KD-Lib provides are designed primarily for classification tasks. Hence, the distiller objects expect dataloaders which supply 2 things: the input data for the classification task and a corresponding label for the task. In your case, the dataloders seem to be supplying 3 things: input data, attn masks, and labels while only 2 are expected.

from kd_lib.

NeelayS commented on June 3, 2024

Hi @OriAlpha.

Could you tell me what kind of NLP task you are looking to do? Also, could you please post the error stack trace if possible?

from kd_lib.

OriAlpha commented on June 3, 2024

Sorry i forgot to mention i am following distillation example on readme.
I am using SequenceClassification task, also the error was

    for (data, label) in self.train_loader:
ValueError: too many values to unpack (expected 2)

i am sure custom dataloader may be creating issue

###tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# Tokenize all of the sentences and map the tokens to thier word IDs.

input_ids = []
attention_masks = []

# For every sentence...

for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 100,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])


# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

### Not combine the input id , mask and labels and divide the dataset

#:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.90 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

### Not you call loader of these datasets


from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

from kd_lib.

NeelayS commented on June 3, 2024

Feel free to close this if your issue has been resolved @OriAlpha.

from kd_lib.

custom dataloader for NLP dataset about kd_lib HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent