Giter Site home page Giter Site logo

arxyzan / data2vec-pytorch Goto Github PK

View Code? Open in Web Editor NEW
165.0 4.0 26.0 1.19 MB

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

License: MIT License

Python 100.00%
pytorch self-supervised-learning data2vec fairseq roberta wav2vec huggingface beit

data2vec-pytorch's People

Contributors

arxyzan avatar kabouzeid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

data2vec-pytorch's Issues

Mask value overflowed in audio pre-training

Hi @arxyzan,

I came up with a quite strange bug: the mask value overflowed in audio pretraining.

Here: https://github.com/arxyzan/data2vec-pytorch/blob/main/audio/encoder.py#L35. The mask is served as mask_time_indices during the computation of output hidden states.

The input mask is fine, a binary matrix (B, L):
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

However, after the computation, the mask value overflowed, like this:
1421638320, 1421638576, 1421638832, 1421639088, 1421639344, 1421639600, 1421639856, 1421640112,
1421640368, 1421640624, 1421640880, 1421641136, 1421641392, 1421641648, 1421641904, 1421642160...

Have you ever meet such issues? By the way, this only happens when run train.py. The debugging on audio/encoder.py only will not raise this bug.

My env is:
torch 1.13.1
torchaudio 0.13.1
transformers 4.26.0
python 3.8.16

thanks,
Junru

Error while training NLP

Traceback (most recent call last):
File "F:\study\UTA_PhD\Papers\data2vec-pytorch-main\train.py", line 24, in
trainer = trainers_dictmodality
File "F:\study\UTA_PhD\Papers\data2vec-pytorch-main\text\trainer.py", line 55, in init
self.test_loader = DataLoader(self.test_dataset, batch_size=cfg.train.val_batch_size,
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 355, in getattr
self._format_and_raise(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\base.py", line 231, in _format_and_raise
format_and_raise(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 351, in getattr
return self._get_impl(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 442, in _get_impl
node = self._get_child(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\basecontainer.py", line 73, in _get_child
child = self._get_node(
File "C:\Users\User\anaconda3\lib\site-packages\omegaconf\dictconfig.py", line 480, in _get_node
raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key val_batch_size
full_key: train.val_batch_size
object_type=dict

Can you please help me with the omegaconf. What package version is used while training the datasets?

doubt about finetuning

first of all great work @AryanShekarlaban and @kabouzeid!

quick question - if I want to fine tune data2vec with a given backbone (e.g. wav2vec2) - would freezing the feature extractor be enough? or should I also add an nn.Linear layer?

I see that by design trainer.py finetunes with TIMIT - but I also seen in another issue that we're actually training it from scratch (not sure if I'm missing something here)

thanks!

In data2vec.py

In data2vec.py, in line 90,

y = self.ema.model(trg, ~mask, **kwargs)['encoder_states']

shouldn't it have been,

y = self.ema.model(trg, None, **kwargs)['encoder_states'] (going by the training strategy in the paper)?

Question on personalized data modality

Hi,

Thanks for the repo. It is really helpful! I want to get some additional suggestions from you!

I am trying to use the model with a similar architecture and training paradigm to time-series data other than audio or text. Do you suggest I start with your code, or do you think it is better to use fairseq? Fairseq's repo is a bit complicated and can be hard to transfer to my application. So, I am a little bit hesitant. I saw you mentioned you didn't implement some tricks used in fairseq. Do you think these tricks are important? How these tricks can affect pertaining?

Thanks for your knowledge in advance!!

Trouble with audio....?

Aryan, thank you very much for sharing your code with the world. I wonder if you could advise:

I am trying to train by following the instructions for audio, but I haven't been able to get TIMIT or LibriSpeech to work.

TIMIT

For TIMIT, I get the message from HuggingFace that it must be downloaded manually. From the URL provided in the message, I got to UPenn who apparently want $250? for the dataset?? ...So, ok, I obtained a copy from a friend and also from Kaggle. But in both cases the HF dataloader fails; it is looking for files that don't exist anywhere in the dataset: it is looking for files with lower-case letters like "*test" (all the filenames in both my copies are uppercase) and certain file extensions that exclude the .DOC which is provided in TIMIT:

Error message

  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/datasets/data_files.py", line 201, in resolve_patterns_locally_or_by_urls
    raise FileNotFoundError(error_msg)
FileNotFoundError: Unable to resolve any data file that matches '['**test*', '**eval*']' at /home/ubuntu/datasets/timit with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'zip']

The files look like

³       PHONCODE.DOC
³       PROMPTS.TXT
³       SPKRINFO.TXT
³       SPKRSENT.TXT
³       TESTSET.DOC

If I take away the 'clean' directive in the load_dataset call, then the dataset loads but fails later with a key error:

Epoch: 1/1000     0%|                                                                      | 0/31678 [00:00<?, ?batch/s]
Traceback (most recent call last):
  File "/home/ubuntu/shawley/data2vec-pytorch/train.py", line 25, in <module>
    trainer.train()
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 142, in train
    train_loss = self.train_epoch(epoch)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 106, in train_epoch
    for batch in iterator:
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 570, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/dataset.py", line 21, in __getitem__
    x = self.data[index]['audio']
KeyError: 'audio'

If I print out self.data just after it's loaded in your TIMIT class, there is no 'audio' part to the namespace:

print("self.data[0] = ",self.data[0])
self.data =  {'index': 1, 'test_or_train': 'TRAIN', 'dialect_region': 'DR4', 'speaker_id': 'MMDM0', 'filename': 'SI681.WAV.wav', 'path_from_data_dir': 'TRAIN/DR4/MMDM0/SI681.WAV.wav', 'path_from_data_dir_windows': 'TRAIN\\\\DR4\\\\MMDM0\\\\SI681.WAV.wav', 'is_converted_audio': True, 'is_audio': True, 'is_word_file': False, 'is_phonetic_file': False, 'is_sentence_file': False}

Are you able to comment or advise about getting TIMIT to work?

LibriSpeech

For LibriSpeech, I copied your TIMIT class in dataset.py and just hard-coded the name of the dataset:

class LibriSpeech(Dataset):
    def __init__(self, cfg, split, **kwargs):
        super(LibriSpeech, self).__init__()
        path = cfg.dataset.path
        #self.data = load_dataset(path, 'clean')[split]
        self.data = load_dataset("librispeech_asr", 'clean')
        #print("self.data = ",self.data)
        self.data = self.data[split]
        self.feature_extractor = Wav2Vec2FeatureExtractor(cfg.model.encoder_checkpoint)
        self.__dict__.update(kwargs)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        x = self.data[index]['audio']
        x = self.feature_extractor(x['array'], sampling_rate=x['sampling_rate'], padding=True, return_tensors='pt')['input_values']
        return {'input_values': x[0]}
       

And then in trainer.py I just wrote

        #self.train_dataset = TIMIT(cfg, 'train')
        #self.test_dataset = TIMIT(cfg, 'test')
        self.train_dataset = LibriSpeech(cfg, 'train.100')
        self.test_dataset = LibriSpeech(cfg, 'test')

In that case the data is loaded without errors, and the training begins but aborts with a series of CUDA errors:

Epoch: 1/1000     0%|                                                                      | 0/28539 [00:00<?, ?batch/s]../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [5550,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [5550,0,0], thread: [65,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [5550,0,0], thread: [66,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

...hundreds more lines like this, and then...

Epoch: 1/1000     0%|                                                                      | 0/28539 [00:04<?, ?batch/s]
Traceback (most recent call last):
  File "/home/ubuntu/shawley/data2vec-pytorch/train.py", line 25, in <module>
    trainer.train()
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 142, in train
    train_loss = self.train_epoch(epoch)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 107, in train_epoch
    loss = self.train_step(batch)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/trainer.py", line 65, in train_step
    x, y = self.model(src, src, mask)
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/shawley/data2vec-pytorch/data2vec/data2vec.py", line 83, in forward
    x = self.encoder(src, mask, **kwargs)['encoder_out']  # fetch the last layer outputs
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/shawley/data2vec-pytorch/audio/encoder.py", line 35, in forward
    outputs = self.encoder(inputs, mask_time_indices=mask, output_hidden_states=True,
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1357, in forward
    hidden_states = self._mask_hidden_states(
  File "/home/ubuntu/envs/data2vec/lib/python3.9/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1297, in _mask_hidden_states
    hidden_states[mask_time_indices] = self.masked_spec_embed.to(hidden_states.dtype)
RuntimeError: CUDA error: device-side assert triggered

Do you have a suggestion about getting LibreSpeech working?

Thanks again,
Soctt

Model weights?

Although pretraining these models requires a lot of hardware resources and is almost impossible for an individual like me to do, there is the possibility to port the weights from HuggingFace models that actually use the same encoders as fairseq (and this repo). Otherwise this repo would be benefitial only for educational purposes.

Obviously, this task must be carried out so carefully but before that, the possibility of it must be verified. As this model "slightly" outperforms previous SOTA models, messing up even a single layer weight can ruin the whole thing!

The progress and issues regarding this task, will be stated here.

Question about reproducibility

Hello, thanks for your effort to make it easier to understand the data2vec.
Let me ask a quick question; can we reproduce the paper with your implementation?
I guess it is out of the scope of this repo, but I thought it would be quite nice if possible.
Thank you anyway!

EMA model forward

        # model forward in online mode (student)
        x = self.encoder(src, mask, **kwargs)['encoder_out']  # fetch the last layer outputs
        if trg is None:
            return x

        # model forward in offline mode (teacher)
        with torch.no_grad():
            self.ema.model.eval()
            y = self.ema.model(trg, ~mask, **kwargs)['encoder_states']  # fetch the last transformer layers outputs

In the teacher forward pass the mask_time_indices is the inverse of the one in student, is this correct?
I think the mask in the teacher forward pass should be None since the teacher expects the full version of input data

file type error omegaconf

getting the following error

Traceback (most recent call last):
File "c:/Users/{}/Downloads/data2vec-pytorch-main/data2vec-pytorch-main/train.py", line 15, in
cfg = omegaconf.OmegaConf.load(cfg_path)
File "C:\Program Files\Python38\lib\site-packages\omegaconf\omegaconf.py", line 194, in load
raise TypeError("Unexpected file type")
TypeError: Unexpected file type

do we need to change anything in here?
cfg = omegaconf.OmegaConf.load(cfg_path)

Which Hyperparameter tuning method to use?

First of all, thank you for the great work @AryanShekarlaban and @kabouzeid!

quick question: I have less experience with training big models like Transformers. I see that there are many frameworks and Algorithms for Hyperparameter tuning in internet. Could you suggest a hyperparameter tuning framework and algorithm for data2vec?

Thank you!

Question on the visual modality mask method

I am a beginner in deep learning. While reading your code, I noticed the following code snippet used in data processing for the visual modality(vision/dataset.py line:44):

masked_image = (image * mask).reshape(-1, self.input_size, self.input_size)

I have a question regarding this. Based on my understanding, in the mask matrix, 0 represents the areas that do not need to be masked, and 1 represents the areas to be masked. After examining the code for MaskingGenerator and the subsequent use of mask in the Data2Vec model, it seems like my understanding is correct.

Should the above code be modified to:

masked_image = (image * (1 - mask)).reshape(-1, self.input_size, self.input_size)

Or is my understanding incorrect? Please let me know.

EMA teacher model should not be deepcopied

EMA teacher model, according to the paper, is initialized randomly with the same architecture as student model. So, deepcopying the student model to create the teacher model should be avoided as it copies the weight parameters as well.

Some Questions

Hi @arxyzan ,

  1. Can you tell me what parts I need to change if my input size is 256 instead of 224?

  2. Is it mandatory to load encoder_checkpoint? Can I train my model from scratch?

  3. why is the config file named beit-pretraining.yaml for the vision task?

  4. Could you help me to solve problem below:

Epoch: 1/100 0%| | 0/18001 [00:02<?, ?batch/s]
Traceback (most recent call last):
File "/mnt/c/data2vec-pytorch/train.py", line 25, in
trainer.train()
File "/mnt/c/data2vec-pytorch/vision/trainer.py", line 131, in train
train_loss = self.train_epoch(epoch)
File "/mnt/c/data2vec-pytorch/vision/trainer.py", line 97, in train_epoch
loss = self.train_step(batch)
File "/mnt/c/data2vec-pytorch/vision/trainer.py", line 56, in train_step
x, y = self.model(src, trg, mask)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/c/data2vec-pytorch/data2vec/data2vec.py", line 90, in forward
y = self.ema.model(trg, ~mask, **kwargs)['encoder_states'] # fetch the last transformer layers outputs
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/c/data2vec-pytorch/vision/encoder.py", line 38, in forward
outputs = self.encoder(pixel_values=inputs, output_hidden_states=True, output_attentions=True, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/modeling_beit.py", line 681, in forward
embedding_output = self.embeddings(pixel_values, bool_masked_pos)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/modeling_beit.py", line 154, in forward
embeddings = self.patch_embeddings(pixel_values)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/transformers/models/beit/modeling_beit.py", line 206, in forward
embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/bryan/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.