<input type="checkbox" id="" disabled=""

h = 2, 1000 эпох, 10 примеров, train/test одинаковый <a target="_blank" rel="noope

200 tokens 1000 dataset (500val/test) bs = 250, lr=0.1 <p di

NMT 100 tokens на всем датасете <a href="h

[nmt-2.0] about gcm HOT 8 OPEN

natalymr commented on August 25, 2024

[nmt-2.0]

from gcm.

Comments (8)

natalymr commented on August 25, 2024

from gcm.

natalymr commented on August 25, 2024

h = 2, 1000 эпох, 10 примеров, train/test одинаковый

h = 2, 1000 эпох, 10 примеров, train/test разный

h = 256, 1000 эпох, 10 примеров, train/test одинаковый

h = 256, 10 эпох, полный датасет, train/test разный

h = 256, 50 эпох, полный датасет, train/test разный

h = 256, 30 эпох, полный датасет, другой lr, train/test разные

h = 256, 30 эпох, 2000 датасет, lr = 0.001, train/test разные

h = 256, 30 эпох, 2000 датасет, lr = 0.0001, train/test разные

h = 256, 30 эпох, 100 датасет, lr = 0.0001, train/test разные

from gcm.

natalymr commented on August 25, 2024

запуски на маке

500 эпох, датасет 100 train != 100 test, lr = 0.05, step_size=150, gamma=0.1, bs=10*10, test every 10, ИСПРАВЛЕННЫЙ grad acc - ничем вроде не отличается от предыдущего, но что-то скоры совсем другие; ДО BIDERECTIONAL, добавила clip_grad

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/1s09khnv

500 эпох, датасет 100 train != 100 test, lr = 0.05, step_size=150, gamma=0.1, bs=10*10, test every 10, ИСПРАВЛЕННЫЙ grad acc - ничем вроде не отличается от предыдущего, но что-то скоры совсем другие; ДО BIDERECTIONAL

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/35uz9xt0

500 эпох, датасет 100 train != 100 test, lr = 0.1, step_size=100, gamma=0.1), bs=10*10, test every 10, ИСПРАВЛЕННЫЙ grad acc

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/cnqca8ml

500 эпох, датасет 100 train != 100 test, lr = 0.1, step_size=100, gamma=0.1), bs=10*10, test every 10

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/1o31kd14

добавила grad acc (bs = 1, step делаем через каждый 5 шагов)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/uaoy1i2r

просто предыдущий запуск

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/2iarf6cm

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, добавила pack_padded

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ypzmdhj4

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, убрала dropout, clip_grad

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/2bhi9b3z

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, clip_grad(0.25)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/3nkuv908

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, добавила dropout на lstm в decoder, clip_grad(0.25)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/3v4h52mv?workspace=user-natalymr

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох увеличили в 10 раз

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ba9k4hxu?workspace=user-natalymr

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr увеличивается через каждые 500 эпох в 10 раз

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/1612jgvi

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr уменьшается через каждые 500 эпох в 10 раз

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/2kdz2ntc

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr уменьшается через каждые 100 эпох в 10 раз

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/2ar6y7nu

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/cp5vlq38

h = 2, init = xavier_uniform_, 10 dataset, train=test, добавила dropout везде с коэф 0.5

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/3msbqrk2

h = 2, init = xavier_normal_, 10 dataset, train=test

вот тут видно, что

веса становятся ненулевыми
градиенты тоже отличны от 0
https://app.wandb.ai/natalymr/nmt-2.0-test/runs/8a8hs6pg?workspace=user-natalymr

h = 2, init = normal, 10 dataset, train=test

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/3nxm5dk6?workspace=user-natalymr

h = 2, init не делала, 10 dataset, train=test

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/2bjp9eqj?workspace=user-natalymr

запуски на машине

200 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 50 эпох

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/4rx2i84m

200 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ttyv6fs2

1000 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох, batch size = 100, а не 10; hid_size 400, а не 300

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/mpr4zj4q

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох, batch size = 100, а не 10; hid_size 400, а не 300

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ymhus29z

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr НЕ УМЕНЬШАЕМ (0.001), batch size = 100, а не 10; hid_size 400, а не 300

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ej4s0x8t

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr УВЕЛИЧИВАЕМ через каждые 500 эпох в 5 раз (0.001), batch size = 100, а не 10; hid_size 400, а не 300

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/wy8r2kfu

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох в 2 раза (а не в 10 раз), начиная с 0.001, batch size = 100, а не 10; hid_size 400, а не 300

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/5gnbvy1d

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr не уменьшаем (0.001), batch size = 100, а не 10; hid_size 400, а не 300, добавила DROPOUT (encoder: embed, lstm, decoder: embed)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/dk5czbor

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr не уменьшаем (0.001), batch size = 100, а не 10; hid_size 400, а не 300, добавила DROPOUT (encoder: embed, lstm, decoder: embed)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/hysoott0

200 эпох, init = xavier_normal_, 2000 dataset, train != test, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT (encoder: embed, lstm, decoder: embed) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/513ukeok

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/zyvjmits

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1), sort dataset = True

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/r1ik0206?workspace=user-natalymr

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1), sort dataset = True, SKIP PADDING

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/anwy0t7z

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed), lr НЕ УМЕНЬШАЕМ, sort dataset = True, скип паддинг

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/nobbnkk3

59 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.01, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed), lr не уменьшаем, sort dataset = FALSE, shuffle=TRUE, скип паддинг

начала генерировать <sos> и <eos> 😡

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/bdtttitj

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed), lr не уменьшаем, sort dataset = TRUE, shuffle=TRUE, скип паддинг

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/nlz2wr98

200 эпох, init = xavier_normal_, 2000 dataset, train != test, hid_size 256, а не 400, добавила dropout 0.2 (encoder: embed, lstm, decoder: embed), lr не уменьшаем, sort dataset = true, shuffle=true, скип паддинг; GRAD ACC (bs=100*2) & lr = 0.1, step_size=100, gamma=0.1 & test_every 2

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/45tyzit5

30 эпох, init = xavier_normal_, 2000 dataset, train != test, hid_size 256, а не 400, добавила dropout 0.2 (encoder: embed, lstm, decoder: embed), lr не уменьшаем, sort dataset = true, shuffle=true, скип паддинг; GRAD ACC (bs=100*5(!!!)) & lr = 0.1, step_size=100, gamma=0.1 & test_every 5(!) - дубина - неправильно реализовала GRAD ACC

для почти всего генерирует одно и то же + повторяет одно и то же слово в одном сообщении

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/q5okz6q1

100 эпох, init = xavier_normal_, 2000 dataset, train != test, hid_size 256, а не 400, добавила dropout 0.2 (encoder: embed, lstm, decoder: embed), lr не уменьшаем, sort dataset = true, shuffle=true, скип паддинг; GRAD ACC (bs=50*5(!!!)) & lr = 0.1, step_size=50, gamma=0.1 & test_every 1 - ДУБИНА, неправильно реализовала GRAD ACC

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/s7hwma0d

вроде как лучше оказался 2*100

100 эпох, весь датасет, Total number of params = 16167398, src vocab = 26774, tgt vocab = 13795, bs=100*2 & lr = 0.1, step_size=50, gamma=0.1 & test_every 1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/99lrdx51

100 эпох, весь датасет, Total number of params = 16133862, src vocab = 26643, tgt vocab = 13795, bs=100*5 & lr = 0.1, step_size=50, gamma=1 & test_every 1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/fhva9pab

100 эпох, весь датасет, Total number of params = 26839398, src vocab = 26643, tgt vocab = 13795, hid_size 400, а не 256, bs=50*5(!) & lr = 0.1, step_size=50, gamma=1 & test_every 1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/4nc1j7lj

100 эпох, весь датасет, Total number of params =19172998, src vocab = 26643, tgt vocab = 13795, hid_size 300, а не 400, bs=50*10(!) & lr = 0.1, step_size=50, gamma=1 & test_every 1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/d9gjkoer

100 эпох, весь датасет, Total number of params =19172998, src vocab = 26643, tgt vocab = 13795, hid_size 300, а не 400, bs=50*10(!) & lr = 0.1, step_size=50, gamma=1 & test_every 1, добавила clip_grad(1)

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/625igudj

ДУБИНА - неправильно реализовала bleu - новый Bleu, добавляем по 1000 каждую эпоху после test_bleu > 1.5

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/z2vjlp74

Новые Bleu, сразу на всем датасете - ДУБИНА, не на всем, неправильно реализовала постепенное добавление датасета, поэтому тут все испортила

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/3amsiq35?workspace=user-natalymr

Новые Bleu, сразу на всем датасете

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/2hyvjbwi

from gcm.

natalymr commented on August 25, 2024

если градиенты начинают расходиться (перепрыгнули лок минимум)

lr уменьшаем, dropout добавить

если градиенты сходятся, а loss/acc не тот (засели в лок минимуме)

lr увеличиваем, dropout ослабить

Статья про Batch Size
There might be critical consequences when using different batch sizes that should be taken into consideration when choosing one. Let’s cover two of the main potential
consequences of using small or large batch sizes:

Generalization: Large batch sizes may cause bad generalization (or even get stuck in
a local minimum). Generalization means that the neural network will perform quite well on samples outside of the training set. So, bad generalization — which is pretty much overfitting — means that the neural network will perform poorly on samples outside of the training set.
Convergence speed: Small batch sizes may lead to slow convergence of the learning algorithm. The variable updates applied in every step, that were calculated using a
batch of samples, will determine the starting point for the next batch of samples.
Training samples are randomly drawn from the training set every step and therefore the
resulting gradients are noisy estimates based on partial data. The fewer samples we use in a single batch, the noisier and less accurate the gradient estimates will be. That is, the smaller the batch, the bigger impact a single sample has on the applied variable updates. In other words, smaller batch sizes may make the learning process noisier and
fluctuating, essentially extending the time it takes the algorithm to converge.

With all that in mind, we have to choose a batch size that will be neither too small nor too large but somewhere in between. The main idea here is that we should play around with
different batch sizes until we find one that would be optimal for the specific neural
network and dataset we are using.

Solution (survey)

One way to overcome the GPU memory limitations and run large batch sizes is to split the
batch of samples into smaller mini-batches, where each mini-batch requires an amount of
GPU memory that can be satisfied. These mini-batches can run independently, and their
gradients should be averaged or summed before calculating the model variable updates.
There are two main ways to implement this:

Data-parallelism — use multiple GPUs to train all mini-batches in parallel, each on a
single GPU. The gradients from all mini-batches are accumulated and the result is used to
update the model variables at the end of every step.
Gradient accumulation — run the mini-batches sequentially while accumulating the gradients. The accumulated results are used to update the model variables at the end of the last mini-batch.

So what is gradient accumulation, technically?

Gradient accumulation means running a configured number of steps without updating the model variables while accumulating the gradients of those steps and then using the accumulated gradients to compute the variable updates.
Yes, it’s really that simple.
Running some steps without updating any of the model variables is the way we —
logically — split the batch of samples into a few mini-batches. The batch of samples that is used in every step is effectively a mini-batch, and all the samples of those steps combined are effectively the global batch.
By not updating the variables at all those steps, we cause all the mini-batches to use the same model variables for calculating the gradients. This is mandatory to ensure the same gradients and updates are calculated as if we were using the global batch size.
Accumulating the gradients in all of these steps results in the same sum of gradients as if we were using the global batch size.

Iterating through an example

So, let’s say we are accumulating gradients over 5 steps. We want to accumulate the gradients of the first 4 steps, without updating any variable. At the fifth step, we want to use the accumulated gradients of the previous 4 steps combined with the gradients of the fifth step to compute and assign the variable updates. Let’s see it in action:

Starting at the first step, all the samples of the first mini-batch propagate through the forward and backward passes, resulting in computed gradients for each trainable model variable. We don’t want to actually update the variables, so there is no need in computing the updates at this point. What we need, though, is a place to store the gradients of the first step, in order for them to be accessible in the following steps, and we will use another variable for each trainable model variable, to hold the accumulated gradients. So, after computing the gradients of the first step, we will store them in the variables we created for the accumulated gradients.
Now the second step starts, and again, all the samples of the second mini-batch
propagate through all the layers of the model, computing the gradients of the second step. Just like the step before, we don’t want to update the variables yet, so there is no need in computing the variable updates. What’s different than the first step though, is that instead of just storing the gradients of the second step in our variables, we are going to add them to the values stored in the variables, which currently hold the gradients of the first step.
Steps 3 and 4 are pretty much the same as the second step, as we are not yet updating the variables, and we are accumulating the gradients by adding them to our variables.
Steps 3 and 4 are pretty much the same as the second step, as we are not yet updating the variables, and we are accumulating the gradients by adding them to our variables.
Then, in step 5, we do want to update the variables, as we intended to accumulate the gradients over 5 steps. After computing the gradients of the fifth step, we will add them to the accumulated gradients, resulting in the sum of all the gradients of those 5 steps.

We’ll then take this sum and insert it as a parameter to the optimizer, resulting in the updates computed using all the gradients of those 5 steps, computed over all the samples in the global batch.

Solution (implementation)

https://discuss.pytorch.org/t/how-to-implement-accumulated-gradient/3822

И ЕЩЕ ТРИ ВАРИАНТА РЕАЛИЗАЦИИ
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20?u=alband

Из этого же обсуждения

from gcm.

natalymr commented on August 25, 2024

200 tokens

1000 dataset (500val/test) bs = 250, lr=0.1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/0yjtnd2s

ALL dataset, Total number of params = 26230872, src vocab = 41607, tgt vocab = 18069, bs = 250, lr=0.1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/1p2c36bd
https://app.wandb.ai/natalymr/nmt-2.0-test/runs/w3ao2qls

ALL dataset, Total number of params = 10497688, src vocab = 41607, tgt vocab = 18069, bs = 500, hid_size = 128, а не 300, lr=0.1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ljc12vrp

Обучаем на 1000 коммитах, проверяем на всем test/val, hid_size=300, bs=25*10, lr=0.1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ha8nper1

Обучаем на 1000 коммитах С КОНЦА, проверяем на всем test/val, hid_size=300, bs=25*10, lr step_size=10, gamma=0.8, lr=0.1

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ehtguhaa

Обучаем на 5000 коммитах, проверяем на всем test/val, hid_size=300, bs=25*10, lr step_size=10, gamma=0.8, lr=0.1,

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/0ux74572

Проверочный запуск: вначале обучаемся на 1000 примерх, когда test_bleu > 0.5, начинаем каждую эпоху добавлять по 500 примеров в трейн, проверяемся на всем test/val, hid_size=300, bs=25*10, lr step_size=10, gamma=0.8, lr=0.1,

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/ojqkbrjw

Вначале обучаемся на 1000 примерх, когда test_bleu > 1.5, начинаем каждую эпоху добавлять по 500 примеров в трейн, проверяемся на всем test/val, hid_size=300, bs=25*10, lr step_size=10, gamma=0.7, lr=0.1,

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/q6jh7kqd

Поменяла способ подсчета bleu

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/1k6wlhkq

Добавляем по 1000 коммитов

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/31y4gm86?workspace=user-natalymr

Добавляем по 500 коммитов

https://app.wandb.ai/natalymr/nmt-2.0-test/runs/2ekge9aq

from gcm.

natalymr commented on August 25, 2024

NMT

100 tokens

на всем датасете
https://app.wandb.ai/natalymr/nmt-1.0-test/runs/6y01tlo1

200 tokens

добавляем по 500
https://app.wandb.ai/natalymr/nmt-1.0-test/runs/20rquvgm
после того, как добавили новые данные в трейн, ждем 4 эпохи, потом опять добавляем
https://app.wandb.ai/natalymr/nmt-1.0-test/runs/iio91nz5
уменьшила lr (0.01)
https://app.wandb.ai/natalymr/nmt-1.0-test/runs/i7l6s0dg

from gcm.

natalymr commented on August 25, 2024

New Dataset (match with code2seq)

NMT

100 tokens

по старому - добавлять немного коммитов, потом 4 эпохи обучаемся

https://app.wandb.ai/natalymr/nmt-1.0-test/runs/8p3nb59w?workspace=user-natalymr

сразу все данные

https://app.wandb.ai/natalymr/nmt-1.0-test/runs/d97basok?workspace=user-natalymr

сразу все данные, в 2 раза больше эпох

https://app.wandb.ai/natalymr/nmt-1.0-test/runs/9kdczc93?workspace=user-natalymr

изменила lr

прервался запуск
https://app.wandb.ai/natalymr/nmt-1.0-test/runs/vylsmbo9?workspace=user-natalymr
полный запуск
https://app.wandb.ai/natalymr/nmt-1.0-test/runs/1yinbj6a?workspace=user-natalymr

200 tokens

lr = 0.01, после 150 эпох уменьшаем в 10 раз, 250 эпох
https://app.wandb.ai/natalymr/nmt-1.0-test/runs/ucukw8kl

from gcm.

natalymr commented on August 25, 2024

New Dataset (match with code2seq)

NMT-2

100 tokens

Весь датасет, lr=0.01
https://app.wandb.ai/natalymr/nmt-2.0-test/runs/og66z2nl
400 эпох:
https://app.wandb.ai/natalymr/nmt-2.0-test/runs/6c7f56mo
250 эпох, не уменьшаем LR
https://app.wandb.ai/natalymr/nmt-2.0-test/runs/9k5usrqq

from gcm.

Comments (8)

запуски на маке

500 эпох, датасет 100 train != 100 test, lr = 0.1, step_size=100, gamma=0.1), bs=10*10, test every 10, ИСПРАВЛЕННЫЙ grad acc

500 эпох, датасет 100 train != 100 test, lr = 0.1, step_size=100, gamma=0.1), bs=10*10, test every 10

добавила grad acc (bs = 1, step делаем через каждый 5 шагов)

просто предыдущий запуск

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, добавила pack_padded

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, убрала dropout, clip_grad

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, clip_grad(0.25)

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох уменьшили в 0.2 раза, добавила dropout на lstm в decoder, clip_grad(0.25)

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr начали с 0.001, после 500 эпох увеличили в 10 раз

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr увеличивается через каждые 500 эпох в 10 раз

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr уменьшается через каждые 500 эпох в 10 раз

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5, lr уменьшается через каждые 100 эпох в 10 раз

h = 2, init = xavier_uniform_, 10 dataset, train != test, добавила dropout везде с коэф 0.5

h = 2, init = xavier_uniform_, 10 dataset, train=test, добавила dropout везде с коэф 0.5

h = 2, init = xavier_normal_, 10 dataset, train=test

h = 2, init = normal, 10 dataset, train=test

h = 2, init не делала, 10 dataset, train=test

запуски на машине

1000 эпох, 100 dataset, train != test, lr = 0.0001

200 эпох, init = xavier_normal_, 100 dataset, train != test, lr = 0.01

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr = 0.01

200 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 50 эпох

200 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох

1000 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох, batch size = 100, а не 10; hid_size 400, а не 300

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох, batch size = 100, а не 10; hid_size 400, а не 300

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr НЕ УМЕНЬШАЕМ (0.001), batch size = 100, а не 10; hid_size 400, а не 300

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr УВЕЛИЧИВАЕМ через каждые 500 эпох в 5 раз (0.001), batch size = 100, а не 10; hid_size 400, а не 300

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr уменьшающийся через каждые 500 эпох в 2 раза (а не в 10 раз), начиная с 0.001, batch size = 100, а не 10; hid_size 400, а не 300

2000 эпох, init = xavier_normal_, 100 dataset, train != test, lr не уменьшаем (0.001), batch size = 100, а не 10; hid_size 400, а не 300, добавила DROPOUT (encoder: embed, lstm, decoder: embed)

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr не уменьшаем (0.001), batch size = 100, а не 10; hid_size 400, а не 300, добавила DROPOUT (encoder: embed, lstm, decoder: embed)

200 эпох, init = xavier_normal_, 2000 dataset, train != test, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT (encoder: embed, lstm, decoder: embed) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed) scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed), lr НЕ УМЕНЬШАЕМ, sort dataset = True, скип паддинг

59 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.01, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed), lr не уменьшаем, sort dataset = FALSE, shuffle=TRUE, скип паддинг

200 эпох, init = xavier_normal_, 2000 dataset, train != test, lr =0.1, batch size = 130, а не 10; hid_size 400, а не 300, добавила DROPOUT 0.2 (encoder: embed, lstm, decoder: embed), lr не уменьшаем, sort dataset = TRUE, shuffle=TRUE, скип паддинг

100 эпох, весь датасет, Total number of params = 16167398, src vocab = 26774, tgt vocab = 13795, bs=100*2 & lr = 0.1, step_size=50, gamma=0.1 & test_every 1

100 эпох, весь датасет, Total number of params = 16133862, src vocab = 26643, tgt vocab = 13795, bs=100*5 & lr = 0.1, step_size=50, gamma=1 & test_every 1

100 эпох, весь датасет, Total number of params = 26839398, src vocab = 26643, tgt vocab = 13795, hid_size 400, а не 256, bs=50*5(!) & lr = 0.1, step_size=50, gamma=1 & test_every 1

100 эпох, весь датасет, Total number of params =19172998, src vocab = 26643, tgt vocab = 13795, hid_size 300, а не 400, bs=50*10(!) & lr = 0.1, step_size=50, gamma=1 & test_every 1

100 эпох, весь датасет, Total number of params =19172998, src vocab = 26643, tgt vocab = 13795, hid_size 300, а не 400, bs=50*10(!) & lr = 0.1, step_size=50, gamma=1 & test_every 1, добавила clip_grad(1)

ДУБИНА - неправильно реализовала bleu - новый Bleu, добавляем по 1000 каждую эпоху после test_bleu > 1.5

Новые Bleu, сразу на всем датасете - ДУБИНА, не на всем, неправильно реализовала постепенное добавление датасета, поэтому тут все испортила

Новые Bleu, сразу на всем датасете

если градиенты начинают расходиться (перепрыгнули лок минимум)

lr уменьшаем, dropout добавить

если градиенты сходятся, а loss/acc не тот (засели в лок минимуме)

lr увеличиваем, dropout ослабить

Solution (survey)

So what is gradient accumulation, technically?

Iterating through an example

Solution (implementation)

200 tokens

1000 dataset (500val/test) bs = 250, lr=0.1

ALL dataset, Total number of params = 26230872, src vocab = 41607, tgt vocab = 18069, bs = 250, lr=0.1

ALL dataset, Total number of params = 10497688, src vocab = 41607, tgt vocab = 18069, bs = 500, hid_size = 128, а не 300, lr=0.1

Обучаем на 1000 коммитах, проверяем на всем test/val, hid_size=300, bs=25*10, lr=0.1

Обучаем на 1000 коммитах С КОНЦА, проверяем на всем test/val, hid_size=300, bs=25*10, lr step_size=10, gamma=0.8, lr=0.1

Обучаем на 5000 коммитах, проверяем на всем test/val, hid_size=300, bs=25*10, lr step_size=10, gamma=0.8, lr=0.1,

Вначале обучаемся на 1000 примерх, когда test_bleu > 1.5, начинаем каждую эпоху добавлять по 500 примеров в трейн, проверяемся на всем test/val, hid_size=300, bs=25*10, lr step_size=10, gamma=0.7, lr=0.1,

Поменяла способ подсчета bleu

Добавляем по 1000 коммитов

Добавляем по 500 коммитов

NMT

100 tokens

200 tokens

New Dataset (match with code2seq)

NMT

100 tokens

по старому - добавлять немного коммитов, потом 4 эпохи обучаемся

сразу все данные

сразу все данные, в 2 раза больше эпох

изменила lr

200 tokens

New Dataset (match with code2seq)

NMT-2

100 tokens

Related Issues (20)

Recommend Projects