Giter Site home page Giter Site logo

Comments (10)

PengNi avatar PengNi commented on August 23, 2024 1

Hi @pterzian ,

Thanks for your interest.

  1. So when using the deepsignal, we don't blend the fast5 files first. The correct order is extracting features from both samples first, then blending the output tsv files for training.
  2. --methy_label is used to classify positive and negativate samples during training. Normally we can set --methy_label to 1 when extracting features from native sample, and 0 when extracting features from control samples.
  3. concat_two_files.py may be helpful to shuffle two files.
  4. Once we get the shuffle-concated file, we can split it into training file and validating file. 10k samples may be enough for validation. Also, the ratio of positive and negative samples for training and validating is suggested to be set as ~1:1.

Best,
Peng

from deepsignal.

pterzian avatar pterzian commented on August 23, 2024

Thank you Peng, it is much clearer to me now.
I'll keep you posted on how it went ! (probably next week)

from deepsignal.

pterzian avatar pterzian commented on August 23, 2024

So it has been a couple of days that the training is going on and I am following the process through the train.txt and valid.txt logs but I am not sure of how I should interpret it.

This is the command :

deepsignal train --train_file training_file \ 
      --valid_file validating_file \
      --model_dir models/  \
      --log_dir logs

This is a snippet of the train.txt log :

epoch:0, iterid:100, loss:3.228, accuracy:0.522, recall:0.292, precision:0.466
epoch:0, iterid:200, loss:0.768, accuracy:0.551, recall:0.288, precision:0.531
epoch:0, iterid:300, loss:0.689, accuracy:0.605, recall:0.385, precision:0.627

And this is for the valid.txt log :

epoch:0, iterid:100, loss:0.825, accuracy:0.548, recall:0.114, precision:0.541
epoch:0, iterid:200, loss:0.708, accuracy:0.612, recall:0.608, precision:0.574
epoch:0, iterid:300, loss:0.631, accuracy:0.679, recall:0.520, precision:0.706

So these are only the first 3 lines and I am actually at the 7th iteration (iterid:700).

  • My first question would be how much iteration should I expect until the model is ready?

  • Following that I would love to understand why there are two log files sharing the same variables with different values. Is deepsignal swapping the roles of the training and validation dataset in order to find the best one to train on ?

  • My last question would be : Can I try the model already ?

Thank you for your time!
Paul

from deepsignal.

PengNi avatar PengNi commented on August 23, 2024

Hi Paul,

  1. So by default, deepsignal will train 5 epoches at least and 10 epoches at most. The training may stop after any epoch (5 to 10) finished, because we use early-stopping strategy. One epoch will go over all training dataset once.

  2. At each iteration, we will use the current model to predict: (1) all training dataset at this iteration (51200 samples by default); (2) the validation dataset. And we log the prediction performances to train.txt and valid.txt separately. We log both of them just for comparison, because we use Dropout during training. You can also check out the stdout of the training.

    Normally the performance on the validation dataset is closer to the performance on a test dataset.

  3. In my experience, we may need to train at least 5 epoches to get a stable model.

Also, can you tell me how many samples you have in the validation dataset? Based on the tests I did, 10k samples are enough for validation (to get a stable model). Too many samples in validation dataset will make the training process much slower.

Best,
Peng

from deepsignal.

pterzian avatar pterzian commented on August 23, 2024

Hi Peng,

I understand that my validation dataset was too big. If by samples you mean the number of lines in my validation_file and training_file, I had around 1M samples for each file.

I did launch a new run with your last instructions. I reduced the validation_file to 10k lines with a ratio of ~1:1 negative and positive samples. I also reduced my training_file to 500k lines, with the same ratio.

Indeed it seems to run faster.

I would have two questions :

  1. I found the path of a supposed model in the checkpoint file of the models/ folder :
model_checkpoint_path: "bn_17.sn_360.epoch_0.ckpt"
all_model_checkpoint_paths: "bn_17.sn_360.epoch_0.ckpt"

I used this path to call modifications on a dataset and it seems to be working, yet I don't see the actual file in the folder. All I see in the folder is :

bn_17.sn_360.epoch_0.ckpt.data-00000-of-00001  
bn_17.sn_360.epoch_0.ckpt.index  
bn_17.sn_360.epoch_0.ckpt.meta  
checkpoint

I guess I am not sure of what I'm doing here.

  1. This question is more about understanding the algorithm. Following what you said about iterations and epoch and given the number of lines in my training_file, I should expect around 10 iterations per epoch right (with default parameters) ?

Paul.

from deepsignal.

PengNi avatar PengNi commented on August 23, 2024

Hi Paul,

So theoretically we should use as many samples for training as possible. In my case I use 20m samples for training, 10k samples for validtion.

(1) "bn_17.sn_360.epoch_0.ckpt" actually is prefix of those file names. It is a tensorflow feature. The prefix is used to indicate those files in the folder, all together as a model (a bunch of parameters).

(2) Yes. Say there are 500k samples for training, then there will be around 10 iterations per epoch. And normally we should expect a stable model after 5-10 epoches. But not limited to 5-10, these also can be tuned to get a better model.

Best,
Peng

from deepsignal.

pterzian avatar pterzian commented on August 23, 2024

Hi Peng,

Sorry for delaying my answer so much, I was not around. Following your indication we manage to obtain very satisfying results (5 epoches worked well as you said). We will definitely continue trying deepsignal with other datasets, so I will be back for sure.

Thanks a bunch for the support !

from deepsignal.

pterzian avatar pterzian commented on August 23, 2024

Hi Peng,

I am taking the liberty to reopen this topic because I am back to model training and I have new questions. I am currently using 10M samples for training and I was wondering how much time it took for your model to train on 20M samples ? Did you do it on gpu ?

From what I see on my cpu machine, 10M samples is around 200 iterations the epoch and will definitly take at least 3/4 weaks to complete 5 epoches.

best,
Paul

from deepsignal.

PengNi avatar PengNi commented on August 23, 2024

Hi Paul,

I used one GPU (TITAN X (Pascal)). It costs about 44 hours for training on 20M samples. we don't suggest training on cpu. However, converting training and validation file from txt format to binary format will speed up the training process (see --is_binary option of deepsignal train).

Best,
Peng

from deepsignal.

pterzian avatar pterzian commented on August 23, 2024

Hi Peng,

Thanks for the answer.
Actually we have a few gpu machine available so I am up to try training this way. I will open a new issue for training model and calling modifications with gpu, I tried this last one with no success a couple of weeks ago.

from deepsignal.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.