Giter Site home page Giter Site logo

Why my model was retrained? about tgan HOT 7 CLOSED

sdv-dev avatar sdv-dev commented on May 30, 2024
Why my model was retrained?

from tgan.

Comments (7)

ManuelAlvarezC avatar ManuelAlvarezC commented on May 30, 2024 1

Hi @TrinhDinhPhuc,

Regarding your first question:

I check the code, It was freezed at this line in launcher.py file. Do you know why?

Seeing your configuration file and the code, there is something that may cause trouble:

[
    {
        ...
        "sample_rows": 5
    }
]

We just found out that there is a bug which prevents TGAN from working properly when sampling a number of rows that is not an exact multiple of the batch_size.

To work around this problem, and since the only possible batch sizes that TGAN can use right now are 50, 100 and 200, please make sure to always request a number of rows to sample that is an exact multiple of 200.

For the second question:

As you can see here, it seems like it was freezing at this line sample(args.sample, Model(), args.load +".index", output_filename=args.output).

Well, I can't see it from your screenshots, but this one of the potential consequences from the bug explained above.

I think the correct path is model-10.index not model-10 right?

Yes, indeed.

Also, I'm not sure if you're aware but any synthesized data will be stored in TGAN-master/exp_dir/KDD2/

from tgan.

ManuelAlvarezC avatar ManuelAlvarezC commented on May 30, 2024

Hi @TrinhDinhPhuc,

Would you mind sharing with us some more details like:

  • List of steps you used
  • Contents of the JSON config file

Thanks.

from tgan.

harrytrinh2 avatar harrytrinh2 commented on May 30, 2024

Hi I am using Ubuntu 18.04, python 3.6, this is my configuration file

[
{
"name": "census",
"num_random_search": 10,
"train_csv": "data/census-train.csv",
"continuous_cols": [0, 2, 3, 4, 5],
"epoch": 5,
"steps_per_epoch": 10000,
"output_epoch": 3,
"sample_rows": 10000
}
]

I just simply run $python3.6 src/launcher.py demo_config.json as the instruction in README. After 4 hours, it was training epoch 4 but suddenly, it showed these lines:

[0327 14:23:24 @sessinit.py:90] WRN The following variables are in the checkpoint, but not found in the graph: global_step:0, optimize/beta1_power:0, optimize/beta2_power:0
 25%|###############2                                             |2498/10000[27:36<1:08:27, 1.83it/s]2019-03-27 14:23:24.578021: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[0327 14:23:24 @sessinit.py:117] Restoring checkpoint from train_log/TGAN_synthesizer:KDD-2/model-50000 ...
 25%|###############3                                             |2523/10000[27:50<1:07:24, 1.85it/s][0327 14:23:39 @logger.py:74] Argv: src/TGAN_synthesizer.py --batch_size 50 --z_dim 100 --num_gen_rnn 400 --num_gen_feature 300 --num_dis_layers 4 --num_dis_hidden 400 --learning_rate 0.001 --noise 0.1 --exp_name KDD-3 --max_epoch 5 --steps_per_epoch 10000 --data expdir/KDD/train.npz --gpu 0
 25%|###############3                                             |2524/10000[27:51<1:11:50, 1.73it/s][0327 14:23:39 @develop.py:96] WRN [Deprecated] ModelDescBase._get_inputs() interface will be deprecated after 30 Mar. Use inputs() instead!
[0327 14:23:39 @input_source.py:221] Setting up the queue 'QueueInput/input_queue' for CPU prefetching ...
[0327 14:23:39 @develop.py:96] WRN [Deprecated] ModelDescBase._build_graph() interface will be deprecated after 30 Mar. Use build_graph() instead!
[0327 14:23:39 @registry.py:121] gen/LSTM/00/FC input: [50, 400]
[0327 14:23:39 @registry.py:129] gen/LSTM/00/FC output: [50, 300]
[0327 14:23:39 @registry.py:121] gen/LSTM/00/FC2 input: [50, 300]
[0327 14:23:39 @registry.py:129] gen/LSTM/00/FC2 output: [50, 1]
WARNING:tensorflow:From src/TGAN_synthesizer.py:71: calling softmax (from tensorflow.python.ops.nn_ops) with dim is deprecated and will be removed in a future version.
Instructions for updating:
dim is deprecated, use axis instead

then, it automatically back to epoch 1

discrim/dis_fc_top/W:0            [410, 1]          410
discrim/dis_fc_top/b:0            [1]                 1
Total #vars=92, #params=3934650, size=15.01MB
[0327 14:23:46 @base.py:187] Setup callbacks graph ...
[0327 14:23:46 @summary.py:38] Maintain moving average summary of 6 tensors in collection MOVING_SUMMARY_OPS.
[0327 14:23:46 @summary.py:75] Summarizing collection 'summaries' of size 9.
[0327 14:23:46 @graph.py:91] Applying collection UPDATE_OPS of 16 ops.
 25%|###############4                                             |2540/10000[27:59<1:08:34, 1.81it/s][0327 14:23:48 @base.py:205] Creating the session ...

from tgan.

harrytrinh2 avatar harrytrinh2 commented on May 30, 2024

Here is the error:

2019-03-27 23:03:06.579435: W tensorflow/core/kernels/queue_base.cc:277] _0_QueueInput/input_queue: Skipping cancelled enqueue attempt with queue not closed
 60%|########################################################################################################4                                                                    |3018/5000[46:27<25:53, 1.28it/s]Traceback (most recent call last):
  File "src/TGAN_synthesizer.py", line 313, in <module>
    sample(args.sample, Model(), args.load, output_filename=args.output)
  File "src/TGAN_synthesizer.py", line 234, in sample
    session_init=get_model_loader(model_path),
  File "/home/harry/Documents/GANs-demo/TGAN-master/py36_env/lib/python3.6/site-packages/tensorpack/tfutils/sessinit.py", line 262, in get_model_loader
    return SaverRestore(filename)
  File "/home/harry/Documents/GANs-demo/TGAN-master/py36_env/lib/python3.6/site-packages/tensorpack/tfutils/sessinit.py", line 107, in __init__
    model_path = get_checkpoint_path(model_path)
  File "/home/harry/Documents/GANs-demo/TGAN-master/py36_env/lib/python3.6/site-packages/tensorpack/tfutils/varmanip.py", line 182, in get_checkpoint_path
    assert tf.gfile.Exists(model_path) or tf.gfile.Exists(model_path + '.index'), model_path
AssertionError: train_log/TGAN_synthesizer:KDD2-2/model-0

my config file:
[
{
"name": "KDD2",
"num_random_search": 10,
"train_csv": "data/KDD2.csv",
"continuous_cols": [0, 2, 3, 4, 5],
"epoch": 2,
"steps_per_epoch": 5000,
"output_epoch": 3,
"sample_rows": 5000
}
]

from tgan.

ManuelAlvarezC avatar ManuelAlvarezC commented on May 30, 2024

Hi @TrinhDinhPhuc,

Regarding your first question:

Why this problem occured? Please explain it to me

There is no problem, nor the model was retrained, let's see what happened:

  1. According to your first config json:
[
    {
        "name": "census",
        "num_random_search": 10,  # num_random_search: iterations of random hyper parameter search.
        "train_csv": "data/census-train.csv",
        "continuous_cols": [0, 2, 3, 4, 5],
        "epoch": 5,
        "steps_per_epoch": 10000,
        "output_epoch": 3,
        "sample_rows": 10000
    }
]

You are running 10 parallel random searches of model hyperparameters, according to the parameter num_random_search. That is, training and evaluating different model instances with different sets of hyperparameters, to find the best ones for your given dataset. And this message:

[0327 14:22:41 @base.py:264] Training has finished!

Is just that one of the training cycles of the random search has finished, and the following output from when starting the next iteration in the hyperparameter search loop is what I think may have lead you to think that the model was being retrained.

Regarding your second question:

The issue here is that an experiment can't be run twice with the same name.

from tgan.

harrytrinh2 avatar harrytrinh2 commented on May 30, 2024

Hi, based on your explanation. I modified my config file like this to check the code:
[
{
"name": "KDD2",
"num_random_search": 1,
"train_csv": "data/KDD2.csv",
"continuous_cols": [0, 2, 3, 4, 5],
"epoch": 1,
"steps_per_epoch": 10,
"output_epoch": 1,
"sample_rows": 5
}
]
However, the script was like this, it was not running for a long time, I dont understand why it was Restoring checkpoint and why it took so much time:

[0330 12:43:42 @registry.py:129] discrim/dis_fc_top output: [50, 1]
[0330 12:43:42 @collection.py:145] New collections created in tower : tf.GraphKeys.REGULARIZATION_LOSSES
[0330 12:43:42 @collection.py:164] These collections were modified but restored in : (tf.GraphKeys.SUMMARIES: 0->2)
[0330 12:43:42 @sessinit.py:90] WRN The following variables are in the checkpoint, but not found in the graph: global_step:0, optimize/beta1_power:0, optimize/beta2_power:0
2019-03-30 12:43:42.856889: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[0330 12:43:43 @sessinit.py:117] Restoring checkpoint from train_log/TGAN_synthesizer:KDD2-0/model-10 ...

I check the code, It was freezed at this line in launcher.py file. Do you know why?
pool.map(worker, commands)

from tgan.

harrytrinh2 avatar harrytrinh2 commented on May 30, 2024

Screenshot from 2019-03-30 18-57-27
As you can see here, it seems like it was freezing at this line sample(args.sample, Model(), args.load +".index", output_filename=args.output). In the picture, it was Restoring checkpoint from train_log/TGAN_synthesizer:KDD2-0/model-10 ... but in the folder, there was no file model-10 file like that. I think the correct path is model-10.index not model-10 right?

from tgan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.