2023-07-20 02:13:16,273 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:17,519 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:18,124 - modelscope - INFO - Set workdir to ./pretrain_work_dir/
2023-07-20 02:13:18,171 - modelscope - INFO - load ./output_training_data/
2023-07-20 02:13:18,561 - modelscope - INFO - Use user-specified model revision: v1.0.6
2023-07-20 02:13:37,195 - modelscope - INFO - am_config=./pretrain_work_dir/orig_model/basemodel_16k/sambert/config.yaml voc_config=./pretrain_work_dir/orig_model/basemodel_16k/hifigan/config.yaml
2023-07-20 02:13:37,197 - modelscope - INFO - audio_config=./pretrain_work_dir/orig_model/basemodel_16k/audio_config_se_16k.yaml
2023-07-20 02:13:37,198 - modelscope - INFO - am_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/sambert/ckpt/checkpoint_2400000.pth')])
2023-07-20 02:13:37,200 - modelscope - INFO - voc_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth')])
2023-07-20 02:13:37,203 - modelscope - INFO - se_path=./pretrain_work_dir/orig_model/se.npy se_model_path=./pretrain_work_dir/orig_model/basemodel_16k/speaker_embedding/se.onnx
2023-07-20 02:13:37,204 - modelscope - INFO - mvn_path=./pretrain_work_dir/orig_model/mvn.npy
100%|██████████| 2/2 [00:00<00:00, 2823.50it/s]TextScriptConvertor.process:
Save script to: ./pretrain_work_dir/data/Script.xml
TextScriptConvertor.process:
Save metafile to: ./pretrain_work_dir/data/raw_metafile.txt
[AudioProcessor] Initialize AudioProcessor.
[AudioProcessor] config params:
[AudioProcessor] wav_normalize: True
[AudioProcessor] trim_silence: True
[AudioProcessor] trim_silence_threshold_db: 60
[AudioProcessor] preemphasize: False
[AudioProcessor] sampling_rate: 16000
[AudioProcessor] hop_length: 200
[AudioProcessor] win_length: 1000
[AudioProcessor] n_fft: 2048
[AudioProcessor] n_mels: 80
[AudioProcessor] fmin: 0.0
[AudioProcessor] fmax: 8000.0
[AudioProcessor] phone_level_feature: True
[AudioProcessor] se_feature: True
[AudioProcessor] norm_type: mean_std
[AudioProcessor] max_norm: 1.0
[AudioProcessor] symmetric: False
[AudioProcessor] min_level_db: -100.0
[AudioProcessor] ref_level_db: 20
[AudioProcessor] num_workers: 16
[AudioProcessor] Amplitude normalization started
Volume statistic proceeding...
100%|██████████| 1/1 [00:00<00:00, 1.70it/s]
Average amplitude RMS : 0.126146
Volume statistic done.
Volume normalization proceeding...
100%|██████████| 1/1 [00:00<00:00, 530.12it/s]Volume normalization done.
[AudioProcessor] Amplitude normalization finished
[AudioProcessor] Duration generation started
0%| | 0/1 [00:00<?, ?it/s][AudioProcessor] Duration align with mel is proceeding...
100%|██████████| 1/1 [00:01<00:00, 1.14s/it]
[AudioProcessor] Duration generate finished
[AudioProcessor] Trim silence with interval started
[AudioProcessor] Start to load pcm from ./pretrain_work_dir/data/wav
100%|██████████| 1/1 [00:01<00:00, 1.08s/it]
0%| | 0/1 [00:01<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 815.70it/s][AudioProcessor] Trim silence finished
[AudioProcessor] Melspec extraction started
100%|██████████| 1/1 [00:01<00:00, 1.57s/it]
[AudioProcessor] Melspec extraction finished
Melspec statistic proceeding...
100%|██████████| 1/1 [00:00<00:00, 3236.35it/s]
100%|██████████| 1/1 [00:00<00:00, 363.39it/s]Melspec statistic done
[AudioProcessor] melspec mean and std saved to:
./pretrain_work_dir/data/mel/mel_mean.txt,
./pretrain_work_dir/data/mel/mel_std.txt
[AudioProcessor] Melspec mean std norm is proceeding...
[AudioProcessor] Melspec normalization finished
[AudioProcessor] Normed Melspec saved to ./pretrain_work_dir/data/mel
[AudioProcessor] Pitch extraction started
0%| | 0/1 [00:00<?, ?it/s][AudioProcessor] Pitch align with mel is proceeding...
100%|██████████| 1/1 [00:01<00:00, 1.69s/it]
[AudioProcessor] Pitch normalization is proceeding...
100%|██████████| 1/1 [00:00<00:00, 4128.25it/s]
100%|██████████| 1/1 [00:00<00:00, 3721.65it/s][AudioProcessor] f0 mean and std saved to:
./pretrain_work_dir/data/f0/f0_mean.txt,
./pretrain_work_dir/data/f0/f0_std.txt
[AudioProcessor] Pitch mean std norm is proceeding...
[AudioProcessor] Pitch turn to phone-level is proceeding...
100%|██████████| 1/1 [00:01<00:00, 1.55s/it]
[AudioProcessor] Pitch normalization finished
[AudioProcessor] Normed f0 saved to ./pretrain_work_dir/data/f0
[AudioProcessor] Pitch extraction finished
[AudioProcessor] Energy extraction started
100%|██████████| 1/1 [00:01<00:00, 1.12s/it]
100%|██████████| 1/1 [00:00<00:00, 252.64it/s]
100%|██████████| 1/1 [00:00<00:00, 3682.44it/s][AudioProcessor] energy mean and std saved to:
./pretrain_work_dir/data/energy/energy_mean.txt,
./pretrain_work_dir/data/energy/energy_std.txt
[AudioProcessor] Energy mean std norm is proceeding...
100%|██████████| 1/1 [00:01<00:00, 1.08s/it]
[AudioProcessor] Energy normalization finished
[AudioProcessor] Normed Energy saved to ./pretrain_work_dir/data/energy
[AudioProcessor] Energy extraction finished
[AudioProcessor] All features extracted successfully!
Processing audio done.
[SpeakerEmbeddingProcessor] Speaker embedding extractor started
[SpeakerEmbeddingProcessor] se model loading error!!!
[SpeakerEmbeddingProcessor] please update your se model to ensure that the version is greater than or equal to 1.0.5
[SpeakerEmbeddingProcessor] try load it as se.model
[SpeakerEmbeddingProcessor] Speaker embedding extracted successfully!
Processing speaker embedding done.
Processing done.
Voc metafile generated.
AM metafile generated.
2023-07-20 02:14:06,035 - modelscope - INFO - Start training....
2023-07-20 02:14:06,040 - modelscope - INFO - Start SAMBERT training...
2023-07-20 02:14:06,042 - modelscope - INFO - TRAIN SAMBERT....
2023-07-20 02:14:06,059 - modelscope - INFO - TRAINING steps: 2400202
2023-07-20 02:14:06,069 - modelscope - INFO - audio_config = {'fmax': 8000.0, 'fmin': 0.0, 'hop_length': 200, 'max_norm': 1.0, 'min_level_db': -100.0, 'n_fft': 2048, 'n_mels': 80, 'norm_type': 'mean_std', 'num_workers': 16, 'phone_level_feature': True, 'preemphasize': False, 'ref_level_db': 20, 'sampling_rate': 16000, 'symmetric': False, 'trim_silence': True, 'trim_silence_threshold_db': 60, 'wav_normalize': True, 'win_length': 1000}
2023-07-20 02:14:06,070 - modelscope - INFO - Loss = {'MelReconLoss': {'enable': True, 'params': {'loss_type': 'mae'}}, 'ProsodyReconLoss': {'enable': True, 'params': {'loss_type': 'mae'}}}
2023-07-20 02:14:06,072 - modelscope - INFO - Model = {'KanTtsSAMBERT': {'optimizer': {'params': {'betas': [0.9, 0.98], 'eps': 1e-09, 'lr': 0.001, 'weight_decay': 0.0}, 'type': 'Adam'}, 'params': {'MAS': False, 'NSF': True, 'SE': True, 'decoder_attention_dropout': 0.1, 'decoder_dropout': 0.1, 'decoder_ffn_inner_dim': 1024, 'decoder_num_heads': 8, 'decoder_num_layers': 12, 'decoder_num_units': 128, 'decoder_prenet_units': [256, 256], 'decoder_relu_dropout': 0.1, 'dur_pred_lstm_units': 128, 'dur_pred_prenet_units': [128, 128], 'embedding_dim': 512, 'emotion_units': 32, 'encoder_attention_dropout': 0.1, 'encoder_dropout': 0.1, 'encoder_ffn_inner_dim': 1024, 'encoder_num_heads': 8, 'encoder_num_layers': 8, 'encoder_num_units': 128, 'encoder_projection_units': 32, 'encoder_relu_dropout': 0.1, 'max_len': 800, 'nsf_f0_global_maximum': 730.0, 'nsf_f0_global_minimum': 30.0, 'nsf_norm_type': 'global', 'num_mels': 82, 'outputs_per_step': 3, 'postnet_dropout': 0.1, 'postnet_ffn_inner_dim': 512, 'postnet_filter_size': 41, 'postnet_fsmn_num_layers': 4, 'postnet_lstm_units': 128, 'postnet_num_memory_units': 256, 'postnet_shift': 17, 'predictor_dropout': 0.1, 'predictor_ffn_inner_dim': 256, 'predictor_filter_size': 41, 'predictor_fsmn_num_layers': 3, 'predictor_lstm_units': 128, 'predictor_num_memory_units': 128, 'predictor_shift': 0, 'speaker_units': 192}, 'scheduler': {'params': {'warmup_steps': 4000}, 'type': 'NoamLR'}}}
2023-07-20 02:14:06,074 - modelscope - INFO - allow_cache = False
2023-07-20 02:14:06,084 - modelscope - INFO - batch_size = 32
2023-07-20 02:14:06,085 - modelscope - INFO - create_time = 2023-07-20 02:14:06
2023-07-20 02:14:06,087 - modelscope - INFO - eval_interval_steps = 10000000000000000
2023-07-20 02:14:06,090 - modelscope - INFO - git_revision_hash = d16755444c9baf23348213211a5ed9035458ecf0
2023-07-20 02:14:06,093 - modelscope - INFO - grad_norm = 1.0
2023-07-20 02:14:06,096 - modelscope - INFO - linguistic_unit = {'cleaners': 'english_cleaners', 'lfeat_type_list': 'sy,tone,syllable_flag,word_segment,emo_category,speaker_category', 'speaker_list': 'F7'}
2023-07-20 02:14:06,098 - modelscope - INFO - log_interval_steps = 50
2023-07-20 02:14:06,099 - modelscope - INFO - model_type = sambert
2023-07-20 02:14:06,100 - modelscope - INFO - num_save_intermediate_results = 4
2023-07-20 02:14:06,101 - modelscope - INFO - num_workers = 4
2023-07-20 02:14:06,102 - modelscope - INFO - pin_memory = False
2023-07-20 02:14:06,105 - modelscope - INFO - remove_short_samples = False
2023-07-20 02:14:06,111 - modelscope - INFO - save_interval_steps = 200
2023-07-20 02:14:06,113 - modelscope - INFO - train_max_steps = 2400202
2023-07-20 02:14:06,115 - modelscope - INFO - train_steps = 202
2023-07-20 02:14:06,119 - modelscope - INFO - log_interval = 10
2023-07-20 02:14:06,121 - modelscope - INFO - modelscope_version = 1.7.1
Loading metafile...
0it [00:00, ?it/s]Loading metafile...
100%|██████████| 1/1 [00:00<00:00, 9198.04it/s]
2023-07-20 02:14:06,139 - modelscope - INFO - The number of training files = 0.
2023-07-20 02:14:06,141 - modelscope - INFO - The number of validation files = 1.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
[<ipython-input-15-0089498a7012>](https://localhost:8080/#) in <cell line: 33>()
31 default_args=kwargs)
32
---> 33 trainer.train()