Hi, As the authors used the copyrighted songs (Madeon and David Bowi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Issues of replicating the results about gruv HOT 7 CLOSED

mattvitelli commented on August 11, 2024

Issues of replicating the results

from gruv.

Comments (7)

gb96 commented on August 11, 2024

Today I found my model trained on different music generated what sounded like white noise.

My problem appeared to be due to some of the converted wav files (generated in datasets/YourMusicLibrary/wave/ ) being mono 32bit PCM audio at 8kHz whereas the GRUV conversion functions assume mono 16bit PCM audio at 8kHz.

If you find some of your wav files have the wrong bit-size, you can convert them with sox, e.g.:
sox oldfile.wav -b 16 newfile.wav

This might be the cause of the second issue you mentioned.

The first issue is probably due to over-fitting. Your trained model fits the training data well, but does not generalize to the validation data. However you want your validation loss to decrease at some point epochs earlier on. Some people have reported that for LSTM networks the validation loss can move up and down unpredictably during training before the optimal minimum is reached.

from gruv.

nixingyang commented on August 11, 2024

@gb96 Thanks for your reply. Have you trained a model which is capable of producing meaning sound?
I re-implemented the code and I forgot to normalize the raw audio data. That might be the reason for these two issues.

from gruv.

gb96 commented on August 11, 2024

@nixingyang I have trained models that produce sound (e.g, https://soundcloud.com/gb96/stairway-to-gruv-hd512-epoch48000-loss067-seed3x3 )

Have you tried running the audio_unit_test or equivalent? (see https://github.com/MattVitelli/GRUV/blob/master/data_utils/parse_files.py#L190 )

That verifies methods for loading/saving sound files, converting between wave and Numpy formats, and converting between time-domain and frequency-domain representations (via Fast Fourier Transform and its reverse)

from gruv.

nixingyang commented on August 11, 2024

I have defined a function which is similar to audio_unit_test and I can confirm that the transformation process is lossless.
The audio you shared contains informative sound at the beginning. However, the model simply repeats useless sound after that. My prediction does not contain informative sound at all. Did you modify the generate_from_seed function and did you train your model solely on 65 seconds audio?

from gruv.

gb96 commented on August 11, 2024

Looks like I have made some significant modifications to the generate_from_seed function.
The main idea of my changes is to keep a fixed seed-sequence length. New predicted values are appended to the end and initial values are deleted from the beginning to maintain constant length.

# Extrapolates from a given seed sequence
def generate_from_seed(model, seed, sequence_length, data_variance, data_mean):
    seedSeq = seed.copy()
    output = []
    # The generation algorithm is simple:
    # Step 1 - Given A = [X_0, X_1, ... X_n], generate X_n + 1
    # Step 2 - Concatenate X_n + 1 onto A
    # Step 3 - Repeat MAX_SEQ_LEN times
    for it in xrange(sequence_length):
        seedSeqNew = model.predict(seedSeq) #Step 1. Generate X_n + 1
        # Step 2. Append it to the sequence
        newSeq = seedSeqNew[0][seedSeqNew.shape[1]-1]
        output.append(newSeq.copy()) 
        # Construct new seedSeq
        newSeq = np.reshape(newSeq, (1, 1, newSeq.shape[0]))
        seedSeq = np.concatenate((seedSeq, newSeq), axis=1)
        seedSeq = np.delete(seedSeq, 0, 1)
    # Finally, post-process the generated sequence so that we have valid frequencies
    # We're essentially just undo-ing the data centering process
    for i in xrange(len(output)):
        output[i] *= data_variance
        output[i] += data_mean
    return output

from gruv.

gb96 commented on August 11, 2024

To answer your question about training data, I used the first 65 seconds audio from each channel of a stereo source, for a total of 130 seconds. The reason I did that was because the source music had quite distinct sounds in each channel (e.g. guitar notes in one and vocals in the other) and I figured it would be easier to train a LSTM network on the separate sounds rather than the combined mono version.

from gruv.

nixingyang commented on August 11, 2024

The modification of generate_from_seed is reasonable. From my point of view, the algorithm devised in GRUV is not capable of handling real-world audio signals. Google has revealed WaveNet which is probably the state of the art.

from gruv.

Issues of replicating the results about gruv HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent