Hi, i tried to running your coding but this error raised. I searched everywhere bu

CorrMM issue about acapellabot HOT 4 OPEN

madebyollin commented on September 27, 2024

CorrMM issue

from acapellabot.

Comments (4)

jabelman commented on September 27, 2024 1

Any chance you could post your v2 or at least the implementation of the chop function you put in conversion.py, please?

from acapellabot.

madebyollin commented on September 27, 2024

It's an out-of-memory error; the code currently tries to process songs all at once (rather than splitting them up and processing segments individually), and that's a problem for long songs!

A quick-and-dirty workaround is just to split up your input file into separate files and run it on each of those separately, then join the outputs. https://unix.stackexchange.com/questions/280767/how-do-i-split-an-audio-file-into-multiple https://superuser.com/questions/571463/how-do-i-append-a-bunch-of-wav-files-while-retaining-not-zero-padded-numeric

The good way is to have the code automatically slice up the input until it can fit a slice in memory, and then run each slice through and reassemble them for you.

Try adding a predict method on acapellabot.py line 88, something like:

def predict(self, spectrogram):
        expandedSpectrogram = conversion.expandToGrid(spectrogram, self.peakDownscaleFactor)
        sliceSizeTime = 6144
        predictedSpectrogramWithBatchAndChannels = None
        while sliceSizeTime >= self.peakDownscaleFactor and predictedSpectrogramWithBatchAndChannels is None:
            try:
                slices = conversion.chop(expandedSpectrogram, sliceSizeTime, expandedSpectrogram.shape[0])
                outputSlices = []
                for s in slices:
                    sWithBatchAndChannels = s[np.newaxis, :, :, :]
                    outputSlices.append(self.model.predict(sWithBatchAndChannels))
                predictedSpectrogramWithBatchAndChannels = np.concatenate(outputSlices, axis=2)
            except (RuntimeError, MemoryError):
                console.info(sliceSizeTime, "is too large; trying", sliceSizeTime // 2)
                sliceSizeTime = sliceSizeTime // 2
        predictedSpectrogram = predictedSpectrogramWithBatchAndChannels[0, :, :, :]
        newSpectrogram = predictedSpectrogram[:spectrogram.shape[0], :spectrogram.shape[1]]
        return newSpectrogram

and replacing lines 95-103 with:

newSpectrogram = self.predict(spectrogram)
newAudio = conversion.spectrogramToAudioFile(newSpectrogram, fftWindowSize=fftWindowSize, phaseIterations=phaseIterations)

(this is copy-pasted from my v2 code, which uses stereo instead of mono, so there might be some issues actually getting it to run–will try to test later...)

from acapellabot.

wawang250 commented on September 27, 2024

I try to split the song into much smaller pieces, and it worked perfectly! Thanks a lot for your early reply!

Besides, I'm wondering what will happen if I run the full song on a server with large memory.

Actually what I am trying to do is separating two or more people's voices from each other. I think training this model with my own training set might be necessary. Got any tips for that?

from acapellabot.

madebyollin commented on September 27, 2024

The network is fully convolutional, so there's not much of a difference between running it on segments and running it on the whole thing (the one possible difference is artifacts at the boundaries between sections).

Multi-speaker source separation (from a monophonic source) is something this architecture will probably do poorly at. The way this architecture is designed right now, it only really makes judgement about individual harmonics, which isn't enough to separate speakers. For example, here's a vocal over sin/square wave chords–it's incredibly easy for the network to identify the vocals, since all you need to do is filter out all of the straight lines:

I would suggest using a deeper U-net architecture (to take a larger context into account) if you want to do multi-speaker separation. Even that will only be able to succeed by memorizing facts about specific speakers, though... a better implementation might have two input spectrograms to a large U-net: the multi-speaker spectrogram, and a "reference" spectrogram of one of the speakers, with the target output being that speaker's separated audio. Generating data for that is still pretty easy (you can probably even use my same script, just run it on lots of single-speaker recordings) but getting a good network is the tricky part. It might be worthwhile to start on a simpler case like decomposing saw waves/square waves, where it's more obvious what the network is (and should be) doing.

I found visualizing spectrograms (Audacity works well for this, as does Sonic Visualiser) was really helpful in understanding what the network should and shouldn't be able to do–if you can't tell the two speakers apart in spectrogram view (again, on a monophonic file), then it's unlikely that an image-to-image network in spectrogram space will be able to either.

from acapellabot.

CorrMM issue about acapellabot HOT 4 OPEN

Comments (4)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent