Giter Site home page Giter Site logo

CorrMM issue about acapellabot HOT 4 OPEN

madebyollin avatar madebyollin commented on September 27, 2024
CorrMM issue

from acapellabot.

Comments (4)

jabelman avatar jabelman commented on September 27, 2024 1

Any chance you could post your v2 or at least the implementation of the chop function you put in conversion.py, please?

from acapellabot.

madebyollin avatar madebyollin commented on September 27, 2024

It's an out-of-memory error; the code currently tries to process songs all at once (rather than splitting them up and processing segments individually), and that's a problem for long songs!

A quick-and-dirty workaround is just to split up your input file into separate files and run it on each of those separately, then join the outputs. https://unix.stackexchange.com/questions/280767/how-do-i-split-an-audio-file-into-multiple https://superuser.com/questions/571463/how-do-i-append-a-bunch-of-wav-files-while-retaining-not-zero-padded-numeric

The good way is to have the code automatically slice up the input until it can fit a slice in memory, and then run each slice through and reassemble them for you.

Try adding a predict method on acapellabot.py line 88, something like:

def predict(self, spectrogram):
        expandedSpectrogram = conversion.expandToGrid(spectrogram, self.peakDownscaleFactor)
        sliceSizeTime = 6144
        predictedSpectrogramWithBatchAndChannels = None
        while sliceSizeTime >= self.peakDownscaleFactor and predictedSpectrogramWithBatchAndChannels is None:
            try:
                slices = conversion.chop(expandedSpectrogram, sliceSizeTime, expandedSpectrogram.shape[0])
                outputSlices = []
                for s in slices:
                    sWithBatchAndChannels = s[np.newaxis, :, :, :]
                    outputSlices.append(self.model.predict(sWithBatchAndChannels))
                predictedSpectrogramWithBatchAndChannels = np.concatenate(outputSlices, axis=2)
            except (RuntimeError, MemoryError):
                console.info(sliceSizeTime, "is too large; trying", sliceSizeTime // 2)
                sliceSizeTime = sliceSizeTime // 2
        predictedSpectrogram = predictedSpectrogramWithBatchAndChannels[0, :, :, :]
        newSpectrogram = predictedSpectrogram[:spectrogram.shape[0], :spectrogram.shape[1]]
        return newSpectrogram

and replacing lines 95-103 with:

newSpectrogram = self.predict(spectrogram)
newAudio = conversion.spectrogramToAudioFile(newSpectrogram, fftWindowSize=fftWindowSize, phaseIterations=phaseIterations)

(this is copy-pasted from my v2 code, which uses stereo instead of mono, so there might be some issues actually getting it to run–will try to test later...)

from acapellabot.

wawang250 avatar wawang250 commented on September 27, 2024

I try to split the song into much smaller pieces, and it worked perfectly! Thanks a lot for your early reply!

Besides, I'm wondering what will happen if I run the full song on a server with large memory.

Actually what I am trying to do is separating two or more people's voices from each other. I think training this model with my own training set might be necessary. Got any tips for that?

from acapellabot.

madebyollin avatar madebyollin commented on September 27, 2024

The network is fully convolutional, so there's not much of a difference between running it on segments and running it on the whole thing (the one possible difference is artifacts at the boundaries between sections).

Multi-speaker source separation (from a monophonic source) is something this architecture will probably do poorly at. The way this architecture is designed right now, it only really makes judgement about individual harmonics, which isn't enough to separate speakers. For example, here's a vocal over sin/square wave chords–it's incredibly easy for the network to identify the vocals, since all you need to do is filter out all of the straight lines:
screen shot 2017-12-14 at 04 51 27

I would suggest using a deeper U-net architecture (to take a larger context into account) if you want to do multi-speaker separation. Even that will only be able to succeed by memorizing facts about specific speakers, though... a better implementation might have two input spectrograms to a large U-net: the multi-speaker spectrogram, and a "reference" spectrogram of one of the speakers, with the target output being that speaker's separated audio. Generating data for that is still pretty easy (you can probably even use my same script, just run it on lots of single-speaker recordings) but getting a good network is the tricky part. It might be worthwhile to start on a simpler case like decomposing saw waves/square waves, where it's more obvious what the network is (and should be) doing.

I found visualizing spectrograms (Audacity works well for this, as does Sonic Visualiser) was really helpful in understanding what the network should and shouldn't be able to do–if you can't tell the two speakers apart in spectrogram view (again, on a monophonic file), then it's unlikely that an image-to-image network in spectrogram space will be able to either.

from acapellabot.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.