dmitryulyanov / neural-style-audio-tf Goto Github PK
View Code? Open in Web Editor NEWTensorFlow implementation for audio neural style.
TensorFlow implementation for audio neural style.
Hi Dmitry,
thanks for putting this together, this is exactly what I was looking for an experiment!
I am definitely a beginner in this, but I was trying to run your example and I get a Syntax error on the Optimise kernel and in the Output in the print as now you have to add parenthesis.
File "<ipython-input-16-9eb962c6044b>", line 50
print 'Final loss:', loss.eval()
^
SyntaxError: invalid syntax
I also figured out that tf.initialize_all_variables() is now deprecated and so changed it to tf.global_variables_initializer()
Then it all works well!
Thanks!
Hi Dmitry,
I wanted to try to do a audio style transfer, but get this error on the optimize and invert spectrum step.
AttributeError Traceback (most recent call last)
in ()
----> 1 get_ipython().run_cell_magic('time', '', 'from sys import stderr\n\n#@markdown ---\n#@markdown Advanced settings / Расширенные настройки\nALPHA= 0.1 #@param {type:"slider", min:0.01, max:0.2, step:0.01}\nlearning_rate= 0.01 #@param {type:"slider", min:0.001, max:0.02, step:0.001}\niterations = 300 #@param {type:"slider", min:100, max:500, step:10}\n#@markdown ---\nresult = None\nwith tf.Graph().as_default():\n\n # Build graph with variable input\n #x = tf.Variable(np.zeros([1,1,N_SAMPLES,N_CHANNELS], dtype=np.float32), name="x")\n x = tf.Variable(np.random.randn(1,1,N_SAMPLES,N_CHANNELS).astype(np.float32)*1e-3, name="x")\n\n kernel_tf = tf.constant(kernel, name="kernel", dtype='float32')\n conv = tf.nn.conv2d(\n x,\n kernel_tf,\n strides=[1, 1, 1, 1],\n padding="VALID",\n name="conv")\n \n \n net = tf.nn.relu(conv)\n\n content_loss = ALPHA * 2 * tf.nn.l2_loss(\n net - content_features)\n\n style_loss = 0\n\n _, height, width, number = map(lambda i: i.value, net.get_shape())\n\n size = height * width * number\n feats = tf.reshape(net, (-1, number))\n gram = tf.matmul(tf.transpose(feats), feats) / N_SAMPLES\n style_loss = 2 * tf.nn.l2_loss(gram - style_gram)\n\n # Overall loss\n loss = content_loss + style_loss\n\n opt = tf.contrib.opt.S...
2 frames
/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2115 magic_arg_s = self.var_expand(line, stack_depth)
2116 with self.builtin_trap:
-> 2117 result = fn(magic_arg_s, cell)
2118 return result
2119
in time(self, line, cell, local_ns)
/usr/local/lib/python3.7/dist-packages/IPython/core/magic.py in (f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):
/usr/local/lib/python3.7/dist-packages/IPython/core/magics/execution.py in time(self, line, cell, local_ns)
1191 else:
1192 st = clock2()
-> 1193 exec(code, glob, local_ns)
1194 end = clock2()
1195 out = None
in ()
AttributeError: module 'librosa' has no attribute 'output'
I'm trying to understand if it would make sense to learn style from a group of examples (in this case, audio files) instead of just one. In the best case this would produce a sort of "mean style" representing the group of audio excerpts. In your experience, would such an approach work (as long as the examples do somehow share some style in common), or it would produce just garbage?
You added this 3 years ago, and I am just now finding it. I have been searching for an implementation of neural style that treats music as the images, in this case waveforms. This is amazing have you built more upon this? Thanks for this repo.
In your blog, you wrote 1D convolution works better than 2D ones. But this tensorflow version didnt use Conv1D but Conv2D. Why is that? any reason or am I missing out something else?
hello Dmitry,
a quick question. How do I produce longer output files with this approach? Should I necessary provide longer inputs or there is another way?
Thank you very much for sharing your results
Giancarlo
Though I fully trust Dmitry and believe in his claim that a random cnn is as good as a pretrained net in detecting and extracting texture features (the "style"), I would really appreciate the possibility of testing some pretrained net for extracting the "content" features.
While experiencing with this lovely software I found that its ability to discriminate the content structure in "content" sound files does not appear as accurate as in the examples provided elsewhere for the "image style transfer" case. In particular it seems that too much of the style still remains in the content, and this is perhaps the cause of high dominance of some audio files when combined with others.
I noted that the best combinations (i.e., where the "content" audio imposes only its structure and the "style" audio enforces its own texture) are produced, when the spectra of the two audios share most of their frequencies, but the "style" has less structure or, in other words, less evident "beats". This would correspond, in images, to the "style" image having mostly the same spectrum as the "content" one, but featuring weaker and shorter edges. The output audio, in this case, resembles a combination of an "envelope" taken from the "content" audio, modulating the amplitude of the "style" audio.
On the other hand, when the "style" audio lies on a mostly different region of the frequency spectrum (e.g. higher frequencies) with respect to the "content", then the two audios get mixed (their spectra appear to be merged) and both are almost equally present in the output, producing in most cases very confusing output.
I can provide some examples, but I guess anyone can figure out what I'm trying to explain, by testing on the available audio samples.
By looking at the results produced by applying style transfer to images, I would expect a different behavior, where the style (i.e. the texture) of the "style" image almost completely substitute the texture of the "content" image. I suspect that some more investigation might be needed in the selection of the most suitable net for content features selection, and therefore I would love to have some hints about how to load and use a pretrained network.
Sorry for the long message.
in read_audio_spectum
@ 4th cell, S = np.log1p(np.abs(S[:,:430]))
. What's the purpose of constant 430?
Much thx!
@DmitryUlyanov
Hi Dimitry :)
I have encountered some kind of error, when trying to transfer style from one song to another.
After running few cells, screen goes black and I cannot use keyboard nor mouse, cant enter tty mode - looks like regular system crash.
I'm using ubuntu 16.04 with tensorflow gpu (geforce 760gti 2gb vram).
Is this problem caused by using gpu version?
hello @DmitryUlyanov ,
based on your github it said to be run on jupyter, can i run it in terminal in ubuntu and how?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.