After further inspection on the original research paper, I believe the LSTM model described is different from the implementation in GuitarLSTM. It describes using two hidden layers, which I initially thought were two conv1d layers, but I think they meant two stacked LSTM layers. In that case, the model would look like this:
input_size=1
learning_rate=0.0005
max_epochs=500
# Create Sequential Model ###########################################
clear_session()
model = Sequential()
model.add(LSTM(hidden_units, return_sequences=True, input_shape=(input_size,1)))
model.add(LSTM(hidden_units,))
model.add(Dense(1, activation=None))
That would explain why the model had to be trained for 500 epochs, and also why it took 20 hours on a high end GPU. The model would take in a single audio sample, as opposed to the "input_size" amount.
Either way, the conv1d layers seem to offer a significant speed advantage while training. The conv1d layers allow for a large input_size, while reducing number of parameters going into the LSTM layer. The configuration has yet to be tested in a real time application.
The paper also describes adding the input to the output, so that the model only has to learn the difference. This technique is not implemented in GuitarLSTM.
Experimentation on the model stack would be beneficial to finding an optimal approach. Feel free to share any findings here, or on the facebook group (intended for discussion on model training).
Link to facebook Community Group:
https://www.facebook.com/groups/674031436584335/?ref=pages_profile_groups_tab&source_id=102883764967858
Update: Keras uses LSTM stateful=false by default, which means hidden and cell states aren't carried to the next step. Based on the statements in the paper, it appears they use a stateful LSTM, which is another difference.