sleepwalking / rocaloid-old Goto Github PK

View Code? Open in Web Editor NEW

63.0 63.0 16.0 6.05 MB

OBSOLETED! Moved to http://github.com/Rocaloid

License: GNU General Public License v3.0

C 77.58% Objective-C 1.31% IDL 7.53% C++ 13.51% Shell 0.06%

rocaloid-old's People

Contributors

Stargazers

Watchers

Forkers

fireworks-nt angryturf glace kunlunh torinkwok rxbit faiaboru yajiedesign ueriehj ccxuy dyxushuai alexclazrey lyd87880938 hotpoor mmd-ybk

rocaloid-old's Issues

What about providing an instruction on compilation & installation

There are many people like @tuxzz who does not know how to install Rocaloid as well as QTau.

There are so many components so that installing is a tough work.

Would you mind providing an instruction on compilation & installation?

Designing the Rocaloid Engine 3

As I wrote in README.md before, the next version will be totally rewritten again, like the evolution from Rocaloid1 to Rocaloid1.6.

Currently the version of RSC, CVS, and CDT format has already reached 2.x, which means they are in different version with the synthesizer.
Also considering the significant change in synthesis algorithm(TDPSM -> FECSOLA), I've decided to name the next generation as "Rocaloid Engine 3" instead of 2, along with CVE 3.

Here I have to restate the definition and relations of "Rocaloid Engine":

Rocaloid Engine includes Cybervoice Engine(CVE) and CVS Generator and provides I/O of RSC(Rocaloid SCript) and CVS(CyberVoice Script).
CVE is the synthesis engine in Rocaloid Project.
CVS is the file format for storing phonetic information, can be directly synthesized by CVE. CVS contains much more detailed information(e.g. duration of each phoneme) than RSC.
RSC is the file format for the note editor, can only be synthesized by transforming into CVS, with CVS Generator and CDT.
CVS Generator is the sub program of RSCCommon(which includes CVS Gen, I/O of RSC and vsqx), which is used to transform RSC into CVS using CDT so that the RSC file can be synthesized.
CDT is the dictionary used by CVS Generator. Contains phonetic definitions, which are data derived from lots of phonetic experiments.

RSC will not be included in Rocaloid Engine anymore, because RSC is strongly related to the note editor, and dealing with editor settings and musical notations is not the business of Rocaloid Engine.
RSC will be replaced by RVS(Rocaloid Vocal Script), which describes the general (but not in detail) information of notes and lyrics (but not phonemes). CVS Generator will be responsible for transforming RVS into CVS. The transformation from RSC (or .vsqx, .vsq, .ust, .nn, etc.) to RVS should be simple (does not require professional phonetics knowledge).

Altogether, the major components and formats in RE3(Rocaloid Engine 3) will be:

CVE 3
CGTOR 3 (Cvs GeneraTOR 3)
CVS 3
RVS 3
CDT 3

Additionally, CVS 3 and RVS 3 will be stored in binary instead of text. This is because formant data will be included in CVS and RVS, which will greatly increase the file size, and slow down the IO performance. (approximately a CVS 3 text file which contains a song will be 10MB)

Specifications about CVDBStudio

CVDBStudio is the tool for making sound db for Rocaloid 3, the replacement of TDPSMStudio.
The significance of this tool, or why not writing plug-ins for wave editors such as audacity is that wave editors are not convenient for batch processing.

Generally what CVDBStudio does is to turn bunches of .wav into .cvdb, like:

a_C3.wav -> a_C3.cvdb
i_D#4.wav -> i_D#4.cvdb
...

More specifically, the three major jobs(steps) CVDBStudio does are:

Refine & adjust the raw .wav data (output is also .wav).
Identify pulses and formants in .wav and output .cvdb.
Check & adjust the .cvdb produced in step 2.

CVDBStudio will be very similar to TDPSMStudio, which also has three similar functions.
Like TDPSMStudio, CVDBStudio also has three modes for the above steps:

Wave Editor Mode
CVDB Converter Mode
CVDB Quality Control Mode

To offer a direct view of what CVDBStudio should look like, here is a screenshot of TDPSMStudio, and CVDBStudio generally copies its UI.

Wave Display Box

The center is a picture box which displays the waveform. Scroll mouse wheel to zoom in/out. Where you scroll the mouse wheel, where will be the center of zoom.
The red contour shows the magnitude envelope. The black line won't be included in CVDBStudio.
The green wave is the moving average of input, will be replaced by the low passed wave of input (which is used to identify the pulses).
The pink straight line shows the VOT of input(delayed by certain num of samples). It won't be delayed in CVDBStudio.

Red contour, pink line, and green wave will only occur in CVDB Converter Mode.

Lower Panel

Amplitude: vertical scale of the wave box.
Completeness: the percentage of completed conversion.
Consonant: if the input is a diphone, consonant should be checked.
Data Length: how many samples will be contained in .cvdb.
Balance Wave: whether to cancel out vertical offset of the wave or not.

Upper Tool Strip

Symbol & Pitch specifies which .wav to open(in Wave Editor & CVDB Converter Mode) or .cvdb to open(in CVDB Quality Control Mode). (Removed in CVDBStudio)
Prev & Next loads prev/next pitch. We had different files for every pitches, and you can see it used to be super time consuming to make a sound db...
Analyze Frame: Convert .wav to .cvdb. Will be renamed as "Convert" in CVDBStudio.
Analyze Frame To All: Batch conversion. Click once, all pitches done (but some may be wrongly converted so that's why we have CVDB Quality Control Mode). Will be removed in CVDBStudio.
View All: Surf through all pitches of the given symbol, from C2 to C5, 2 files per second. Do not modify or change any thing.
Stop: abort the above process.
Adjust To All: Batch process of .wav. Only enabled in Wave Editor Mode.

Wave Editor Mode

In Wave Editor Mode you can use left/right mouse button to select a part of the wave, and then modify them.
Only one sound track of course.
The tool strip on the left will be enabled, these are effects that can be applied on the wave.

Additional Features in CVDBStudio

Symbol & Pitch text boxes in the upper tool strip will be removed. Instead there will be a floating panel holding a file list. Just drag files in as input. The panel should also contain a clear button. Prev & Next button will act on the file list instead of pitches.
There will be another floating panel for marking formants (F1, F2, F3). The panel is quite like formant tester: a spectrum and F1, F2, F3 text boxes and sliders, but no S1, S2, S3. They will be automatically identified by program.

This issue still doesn't cover all the details, but you can refer to the vb.net source code of TDPSMStudio:

Translate it into Qt & C++ & C Interfaces
Make amendments
Add two panels.

https://github.com/Sleepwalking/Rocaloid/tree/Rocaloid-1.6.0-Core-ver.-%28VB.Net%29/RocaloidDevelopSuit/TDPSMStudio

Rocaloid3 Development Halted

PitchMixer of CVE3 shows serious problems, causing some of the synthesized vocals being vague.
http://bbs.ivocaloid.com/thread-124636-1-1.html

I'm designing an improved algorithm for CVE. Development will go on when researches are done.

I need someone to write some Qt applications to test the new algorism.

Well. It's an algorism about formant modulation, which can be used to change the pronunciation of waves in the sound db and I think it extremely useful for CVE2.
I called it FECSOLA (Formant Envelope Coefficent Shift and OverLap Add). Briefly it works by modifying the spectral envelope with OLA.

For example, you have the wave of "a", and you know its formant frequencies. Just put it into FECSOLA and tell it the new formant frequencies, and the modified wave comes out (which might be transformed into "i" / "o" / "e").

Obviously this algorism can be used for correcting Miku's poor Chinese pronunciation.

I'm not going to use FECSOLA in building the new db, because it takes much more efforts (lots of work to do with the new db) and increases the size of db. Instead I'm going to embed it into CVE2 and do modification in real time (by some given parameters).

So the problem is we have to figure out:

Which mis-pronunced symbol could be corrected.
The best parameters to correct these symbols.

Theoretical solutions such as observing & analyzing the spectrums would not work since we want the best output quality. So the only way is to put those symbols and formant parameters in abundant real tests and try...

The tester would be really simple. Nothing more than a few sliders (controls F0, F1, F2, F3) and pictureboxes (to show the spectrum before and after modification), and several buttons to load and play the .wav files.

I learned neithor Qt nor C++... So I would be glad if someone could help me make this application. The algorism has a C implementation, easy to port to C++.

For details and the codes of FECSOLA, I'll post them below if someone replies to this post.

Consider using MusicXML format for exchange?

After I have read some information about Cadencii, it uses MusicXML to exchange music data with other applications.

I wonder, can Rocaloid accept MusicXML as input file? It is convenient to extend its features.

As I know, many software supports MusicXML, such as MuseScore, Lilypond, etc. So that one can easily export MusicXML file from another app and import it here.

It seems that Rocaloid accepts an ini file, which is obviously poor-extendable. We can use a modified version of MusicXML (for example with additional information included such as phonetic symbols). Since it is XML, adding an attribute will not affect the existing file format and is still compatible with other apps.

P.S. I do not expect this feature to be implemented in a short time. I hope you keep in mind that this may be a feature in the future. So let us keep this issue open for a long time.

Hint: VOCALOID 3 also use XML (but not MusicXML) instead of using modified binary MIDI format as in VOCALOID 2 (MIDI format is formed so tightly that hardly another feature can be appended). So this shows the advantage of XML.

音标记号问题 Phonetic symbol issue

看了你的字典，发现使用的是你们自定义的发音记号，建议使用国际标准的 X-SAMPA 记号。
VOCALOID 的发音记号就是 X-SAMPA 记号。
然而也有例外，如洛天依音源因为某些 bug 导致音标和发音不完全对应。（如拼音 bo 发音应为 p uo 实为 p o 或者拼音 er 洛天依的发音是 Ar 而音标写成了 `@``。）
我们先不吐槽洛天依因为赶工时导致的各种 bug，建议 Rocaloid 使用国际标准的 X-SAMPA 音标格式。
如果你愿意，我可以提供拼音到标准 X-SAMPA 的转换表。 m13253/pinyin2xsampa。（你可能想使用修改一点的 X-SAMPA 便于实现）
我也愿意参与 Linux 移植计划。（当然是等到 C++ 重写之后）

I have had a glance at your dictionary and have found that you are using custom phonetic symbols. I suggest you use X-SAMPA phonetic symbols which is International standard.
In fact, the phonetic symbols that VOCALOID uses are X-SAMPA.
However there are exceptions. For example, some bugs of Luo Tianyi soundfont resulted in inconsistency between phonetic symbols and the actual pronunciation. (Such as bo in Pinyin should be pronounced as p uo instead of p o, and er in Pinyin should be Ar instead of `@``.)
Despite those bugs found in Luo TY soundfont due to terrible work quality before the deadline, I recommend Rocaloid use X-SAMPA phonetic symbols.
If you are willing to, I can provide a table converting from Pinyin to standard X-SAMPA. m13253/pinyin2xsampa (You may want to use a slightly modified version of X-SAMPA for easier implementation)
I would like to participate in porting to Linux as well. (Of course not until it is rewritten with C++)

About future development.

嗯...我知道这是很不规范的做法,Issue Tracker是用来Track Issue的不是用来当论坛聊天的...但是我在iVocaloid论坛上没有发帖权限...直接给开发者发邮件又怕被垃圾邮件过滤...于是我就到这里来发了...

嗯..首先,我觉得自己可以算半个程序员了..学编程大概学了两年左右吧..会C/C++/Python/Go/Javascript, 对Linux和各种开源软件体系都比较熟悉...嗯嗯这是自我介绍了...

然后...我觉得Sleepwalking桑你这个项目做的很棒啊!!! 其实我早就有用初音调教中文歌的想法,但是碍于完全不了解语音学而一直都做不了什么,而且对怎么做逆向工程也是完全不知道所以也搞不定Vocaloid...也是因为平时事情很多,没大块的时间...于是呢,现在我希望能参与这个项目合作...论坛上看到你说GUI和C++苦手...我恰好这方面强一点可以帮忙做做前端开发 ...当然我是觉得我完全做不了后端了(笑)

嗯现在肯定是有这么几个建议:

1.建议还是用C++做开发...C++封装性好,语言相对比较直观,方便做前端开发...不管是开发效率还是运行效率,都相对高一些...其实我是想能用Python做前端肯定最方便..但是出于跨平台考虑,Python要部署Windows 运行环境略坑...

2.建议不要用WxWidget做GUI,改用Qt吧...Qt比起WxWdiget要易学易懂的多...乃说乃学C++时被MFC的Hello World吓到了...其实WXWidget和MFC风格是一样的...而MFC的反人类的API复杂程度世人皆知...与WXWidget比起来Qt就容易学得多...而且Qt也跨平台... 我最熟悉的GUI编程也是用Qt编....

3.关于开源软件协议的事情...我觉得有必要提醒乃一下GPLv3是支持商业使用的..... GPL只是禁止商业公司把代码拿去做闭源软件...如果商业公司拿去修改了之后继续开源,甚至拿来卖钱,只要他提供源代码,那都是不违反GPL的...但是GPL允许散布软件,就是说商业公司拿去卖钱的GPL软件,用户买来后拷贝给别人,或者放在网上分享都还是完全合法的...就是说GPL并不是不允许商业使用...他只是让这个软件没有了被商业使用的意义...另一方面,禁止商业使用并不是开源软件精神推崇的...如果你真的想禁止商业使用的话请不要使用GPL...考虑CC协议吧 ...

<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year>  <name of author>

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

嗯 ...基本上就这样...期待能一起合作吧(虽然肯定只能等到期末考试后暑假时间才有时间码代码的说_(:3L)_)

What features & bug fixes should be included in the rewritten CVE?

We've decided to write the entire project in C++. The basic algorism won't change, still TDPSM. But currently CVE has a few bugs and we'd better fix that in the written version. Some functions are uncompleted, such as GEN factor and the factor conversion from vsqx to rsc.

There are two bugs in CVE 1.6:

The bug of period prediction.

When a transition takes place, for example, a_C3 -> o_C3, the transition is done by stretching and mixing hundreds of periods of a_C3 & o_C3 by different ratio. Let's suppose the instantaneous transition ratio is 0.5, so it should be half a_C3 and half o_C3 if it is perfect. Then there comes a problem: the transition ratio in the beginning of a period is not the same with the transition ratio in the end. Why? If don't do so, the end of the period cannot match with the start of the next period.
In a mathematical way to prove it, let's suppose the instantaneous transition ratio is given by:

TR(t) = t * 0.5 (When 0 <= t <= 2)

So this is a transition of 2 seconds. Suppose there is a period begins at 1 sec, which lengths 0.01 sec. So the TR at its beginning should be exactly 0.5 and at its end, TR should be (1 + 0.01) * 0.5 = 0.505.
Period Prediction is a method used in CVE 1.6 to solve the problem of different TRs at the beginning and end of periods. In fact I didn't realize this problem when I designed CVE 1.6. And Period Prediction was added after I finished the first version of CVE 1.6...

Here is the code (in PitchPreSynthesizer):

Dim TR1 As Double, TR2 As Double TR1 = PCalc.TransitionRatio SetStartMixRatio(TR1) PCalc.PitchCalc(Time + 1 / PCalc.GetFreqAt(Time)) TR2 = PCalc.TransitionRatio If TR2 > 1 Then TR2 = 1 If TR2 < TR1 Then TR2 = 1

As you can see, the EndRatio(TR2) comes from PCalc.PitchCalc(Time + 1 / PCalc.GetFreqAt(Time)), plainly add the current time with the length of the current period, but we don't know the exact length of the current period! So there would be an error of aroud 5 samples.

(I guess this is the slightest bug... 5 samples... May result in almost no change in the outputed wave...)

The bug of Pitch Calculator.

Here's an example that shows how Pitch Calculator works:
When CVE shifts the pitch from C3 to E3 and back to C3, the Pitch Calculator provides these transition instructions as time increases:

C3 -> D#3
D#3 -> D3
D3 -> D#3
D#3 -> E3
E3 -> D#3
D#3 -> D3
D3 -> C#3
C#3 -> C3

Sounds perfect, but what would happen if we shrink the total time of pitch change to 0.2s? There are 8 transitions in all, so each transition can only take 0.025s, which is the length of 3 periods under C3... Such short transitions would cause a sharp decrease in quality.
So I set up a limit in PCalc:

TimeResolution = 0.03

Then PCalc skips some of the transitions like this:

C3 -> D3
D3 -> E3
E3 -> D3
D3 -> C3

Then the bug comes...
What would happen if you suddenly change the pitch from D3 -> a bit lower than E3 -> A2?

D3 -> D#3
D#3 -> E3
E3 -> D3
D3 -> B2
B2 -> A2

Pay attention to the second and third transition above. There should be a moment, when the second transition is finished, the output is at a state between D#3 and E3, and in the next moment it becomes a state between E3 and D3... You know D#3 and D3 are from different files... So a boom may occur at the intersection of two neighboring periods...

My idea is to rewrite the PCalc. When a segment is loaded, send its FreqList to the PCalc. The PCalc should calculate all pitch transitions and store them in an array before being called by the synthesizers.
The pitch transitions should fit in two rules:

The [start of the current transition] should be same with the [end of the last transition].
The [end of the current transition] should have a small difference between the lower pitch and the higher pitch. The difference in pitch decreases the quality of synthesis.

add build system for build, test and package

I would suggest to use CMake.