lingdong- / rrpl Goto Github PK

Describing Chinese Characters with Recursive Radical Packing Language (RRPL)

License: MIT License

JavaScript 72.25% HTML 22.60% Python 5.15%

chinese radicals markup-language typography font cjk-characters

rrpl's Introduction

Recursive Radical Packing Language

Recursive Radical Packing Language (RRPL) is a proposal for a method of describing arbitrary Chinese characters concisely while retaining their structural information. Potential fields for usage include font design and machine learning. In RRPL, each Chinese character is described as a short string of numbers, symbols, and references to other characters. Its syntax is inspired by markup languages such as LaTeX, as well as the traditional "米" grids used for calligraphy practice.

5000+ Traditional Chinese Characters and radicals are currently described using this language. You can download a .json file containing all of them (and unicode mapping) here: dist/min-trad.json

Check out Chinese character & radical visualizations made with RRPL here and here.

Syntax

Each Chinese character is described as a combination of components. These components can be other characters or radicals, as well as building blocks, which defines the simplest shapes that make up every component. Combination can be applied recursively to describe ever more complex glyphs.

Below is an overview of this syntax; You can also check out the Interactive Demo to play with it yourself.

Building Blocks

A building block is a string of the alphabet {0, 1, 2, 3, 4, 5, 6, 7, 8}, in which the presence of a number indicates a corresponding stroke to be drawn on a "米" grid:

 1 2 3
  \|/
8 -+- 4
  /|\
 7 6 5

0 indicates that no stroke should be drawn in this block.

Example:

Result	Code	Result	Code
	`48`		`24578`

Packing

Building blocks can be packed horizontally or vertically using the - and | symbols respectively to compose more complex glyphs. These symbols can be chained to pack more than two symbols with equal room.

Example:

Result	Code	Result	Code
	`27-26-26`		`2468\|24578`

Grouping

( and ) symbols can be used to group components together so mixed horizontal and vertical packing can happen in the correct order.

Example:

Result	Code	Result	Code
	`(48\|37)-(25678\|27)-(37\|15)`		`(46-68)\|(246-268)\|(24-28)`

Referencing

Other characters and radicals can be referenced directly to build a new character. The parser will dump the contents of the reference glyph directly into the string, similar to C/C++ #include feature. This makes it especially easy to describe the more complicated Chinese characters, as most of them consist of radicals.

Example:

Result	Code	Result	Code
	`廿\|468\|由\|(八)`		`((車\|(山))-(殳))\|(手)`
	`((口)-(口))\|(甲)\|十`		`(((木)-(缶)-(木))\|(冖))\|((鬯)-(彡))`

Parser

An baseline parser is included in rrpl_parser.js, which powers this Interactive Demo. It can be used with browser-side JavaScript as well as Node.js:

//require the module: (or in html, <script src="./rrpl_parser.js"></script>)
var parser = require('./rrpl_parser.js');

//obtain an abstract syntax tree
var ast = parser.parse("(48|37)-(25678|27)-(37|15)");

//returns line segments (normalized 0.0-1.0) that can be used to render the character
var lines = parser.toLines(parser.toRects(ast));

File Type

RRPL data can be stored in a JSON file, whith the root object mapping unicode characters to their respective description, e.g.

{
  "一":"48",
  "丁":"468|26|27",
  "上":"246|248",
  "不":"(48-45678-48)|(3-26-1)",
  "丕":"不|一",
  "中":"(46-2468-68)|(24-2468-28)",
  "串":"中|中"
}

The references in these files are usually first expanded before rendering is attempted. This can be done in two ways. The first is using parser.preprocess(json_object) in rrpl_parser.js, while the second is using compile.js. More documentation can be found in the header comments of these files.

The JSON files can be further compressed into (and uncompressed from) a binary file around half of the size of the original using compress.js, by using a half byte to encode each symbol in the RRPL alphabet.

Downloads

dist/min-trad.json contains RRPL description of 5000+ traditional Chinese characters stored in JSON format.
dist/RRPL.ttf contains a True Type Font (ttf) containing 5000+ traditional Chinese characters with glyphs generated by the default parser. Below is a screenshot of the font in macOS TextEdit.app:

Tools

Rendering

Generate a preview.html web page containing a rendering of all characters in a RRPL json file:

$node render.js preview path/to/input.json

Generate a realtime.html web page where user inputs can be parsed and rendered interactively: (Characters defined in the input file will be available for referencing)

$node render.js realtime path/to/input.json

Exporting

Export a folder of SVG (Scalable Vector Graphics) rendering of each character in a RRPL json file:

$node export_glyphs.js path/to/input.json path/to/output/folder 0

Contrary to what render.js generates, these SVG's contains "outlines" of the glyphs instead of simple strokes. More settings such as thickness can be tweaked in the source code of export_glyphs.js; Command-line API will come later.

To generate a TTF (True Type) font from the aforementioned SVG's, FontForge's python library can be used for this purpose. (pip install fontforge) An example can be found in tools/forge_font.py.

Applications

Since RRPL reduces all Chinese characters to a short string of numbers, their structure can be learned by sequential models such as Markov chains, RNN's and LSTM's without much difficulty. I've applied RNN (Recurrent Neural Networks) to the language to hallucinate non-existent Chinese characters. Below are some characters generated by training overnight on ~1000 RRPL character descriptions, with the visuals rendered using a pix2pix model. A separate repo for that project will be created soon.

Contributing

rrpl.json contains the latest, work-in-progress version. There're some 5,000 characters in there, but there're over 50,000 Chinese characters in existence! So help is very much appreciated. If you'd like to help with this project, please append new characters to the file and submit a pull request. For more info, contact me by sending an email to lingdonh[at]andrew[dot]cmu[dot]edu.

Below is a rendering of all 5000+ Chinese characters denoted using RRPL so far. Click on the image to enlarge.

rrpl's People

Contributors

Stargazers

Watchers

Forkers

qaz734913414 geckomuscle mandel59 s-you aenrichus fendaq jdc08161063 allensmile happog fiona8231 iveskins yibit granchee tianchi03 xyfeng linrstudio danielt998 eticzon rmoro beastneedsmoretorque palerdot tomcumming radomd92 plantvsbirds lemonci yangchuanosaurus spencerx harrisin2037 contropist czzlegend weixiaohu lixia9 johncido maciejwas lorenhsu1128 pika86 keysersozae666 fokx kawanerio hhy5277 arryboom user01010011 merlin-chu adriankeys groupinmenre ganji15 garvan2021

rrpl's Issues

Number of Chinese character components

Hi there, this looks like an interesting project. I'm wondering if it involves coming up with a set of primitives/components/radicals/etc. that are building blocks of all chinese characters. I'm interested to know if you found a system to handle every character, or if there are still new characters you encounter which throw the framework for a loop -- that is, the framework wasn't able to account for it, for some new stroke or character component of some sort.

I'd also be interested to know of a list of such components if available. I'm not sure I see them included in the data directory.

Thank you.

Were you inspired by Zhu Bangfu, creator of Cangjie input method?

His ideas here: http://www.cbflabs.com/book/gif_cg/gif_cg/cg2.htm sounds exactly what you are doing!

Question on distance metric

greetings, saw your very interesting 'cjk-morph' sketch on glitch - curious as to what distance metric you are using for characters in rrpl?

If you can parse a glyph visually to generate a code string

I'm wondering if you can go from character -> code string automatically. I'm not sure exactly what the parser does, but I think it's probably parsing the code string rather than visually parsing a unicode glyph and then outputing some sort of tree of strokes or something, which you would then use somehow to automatically figure out how to generate the appropriate/approximate codes out of.

If you're not automatically doing it, I'd be interested to know how you've done 5,000+ glyphs already! It would seem only then that you must have done them by hand! That would be a lot of work but it would make sense :) Interested to know how you did/do that part of the process.

If you do it manually, by looking at the glyph and mentally (or with pen/paper sort of thing) overlaying a grid over an zoomed in glyph, then I'm wondering what your technique is to efficiently figuring out (manually though) the code string that accurately represents it.

I can understand how your AI generation thing might work, which generated potential new characters. This would just iterate through the possible code strings you could make and go from there. But this other problem of figuring out how to find the code string for a given glyph seems pretty hard.

Question: is there a unique "best" encoding for each character?

Very cool project!

I was wondering if you have any thoughts on how unique the description for a character in your system is. There are obviously several ways to describe each character (e.g. by "zooming out"), but maybe there is always a "best" or most condensed way. What is your experience in encoding the characters so far? Have there been cases where two encoding options seemed equally good, or cases where you used different encoding for the same component to make it work in different contexts?

Here is an example of encoding a component in two different ways, side by side:

PS: Let me know if there is a better place to ask questions like this, e.g. some chat server!

Duplicate strokes for characters surrounded by 厂 and 广

All the characters with 厂 and 广 are not reusing the shared parts. I wonder if something like IDS ⿸ or a layering system is possible and would allow for reuse of these strokes?

Stroke weight, direction, and order

Hi there.

I just skimmed the README and my understanding is that the syntax currently doesn't really encode these attributes:

stroke weight (e.g. long vs short 點);
stroke direction (e.g. a diagonal stroke from top left to bottom right could be a 左勾 or a 捺);
stroke order.

Is there any plan in extending the syntax to support encoding those attributes? Maybe make it an optional thing so that the existing syntax remains compatible?

ideas on enhancement

some machine learning to adjust the grid size of components when composed...
though i believe it's trivially the future work...

about decomposition,
https://www.babelstone.co.uk/CJK/index.html -> ids.txt
there's an actively maintained dataset in IDS format, not sure if you're already using it. those could be easily parsed and directly transformed to your format. though also it's expected not a few of them will need to be hand corrected. visualizing them with this project may help to identify errors too.

personally i suggest not to break up the components too fine/deep. down to some mid-level components, much likely they're essentially (look up at zdic.net or hanziyuan.net for example) standalone components but may be generalized and merged to look like combinations of semantically irrelevant sub-shapes during 隸變楷化. in other word, there's much chance for some fake orthogonality. for example 它→宀匕, 宁→宀丁, 寅→宀？, but in fact none of them is really composed with 宀. also too many levels of composition degrades output quality.

How it handles curves

I have looked around some (with my limited knowledge of Chinese) to see if Chinese characters are composed entirely out of straight lines, or if there are curves sometimes, but didn't get answered. So thought I'd ask how this system handles curves. It seems there are at least slight curves in some strokes, as in here. By slight curve I mean like a sword might be slightly curved, not like a spiral or anything. Now that I think of it I'm not sure if I've seen a Chinese character with a circle in it. Here it looks like all straight lines. And while this is tempting to think is Chinese characters from far away, it has lots of circles and curves and is actual scene drawings that sort of look like the same style as Chinese characters. So it seems like Chinese characters are mostly straight lines, but wanted to ask just to clarify to make sure. It seems like it would just flatten them out or ignore them.

How did you split the chinese word to radical?

how to split the word to radical, such as in this example:"晚":"(日)-(免)"?

The readme doesn't look good in GitHub dark mode

As you may know, GitHub recently released dark mode a few weeks ago.

However the readme has many SVG files that do not have a consistently colored background, so they are illegible in dark mode.

lingdong- / rrpl Goto Github PK

rrpl's Introduction

Recursive Radical Packing Language

Syntax

Building Blocks

Packing

Grouping

Referencing

Parser

File Type

Downloads

Tools

Rendering

Exporting

Applications

Contributing

rrpl's People

Contributors

Stargazers

Watchers

Forkers

rrpl's Issues

Recommend Projects

Recommend Topics

Recommend Org