Giter Site home page Giter Site logo

Comments (9)

fstirlitz avatar fstirlitz commented on August 20, 2024 1

Implemented in 7172940. Consider it unstable, however. I might still revisit the encoding issue.

from luaparse.

 avatar commented on August 20, 2024 1

@fstirlitz

That at least makes it simple to implement; I worried we might have to import the Unicode character database to check character properties or something.

I was able to convert a UnicodeData.txt into a lightweight 2MB string map of general categories recently. It's something like '²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²¯ªªª¬ªªª¦§ª«ª¥ªªCCCCCCCCCCªª«««ªª ¦ª§­D­!!!!!!!!!!!!!!!!!!!!!!!!!!¦«§«²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²²¯ª¬¬¬¬®®­®!¨«²®­®«¤¤­!®ª­ [more GCs]'.

package hydroper.unicode
{
	/**
	 * The UnicodeType static class. 
	 */
	public class UnicodeType
	{
		public static const LETTER_UPPERCASE: uint = ' '.charCodeAt(0);
		public static const LETTER_LOWERCASE: uint = '!'.charCodeAt(0);
		public static const LETTER_TITLECASE: uint = '#'.charCodeAt(0);
		public static const LETTER_MODIFIER: uint = '$'.charCodeAt(0);
		public static const LETTER_OTHER: uint = '%'.charCodeAt(0);

		public static const MARK_NON_SPACING: uint = 'A'.charCodeAt(0);
		public static const MARK_SPACING_COMBINING: uint = 'B'.charCodeAt(0);
		public static const MARK_ENCLOSING: uint = 0xA3;

		public static const NUMBER_DECIMAL_DIGIT: uint = 'C'.charCodeAt(0);
		public static const NUMBER_LETTER: uint = '&'.charCodeAt(0);
		public static const NUMBER_OTHER: uint = 0xA4;

		public static const PUNCTUATION_CONNECTOR: uint = 'D'.charCodeAt(0);
		public static const PUNCTUATION_DASH: uint = 0xA5;
		public static const PUNCTUATION_OPEN: uint = 0xA6;
		public static const PUNCTUATION_CLOSE: uint = 0xA7;
		public static const PUNCTUATION_INITIAL_QUOTE: uint = 0xA8;
		public static const PUNCTUATION_FINAL_QUOTE: uint = 0xA9;
		public static const PUNCTUATION_OTHER: uint = 0xAA;

		public static const SYMBOL_MATH: uint = 0xAB;
		public static const SYMBOL_CURRENCY: uint = 0xAC;
		public static const SYMBOL_MODIFIER: uint = 0xAD;
		public static const SYMBOL_OTHER: uint = 0xAE;

		public static const SEPARATOR_SPACE: uint = 0xAF;
		public static const SEPARATOR_LINE: uint = 0xB0;
		public static const SEPARATOR_PARAGRAPH: uint = 0xB1;

		public static const OTHER_CONTROL: uint = 0xB2;
		public static const OTHER_FORMAT: uint = 0xB3;
		public static const OTHER_SURROGATE: uint = 0xB4;
		public static const OTHER_PRIVATE_USE: uint = 0xB5;
		public static const OTHER_NOT_ASSIGNED: uint = 0xB6;

		// private static const data: String = /* ... 2MB ...  */;

		[Inline]
		public static function getType(cp: uint): uint
		{
			return data.charCodeAt(cp);
		}
	}
}

There's also the UnicodeSet tool for that, but it outputs a pattern-like set instead, with range elements. Range checking is generally slower than indexing into a string literal map.

from luaparse.

fstirlitz avatar fstirlitz commented on August 20, 2024

PUC Lua doesn't accept this code:

$ lua5.3
Lua 5.3.3  Copyright (C) 1994-2016 Lua.org, PUC-Rio
> function 中文函数名(参数1,参数2) end
stdin:1: <name> expected near '<\228>'
$ lua5.2 
Lua 5.2.4  Copyright (C) 1994-2015 Lua.org, PUC-Rio
> function 中文函数名(参数1,参数2) end
stdin:1: <name> expected near char(228)
$ lua5.1 
Lua 5.1.5  Copyright (C) 1994-2012 Lua.org, PUC-Rio
> function 中文函数名(参数1,参数2) end
stdin:1: '<name>' expected near '�'

Aparently LuaJIT allows any octet ≥ 128 inside identifiers. (The documentation only says 'UTF-8' characters are supported, but nothing actually checks the encoding for validity, and a comment in the source code suggests that other encodings are meant to be supported as well.) That at least makes it simple to implement; I worried we might have to import the Unicode character database to check character properties or something.

Well, sort of simple, because we parse Lua source code at the level of code points, not bytes, which brings back the conundrum I've had with interpreting string literals...

Given that it's an extension from PUC Lua, I'll probably implement this, but it will require being explicitly enabled by the user, like with the luaVersion option. Please note that this is not full LuaJIT support. For example, LuaJIT will accept code like this, unless LUAJIT_ENABLE_LUA52COMPAT is defined when compiling:

goto skip
local goto = print
goto "hello"
::skip::

Supporting code like this might be too tricky to be worth it, so I explicitly don't promise that.

from luaparse.

fstirlitz avatar fstirlitz commented on August 20, 2024

I already wrote this is unnecessary. There's no need to check category codes at all; the LuaJIT parser doesn't work with Unicode characters anyway. The real issue is that luaparse does work with Unicode characters (given that JavaScript has no portable and expedient bytestring type), which causes problems here and in a couple other places.

from luaparse.

 avatar commented on August 20, 2024

@fstirlitz Well, you're kinda rudde. I read your post and I didn't try to help you, was just showing you I've done it and I'm able to.

from luaparse.

 avatar commented on August 20, 2024

@fstirlitz Oops, I meant... You're an 137 old. I remember that.

from luaparse.

fstirlitz avatar fstirlitz commented on August 20, 2024

I read your post and I didn't try to help you, was just showing you I've done it and I'm able to.

Well, good for you. But with respect to this issue report, it's off-topic. If you want to demonstrate your programming prowess to your peers, there are plenty of forums meant for that purpose. This is not one of them.

from luaparse.

fstirlitz avatar fstirlitz commented on August 20, 2024

I filed the encoding issue as #68. Another question that remains is whether to integrate this feature into the feature flags framework (i.e. the features object) and expose the latter directly in the API. I'm leaning towards 'yes'.

from luaparse.

fstirlitz avatar fstirlitz commented on August 20, 2024

Apparently Lua 5.4 will add this feature behind a compile-time flag: lua/lua@e0ab13c. Although it seems that unlike LuaJIT only (modern) UTF-8 names will be supported.

from luaparse.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.