Comments (11)
This has been tentatively implemented as of Alexhuszagh/rust-lexical-experimental@e062276. It passes all the unittests I've thrown at it so far, including Miri and Valgrind.
I'm currently waiting on fuzz results, then will apply the experimental branch on top of the current lexical branch, and we should have this release in the next week.
Here's a code sample of it:
const FORMAT: u128 = NumberFormatBuilder::new()
.base_prefix(num::NonZeroU8::new(b'x'))
.base_suffix(num::NonZeroU8::new(b'h'))
.build();
let options = Options::new();
i32::from_lexical_with_options::<FORMAT>(b"+3h", &options); // Ok(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3", &options); // Ok(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3h", &options); // OK(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3h ", &options); // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+0xh", &options); // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+h", &options); // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+0x", &options); // Err(Empty)
It currently supports both floats and integers, and the checks to skip/ignore base prefixes and suffixes are only done if the features are specified in the format packed struct.
from rust-lexical.
@Alexhuszagh The language i'm parsing does indeed have base prefixes for floats, too. I use hand-rolled code for integer parsing, because it more cleanly handles these bases, especially given that a delayed sign flip is not lossless for integers.
I know of at least C to have number base indicators on floats, too, but it uses a nonsensical format that is incompatible to lexical-core
. But i don't C, so… I know more languages using these on floats exist, but i long forgot which.
I'll look into the example script tomorrow. Tonight is sleep time.
from rust-lexical.
@Alexhuszagh Yeah, for parsing C++ floats we need two number bases. One for the digits, one for what the exponent actually does. So far, and i'm glad it does, lexical-core
makes exponents shift the numbers one digit at a time. It's what makes most sense, after all. In C++, as you mention, hex floats are shifted one bit at a time by their exponents. Or multiplied by 2^exponent. So you get this table, of how a literal is interpreted/parsed:
In Code | Lexical | C++ |
---|---|---|
0b10.01^1 |
0b100.1 |
N/A |
0d10.01^1 |
0d100.1 |
0d100.1 |
0x10.01^1 |
0x100.1 |
0x20.02 |
In other words, to support C++'s 0x3.4p7
number format, rust-lexical
would need an extra parameter for the format spec. But given that the format spec is gonna be a nice builder struct, at least for sane number formats the exponent base can default to just the given number base. Like cxx_fmt.base(16).exp_base(2)
or whatever.
The left column is the syntax of my programming language, btw., and that ^
for exponents made me write the »plz no global state« issue. =)
(although this would work perfectly fine with the partial parser, it just wouldn't be able to deduce the type)
All a compiler would need is parse at max precision and just return the unparsed rest.
E: Though really a compiler would've already lexed the number before parsing it and thus knows the type before invoking lexical-core
.
making it clear
exponent_backup
would then be used for all non-decimal cases.
This is a difficult one. C++ has no binary float literals, and it only uses the p
for exponents, because e
is a valid hex digit. So really the exponent_backup
is for when the exponent
character is in the set of valid digit characters.
Technically one could have a different exponent character per number base and even make just this exponent distinguish the number base. Not that i've seen anyone actually do such a crazy thing. In my programming language, all bases use ^
for the exponent.
I shall further clarify that in my use-case, the base of a number literal is already identified by my lexer, including syntax validity, so if there is an easy/low-effort possibility to fast path lexical-core
for when you already know the literal syntax to be correct, adding that to the API would be neat.
Another side note: Programming languages like Agda show that there may be interest in being able to parse Unicode numbers. However, that'd be so way different to normal literal parsing that even if lexical-core
would ever do that in the far future, i'd recommend doing so using a different API. What i mean is stuff like: Half-width-no-break-space for digit separation, superscript digits for the exponent, final subscript digits for the number base, exponent digit actually written out with a multiplication sign.
So we currently need ~10 bits for just prefixes and postfixes
So, question… What if the argument to parsing-with-base was actually not just the numerical base, but another bit-packed struct? This is what i propose:
Have a base format builder with these settings:
- Numerical digit base, e.g. 12 for duodecimal.
- Optional exponent base for C++, where it differs from the digit base.
- Exponent character.
- Optonal prefix base indicator character.
- Optional postfix base indicator character.
The main format struct then implements just a subset of this for the default base-10 case.
This has a ton of advantages:
- Literals of several different bases can be parsed unambiguously without base indicators, if the lexer or parser has another source of truth for the intended number base. Like parsing a column of a spreadsheet where the column format is configured elsewhere.
- More than just the traditional number bases are supported, allowing experimental programming languages to toy around with e.g. base-12 numbers. (Though really, base-12 has another issue, as traditionally the digits after 9 aren't AB, but XE/TE/somegreekstuff. But that's a whole other can of worms.) E: Another interesting edge-case number format would be ternary numbers. There's unbalanced ternary, where the digits are
012
, and there's balanced ternary, where the digits are0±1
, or01i
or0pn
or Wikipedia's01T
or whatever one names them. Put another way: There may be demand for another base-dependent parsing function that takes e.g. a&[char]
or&[u8]
to name the digits of a number system, plus an offset where the zero digit is. Or for speed a function pointer that turns anychar
into anOption<i8>
for the digit,None
indicating an error. - The »backup exponent« is now no longer needed, because each base format can specify its own exponent character.
- If a parser wants to, it can swap out the literal formats of the very same base depending on context. Imagine for example a scriptlang that has a special code block for a sub-set of C to be able to declare FFI bindings via copy-pasta.
- It leaves enough extra encoding space to add a single bit that decides whether simultaneously having base prefixes and postfixes in the same literal is legal. The fewer assumptions
lexical-core
hard-codes, the more attractive it is for any parser creator out there. (And again, in my case the legality of this is already decided by the lexer, so i'd just telllexical-core
that i don't care.)
I'm going to try to find out if there's any exceptions I can find to these rules
Here's what i know:
- case-insensitive: I know some proglangs have case-sensitive (usually lowercase-only prefix) base prefixes, but this can be caught by the individual parser or lexer and
lexical-core
can just safely go: »I don't care, i'm just figuring out whether to skip this.« - Technically, if we already decide to support C++'s strange exponent base, we may also still want to support the single
0
prefix to indicate octal. This could be handled as a special case when setting the base prefix character to0
. Though on the other hand, if the parser already figured out that we'll be parsing base-8 here, tolexical-core
this0
prefix can just be a leading zero. So no/optional prefix for octal. - Binary Exponent for Non-Decimal Floats: Just a quick reminder that i do not parse hex floats like C++ does. So if it's one-or-the-other, i'd keep the current way of doing things, for there is another crate that correctly parses C++'s hex floats one can fall back to.
from rust-lexical.
Also, C++'s user-defined literals opens a whole new can of worms I'm not eager to get into.
For lexical-core
it just means parse at max precision and return the rest. =) When you apply a user-defined literal, your literal code has to manually handle either the text form of your float, or a pre-parsed float
or double
or what have you.
I do have an external library […] that allows you to provide a custom iterator and then feed that into lexical
That's a really interesting feature to have! I have no use for it, but that'd indeed make stuff like Unicode-conforming scientific number parsing with lexical-core
possible, and would thus make lexical-core
attractive for more scenarios, like parsing locale-aware number inputs of GUI applications.
Allowing custom vocabulary would either require passing around a reference to a pre-computed table
It wouldn't be impossible to generate such a table in a const fn
given an input alphabet. But i do agree that this approach is better:
but using an adaptation of minimal lexical above with a mapping iterator would be quite nice.
This also allows redundant digits, where desired. Like allowing all, XE/TE/greekstuff/AB to mean 10
and 11
respectively. Or allowing Japanese Kanji as well as indo-arabic digits in I18N situations.
The only question is: Can lexical-core
potentially handle negative digits for balanced ternary? Not that i need that or that i ever saw anyone asking for it, but you know, if lexical-core
could pretty much handle almost every sane number format imaginable in this fashion, then why the heck would anyone want to use anything but lexical-core
anymore for parsing and printing numbers.
and then another bit-packed struct with the following layout:
This sounds very reasonable and quite feature-complete. Good thinking with the separate exponent radix.
For each of these,
\x00
(NUL) would be the invalid character, since it is invalid ASCII.
That's actually factually incorrect, though practically »true-ish enough«. Originally, NUL
is a NOP
for teletypes. You can pad data with that and it's ignored. C made that »invalid«. ASCII has no invalid character, Unicode has a set of non-characters which an application can utilise for any internal purposes it wants to, and UTF-8 has truly invalid bytes 0xC0
and 0xC1
. That said, using NUL
is Good Enough™. If you did ASCII-only, but on 8-bit characters, then anything past 0x7F
would be truly illegal. Assuming you interpret that as 7-bit-ASCII-packed-in-8-bits, not as in 8-bit ASCII where the top bit is a parity bit. In any case, 0xFE
would do for 8-bit.
That's a long way of saying: »Ehh, not exactly, but NUL makes most sense for the implementation, given NonZero
types etc.«
This... looks good. Any suggestions?
Apart from that nit pick? Nah, sounds really good, indeed.
As for potentially supporting more esoteric duodecimal and other notation […]
I see them as not strictly necessary nice-to-haves that would make lexical-core
even more attractive for a bigger set of potential users. But not supporting these would make it no less awesome. So adaptations for that can imho comfortably sit on the »maybe, if spare time« pile.
from rust-lexical.
Implemented as of lexical v6.0.0 and lexical-core v0.8.0, using the API above.
from rust-lexical.
@Evrey I like the NumberFormat
flags approach, since it's:
- Backward compatible.
- Not enabled by default (only with
feature = "format"
). - Pretty trivial to implement.
In this case, I'm assuming we'd silently reset the provided radix (which might be a default value) to the one shown by the prefix. The post-fixes sound a little more difficult, but not too tricky, assuming that no digit separators are allowed. Since I'm assuming it's only ISAs that allow postfixes, which do not support digit separators, this could be a trivial check.
Also, this would only be for integers, correct? Or would you like to support floats? I'll have to check all my language scripts to see if any support base-prefixes with floats. See all the examples in here for scripts to check.
from rust-lexical.
Actually, for postfixes, we'd have to do it incrementally (which could work, easily, due to how we structure our logic) since we first extract the digits of the number, validate the format, then parse. In short, it would be easy to check if we have a non-digit character that appears, if that digit is a valid postfix.
from rust-lexical.
The C format (and C++17) format for hexadecimal literals would actually be doable with the new ParseFloatOptions and the NumberFormat schemes designed.
const float f = 0x1ffp10;
const float g = 0x10.1p0;
Note that we wouldn't supported float formats like, since the f
here is a type signifier (although this would work perfectly fine with the partial parser, it just wouldn't be able to deduce the type).
const auto f = .1E4f;
I would need to add support for hex prefixes into NumberFormat, but this would be entirely doable with:
let format = NumberFormat.builder()
// Add flags here for hex prefixes.
// We also would need a flag for required exponents, since
// it's **never** optional.
.exponent_backup(b'p')
.build()
.unwrap();
let options = ParseFloatOptions::builder()
.format(Some(format))
.build()
.unwrap();
And then we'd have a valid parser for float hexadecimal literals in C++17 and C99. We would need one more addition: a way to specify the base for the exponent: hexadecimal float literals have hex digits for the integral and fractional components, and then decimal digits for the exponent, which signifies an exponent with base 2.
An example demonstrating this is:
#include <iostream>
int main() {
std::cout << 0xa.bp10
<< std::endl
<< 0xa.bp5
<< std::endl;
return 0;
}
Which then outputs:
10944 // 0x2ac0, or 10.6875 * 2^10
342 // 0x156, or 10.6875 * 2^5
In short, we have in 0xa.bp10
:
0xa
, or10
0xb
or11
p10
, or2^10
Lexical already correctly parses 0xa.b
as 10.6875
, the only issue is specifying a hex literal format for it.
from rust-lexical.
So, the likely flags required would be:
0b
prefix (integer, float)0o
prefix (integer, float)0d
prefix (integer, float)?0x
prefix (integer, float)H
postfix (integer, float)- Decimal Exponent Radix
- Binary Exponent for Non-Decimal Floats
So we currently need ~10 bits for just prefixes and postfixes (8 if 0xd
is invalid), and 2 for exponent specifications. If so, with the release of v0.8, I'd likely change exponent_default
in the ParseFloatOptions
and WriteFloatOptions
to exponent_decimal
, making it clear exponent_backup
would then be used for all non-decimal cases.
All of these prefixes and suffixes would have to be case-insensitive. It would also have to deal with non-sensical cases, which should be easy due to use tracking state. It would also assume that a prefix and a postfix would be mutually exclusive, that is, even if both are consistent, 0x3H
is not a valid number. For example:
0b1010H // invalid
0xAH // invalid
This would also likely make the exponent backup character be used for all floats except those with base 10 (currently it is only used if the radix is too high to represent E
). That is, 0b1010.1p5
would then be 10.5 * 2^5
, or 336
.
I'm going to try to find out if there's any exceptions I can find to these rules, because I'd try to avoid at all costs adding a exponent base to the NumberFormat
that takes more than a single bit.
from rust-lexical.
»plz no global state« issue. =)
Global state is bad to begin with, it was just a hack for performance and simple APIs at the time and it was a bad idea. Thanks for bringing it up again.
All a compiler would need is parse at max precision and just return the unparsed rest.
Yeah, sounds like a great idea. Also, C++'s user-defined literals opens a whole new can of worms I'm not eager to get into.
E: Though really a compiler would've already lexed the number before parsing it and thus knows the type before invoking lexical-core
Fair enough, and I'd assume this would be generally true.
This is a difficult one. C++ has no binary float literals, and it only uses the p for exponents, because e is a valid hex digit. So really the exponent_backup is for when the exponent character is in the set of valid digit characters.
Very true, so then we could just remove exponent_backup
and only have exponent
. This would save a lot of packed bits too.
Programming languages like Agda show that there may be interest in being able to parse Unicode numbers. However, that'd be so way different to normal literal parsing that even if lexical-core would ever do that in the far future, i'd recommend doing so using a different API. What i mean is stuff like: Half-width-no-break-space for digit separation, superscript digits for the exponent, final subscript digits for the number base, exponent digit actually written out with a multiplication sign.
Yeah, I'm not eager to add Unicode classification into the parser, as that would destroy performance. Although needed for correctness in a lot of cases, it's such a marginal case in float parsing it's not worth the tradeoffs. I do have an external library (that was originally written for integration into serde-json) that allows you to provide a custom iterator and then feed that into lexical, allowing you to provide any parsing logic you want as long as the input is. It currently does not support custom radixes, but that could be easily amenable (I'd likely do that in another library, since this really focuses on a single codebase with no features for blazing fast compile times).
We could use a similar concept for Unicode numbers, with characters like U+200C and U+200D being valid digit joiners or whatever, since it would be trivial for a custom iterator either to stop when it encounters such a digit or skip over them.
Have a base format builder with these settings:
Numerical digit base, e.g. 12 for duodecimal.
Optional exponent base for C++, where it differs from the digit base.
Exponent character.
Optional prefix base indicator character.
Optional postfix base indicator character.
This sounds... really clean and doable. I really like this idea.
Though really, base-12 has another issue, as traditionally the digits after 9 aren't AB, but XE/TE/somegreekstuff. But that's a whole other can of worms
Yeah, this wouldn't be doable currently and I'm not planning on supporting it, for obvious reasons. Internally, we simplify base conversions by using a single digit-to-integer table, which then is used to parse digits. Allowing custom vocabulary would either require passing around a reference to a pre-computed table (ok, I guess, especially if it only works [0x30-0x5A]
range and is case-insensitive) or a function pointer (absolutely not). Greek characters would therefore absolutely be out of the question, but using an adaptation of minimal lexical above with a mapping iterator would be quite nice.
https://github.com/Alexhuszagh/rust-lexical/blob/master/lexical-core/src/util/digit.rs#L7-L38
Essentially, you could have an iterator that then maps lazily all the digits (as bytes or as characters) to return the expected digits (as bytes). This would allow supporting Greek characters in duodecimal notation, without actually requiring any complex changes internally. It might be possible to do this in lexical proper, since to support digit separators internally we map everything to iterators anyway, but we currently use an unusual way to track the number of bytes parsed to simplify our internal logic, which would require modification to support arbitrary iterators from unknown sources.
The »backup exponent« is now no longer needed, because each base format can specify its own exponent character.
I really like this part, and this is a great idea.
It leaves enough extra encoding space to add a single bit that decides whether simultaneously having base prefixes and postfixes in the same literal is legal. The fewer assumptions lexical-core hard-codes, the more attractive it is for any parser creator out there. (And again, in my case the legality of this is already decided by the lexer, so i'd just tell lexical-core that i don't care.)
Removing the exponent backup character gives us a lot more places to add bits, so absolutely, this sounds doable.
What if the argument to parsing-with-base was actually not just the numerical base, but another bit-packed struct? This is what i propose:
That sounds... like a good idea. It would mean the only actual character left in the format specifier would therefore be the digit separator, which sounds normal, and then another bit-packed struct with the following layout:
- radix: u8 (min 6 bits)
- exponent base: u8 (min 6 bits)
- exponent radix: u8 (min 6 bits)
- exponent character: u8 (min 7 bits)
- decimal point: u8 (min 7 bits)
- prefix character: u8 (min 7 bits)
- postfix character: u8 (min 7 bits)
- bitflag for case-insensitive exponent
- bitflag for case-insensitive prefix
- bitflag for case-insensitive postfix
- bitflag for a literal octal (if the prefix is not found or is missing)
For each of these, \x00
(NUL) would be the invalid character, since it is invalid ASCII. This would require ~60 bits using 8 bits per character, and ~50 bits if we pack it even closer. This... looks good. Any suggestions?
For the decimal point, see #58 for why.
As for potentially supporting more esoteric duodecimal and other notation, I would have to look at seeing if we can track the number of digits written in a way suitable for iterators while keeping the code internally clean and performant.
from rust-lexical.
As far as the character for the exponent base and exponent radix, they would be stored effectively as a non-zero value, where 0 means "unset" (basically, Option<NonZeroU8>
, just packed), so then if the values aren't provided, we default to the radix (making it optional).
from rust-lexical.
Related Issues (20)
- [OTHER] Improve internal safety comments and architecture
- [BUG] Unsoundness: `try_parse_{4,8}digits` appear to advance iterators out of bounds
- [BUG] Unsoundness in `Bytes::read()` HOT 1
- [BUG] BytesIter should be an `unsafe trait` or private
- [BUG] Bounds for integer parsing are not correctly checked HOT 1
- [FEATURE] Update Dragonbox Algorithm HOT 2
- [BUG] `umul192_lower128` can error on due to overflow on addition
- [FEATURE] Add thousands digit separator for ```WriteIntegerOptions``` to use with ```to_string_with_options```
- parse_partial ignoring minus sign in presence of trailing characters HOT 3
- [BUG] Assertion failure on parsing hexadecimal floats HOT 1
- [BUG] Tests fail to compile because quickcheck is missing from Cargo.toml in the crates.io tarball. HOT 4
- [QUESTION] Use of min_significant_digits with max_significant digits
- [BUG] Integer overflow checking logic is incorrect
- [FEATURE] `write` APIs should take `buffer: &mut [MaybeUninit<u8>]`
- [QUESTION] support custom types (i256)
- [BUG] -0.0 printed as "0.0" HOT 1
- [BUG] Safety comments for MaybeUninit::assume_init calls are wrong, calls are UB
- [BUG] consecutive_digit_separator has no effect
- [BUG] Inconsistent error between int and float parsing
- [BUG] `no_integer_leading_zeros(true)` incorrectly parses input
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rust-lexical.