vickenty / lang-c Goto Github PK
View Code? Open in Web Editor NEWLightweight C parser for Rust
License: Apache License 2.0
Lightweight C parser for Rust
License: Apache License 2.0
I've found this kind of code which is accepted by Clang and GCC, but not by lang-c.
int fn() {
int ifn() {
return 5;
}
return ifn();
}
fn
is a function having an internal function, named ifn
. ifn
is only in scope inside the body of fn
. This is not standard C, but a GNU extension. I don't think it's in ANSI C or ISO C99, or K&R, or anything like that. So I am not sure if we want or need to accept this kind of code in lang-c.
I'm getting this error message:
SyntaxError(SyntaxError { source: "# 0 \"c/fninfn.c\"\n# 0 \"<built-in>\"\n# 0 \"<command-line>\"\n# 1 \"/usr/include/stdc-predef.h\" 1 3 4\n# 0 \"<command-line>\" 2\n# 1 \"c/fninfn.c\"\nint fn() {\n int ifn() {\n return 5;\n }\n return ifn();\n}\n", line: 8, column: 12, offset: 156, expected: {";", "asm", "(", "[", ",", "="} })
What do you think?
The following gcc-valid sources cannot be parsed:
void f(void) {
};
struct s {
struct t {
int i;
} __attribute((packed)) v;
};
struct s {
union { int i; } __attribute__((aligned(8)));
};
struct s {
int i;;
};
struct s {
int __attribute__((aligned(8))) *i;
};
Case ranges are a GNU extension that allow matching a number of consecutive cases, like so:
#include <stdio.h>
int main() {
int v = 4;
switch (v) {
case 0 ... 4: puts("Between 0 and 4"); break;
case 5: puts("5"); break;
default: puts("something else"); break;
}
return 0;
}
This will be compiled without errors with gcc -std=gnu11 range.c
. However, when parsing with lang-c, this results in a SyntaxError
:
use lang_c::driver::{Config, parse};
fn main() {
let config = Config::with_gcc();
let p = parse(&config, "range.c");
if let Err(e) = p {
println!("{}", e);
}
}
Output:
syntax error: unexpected token at line 733 column 11, expected '[_a-zA-Z]'
Compound literals do not seem to work.
Test preprocessed C file:
# 1 "test.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "test.c"
typedef struct
{
int value;
} test_t;
void test(test_t* myStruct)
{
*myStruct = (test_t) {.value = 1};
}
Error:
SyntaxError {
source: --- see above ---,
line: 12,
column: 26,
offset: 173,
expected: {
"<",
"!=",
"*=",
"%=",
"&",
"==",
">",
"<<",
"&&",
">>",
"u8",
"[",
"~",
">=",
"*",
"/",
"++",
"->",
"--",
"+",
"<<=",
";",
"?",
"|=",
"[_a-zA-Z]",
">>=",
"[uUL]",
"^=",
",",
".",
"!",
"||",
"(",
"-=",
"\"",
"/=",
"<=",
"^",
"%",
"=",
"+=",
"&=",
"|",
"-",
},
}
This works:
struct thing
{
int value;
};
struct thing returnthing()
{
return (struct thing){ 1 };
}
This does not:
typedef struct
{
int value;
} thing;
thing returnthing()
{
return (thing){ 1 };
}
Error:
# 1 "test.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "test.c"
typedef struct
{
int value;
} thing;
thing returnthing()
{
return (thing){ 1 };
}
SyntaxError {
source: "# 1 \"test.c\"\r\n# 1 \"<built-in>\"\r\n# 1 \"<command-line>\"\r\n# 1 \"test.c\"\r\ntypedef struct\r\n{\r\n int value;\r\n} thing;\r\n\r\nthing returnthing()\r\n{\r\n return (thing){ 1 };\r\n}\r\n",
line: 12,
column: 19,
offset: 157,
expected: {
"<=",
"[",
".",
"<<",
">",
"|=",
"/=",
";",
"<<=",
"++",
"(",
"*",
"!=",
"%",
"&",
">=",
"-=",
"&=",
"^=",
"||",
",",
"->",
"%=",
"[uUL]",
"<",
"/",
"^",
"=",
"*=",
"--",
"[_a-zA-Z]",
"+=",
"?",
"!",
"-",
"u8",
"==",
"&&",
">>=",
"\"",
"|",
"~",
">>",
"+",
},
}
See here:
https://gcc.gnu.org/onlinedocs/gcc-8.1.0/gcc/Empty-Structures.html
I will provide a pull request.
I implements so-called transpiler which process C source before sending to real compiler.
I found lang-c much useful for parsing the sources but it seems the formatting the AST back to the source code still not implemented yet.
I think the line-precise formatter may be very useful feature here.
typedef struct mi_heap_area_s {
size_t x;
} mi_heap_area_t;
==> unexpected token at line 9 column 3, expected '<typedef_name>', '}'
Notice also the wrong line number!
Curiously, this works:
typedef struct mi_heap_area_s {
void* x;
} mi_heap_area_t;
When the parser encounters a type that is unknown, it fails with the following error:
expected: {"<typedef_name>"}
It would be extremely helpful if the parser would at the following point in the parser code:
__state.mark_failure(__pos, "<typedef_name>");
make an effort to print the actual type that it can not parse.
It might be out-of-scope for this project, but having support for Clang's non-standard block pointer types would be nice
e.g.
int (^ _Nonnull __compar)(const void *, const void *)
in
typedef unsigned long long size_t;
void *bsearch_b(const void *__key, const void *__base, size_t __nel,
size_t __width, int (^ _Nonnull __compar)(const void *, const void *) __attribute__((__noescape__)))
__attribute__((availability(macosx,introduced=10.6)));
Purpose:
Be able to parse stdlib.h
and other header files from macOS SDKs
It seems a lot of assumptions about the preprocessor are made that don't apply to the preprocessor on Windows.
Enum DerivedDeclarator
combines pointer, array, and function declarators. However, pointer declarators in C behave differently from array and function declarators. Pointer declarators are considered in right-to-left order, but array declarators go from left to right. For example:
int * const * volatile a[2][4];
This declares a as (array 2 of (array 4 of (volatile pointer to (const pointer to (int))))), but the DerivedDeclarators will be provided in the following order: Pointer(const), Pointer(volatile), Array(2), Array(4)
.
When using this list of DerivedDeclarators to build C types, special care should be taken to apply pointers in one direction and when they end, apply the rest in reverse direction. It would be more convenient to have RTL declarators and LTR declarators separated in the AST type system.
This is only my wish and suggestion and not a bug report. Please feel free to reject this issue as "won't do" if it goes against the philosophy of your crate.
Both GCC and clang allow omitting the middle part.
https://godbolt.org/z/7Erv8rWMd
The semantics are described in the GCC manual: https://gcc.gnu.org/onlinedocs/gcc/Conditionals.html
Hello all,
Currently lang-c
accepts the following C
code:
struct S {}
int z;
while the code is not valid (lack of semicolon ;
after struct declaration). I believe this is because the grammar accepts declaration's specifier as a list:
Lines 447 to 453 in a76a36c
So it includes both struct S {}
and int
as type specifiers into the list, the invalid code is parsed somehow as:
struct S {} int z;
Similarly, the invalid declaration char int z;
is accepted also.
Many thank for any feedback.
With the following program:
use std::path::Path;
use lang_c::driver as c;
fn lint_file<P: AsRef<Path>>(config: &c::Config, file: P) {
let result = c::parse(config, file);
match result {
Ok(parsed) => {
println!("SUCCESS");
println!("{:#?}", parsed.source);
}
Err(c::Error::PreprocessorError(error)) => {
println!("PREPROCESSOR");
println!("{:#?}", error.into_inner().unwrap());
}
Err(c::Error::SyntaxError(error)) => {
println!("SYNTAX");
println!("{}", error.source);
println!("{:#?}", error);
}
}
}
fn main() {
let config = c::Config::with_gcc();
lint_file(&config, "main.c");
}
And a minimal main.c file:
int main()
{
return 0;
}
Prints this output:
SYNTAX
# 1 "main.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "main.c"
int main()
{
return 0;
}
SyntaxError {
source: "# 1 \"main.c\"\r\n# 1 \"<built-in>\"\r\n# 1 \"<command-line>\"\r\n# 1 \"main.c\"\r\nint main()\r\n{\r\n return 0;\r\n}\r\n",
line: 5,
column: 11,
offset: 78,
expected: {
";",
"asm",
"<typedef_name>",
"{",
"=",
"[",
",",
"(",
},
}
I am using:
> gcc --version
gcc.exe (MinGW.org GCC-8.2.0-3) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS
On Windows 10.
Am I not using this correctly?
I'm not sure if this is in the scope of this project, but it seems <stdlib.h>
cannot be parsed on macOS.
syntax error: unexpected token at "/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/mach/arm/_structs.h" line 498 column 2, expected '<typedef_name>', '}'
included from tests/thing.h:6
included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdlib.h:66
included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/wait.h:109
included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/sys/signal.h:146
included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/machine/_mcontext.h:34
included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/arm/_mcontext.h:36
included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/mach/machine/_structs.h:35
From looking at the error, it looks like it gets stuck on the __uint128_t
and __uint32_t
built-in types of the following definition:
struct __darwin_arm_neon_state64
{
__uint128_t __v[32];
__uint32_t __fpsr;
__uint32_t __fpcr;
};
I tried specifying the flavor, but seems not to help
I've started a crummy linter. Not much good yet; it's just got the one checker so far and that is the you're-actually-using-the-variable-before-it's-even-been-initialized checker. But it's finding problems where I put them, so.
The thing is I'd like to report to my end user something like "I found an issue on line 57". But information about line numbers is actually discarded by the time I figure out there's even a problem to report. I have reference to a Span object though, which so far as I can figure out, lets me know the (preprocessed) character count. I wonder if we can add the line number in there.
This file does not parse and the parser says that the error is in line 45695. But this line is unrelated. The error is in line 45772.
It's very nice to have a fully compliant C11 parser, but seems like the AST is fairly hard to search or manipulate at present. One possibility would be to match the visit
, visit_*_mut
, and fold
functionality as exposed in the syn
crate for manipulating Rust ASTs. Do you have direct plans? Or else a path that you like and would accept if submitted? Or else would you recommend providing such functionality in a separate crate?
In src/ast.rs
#[derive(Debug, PartialEq, Clone)]
pub enum IntegerBase {
Decimal,
Octal,
Hexademical,
}
// ..
#[derive(Debug, PartialEq, Clone)]
pub enum FloatBase {
Decimal,
Hexademical,
}
Hexademical
may be changed to Hexadecimal
if the wording is not intended to prevent any confusion.P.S. I am enjoying using your library where my colleagues and I are building Educational C compiler written in rust.
When parsing
int ptsname_r(int fildes, char *buffer, size_t buflen) __attribute__((availability(macos,introduced=10.13.4))) __attribute__((availability(ios,introduced=11.3))) __attribute__((availability(tvos,introduced=11.3))) __attribute__((availability(watchos,introduced=4.3)));
it chokes on the 10.13.4
in the availability attribute.
It might be good to have a "raw" attribute for robustness that if an __attribute__
declaration fails to parse, just store the attribute string.
When trying to compile the example in the readme:
println!("{:?}", parse(&config, "example.c"));
I'm getting the following error:
the size for values of type 'str' cannot be known at compilation time
This is resolved when passing a String reference instead of an str:
println!("{:?}", parse(&config, &"example.c".to_string()));
I've really enjoyed using this library so far! However, I am wondering if there is any desire to write a function / suite of functions to generate C code given an AST (I.e. the opposite of parse
). Such a function would be nice to have, and wouldn't be too hard to write since most of the heavy lifting is already done via the Visit
trait.
The following simple program using K&R function definitions results in an error using lang-c. It compiles fine with gcc however.
extern int puts(char *);
int main(argc, argv)
int argc;
char **argv;
{
puts("Hello!");
return 0;
}
Rust test program:
/* Test program to show bug in K&R parsing.
*
* Input program:
*
* extern int puts(char *);
*
* int main(argc, argv)
* int argc;
* char **argv;
* {
* puts("Hello!");
*
* return 0;
* }
*
* Output:
* Err(SyntaxError(SyntaxError { source: "# 1 \"kr.c\"\n# 1 \"<built-in>\"\n# 1 \"<command-line>\"\n# 31 \"<command-line>\"\n# 1 \"/usr/include/stdc-predef.h\" 1 3 4\n# 32 \"<command-line>\" 2\n# 1 \"kr.c\"\nextern int puts(char *);\n\nint main(argc, argv)\nint argc;\nchar **argv;\n{\n puts(\"Hello!\");\n\n return 0;\n}\n", line: 10, column: 14, offset: 184, expected: {"[_a-zA-Z]", "[_a-zA-Z0-9]", ")"} }))
*
*/
extern crate lang_c;
use lang_c::driver::{Config, parse};
fn main() {
let config = Config::default();
println!("{:?}", parse(&config, "kr.c"));
}
Output:
Err(SyntaxError(SyntaxError { source: "# 1 "kr.c"\n# 1 ""\n# 1 ""\n# 31 ""\n# 1 "/usr/include/stdc-predef.h" 1 3 4\n# 32 "" 2\n# 1 "kr.c"\nextern int puts(char *);\n\nint main(argc, argv)\nint argc;\nchar **argv;\n{\n puts("Hello!");\n\n return 0;\n}\n", line: 10, column: 14, offset: 184, expected: {"[_a-zA-Z]", "[_a-zA-Z0-9]", ")"} }))
void test(void) {
int a = test2((uint64_t) b);
}
Hi - thanks a lot for this crate, it is really great! I did run in to one bug; it looks like chained &&
operators are not parsed correctly. An example:
extern crate lang_c;
use lang_c::driver::{Config, parse_preprocessed};
fn main() {
let parse = parse_preprocessed(&Config::default(), "
int foo(void) {
return 1 && 2 && 3;
}".to_string()).unwrap();
println!("{:?}", parse);
}
Trimming the output down substantially and to just the return, the resulting parsed ast is:
Return(Some(BinaryOperator(BinaryOperatorExpression {
operator: LogicalAnd,
lhs: Constant(Integer(Integer { base: Decimal, number: "1" })),
rhs: BinaryOperator(BinaryOperatorExpression {
operator: BitwiseAnd,
lhs: Constant(Integer(Integer { base: Decimal, number: "2" })),
rhs: UnaryOperator(UnaryOperatorExpression {
operator: Address,
operand: Constant(Integer(Integer { base: Decimal, number: "3" })),
}),
}),
})))
The second &&
is being broken up into a bitwise-and followed by an address-of, as though the input were:
return (1 && (2 & (&3));
but I believe it should be parsed as though it were
return (1 && 2) && 3;
Adding parens around either 1 && 2
or 2 && 3
does not have this problem. I was also not able to recreate it with other infix operators, although I'm sure I didn't try all of them.
test.c:
typedef const char* (*fnPtr) ();
Output:
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "test.c"
typedef const char* (*fnPtr) ();
SyntaxError {
source: "# 1 \"test.c\"\r\n# 1 \"<built-in>\"\r\n# 1 \"<command-line>\"\r\n# 1 \"test.c\"\r\ntypedef const char* (*fnPtr) ();\r\n",
line: 5,
column: 9,
offset: 76,
expected: {
"<typedef_name>",
},
}
I think that the files in ./src/bin/
should be located in ./examples
.
By doing this, we can easily run the bin programs like cargo run --example dump
.
Example code:
int foo(void) {
return L'a';
}
I get the following:
syntax error: unexpected token at "1.c" line 2 column 13, expected '!=', '"', '%', '%=', '&', '&&', '&=', '(', '*', '*=', '+', '++', '+=', ',', '-', '--', '-=', '->', '.', '/', '/=', ';', '<', '<<', '<<=', '<=', '=', '==', '>', '>=', '>>', '>>=', '?', '[', '[_a-zA-Z0-9]', '^', '^=', '|', '|=', '||'
Is it possible to convert an Span
of this library to a (Filename, Span)
in the original files in the disk? I'm trying to show some info about the C code using ariadne crate, and I'm currently using the pre-processed text using the source
field of Parse
, but I would like to do that conversion to make the spans show correct line number and file names.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.