qfpl / hpython Goto Github PK

View Code? Open in Web Editor NEW

157.0 157.0 15.0 4.37 MB

Haskell language tools for Python

License: Other

Haskell 21.08% Nix 0.12% Python 78.80% Shell 0.01%

hpython's People

Stargazers

Watchers

Forkers

gwils karimamer platonic-io emilypi fuath tarsbase alfrezsmitz well-typed tonymorris wbadart commitmoji shyan1 lightandlight

hpython's Issues

Parser fails on type ascription

While parser now supports type annotations in various places, there is one use it chokes on:

    result : List[HistoryLine] = []

BTW, thank you for this great work!

Add locations to ExpectedDedent and UnexpectedDedent

hpython.nix has version 0.2

this is a line in hpython.nix:

version = "0.2";

however, hackage show that the latest versions is 0.3. I am new to nix so what is happening here?

Support for python 3.6/3.7

Are there any plans to support python 3.6/3.7 in the near term?

Python parsing edge-case `"""`

"""""" is not a syntax error
""""""" is a syntax error
""" """ is not a syntax error
""" """" is a syntax error
""" " """ is not a syntax error

There should be a Language.Python.Validation module that exports key things from the Validation tree. As things are now, I don't know where to get started with validation, and there's nowhere I can document how the pieces fit together.

The 'correct' generator created some duplicated keyword parameters in a lambda

See the travis logs for the latest PR

'Correct expression generator' has generated an incorrect term

In a recent deployment we generated lambda a=None,_=None,_=None:None with the 'correct expression generator', which gets a 'DuplicateArguments' syntax error.

The program `1.a` causes a `SyntaxError`, but can be generated from our AST

This AST:

AtomExprNoAwait
  (AtomInteger
    (IntegerDecimal (Left (NonZeroDigit_1, [])) ())
    ())
  (Compose
    [ Compose . Before [] $
      TrailerAccess
      (Compose $ Before [] (Identifier "a" ()))
      ()
    ])
  ()

serializes to 1.a, which causes a SyntaxError. This is probably because it tries to parse it as a float.

Implement raw strings

https://docs.python.org/3.5/reference/lexical_analysis.html#string-and-bytes-literals

Whitespace causing expressions to no longer be correct-by-construction

Valid forms of not expressions in Python:

not(True)
not True
not""
not{}
not 1

Invalid Python:

notTrue
not1

Here's a simplified form of the AST:

data NotExpr
  = NotExprOne KNot [Whitespace] NotExpr
  | NotExprNone Comparison

data Symbol = SGt | SLt | ...

data Comparison
  = Gt Expr [Whitespace] SGt [Whitespace] Expr
  | Lt Expr [Whitespace] SLt [Whitespace] Expr
  | ...

data Expr = Number Int | True | False | None | String [Char] | Parens Expr | ...

This permits all the valid uses of not, but also some invalid uses. For example, one
could write NotExprOne KNot [] (NotExprOne KNot [] ..., which would result in
notnot....

The true whitespace rule for not is this:

Given (NotExprOne a b c): If print(c) begins with an identifier character, then 'b' should not be empty.

If we were to encode this with types, it might look like this:

data NotExpr'
  = NotExprNone' Comparison'

data NotExpr
  = NotExprOne KNot (NonEmpty Whitespace) NotExpr
  | NotExprOne' KNot [Whitespace] Comparison'
  | NotExprNone Comparison

-- | Comparison' will not begin with an identifier character when printed
data Comparison'
  = Gt' Expr' [Whitespace] SGt [Whitespace] Expr
  | Lt' Expr' [Whitespace] SLt [Whitespace] Expr

-- | Expr' will not begin with an identifier character when printed
data Expr' = Parens' (Either Expr Expr') | String' [Char]

-- | Comparison will begin with an identifier character when printed
data Comparison
  = Gt Expr [Whitespace] SGt [Whitespace] Expr
  | Lt Expr [Whitespace] SLt [Whitespace] Expr

-- | Expr will begin with an identifier character when printed
data Expr = Number Int | True | False | None

We split terminals into two types- one that will begin with identifier characters when printed, and one that won't. Then we have to propagate this change to all the non-terminals. It essentially doubles the number of types and data constructors. A lot of this grammar https://docs.python.org/3.5/reference/grammar.html has to be duplicated to encode this. I'm skeptical of this approach due to the amount of code it requires.

Ideally I still want to have simple prisms on all these types to provide a good user interface.
Now it would have to look something like this:

class FromNot s ws | s -> ws where
  _Not :: Prism' NotExpr (KNot, ws, s)

instance FromNot NotExpr (NonEmpty Whitespace) where
  _Not = -- match on NotExprOne

instance FromNot Comparison' [Whitespace] where
  _Not = -- match on NotExprOne'

class FromComparison comp expr | comp -> expr where
  _Gt :: Prism' comp (expr, [Whitespace], SGt, [Whitespace], Expr)
  _Lt :: Prism' comp (expr, [Whitespace], SLt, [Whitespace], Expr)

instance FromComparison Comparison' Expr' where
  _Gt = -- match on Gt'
  _Lt = -- match on Lt'

instance FromComparison Comparison Expr where
  _Gt = -- match on Gt
  _Lt = -- match on Lt

instance FromComparison NotExpr Expr where
  _Gt = -- match on NotExprNone then Gt
  _Lt = -- match on NotExprNone then Lt

instance FromComparison NotExpr' Expr' where
  _Gt = -- match on NotExprNone' then Gt'
  _Lt = -- match on NotExprNone' then Lt'

Following imports

Eventually I want to be able to parse and validate imported modules. Not going to do it yet, but I'm going to do some brainstorming here.

https://docs.python.org/3.5/reference/import.html#searching

We'll have to duplicate the "finders and loaders" logic
Should probably make import awareness work for calls to importlib.import_module as well as import statements
Need to warn about importing inside control flow, as we can't give useful guarantees

`Statement` doesn't have a `HasStatements` instance

It definitely should

Export Error modules from their respective Validation modules

Smart constructors preclude useful optics

There are a bunch of syntax elements that are created using runtime validation, because a type-based correct-by-construction representation is too complex to be useful. All of these runtime-validated types are created using the smart constructor pattern. The problem is that smart constructors prevent helpful prisms.

Here's an example:

module NoNumbers (NoNumbers, mkNoNumbers, _NoNumbers) where

newtype NoNumbers = NoNumbers { unNoNumbers :: String }

mkNoNumbers :: String -> Maybe NoNumbers
mkNoNumbers s
  | any isDigit s = Nothing
  | otherwise = Just $ NoNumbers s

_NoNumbers :: Prism' NoNumbers String
_NoNumbers = prism NoNumbers (Just . unNoNumbers)

_NoNumbers is obviously wrong. review _NoNumbers ~ NoNumbers, which we were trying to hide in the first place.

How do we fix this?

`_NoNumbers :: Prism' (Maybe NoNumbers) String

Now, review _NoNumbers :: String -> Maybe NoNumbers, which is mkNoNumbers. The downside is
that preview _NoNumbers :: Maybe NoNumbers -> Maybe String, which will be a pain when chaining
prisms.
_NoNumbers :: Prism NoNumbers (Maybe NoNumbers) String String

Now review _NoNumbers :: String -> Maybe NoNumbers and preview _NoNumbers :: NoNumbers -> Maybe String. We can always get a string out, so let's downgrade to Iso

_NoNumbers :: Iso NoNumbers (Maybe NoNumbers) String String

view _NoNumbers :: NoNumbers -> String
view (from _NoNumbers) :: String -> Maybe NoNumbers

Now that we have an accurate optic, let's see how it fares in nested updates

{-# language TemplateHaskell #-}
module Test where

import Control.Lens
import NoNumbers       

testMkThing = Thing "hello" <$> ("goodbye" ^. getting (from _NoNumbers)) 
testSet =                                                                                                                                          
  let                                                                                                                                             
    a = testMkThing1 ^?! _Just                                                                       
  in                                                                                                                                                       
    a & traverseOf thingNoNumbers (_NoNumbers .~ "goodbye1")

It turns out that in this context, an Iso is a burden. We could have achieve the same outcome with less code by using mkNoNumbers and unNoNumbers on their own:

testMkThing = Thing "hello" <$> mkNoNumbers "goodbye"
testSet =                                                                                                                                          
  let                                                                                                                                             
    a = testMkThing1 ^?! _Just                                                                       
  in                                                                                                                                                       
    a & traverseOf thingNoNumbers (\_ -> mkNoNumbers "goodbye1")

Parser/Decoder leaks memory

<!> needs to try parsing the left side before to know whether or not it needs the right, so we will use way too much memory for cases where the parser takes the left side of many <!>s.

For example

exprListComp :: Parser ann Whitespace -> Parser ann (Expr '[] ann)
exprListComp ws = do
  ex <- exprOrStar ws
  (\cf ->
      Generator (ex ^. exprAnnotation) .
      Comprehension (ex ^. exprAnnotation) ex cf) <$>
    compFor <*>
    many (Left <$> compFor <!> Right <$> compIf)
    <!>
    (\case
       ([], Nothing) -> ex
       ([], Just ws) ->
         Tuple (ex ^. exprAnnotation) ex ws Nothing
       ((ws, ex') : cs, mws) ->
         Tuple (ex ^. exprAnnotation) ex ws . Just $ (ex', cs, mws) ^. _CommaSep1') <$>
    commaSepRest (exprOrStar anySpace)

will cause an out of memory error for

(()for a in()for a in()for a in()if{**()}for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in()for a in ())

For this to work, <!> needs to be able to determine which side to take based on the current parse state, ala LL/LR(k).

`locals()` and `globals()` messes with scope checking

There are three mutable, globally accessible dicts called locals() and globals(), and vars(). When these dicts are modified, new variables are brought into scope. Warn about these usages. Warnings should only occur when it's the built-in definitions that are accessed. If a variable called globals is introduced, shadowing the original, then there's no issue.

How to implement decorators parsing?

I am trying to implement parsing for python 3 decorators and struggling (see http://python-3-patterns-idioms-test.readthedocs.io/en/latest/PythonDecorators.html for details...).

What I have so far is that I've defined:

data Decorator v a 
  = Decorator (Indents a) (Ident v a)
  -- ^ '@' <ident>
  deriving (Eq, Show, Functor, Foldable, Traversable)

The Decorator is parsed using a TkAt token that contains the string for the decorator's name (I am only tackling argument less decorators ATM):

    parseDecoratorIdentifier :: Parser ann (Ident '[] ann)
    parseDecoratorIdentifier = do
      curTk <- currentToken
      case curTk of
        TkAt s ann -> do
          Parser $ consumed ann
          pure $ MkIdent ann s []
        _ -> Parser . throwError $ ExpectedIdentifier curTk

    maybeDecorator :: Parser ann (Maybe (Decorator '[] ann))
    maybeDecorator = optional (Decorator <$> 
      indents <*> parseDecoratorIdentifier)

Then I've augmented FunDef with a Maybe (Decorator v a) field.

compoundStatement :: Parser ann (CompoundStatement '[] ann)
compoundStatement =
  fundef <!> ....
    where
     fundef =
      (\dec a (tkDef, defSpaces) -> Fundef dec a (pyTokenAnn tkDef) (NonEmpty.fromList defSpaces)) <$>
      maybeDecorator <*>
      indents <*>
      token space (TkDef ()) <*>
.....

Now when I try to parse "@decorate\ndef fun(a:str) -> int:\n return 1" I got the following error:

     ┏━━ test/Helpers.hs ━━━
    40 ┃ doParse :: (Show ann, Monad m) => ann -> Parser ann a -> Nested ann -> PropertyT m a
    41 ┃ doParse initial pa input = do
    42 ┃   let res = runParser initial pa input
    43 ┃   case res of
    44 ┃     Left err -> do
    45 ┃       annotateShow err
       ┃       │ UnexpectedEndOfLine (Caret (Columns 0 0) "@decorate\n")
    46 ┃       failure
       ┃       ^^^^^^^
    47 ┃     Right a -> pure a

So the decorator needs to handle the newline and be at the same indentation level than the def it is attached to. How do I implement that in hpython?

SmallStatements and SimpleStatements should have their names swapped

The grammar has the names the wrong way around which is why I made this mistake.

`else_` in the DSL isn't actually usable

if_, for_, and while_ all return fully-formed Statements, but else_ can only operate on types that have HasElse. Additionally, the signature of else_ requires its input to be something with an 'else-shaped' hole in it, but the combinators from above produce hole-less things.

I currently see two options: make else_ operate on more things, so it can modify the fully-formed output of if_ et al, or change the control flow combinators to output their pre-Statement types (like For and If), so that else_ can modify those. I like the second option more because it means the types force us to write code that is reflected by the generated Python. If I allow else_ to modify Statements, then we're allowed to write redundant applications of else_ to and old statement.

Example of Validation

It would help learners if there were example of validation in the examples directory.

Treatment of `async` and `await`

async and await are not considered reserved keywords. async is an identifier unless followed by def, defining a function definition. await is considered an identifer unless used inside an async def function, in which case it is a keyword.

Traversal on `Whitespace`

It would be nice if you could derive this generically rather than writing it out by hand.

Use `digit`

https://hackage.haskell.org/package/digit

We will probably want to add binary, octal and hex digits to that package too.

`Module` doesn't have an instance of `HasExprs`

I think it probably should

Add convenience functions for reading then parsing a file

Error explainer

Currently if you validate python code then you just get a list of errors. It would be good to be able to translate that list into a human readable representation. The output would give a succinct explanation of the error, and use the original source file + the annotations in the error to pull out relevant areas of the code so we can pinpoint the bits that caused the error.

For example, if we validate this code:

def a():
    x = 1
    y = x + b
    return y

we could get an error message like:

Line 3: 'b' is not in scope
2 |     x = 1
3 |     y = x + b
                ^
4 |     return y

Upgrade to latest version of digit

As part of fixing the nix situation, we need to upgrade to the lastest version of digit.

Create functions that do all the parsing stages

More robust treatment of `globals`, `nonlocals`, and `del`

Currently we warn about all usages of globals, nonlocals, and del. In the general case they interfere with scope checking, but there is a subset of usages that are safe.

We only really need to warn when we need to do extra computation to see if the scope should be modified. Usages at the top-level and in unconditional control flow (like try, with) would be okay, but inside function calls and if statements are bad.

Function definitions with default values are incorrectly parsed

When parsing the following definition signature:

def foo(y : str, x : int = 0) -> bool:
   ...

hpython generates the following parameters list:

[KeywordParam {_paramAnn = Ann {getAnn = SrcInfo {_srcInfoName = "data/test/contractWithSchema.py", _srcInfoLineStart = 50, _srcInfoLineEnd = 50, _srcInfoColStart = 9, _srcInfoColEnd = 10, _srcInfoOffsetStart = 1170, _srcInfoOffsetEnd = 1171}}, _paramName = MkIdent {_identAnn = Ann {getAnn = SrcInfo {_srcInfoName = "data/test/contractWithSchema.py", _srcInfoLineStart = 50, _srcInfoLineEnd = 50, _srcInfoColStart = 9, _srcInfoColEnd = 10, _srcInfoOffsetStart = 1170, _srcInfoOffsetEnd = 1171}}, _identValue = "y", _identWhitespace = [Space,Space]}, _paramType = Nothing, _unsafeKeywordParamWhitespaceRight = [Space], _unsafeKeywordParamExpr = Int {_exprAnn = Ann {getAnn = SrcInfo {_srcInfoName = "data/test/contractWithSchema.py", _srcInfoLineStart = 50, _srcInfoLineEnd = 50, _srcInfoColStart = 14, _srcInfoColEnd = 15, _srcInfoOffsetStart = 1175, _srcInfoOffsetEnd = 1176}}, _unsafeIntValue = IntLiteralDec {_intLiteralAnn = Ann {getAnn = SrcInfo {_srcInfoName = "data/test/contractWithSchema.py", _srcInfoLineStart = 50, _srcInfoLineEnd = 50, _srcInfoColStart = 14, _srcInfoColEnd = 15, _srcInfoOffsetStart = 1175, _srcInfoOffsetEnd = 1176}}, _unsafeIntLiteralDecValue = DecDigit0 :| []}, _unsafeIntWhitespace = []}}]

eg. the first parameter is given a type of Nothing and assigned the literal expression from the second argument.

However the following works as expected

def other_foo(x : int = 0, y : str) -> bool:
   ...

When CR and LF are adjacent in the syntax tree they appear as CRLF

Def "a" NoArgs {- colon -} [Space, Continued CR []] (Just LF) block

If you create a value like this, then you won't be able to parse the result of pretty printing it, because the CR and LF are rendered adjacent in the string and will be considered a CRLF. You'll end up with a Continued CRLF [], the parser will look for a newline token to start the block and it'll choke.

The "simplest" way to fix this would to be detecting it during syntax checking.

Parsing and handling of docstrings (in addition to comments)

As far as I can tell hpython only parses standard Python comments:

hpython/src/Language/Python/Internal/Lexer.hs

Lines 115 to 118 in ede398a

    
           parseComment :: (CharParsing m, Monad m) => m (SrcInfo -> PyToken SrcInfo) 
        
           parseComment = 
        
             (\a b -> TkComment (MkComment (Ann b) a)) <$ char '#' <*> 
        
             many (satisfy (`notElem` ['\r', '\n']))

Is there any reason that it doesn't support parsing of docstrings, or is it just not a use case that QFPL hasn't had need for yet (i.e. would a PR be welcomed, or potentially something to add as a milestone for a future release)?

Improve subscripting

Subscript slicing desugars to a slice object, so all subscripts are actually a single expression. The comma-separatedness helps form a tuple. Figure out how to make this all line up.

Implement Nub and Member internally

We rely on type-level-sets for these functions, but that package isn't actively maintained, so we might as well just re-implement them.

Mistake in argument validation

f(a=False,*b) is valid Python, but is reported as a positional after keyword argument error

Is "syntactically correct by construction" worth it?

Until now, the approach to the AST has been "as correct-by-construction as possible, falling back to smart constructors when necessary". Since we've been getting closer to a complete representation, I have been considering another approach that gets similar levels of safety but permits a more elegant library design, and potentially better user experience.

Here it is, applied to a very small AST:

{-# language DataKinds, PolyKinds, LambdaCase, ViewPatterns #-}
module AST (AST, Val(..), _Int, _Add, _Assign, unvalidate, validate) where

import Control.Lens
import Data.Coerce

data Val = UV | V

data AST (a :: Val)
  = Int Int
  | Add (AST a) (AST a)
  | Assign String (AST a)
  deriving (Eq, Show)

_Int :: Prism (AST a) (AST UV) Int Int
_Int =
  prism
    Int
    (\case; (unvalidate -> Int a) -> Right a; (unvalidate -> a) -> Left a)

_Add :: Prism (AST a) (AST UV) (AST UV, AST UV) (AST UV, AST UV)
_Add =
  prism
    (uncurry Add)
    (\case; (unvalidate -> Add a b) -> Right (a, b); (unvalidate -> a) -> Left a)

_Assign :: Prism (AST a) (AST UV) (String, AST UV) (String, AST UV)
_Assign =
  prism
    (uncurry Assign)
    (\case; (unvalidate -> Assign a b) -> Right (a, b); (unvalidate -> a) -> Left a)

unvalidate :: AST a -> AST UV
unvalidate = coerce

validate :: AST UV -> Maybe (AST V)
validate (Int a) = Just $ coerce $ Int a
validate (Add a b) =
  fmap coerce $
  Add <$> (coerce <$> validate a) <*> (coerce <$> validate b)
validate (Assign a b)
  | a == "bad" = Nothing
  | otherwise =
    fmap coerce $
    Assign a <$> (coerce <$> validate b)

In this approach, I use a phantom type to indicate that the AST has been validated. Due to the types of the prisms, only unvalidated terms can be constructed, but terms of any validation status can be matched on. This means we can use the same data structure to represent validated and unvalidated terms. The current codebase has two distinct datatypes (with a lot of duplication). The other consequence is that all syntax-correction checking is moved to run-time. I don't believe this is a bad thing, considering the amount of checks that are already performed at run-time.

This pattern also fixes the optics problem that is demonstrated in #17.

There is a small "safety" flaw with this approach, that a user can just use coerce to skip the validation stage. Currently I think that's an okay trade-off.

Replace as many `[Whitespace]` as possible with 'syntactic newtypes' instead

For example, where we have Maybe (NonEmpty Whitespace) we can instead have Maybe Async

`_Newlines` should target all newlines EVERYWHERE

Revise the treatment of starred parameters

I think that we can remove the "TypedUnnamedStarredParam" (or whatever it's called) error by using
the correct data structure for the contents of a starred parameter.

Warn for large constant exponentiations

Python will try to constant-fold expressions like 100 ** 123456789, which takes a very long time. Output a warning when this is encountered.

For extra credit, write a refactor rule that will change such occurrences to

a = 123456789
100 ** a

Conversion to `NonEmpty` rather than list for types which should do it naturally

Could we get Foldable1 on types which can implement it.
For instance for Block it would be nice to be able to convert it to NonEmpty Statement rather than [Statement]. I have very little idea about lenses but I understand that in order to use toNonEmptyOf the type should implement Foldable1 and/or Traversable1 rather then Foldable as it does now.

The same I guess applies to CommaSep1 and CommaSep1'

f-string support

It would be awesome to support python3.6 f-strings, introduced here. For example:

name = 'paul'
message =  'hello'
str = f'{name} says {message}'

Add example for reading a Python file

Replace `Trie` with `Map Text`

bytestring-trie is a blocker in getting support for newer GHC versions, and I'm not sure that we care about blazing fast scope analysis quite yet.

isPythonic

Is it possible to write this function?

isPythonic :: PythonAST -> Bool

Implement pretty errors for megaparsec and benchmark them

In the megaparsec-strict branch I have ported the lexer to megaparsec over strict text. It's 10%+ faster than the trifecta version, but it isn't doing the extra work to be able to produce the clang-style errors. It's possible that implementing this in megaparsec would make it slower than the trifecta version.

Let's find out.

Scope checking will need to be updated when this is completed, because I there are usages of del that can be statically known not to interfere with scoping.

	parseComment :: (CharParsing m, Monad m) => m (SrcInfo -> PyToken SrcInfo)
	parseComment =
	(\a b -> TkComment (MkComment (Ann b) a)) <$ char '#' <*>
	many (satisfy (`notElem` ['\r', '\n']))

qfpl / hpython Goto Github PK

hpython's People

Stargazers

Watchers

Forkers

hpython's Issues

Recommend Projects

Recommend Topics

Recommend Org