Giter Site home page Giter Site logo

galimatias's Introduction

galimatias

Build Status Coverage Status

galimatias is a URL parsing and normalization library written in Java.

Design goals

  • Parse URLs as browsers do, optionally enforcing compliance with old standards (i.e. RFC 3986, RFC 2396).
  • Stay as close as possible to WHATWG's URL Standard.
  • Convenient fluent API with immutable URL objects.
  • Interoperable with java.net.URL and java.net.URI.
  • Minimal dependencies.

Gotchas

galimatias is not a generic URI parser. It can parse any URI, but only schemes defined in the URL Standard (i.e. http, https, ftp, ws, wss, gopher, file) will be parsed as hierarchical URIs. For example, in git://github.com/smola/galimatias.git you'll be able to extract scheme (i.e. git) and scheme data (i.e. //github.com/smola/galimatias.git), but not host (i.e. github.com). This is intended. We cannot guarantee that applying a set of generic rules won't break certain kind of URIs, so we do not try with them. I will consider adding further support for other schemes if enough people provides solid use cases and testing. You can check this issue if you are interested.

But, why?

galimatias started out of frustration with java.net.URL and java.net.URI. Both of them are good for basic use cases, but severely broken for others:

  • java.net.URL.equals() is broken.

  • java.net.URI can parse only RFC 2396 URI syntax. java.net.URI will only parse a URI if it's strictly compliant with RFC 2396. Most URLs found in the wild do not comply with any syntax standard, and RFC 2396 is outdated anyway.

  • java.net.URI is not protocol-aware. http://example.com, http://example.com/ and http://example.com:80 are different entities.

  • Manipulation is a pain. I haven't seen any URL manipulation code using java.net.URL or java.net.URI that is simple and concise.

  • Not IDN ready. Java has IDN support with java.net.IDN, but this does not apply to java.net.URL or java.net.URI.

Setup with Maven

galimatias is available at Maven Central. Just add to your pom.xml <dependencies> section:

<dependency>
  <groupId>io.mola.galimatias</groupId>
  <artifactId>galimatias</artifactId>
  <version>0.2.1</version>
</dependency>

Development snapshots are also available at Sonatype OSS Snapshots repository.

Getting started

Parse a URL

// Parse
String urlString = //...
URL url;
try {
  url = URL.parse(urlString);
} catch (GalimatiasParseException ex) {
  // Do something with non-recoverable parsing error
}

Convert to java.net.URL

URL url = //...
java.net.URL javaURL;
try {
  javaURL = url.toJavaURL();
} catch (MalformedURLException ex) {
  // This can happen if scheme is not http, https, ftp, file or jar.
}

Convert to java.net.URI

URL url = //...
java.net.URI javaURI;
try {
  javaURI = url.toJavaURI();
} catch (URISyntaxException ex) {
  // This will happen in rare cases such as "foo://"
}

Parse a URL with strict error handling

You can use a strict error handler that will throw an exception on any invalid URL, even if it's a recovarable error.

URLParsingSettings settings = URLParsingSettings.create()
  .withErrorHandler(StrictErrorHandler.getInstance());
URL url = URL.parse(settings, urlString);

Documentation

Check out the Javadoc.

Contribute

Did you find a bug? Report it on GitHub.

Did you write a patch? Send a pull request.

Something else? Email me at [email protected].

License

Copyright (c) 2013-2014 Santiago M. Mola [email protected]

galimatias is released under the terms of the MIT License.

galimatias's People

Contributors

blicksky avatar dependabot[bot] avatar edwelker avatar fmela avatar josephw avatar ocadaruma avatar sideshowbarker avatar smola avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

galimatias's Issues

Provide command line client

A command line client would be a useful tool for URL debugging, as well as a good showcase for galimatias.

Distinguish between different types of parse exceptions

I'm using your library to take URLs supplied by users, which may contain invalid syntax such as spaces, and convert them to valid URIs with URL.parse(it).toJavaURI().

One additional case that I'd like to cover is when the user leaves out the scheme. In this case, I would like to default it to http. Currently, I'm doing this by catching GalimatiasParseException, and checking to see if the exceptions's message is "Missing scheme".

This is working well for me, but checking the exception's message is very brittle. I'd like to suggest that there be a few subclasses of GalimatiasParseException, including something like MissingSchemeException, so that it can be captured in a safer way.

I'd be happy to submit a pull request if you think that this is a worthwhile enhancement.

Add encoding/decoding utils

Utils that we should provide:

  • encodePathSegment / decodePathSegment
  • encodePath / decodePath
  • encodeQuery / decodeQuery
  • encodeFragment / decodeFragment
  • encodeSchemeData / decodeSchemeData ?

Host encoding/decoding is already provided through the Host class and its subclasses.

ICU4J dependency

Hi Santiago,

We are using galimatias for quite some time now without any issue.

You recently introduced a dependency to ICU4J in this commit 5ce2cb9 and I was wondering if there was a way to fix the issue without adding this dependency.

You might have missed it but ICU4J is a 10 MB jar which is quite huge.

Thanks for your work.

Refine Host API

Checking if a host is a domain or IP shouldn't require instanceof.

URL parsing does not convert characters: [ and ] to correct presentation.

Example: System.out.println(URL.parse("http://test.com/path=[test]").toString());

Output: http://test.com/path=[test]

However if used toJavaURI:
System.out.println(URL.parse("http://test.com/path=[test]").toJavaURI().toString());

Output: http://test.com/path=%5Btest%5D

As far as I understand the first case should return the same result as the second, since it should be the save converted string.

There is method "toHumanString()" which in fact should return (and returns) what is returned now incorrectly by "toString()".

Add builder API

A builder API for URLs might be really useful.
At the moment, I'm delaying this to a version beyond 0.0.1.

Add URL manipulation methods

Currently, there is only URL.withScheme(String) as a proof-of-concept. This needs to be extended to all URL attributes.

Replace old RFC parsing by normalizers

Special-casing the parser in order to normalize URLs to old RFCs is overkill. Let's move the old RFC parsing to separate normalizers. Togerther with this, toJavaURI should be changed to not throw URISyntaxException (by changing the constructor).

For error-reporting parser, if URL contains whitespace char, report more specific “… contains space character.” message

Currently when the error-reporting parser is turned on, if a URL contains a space character (or tab or newline), a generic “… contains invalid character” message is emitted. It would be much more helpful if instead a specific “… contains space character” message (or “contains tab character” or “contains newline”) were emitted instead.

I seem to recall you saying that for the next release you were already planning on having the error messages emit the invalid character in the message. So maybe this is already on your radar. If not I’d be happy to provide a patch.

Use safer java.net.URI constructor

We can minimize URISyntaxException in URL.toJavaURI by using a safer constructor such as:

URI(String scheme, String userInfo, String host, int port, String path, String query, String fragment)

Need feedback: Describe your use cases for non-HTTP(S) URIs

Galimatias born with the goal of parsing URLs that can be opened in a web browser. For my use cases, this included http, https and data. It soon became obvious that support for ftp, gopher, ws, wss and file would be sane and cheap to add since they are supported by most modern browsers and are specified in WHATWG's URL Standard.

Support for any other scheme is currently in place in a limited way. If there is double slash ("//") after the scheme (e.g. "git://"), the URL is parsed as a hierarchical URI. Otherwise, it is parsed an opaque URI. This is known to work with any URI except ed2k links (which is far from compliant with RFC 3986).

Before going on and overengineer anything, I would like to hear about your use cases with handling URIs other than http and https in Java. I will use this feedback to better define the scope of galimatias.

Thank you for your time!

Decide on the behaviour of default/empty URL fields

I've changed to match WHATWG and W3C specs here, but that's an important deviation from java.net.URI and java.net.URL (including Android implementation).

java.net.URI, since it performs no normalization operation at all, accepts both empty and null userInfo. In our case, userInfo is never null. The same happens with authority and other fields.

We probably want this to stay this way, but some warnings to javadocs of the corresponding methods would be nice.

Release v0.2.2

I would like to use the new URL.searchParameters method, but it's not in 0.2.1. Could you make a new release?

Regression in error reporting for bad fragments

For illegal characters in fragments, galimatias now unexpectedly reports Illegal character in path segment rather than Illegal character in fragment as expected.

v0.2.1

$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
    Recoverable error found;
        Error: Illegal character in path segment: not a URL code point
        Position: 17

v0.1.0

$ java -cp dependencies/galimatias-0.1.0.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://foo/path#f#g
Base: http://example.org/foo/bar
Analyzing URL: http://foo/path#f#g
Parsing...
    Recoverable error found;
        Error: Illegal character in fragment: not a URL code point
        Position: 17

The actual results of parsing for this case are the same with v0.2.1 as with v0.1.0; the only difference is the error message reporting "path segment" instead of "fragment".

Percent-decode domain before parsing it

WATWG specifies domain parsing as:

  • Let host be the result of running utf-8's decoder on the percent decoding of input.
  • Let domain be the result of splitting host on any domain label separators.
  • Return the result of running domain to ASCII on domain.

Although this behaviour does not seem consistent across browsers. At the moment, we'll just follow the spec here.

Check for DNS length limits

DNS imposes a 63-byte length on each label, maximum 127 labels, and 253 characters. We should check for these limits

This is where empty labels (excep the root label) should throw an error.

Add a setting to use a default scheme for parsing

As discussed in #35, I'll try to add the possibility to use a default scheme via URLParsingSettings. This is useful for parsing URLs introduced by users where a full absolute URL is expected but the user misses the scheme (e.g. example.com, not http://example.com).

Parsing U+10000 or above in username produces unexpected result

Test case: http://💩:[email protected]
(U+1F4A9 in username)

Results from galimatias:

$ java -cp dependencies/galimatias-0.2.1.jar:dependencies/icu4j-53_1.jar io.mola.galimatias.cli.CLI http://💩:[email protected]
Base: http://example.org/foo/bar
Analyzing URL: http://💩:[email protected]
Parsing...
    Recoverable error found;
        Error: Illegal character in user or password: not a URL code point
        Position: 13
    Result:
        URL: http://%F0%9F%92%A9%3F:[email protected]/
        URL type: hierarchical
        Scheme: http
        Scheme data: 
        Username: %F0%9F%92%A9%3F
        Password: fo
        Host: example.com
        Port: 80
        Path: /
Canonicalizing with RFC 3986 rules...
    Result identical to WHATWG rules
Canonicalizing with RFC 2396 rules...
    Result identical to RFC 3986 rules

Notice that is shows %F0%9F%92%A9%3F as the result for the username part, while it should just be %F0%9F%92%A9, and it shows fo instead of foo as the result for the password part.

I get expected results with code points in username less than U+FFFF; e.g., ○ (U+FFEE).
But with U+10000 and higher, e.g., 𐀠 (U+10020), I get the same unexpected behavior as above.

Determine encoding behaviour of #fragment

java.net.URL, Android's URL, Gecko and WebKit accept "#foo bar" as a valid fragment. RFC 3986 does not, and java.net.URI doesn't either. "#foo%20bar" is left as-is for WebKit, while it is decoded to "#foo bar" in Gecko.

Let's give this a properly defined behaviour for the different parsing versions (WHATWG, RFC...).

Optionally normalize empty path segments ("//" and traling slash in path)

Multiple slashes

Multiple slashes together are ok with standards and have a different meaning than just one slash. That is: "/foo/bar" should be translated to path segments "foo" and "bar", while "/foo//bar" is "foo", "" and "bar".

Some people uses significant empty segments in their paths (see this). However, the most common case is that multiple slashes are not significant and are produced as an unintended consequence of bad serialization.

Trailing slash

It's generally accepted that a trailing slash can be added to an URL path if there is no "file extension". (e.g. /foo -> /foo/ but not foo.html -> /foo.htnl/). However, that changes semantics according to RFC 3986 and might break well-formed URLs in lots of cases.

Further considerations

Both of these normalizations can break standard-compliant URLs. So they should be optional and the user should be warned. Also, when to perform this normalization (during parsing or after parsing) is important, since it can change the result of /../.

Proper processing of these cases (as Google seems to be doing) is normalizing according to the result of fetching the URL and processing redirects and <link rel="canonical">.

Because of all of this, I still doubt that providing these normalizations in Galimatias is a sane choice.

coveralls integration is broken

[ERROR] Failed to execute goal org.eluder.coveralls:coveralls-maven-plugin:2.1.0:cobertura (default-cli) on project galimatias: IO operation failed: /home/travis/build/smola/galimatias/target/site/cobertura/coverage.xml (No such file or directory) -> [Help 1]

It was broken here:
57f699e

Process hashbangs

A hashbang (#!) should be converted to / by a crawler that needs to fetch the page without using Javascript.

Expose host parsing error position to the URL parser

Currently, any host error message is exposed to the ErrorHandler of a URLParser as a GalimatiasException with position to the beggining of the host. In order to get the actual position of the error, one must get the wrapped GalimatiasErrorException and calculate it. It might be nice to fix this in the future.

Add URL.toHumanString method

Add a method to convert a URL to a human-understandable String. That is, domains converted to Unicode and printable characters percent-decoded.

IPv4 hosts handling

Currently, a Host can be either a Domain or an IPv6Address. Determine behaviour for IPv4Address.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.