Giter Site home page Giter Site logo

purell's Introduction

Purell

Purell is a tiny Go library to normalize URLs. It returns a pure URL. Pure-ell. Sanitizer and all. Yeah, I know...

Based on the wikipedia paper and the RFC 3986 document.

CI

Install

go get github.com/PuerkitoBio/purell

Changelog

  • v1.1.1 : Fix failing test due to Go1.12 changes (thanks to @ianlancetaylor).
  • 2016-11-14 (v1.1.0) : IDN: Conform to RFC 5895: Fold character width (thanks to @beeker1121).
  • 2016-07-27 (v1.0.0) : Normalize IDN to ASCII (thanks to @zenovich).
  • 2015-02-08 : Add fix for relative paths issue (PR #5) and add fix for unnecessary encoding of reserved characters (see issue #7).
  • v0.2.0 : Add benchmarks, Attempt IDN support.
  • v0.1.0 : Initial release.

Examples

From example_test.go (note that in your code, you would import "github.com/PuerkitoBio/purell", and would prefix references to its methods and constants with "purell."):

package purell

import (
  "fmt"
  "net/url"
)

func ExampleNormalizeURLString() {
  if normalized, err := NormalizeURLString("hTTp://someWEBsite.com:80/Amazing%3f/url/",
    FlagLowercaseScheme|FlagLowercaseHost|FlagUppercaseEscapes); err != nil {
    panic(err)
  } else {
    fmt.Print(normalized)
  }
  // Output: http://somewebsite.com:80/Amazing%3F/url/
}

func ExampleMustNormalizeURLString() {
  normalized := MustNormalizeURLString("hTTpS://someWEBsite.com:443/Amazing%fa/url/",
    FlagsUnsafeGreedy)
  fmt.Print(normalized)

  // Output: http://somewebsite.com/Amazing%FA/url
}

func ExampleNormalizeURL() {
  if u, err := url.Parse("Http://SomeUrl.com:8080/a/b/.././c///g?c=3&a=1&b=9&c=0#target"); err != nil {
    panic(err)
  } else {
    normalized := NormalizeURL(u, FlagsUsuallySafeGreedy|FlagRemoveDuplicateSlashes|FlagRemoveFragment)
    fmt.Print(normalized)
  }

  // Output: http://someurl.com:8080/a/c/g?c=3&a=1&b=9&c=0
}

API

As seen in the examples above, purell offers three methods, NormalizeURLString(string, NormalizationFlags) (string, error), MustNormalizeURLString(string, NormalizationFlags) (string) and NormalizeURL(*url.URL, NormalizationFlags) (string). They all normalize the provided URL based on the specified flags. Here are the available flags:

const (
	// Safe normalizations
	FlagLowercaseScheme           NormalizationFlags = 1 << iota // HTTP://host -> http://host, applied by default in Go1.1
	FlagLowercaseHost                                            // http://HOST -> http://host
	FlagUppercaseEscapes                                         // http://host/t%ef -> http://host/t%EF
	FlagDecodeUnnecessaryEscapes                                 // http://host/t%41 -> http://host/tA
	FlagEncodeNecessaryEscapes                                   // http://host/!"#$ -> http://host/%21%22#$
	FlagRemoveDefaultPort                                        // http://host:80 -> http://host
	FlagRemoveEmptyQuerySeparator                                // http://host/path? -> http://host/path

	// Usually safe normalizations
	FlagRemoveTrailingSlash // http://host/path/ -> http://host/path
	FlagAddTrailingSlash    // http://host/path -> http://host/path/ (should choose only one of these add/remove trailing slash flags)
	FlagRemoveDotSegments   // http://host/path/./a/b/../c -> http://host/path/a/c

	// Unsafe normalizations
	FlagRemoveDirectoryIndex   // http://host/path/index.html -> http://host/path/
	FlagRemoveFragment         // http://host/path#fragment -> http://host/path
	FlagForceHTTP              // https://host -> http://host
	FlagRemoveDuplicateSlashes // http://host/path//a///b -> http://host/path/a/b
	FlagRemoveWWW              // http://www.host/ -> http://host/
	FlagAddWWW                 // http://host/ -> http://www.host/ (should choose only one of these add/remove WWW flags)
	FlagSortQuery              // http://host/path?c=3&b=2&a=1&b=1 -> http://host/path?a=1&b=1&b=2&c=3

	// Normalizations not in the wikipedia article, required to cover tests cases
	// submitted by jehiah
	FlagDecodeDWORDHost           // http://1113982867 -> http://66.102.7.147
	FlagDecodeOctalHost           // http://0102.0146.07.0223 -> http://66.102.7.147
	FlagDecodeHexHost             // http://0x42660793 -> http://66.102.7.147
	FlagRemoveUnnecessaryHostDots // http://.host../path -> http://host/path
	FlagRemoveEmptyPortSeparator  // http://host:/path -> http://host/path

	// Convenience set of safe normalizations
	FlagsSafe NormalizationFlags = FlagLowercaseHost | FlagLowercaseScheme | FlagUppercaseEscapes | FlagDecodeUnnecessaryEscapes | FlagEncodeNecessaryEscapes | FlagRemoveDefaultPort | FlagRemoveEmptyQuerySeparator

	// For convenience sets, "greedy" uses the "remove trailing slash" and "remove www. prefix" flags,
	// while "non-greedy" uses the "add (or keep) the trailing slash" and "add www. prefix".

	// Convenience set of usually safe normalizations (includes FlagsSafe)
	FlagsUsuallySafeGreedy    NormalizationFlags = FlagsSafe | FlagRemoveTrailingSlash | FlagRemoveDotSegments
	FlagsUsuallySafeNonGreedy NormalizationFlags = FlagsSafe | FlagAddTrailingSlash | FlagRemoveDotSegments

	// Convenience set of unsafe normalizations (includes FlagsUsuallySafe)
	FlagsUnsafeGreedy    NormalizationFlags = FlagsUsuallySafeGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagRemoveWWW | FlagSortQuery
	FlagsUnsafeNonGreedy NormalizationFlags = FlagsUsuallySafeNonGreedy | FlagRemoveDirectoryIndex | FlagRemoveFragment | FlagForceHTTP | FlagRemoveDuplicateSlashes | FlagAddWWW | FlagSortQuery

	// Convenience set of all available flags
	FlagsAllGreedy    = FlagsUnsafeGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
	FlagsAllNonGreedy = FlagsUnsafeNonGreedy | FlagDecodeDWORDHost | FlagDecodeOctalHost | FlagDecodeHexHost | FlagRemoveUnnecessaryHostDots | FlagRemoveEmptyPortSeparator
)

For convenience, the set of flags FlagsSafe, FlagsUsuallySafe[Greedy|NonGreedy], FlagsUnsafe[Greedy|NonGreedy] and FlagsAll[Greedy|NonGreedy] are provided for the similarly grouped normalizations on wikipedia's URL normalization page. You can add (using the bitwise OR | operator) or remove (using the bitwise AND NOT &^ operator) individual flags from the sets if required, to build your own custom set.

The full godoc reference is available on gopkgdoc.

Some things to note:

  • FlagDecodeUnnecessaryEscapes, FlagEncodeNecessaryEscapes, FlagUppercaseEscapes and FlagRemoveEmptyQuerySeparator are always implicitly set, because internally, the URL string is parsed as an URL object, which automatically decodes unnecessary escapes, uppercases and encodes necessary ones, and removes empty query separators (an unnecessary ? at the end of the url). So this operation cannot not be done. For this reason, FlagRemoveEmptyQuerySeparator (as well as the other three) has been included in the FlagsSafe convenience set, instead of FlagsUnsafe, where Wikipedia puts it.

  • The FlagDecodeUnnecessaryEscapes decodes the following escapes (from -> to): - %24 -> $ - %26 -> & - %2B-%3B -> +,-./0123456789:; - %3D -> = - %40-%5A -> @ABCDEFGHIJKLMNOPQRSTUVWXYZ - %5F -> _ - %61-%7A -> abcdefghijklmnopqrstuvwxyz - %7E -> ~

  • When the NormalizeURL function is used (passing an URL object), this source URL object is modified (that is, after the call, the URL object will be modified to reflect the normalization).

  • The replace IP with domain name normalization (http://208.77.188.166/ → http://www.example.com/) is obviously not possible for a library without making some network requests. This is not implemented in purell.

  • The remove unused query string parameters and remove default query parameters are also not implemented, since this is a very case-specific normalization, and it is quite trivial to do with an URL object.

Safe vs Usually Safe vs Unsafe

Purell allows you to control the level of risk you take while normalizing an URL. You can aggressively normalize, play it totally safe, or anything in between.

Consider the following URL:

HTTPS://www.RooT.com/toto/t%45%1f///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

Normalizing with the FlagsSafe gives:

https://www.root.com/toto/tE%1F///a/./b/../c/?z=3&w=2&a=4&w=1#invalid

With the FlagsUsuallySafeGreedy:

https://www.root.com/toto/tE%1F///a/c?z=3&w=2&a=4&w=1#invalid

And with FlagsUnsafeGreedy:

http://root.com/toto/tE%1F/a/c?a=4&w=1&w=2&z=3

TODOs

  • Add a class/default instance to allow specifying custom directory index names? At the moment, removing directory index removes (^|/)((?:default|index)\.\w{1,4})$.

Thanks / Contributions

@rogpeppe @jehiah @opennota @pchristopher1275 @zenovich @beeker1121

License

The BSD 3-Clause license.

purell's People

Contributors

aleksi avatar beeker1121 avatar ciarand avatar codyopel avatar dvrkps avatar jehiah avatar johntitor avatar michael-peterson-cisco avatar mna avatar pchristopher1275 avatar santosh653 avatar zenovich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

purell's Issues

Allow definition of custom Flags and associated normalization

Provide an extensibility framework, for example if I want to add a custom flag RemoveSessionID that removes a session ID query string parameter. This is the type of normalization that requires knowledge of the specific URL, so it can't be generic (although a generic RemoveQueryString() helper could be provided), but can be useful.

Looking for a new maintainer

Hello,

After more than 8 years (according to the git logs), I'd like to hand over maintenance of this library. I don't use it personally, and haven't dedicated much care and attention to it in a long time. It would be best for someone with an interest in it (as in, someone that relies on this library as part of their project(s)) to take over.

I believe the Hugo project uses this? If so, that could be a good fit, but anyone interested, please reach out!

Thanks,
Martin

NormalizeURL does not perform IDNA normalization

IDNA normalization is only performed in NormalizeURLString, but that function returns a string. If you need a url.URL, you must then parse the result of NormalizeURLString, which means you are parsing the URL yet again, which is wasteful.

As NormalizeURLString calls NormalizeURL, the IDNA normalization in the former should be moved to the later, resulting in the URL passed to NormalizeURL having its host field IDNA normalized.

Add flag to force the default scheme

Hi,
I'd like a flag to force the default scheme (eg. http) when no scheme was given.

str, _ := purell.NormalizeURLString("example.com/foo.html", purell.FlagsForceDefaultHttpScheme)

if str == "http://example.com/foo.html" {
  // cool, I have "http" default scheme set now
}

Would it make sense? If yes, I'd be glad to submit a patch.

Reserved characters should not be percent-encoded

purell should not normalize reserved characters, as per RFC3986:

  reserved    = gen-delims / sub-delims

  gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

  sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="
package main

import (
    "fmt"

    "github.com/PuerkitoBio/purell"
)

func main() {
    fmt.Println(purell.MustNormalizeURLString("my_(url)", purell.FlagsSafe))
}

The above code outputs my_%28url%29, whereas it should be my_(url). This is due to a bug in Go stdlib (issue 5684).

A problem with the "golang.org/x" packages in purell.go

In file 'purell.go', some packages imported as:
"golang.org/x/net/idna"
"golang.org/x/text/unicode/norm"
"golang.org/x/text/width"

however, I find their real downloading urls are:
https://github.com/golang/net
https://github.com/golang/text

so after I download them using 'go get' command, by default, paths '/github.com/golang/net' and '/github.com/golang/text/ are generated in my $GOPATH directory . I have to copy them into path 'golang.org/x', which is inconvenient,I think.

why not import them directly as:
"github.com/golang/net/idna"
"github.com/golang/text/unicode/norm"
"github.com/golang/text/width" ?

Opaque URLs should be normalized too

purell doesn't normalize "opaque" URLs (see documentation for url.URL).

package main

import (
    "fmt"
    "net/url"

    "github.com/PuerkitoBio/purell"
)

func main() {
    u := &url.URL{Scheme: "http", Opaque: "//eXAMPLe.com/%3f"}
    fmt.Println(purell.NormalizeURL(u, purell.FlagLowercaseHost|purell.FlagUppercaseEscapes))
}

Output: http://eXAMPLe.com/%3f

go1.1 test error

purell_test.go:680: running LowerScheme...
purell_test.go:680: running LowerScheme2...
purell_test.go:680: running LowerHost...
purell_test.go:696: LowerHost - FAIL expected 'HTTP://www.src.ca/', got 'http://www.src.ca/'
purell_test.go:680: running UpperEscapes...
purell_test.go:680: running UnnecessaryEscapes...
purell_test.go:680: running RemoveDefaultPort...
purell_test.go:696: RemoveDefaultPort - FAIL expected 'HTTP://www.SRC.ca/', got 'http://www.SRC.ca/'
purell_test.go:680: running RemoveDefaultPort2...
purell_test.go:696: RemoveDefaultPort2 - FAIL expected 'HTTP://www.SRC.ca', got 'http://www.SRC.ca'
purell_test.go:680: running RemoveDefaultPort3...
purell_test.go:696: RemoveDefaultPort3 - FAIL expected 'HTTP://www.SRC.ca:8080', got 'http://www.SRC.ca:8080'
.............
.............

Remove Multiple Leading Slashes

using v1.1.0

result, _ := purell.NormalizeURLString("///foo///bar///", purell.FlagRemoveDuplicateSlashes | purell.FlagRemoveTrailingSlash)
// result is //foo/bar/

To get around this, I just use the following function:

func removeDupLeadingSlashes(path string) string {
  if len(path) < 1 {
    return "/"
  }
  var buffer bytes.Buffer
  var i int
  buffer.WriteString("/")
  for i=0; i<len(path); i++ {
    if path[i] == '/' {
      continue
    }
    break
  }
  buffer.WriteString(path[i:])
  return buffer.String()
}

Tests fail with Go1.12rc1

When I run go get github.com/PuerkitoBio/purell with Go1.12rc1, I get the following. I think this is due to the fix for https://golang.org/issue/22907.

--- FAIL: TestEncodeNecessaryEscapesAll (0.00s)
    purell_test.go:768: Got error parse http://host/�����
        

���������������� !": net/url: invalid control character in URL
FAIL
FAIL	github.com/PuerkitoBio/purell	0.088s

FlagForceHTTP -> to also force adding http if schema not present?

Hey guys,

Still new with github, so pardon my ignorance.

I have a use case wherein input url doesn't contain schema, and was wondering if 'FlagForceHTTP' should also include adding default 'http' schema if schema doesn't exist, and not just if schema is 'https'? e.g. :

    &testCase{
        "ForceHTTP",
        "whereareyou.com",
        FlagForceHTTP,
        "http://whereareyou.com",
        false,
    },

Already made changes here in my local, and if this is a valid issue, should just be a simple fix, and will search on how to commit the change.

func forceHTTP(u *url.URL) {
if strings.ToLower(u.Scheme) == "https" {
u.Scheme = "http"
}

// +den
// for those urls not having schema, then add http
if strings.ToLower(u.Scheme) == "" {
    u.Scheme = "http"
}
// -den

}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.