Giter Site home page Giter Site logo

ethauvin / urlencoder Goto Github PK

View Code? Open in Web Editor NEW
29.0 4.0 3.0 394 KB

A simple defensive library to encode/decode URL components.

License: Apache License 2.0

Kotlin 94.45% Java 5.55%
java kotlin url encoder encoding library urlenc urlencode urlencoded urlencoder

urlencoder's Introduction

License Kotlin Nexus Snapshot Release Maven Central

GitHub CI Tests

URL Encoder for Kotlin Multiplatform

UrlEncoder is a simple defensive library to encode/decode URL components.

This library was adapted from the RIFE2 Web Application Framework.
A pure Java version can also be found at https://github.com/gbevin/urlencoder.

The rules are determined by combining the unreserved character set from RFC 3986 with the percent-encode set from application/x-www-form-urlencoded.

Both specs above support percent decoding of two hexadecimal digits to a binary octet, however their unreserved set of characters differs and application/x-www-form-urlencoded adds conversion of space to +, that has the potential to be misunderstood.

This class encodes with rules that will be decoded correctly in either case.

Additionally, this library allocates no memory when encoding isn't needed and does the work in a single pass without multiple loops. Both of these optimizations have a significantly beneficial impact on performance of encoding compared to other solutions like the standard URLEncoder in the JDK or UriUtils in Spring.

Examples (TL;DR)

UrlEncoderUtil.encode("a test &") // -> a%20test%20%26
UrlEncoderUtil.encode("%#okékÉȢ smile!😁") // -> %25%23ok%C3%A9k%C3%89%C8%A2%20smile%21%F0%9F%98%81
UrlEncoderUtil.encode("?test=a test", allow = "?=") // -> ?test=a%20test
UrlEncoderUtil.endode("foo bar", spaceToPlus = true) // -> foo+bar

UrlEncoderUtil.decode("a%20test%20%26") // -> a test &
UrlEncoderUtil.decode("%25%23ok%C3%A9k%C3%89%C8%A2%20smile%21%F0%9F%98%81") // -> %#okékÉȢ smile!😁
UrlEncoderUtil.decode("foo+bar", plusToSpace = true) // -> foo bar

Gradle, Maven, etc.

To use with Gradle, include the following dependency in your build file:

repositories {
    mavenCentral()
    // only needed for SNAPSHOT
    maven("https://oss.sonatype.org/content/repositories/snapshots") { 
      name = "SonatypeSnapshots"
      mavenContent { snapshotsOnly() }
    }
}

dependencies {
    implementation("net.thauvin.erik.urlencoder:urlencoder-lib:1.5.0")
}

Adding a dependency in Maven requires specifying the JVM variant by adding a -jvm suffix to the artifact URL.

<dependency>
    <groupId>net.thauvin.erik.urlencoder</groupId>
    <artifactId>urlencoder-lib-jvm</artifactId>
    <version>1.5.0</version>
</dependency>

Instructions for using with Ivy, etc. can be found on Maven Central.

Standalone usage

UrlEncoder can be used on the command line also, both for encoding and decoding.

You have two options:

  • run it with Gradle
  • build the jar and launch it with Java

The usage is as follows:

Encode and decode URL components defensively.
  -e  encode (default)
  -d  decode

Running with Gradle

./gradlew run --quiet --args="-e 'a test &'"        # -> a%20test%20%26
./gradlew run --quiet --args="%#okékÉȢ"             # -> %25%23ok%C3%A9k%C3%89%C8%A2

./gradlew run --quiet --args="-d 'a%20test%20%26'"  # -> a test &

Running with Java

First build the jar file:

./gradlew fatJar

Then run it:

java -jar urlencoder-app/build/libs/urlencoder-*all.jar -e "a test &"       # -> a%20test%20%26
java -jar urlencoder-app/build/libs/urlencoder-*all.jar "%#okékÉȢ"          # -> %25%23ok%C3%A9k%C3%89%C8%A2

java -jar urlencoder-app/build/libs/urlencoder-*all.jar -d "a%20test%20%26" # -> a test &

Why not simply use java.net.URLEncoder?

Apart for being quite inefficient, some URL components encoded with URLEncoder.encode might not be able to be properly decoded.

For example, a simple search query such as:

val u = URLEncoder.encode("foo +bar", StandardCharsets.UTF_8)

would be encoded as:

foo+%2Bbar

Trying to decode it with Spring, for example:

UriUtils.decode(u, StandardCharsets.UTF_8)

would return:

foo++bar

Unfortunately, decoding with Uri.decode on Android, decodeURI in Javascript, etc. would yield the exact same result.

URLEncoder

urlencoder's People

Contributors

asemy avatar ethauvin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

urlencoder's Issues

Support for Kotlin Multiplatform targets

Hi 👋

I'd like to be able to use this library in a Kotlin Multiplatform project. Would you consider updating the library to support JVM, JS, and Native Kotlin compilation targets?

For now I've migrated it manually. I'm including the code so you are welcome to use it.

If you would like some help with the Gradle config, I can provide a PR.

Summary of changes:

  • converted to a class
  • converted safeChars to class constructor parameter
  • replaced BitSet with BooleanArray
  • updated ByteArray operations to use Kotlin Multiplatform equivalents
  • added some assertions (based on Google PercentEscaper)
/*
 * Copyright 2001-2023 Geert Bevin (gbevin[remove] at uwyn dot com)
 * Copyright 2022-2023 Erik C. Thauvin ([email protected])
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// Adapted version of UrlEncoder.kt
// - converted to a class
// - converted safeChars to class constructor parameter
// - replaced BitSet with BooleanArray
// - updated ByteArray operations to use Kotlin Multiplatform equivalents
// - added some assertions (based on Google PercentEscaper)
// https://github.com/ethauvin/urlencoder/blob/34b69a7d1f3570aa056285253376ed7a7bde03d8/lib/src/main/kotlin/net/thauvin/erik/urlencoder/UrlEncoder.kt

package net.thauvin.erik.urlencoder

/**
 * Most defensive approach to URL encoding and decoding.
 *
 * - Rules determined by combining the unreserved character set from
 * [RFC 3986](https://www.rfc-editor.org/rfc/rfc3986#page-13) with the percent-encode set from
 * [application/x-www-form-urlencoded](https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set).
 *
 * - Both specs above support percent decoding of two hexadecimal digits to a binary octet, however their unreserved
 * set of characters differs and `application/x-www-form-urlencoded` adds conversion of space to `+`, which has the
 * potential to be misunderstood.
 *
 * - This library encodes with rules that will be decoded correctly in either case.
 *
 * @param safeChars a non-null string specifying additional safe characters for this escaper (the
 * ranges `0..9`, `a..z` and `A..Z` are always safe and should not be specified here)
 * @param plusForSpace `true` if ASCII space should be escaped to `+` rather than `%20`
 *
 * @author Geert Bevin (gbevin(remove) at uwyn dot com)
 * @author Erik C. Thauvin ([email protected])
 **/
internal class UrlEncoder(
    safeChars: String,
    private val plusForSpace: Boolean,
) {

    // see https://www.rfc-editor.org/rfc/rfc3986#page-13
    // and https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set
    private val unreservedChars = createUnreservedChars(safeChars)

    init {
        // Avoid any misunderstandings about the behavior of this escaper
        require(!safeChars.matches(".*[0-9A-Za-z].*".toRegex())) {
            "Alphanumeric characters are always 'safe' and should not be explicitly specified"
        }
        // Avoid ambiguous parameters. Safe characters are never modified so if
        // space is a safe character then setting plusForSpace is meaningless.
        require(!(plusForSpace && ' ' in safeChars)) {
            "plusForSpace cannot be specified when space is a 'safe' character"
        }
        require('%' !in safeChars) {
            "The '%' character cannot be specified as 'safe'"
        }
    }

    /**
     * Transforms a provided [String] into a new string, containing decoded URL characters in the UTF-8
     * encoding.
     */
    fun decode(
        source: String,
        plusToSpace: Boolean = plusForSpace,
    ): String {
        if (source.isEmpty()) return source

        val length = source.length
        val out = StringBuilder(length)
        var ch: Char
        var bytesBuffer: ByteArray? = null
        var bytesPos = 0
        var i = 0
        var started = false
        while (i < length) {
            ch = source[i]
            if (ch == '%') {
                if (!started) {
                    out.append(source, 0, i)
                    started = true
                }
                if (bytesBuffer == null) {
                    // the remaining characters divided by the length of the encoding format %xx, is the maximum number
                    // of bytes that can be extracted
                    bytesBuffer = ByteArray((length - i) / 3)
                }
                i++
                require(length >= i + 2) { "Incomplete trailing escape ($ch) pattern" }
                try {
                    val v = source.substring(i, i + 2).toInt(16)
                    require(v in 0..0xFF) { "Illegal escape value" }
                    bytesBuffer[bytesPos++] = v.toByte()
                    i += 2
                } catch (e: NumberFormatException) {
                    throw IllegalArgumentException("Illegal characters in escape sequence: $e.message", e)
                }
            } else {
                if (bytesBuffer != null) {
                    out.append(bytesBuffer.decodeToString(0, bytesPos))
                    started = true
                    bytesBuffer = null
                    bytesPos = 0
                }
                if (plusToSpace && ch == '+') {
                    if (!started) {
                        out.append(source, 0, i)
                        started = true
                    }
                    out.append(" ")
                } else if (started) {
                    out.append(ch)
                }
                i++
            }
        }

        if (bytesBuffer != null) {
            out.append(bytesBuffer.decodeToString(0, bytesPos))
        }

        return if (!started) source else out.toString()
    }

    /**
     * Transforms a provided [String] object into a new string, containing only valid URL
     * characters in the UTF-8 encoding.
     *
     * - Letters, numbers, unreserved (`_-!.'()*`) and allowed characters are left intact.
     */
    fun encode(
        source: String,
        spaceToPlus: Boolean = plusForSpace,
    ): String {
        if (source.isEmpty()) {
            return source
        }
        var out: StringBuilder? = null
        var ch: Char
        var i = 0
        while (i < source.length) {
            ch = source[i]
            if (ch.isUnreserved()) {
                out?.append(ch)
                i++
            } else {
                if (out == null) {
                    out = StringBuilder(source.length)
                    out.append(source, 0, i)
                }
                val cp = source.codePointAt(i)
                if (cp < 0x80) {
                    if (spaceToPlus && ch == ' ') {
                        out.append('+')
                    } else {
                        out.appendEncodedByte(cp)
                    }
                    i++
                } else if (Character.isBmpCodePoint(cp)) {
                    for (b in ch.toString().encodeToByteArray()) {
                        out.appendEncodedByte(b.toInt())
                    }
                    i++
                } else if (Character.isSupplementaryCodePoint(cp)) {
                    val high = Character.highSurrogateOf(cp)
                    val low = Character.lowSurrogateOf(cp)
                    for (b in charArrayOf(high, low).concatToString().encodeToByteArray()) {
                        out.appendEncodedByte(b.toInt())
                    }
                    i += 2
                }
            }
        }

        return out?.toString() ?: source
    }

    /**
     * see https://www.rfc-editor.org/rfc/rfc3986#page-13
     * and https://url.spec.whatwg.org/#application-x-www-form-urlencoded-percent-encode-set
     */
    private fun Char.isUnreserved(): Boolean = this <= 'z' && unreservedChars[code]

    companion object {

        private val hexDigits: CharArray = "0123456789ABCDEF".toCharArray()

        private fun StringBuilder.appendEncodedDigit(digit: Int) {
            append(hexDigits[digit and 0x0F])
        }

        private fun StringBuilder.appendEncodedByte(ch: Int) {
            append("%")
            appendEncodedDigit(ch shr 4)
            appendEncodedDigit(ch)
        }

        /**
         * Creates a [BooleanArray] with entries corresponding to the character values for
         * `0-9`, `A-Z`, `a-z` and those specified in [safeChars] set to `true`.
         *
         * The array is as small as is required to hold the given character information.
         */
        private fun createUnreservedChars(safeChars: String): BooleanArray {
            val safeCharArray = safeChars.toCharArray()
            val maxChar = safeCharArray.maxOf { it.code }.coerceAtLeast('z'.code)

            val unreservedChars = BooleanArray(maxChar + 1)

            unreservedChars['-'.code] = true
            unreservedChars['.'.code] = true
            unreservedChars['_'.code] = true
            for (c in '0'..'9') unreservedChars[c.code] = true
            for (c in 'A'..'Z') unreservedChars[c.code] = true
            for (c in 'a'..'z') unreservedChars[c.code] = true

            for (c in safeCharArray) unreservedChars[c.code] = true

            return unreservedChars
        }
    }
}

Character utils

// Based on https://github.com/cketti/kotlin-codepoints

/*
 * MIT License
 *
 * Copyright (c) 2023 cketti
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to deal
 * in the Software without restriction, including without limitation the rights
 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in all
 * copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */

/**
 * Kotlin Multiplatform equivalent for `java.lang.Character`
 */
internal object Character {

    /**
     * See https://www.tutorialspoint.com/java/lang/character_issupplementarycodepoint.htm
     *
     * Determines whether the specified character (Unicode code point) is in the supplementary character range.
     * The supplementary character range in the Unicode system falls in `U+10000` to `U+10FFFF`.
     *
     * The Unicode code points are divided into two categories:
     * Basic Multilingual Plane (BMP) code points and Supplementary code points.
     * BMP code points are present in the range U+0000 to U+FFFF.
     *
     * Whereas, supplementary characters are rare characters that are not represented using the original 16-bit Unicode.
     * For example, these type of characters are used in Chinese or Japanese scripts and hence, are required by the
     * applications used in these countries.
     *
     * @returns `true` if the specified code point falls in the range of supplementary code points
     * ([MIN_SUPPLEMENTARY_CODE_POINT] to [MAX_CODE_POINT], inclusive), `false` otherwise.
     */
    internal fun isSupplementaryCodePoint(codePoint: Int): Boolean =
        codePoint in MIN_SUPPLEMENTARY_CODE_POINT..MAX_CODE_POINT

    internal fun charCount(codePoint: Int): Int = if (codePoint <= MIN_SUPPLEMENTARY_CODE_POINT) 1 else 2

    internal fun toCodePoint(highSurrogate: Char, lowSurrogate: Char): Int =
        (highSurrogate.code shl 10) + lowSurrogate.code + SURROGATE_DECODE_OFFSET

    internal fun toChars(codePoint: Int): CharArray = when {
        isBmpCodePoint(codePoint) -> charArrayOf(codePoint.toChar())
        else                      -> charArrayOf(highSurrogateOf(codePoint), lowSurrogateOf(codePoint))
    }

    /** Basic Multilingual Plane (BMP) */
    internal fun isBmpCodePoint(codePoint: Int): Boolean = codePoint ushr 16 == 0

    internal fun highSurrogateOf(codePoint: Int): Char =
        ((codePoint ushr 10) + HIGH_SURROGATE_ENCODE_OFFSET.code).toChar()

    internal fun lowSurrogateOf(codePoint: Int): Char =
        ((codePoint and 0x3FF) + MIN_LOW_SURROGATE.code).toChar()

    private const val MIN_CODE_POINT: Int = 0x000000
    private const val MAX_CODE_POINT: Int = 0x10FFFF

    private const val MIN_SUPPLEMENTARY_CODE_POINT: Int = 0x10000

    private const val SURROGATE_DECODE_OFFSET: Int =
        MIN_SUPPLEMENTARY_CODE_POINT -
            (MIN_HIGH_SURROGATE.code shl 10) -
            MIN_LOW_SURROGATE.code

    private const val HIGH_SURROGATE_ENCODE_OFFSET: Char = MIN_HIGH_SURROGATE - (MIN_SUPPLEMENTARY_CODE_POINT ushr 10)

}



/**
 * Returns the Unicode code point at the specified index.
 *
 * The `index` parameter is the regular `CharSequence` index, i.e. the number of `Char`s from the start of the character
 * sequence.
 *
 * If the code point at the specified index is part of the Basic Multilingual Plane (BMP), its value can be represented
 * using a single `Char` and this method will behave exactly like [CharSequence.get].
 * Code points outside the BMP are encoded using a surrogate pair – a `Char` containing a value in the high surrogate
 * range followed by a `Char` containing a value in the low surrogate range. Together these two `Char`s encode a single
 * code point in one of the supplementary planes. This method will do the necessary decoding and return the value of
 * that single code point.
 *
 * In situations where surrogate characters are encountered that don't form a valid surrogate pair starting at `index`,
 * this method will return the surrogate code point itself, behaving like [CharSequence.get].
 *
 * If the `index` is out of bounds of this character sequence, this method throws an [IndexOutOfBoundsException].
 *
 * To iterate over all code points in a character sequence the index has to be adjusted depending on the value of the
 * returned code point. Use [CodePoints.charCount] for this.
 *
 * ```kotlin
 * // Text containing code points outside the BMP (encoded as a surrogate pairs)
 * val text = "\uD83E\uDD95\uD83E\uDD96"
 *
 * var index = 0
 * while (index < text.length) {
 *     val codePoint = text.codePointAt(index)
 *     // Do something with codePoint
 *
 *     index += CodePoints.charCount(codePoint)
 * }
 * ```
 */
internal fun CharSequence.codePointAt(index: Int): Int {
    if (index !in indices) throw IndexOutOfBoundsException("index $index was not in range $indices")

    val firstChar = this[index]
    if (firstChar.isHighSurrogate()) {
        val nextChar = getOrNull(index + 1)
        if (nextChar?.isLowSurrogate() == true) {
            return Character.toCodePoint(firstChar, nextChar)
        }
    }

    return firstChar.code
}

Function 'encode' can not be called with Kotlin 2.0.0-Beta

When I try to call UrlEncoderUtil.encode from a macOsArm64 test I get an error.

Kotlin version 2.0.0-Beta4
Library version 1.4.0

Stack Trace

Function 'encode' can not be called: No function found for symbol 'net.thauvin.erik.urlencoder/UrlEncoderUtil.encode|encode#static(kotlin.String;kotlin.String;kotlin.Boolean){}[0]'
i: <missing declarations>: No function found for symbol 'net.thauvin.erik.urlencoder/UrlEncoderUtil.encode|encode#static(kotlin.String;kotlin.String;kotlin.Boolean){}[0]'


kotlin.native.internal.IrLinkageError: Function 'encode' can not be called: No function found for symbol 'net.thauvin.erik.urlencoder/UrlEncoderUtil.encode|encode#static(kotlin.String;kotlin.String;kotlin.Boolean){}[0]'
        at kotlin.Throwable#<init>(/opt/buildAgent/work/4b543f97c4b0507f/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/Throwable.kt:28)
        at kotlin.Error#<init>(/opt/buildAgent/work/4b543f97c4b0507f/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/Exceptions.kt:12)
        at kotlin.native.internal.IrLinkageError#<init>(/opt/buildAgent/work/4b543f97c4b0507f/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/native/internal/RuntimeUtils.kt:137)
        at kotlin.native.internal#ThrowIrLinkageError(/opt/buildAgent/work/4b543f97c4b0507f/kotlin/kotlin-native/runtime/src/main/kotlin/kotlin/native/internal/RuntimeUtils.kt:141)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.