Giter Site home page Giter Site logo

zenz34 / convert2utf Goto Github PK

View Code? Open in Web Editor NEW

This project forked from x1angli/cvt2utf

0.0 0.0 0.0 31 KB

This lightweight tool converts non-UTF-encoded (such as GB2312, GBK, BIG5 encoded) files to UTF-8 encoding. 将目录下所有的GB, GBK, 及其他编码的文本文件及源代码文件批量转成UTF, UTF8编码,同时移除Byte Order Mark (BOM)

License: MIT License

Python 100.00%

convert2utf's Introduction

Converts text files or source code files into UTF-8 encoding

This lightweight tool converts non-UTF-encoded (such as GB2312, GBK, BIG5 encoded) files to UTF-8 encoded files. It can either be executed from command line (CLI), or imported into other Python code.

Installation

Automatic Installation (recommended)

  1. Make sure Python 3, along with pip, is properly installed.
  2. In your CLI, execute pip install convert2utf

Manual Installation (for developers only)

  1. Make sure Python 3 is properly installed.
  2. Clone this project, or just download the .zip file from github.com and unarchive it
  3. Start CLI (command line interface), enter the local folder
  4. Setup Python virtual environment with virtualenv ... or python -m venv ...
  5. Run: pip install -r requirements.txt

Usage

There is only one mandatory argument: filename, where you can specify the directory or file name.

  • Batch mode: Pass in a directory as the input, and all text files that meets the criteria underneath it will be converted to UTF8-encoding.
  • Single file mode_: If the input argument is just an individual file, it would be straightforwardly converted to UTF-8.

Examples:

  • Change all .txt files to UTF-8 encoding.

    Those byte-order marks a.k.a. "BOM"s or "signature"s in existing UTF-8 files will be removed.

    python cvt2utf.py "D:\mynotebook"

    Afterwards, you could use any text editor (e.g. [Notepad++] (https://notepad-plus-plus.org/)) to verify the text files underneath the specified folder are already converted to UTF-8.

  • Change all .csv files to UTF-8 encoding. Since BOM are used by some applications (such as Microsoft Excel)

    python cvt2utf.py "D:\mynotebook" --exts csv --keepbom

  • Convert all .php, .js, .java, .py files to UTF-8 encoding.

    Also, make sure all BOMs are removed. They are really nuisance for source code files!

    python cvt2utf.py "D:\workspace" --exts php js java py

  • After manually verify the new UTF-8 files are correct, you can remove all .bak files

    python cvt2utf.py "D:\workspace" --cleanbak

  • Alternatively, if you are confident with Python's in-house encoding and decoding, you can simply convert files without creating backups.

    Do NOT call this, unless you know what you are doing.

    python cvt2utf.py "D:\workspace" --overwrite

  • Converts an individual file

    python cvt2utf.py "D:\workspace\a.txt"

  • Show help information

    python cvt2utf.py -h

(Linux only) Directly run the program

Sometimes, you may want to run the program without specifying the Python interpretor, such as:

./cvt2utf.py "~/mynotebooks"

(Note the leading python command is missing here)

To achieve this, you first need to grant the execution permission onto the Python, (skip this provided it already have the eXecution permission:

sudo chmod +x ./cvt2utf.py

Then activate the virtual environment:

. venv/bin/activate

Alternatively, if you already have all dependencies installed with your default python environment, or you've already activated virtualenv’s python you could skip this.

Then, make sure dependencies are installed

pip install -r requirements.txt

Finally, execute the file: (you could add command arguments here):

./cvt2utf.py "~/the/base/dir"

You might want to use absolute path for this program if you are running it in an arbitrary working directory.

(For developers) Programmatically use this Python module

For Python programmers who want to use this module, see below

>>> from cvt2utf import Convert2Utf8
>>> cvt2utf = Convert2Utf8(['php', 'css', 'htm', 'html', 'js'], False, False)
>>> cvt2utf.run('D:\\workspace')
>>> cvt2utf.run('D:\\another\\folder')

Note: the constructor Convert2Utf8() takes 3 arguments: the extension list, the switch to keep BOM, the direct-overwriting mode. The usage of these arguments is same as the command-line method.

Miscellaneous

By default, the converted output text files will NOT contain BOM (byte order mark). Should you want to learn what is BOM along with its implication, please check: https://en.wikipedia.org/wiki/Byte_order_mark

FAQ

Why do we choose UTF-8 among all charsets?

A: For i18n, UTF-8 is wide spread. It is the de facto standard for non-English texts.

Compared with UTF-16, UTF-8 is usually more compact and "with full fidelity". It also doesn't suffer from the endianness issue of UTF-16.

Why do we need this tool?

A: Indeed, there are a bunch of text editors out there (such as Notepad++) that handle various encodings of text files very well. Yet for the purpose of batch conversion we need this Python script. This script is also written for educational purpose -- developers can learn from this script to get an idea of how to handle text encoding.

Why should we remove BOMs (byte order mark) rather than add them?

A: Most compilers and interpreters can handle UTF-8 source code files very well, provided that those files are encoded w/o BOM. Some compilers/interpreters might fail or give unexpected output whenever BOM is present. For this reason, I strongly advise the removal of BOM whenever we use UTF-8 encoding.

Side note: of course, there are certain situations where BOMs are preferred. (For example, Microsoft Excel cannot parse correctly UTF8 w/o BOM CSV files with international characters. ) Such situations are rare. Overall, the necessity of BOM trumps other concerns.

Is the current version reliable?

A: This code is still at its "beta" phase. We are striving to deliver high reliable solutions to our users. You might be aware that Python's built-in UTF encoding/decoding plus chardet may not be very reliable. For that reason, we suggest users create backups, either manually duplicate the file/directory, or automatically through our package (remember, the backup feature will be short-circuited with the --overwrite switch)

What kind of files should have BOM removed?

A: Here is a list

convert2utf's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.