pytorch / text Goto Github PK

View Code? Open in Web Editor NEW

3.4K 285.0 810.0 47.05 MB

Models, data loaders and abstractions for language processing, powered by PyTorch

Home Page: https://pytorch.org/text

License: BSD 3-Clause "New" or "Revised" License

Python 81.08% Shell 1.75% C++ 10.69% Batchfile 0.61% CMake 0.68% C 0.12% Jupyter Notebook 5.07%

nlp data-loader deep-learning pytorch dataset models

text's Introduction

PyTorch is a Python package that provides two high-level features:

Tensor computation (like NumPy) with strong GPU acceleration
Deep neural networks built on a tape-based autograd system

You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to extend PyTorch when needed.

Our trunk health (Continuous Integration signals) can be found at hud.pytorch.org.

More About PyTorch
Installation
Getting Started
Resources
Communication
Releases and Contributing
The Team
License

More About PyTorch

Learn the basics of PyTorch

At a granular level, PyTorch is a library that consists of the following components:

Component	Description
torch	A Tensor library like NumPy, with strong GPU support
torch.autograd	A tape-based automatic differentiation library that supports all differentiable Tensor operations in torch
torch.jit	A compilation stack (TorchScript) to create serializable and optimizable models from PyTorch code
torch.nn	A neural networks library deeply integrated with autograd designed for maximum flexibility
torch.multiprocessing	Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training
torch.utils	DataLoader and other utility functions for convenience

Usually, PyTorch is used either as:

A replacement for NumPy to use the power of GPUs.
A deep learning research platform that provides maximum flexibility and speed.

Elaborating Further:

A GPU-Ready Tensor Library

If you use NumPy, then you have used Tensors (a.k.a. ndarray).

PyTorch provides Tensors that can live either on the CPU or the GPU and accelerates the computation by a huge amount.

We provide a wide variety of tensor routines to accelerate and fit your scientific computation needs such as slicing, indexing, mathematical operations, linear algebra, reductions. And they are fast!

Dynamic Neural Networks: Tape-Based Autograd

PyTorch has a unique way of building neural networks: using and replaying a tape recorder.

Most frameworks such as TensorFlow, Theano, Caffe, and CNTK have a static view of the world. One has to build a neural network and reuse the same structure again and again. Changing the way the network behaves means that one has to start from scratch.

With PyTorch, we use a technique called reverse-mode auto-differentiation, which allows you to change the way your network behaves arbitrarily with zero lag or overhead. Our inspiration comes from several research papers on this topic, as well as current and past work such as torch-autograd, autograd, Chainer, etc.

While this technique is not unique to PyTorch, it's one of the fastest implementations of it to date. You get the best of speed and flexibility for your crazy research.

Python First

PyTorch is not a Python binding into a monolithic C++ framework. It is built to be deeply integrated into Python. You can use it naturally like you would use NumPy / SciPy / scikit-learn etc. You can write your new neural network layers in Python itself, using your favorite libraries and use packages such as Cython and Numba. Our goal is to not reinvent the wheel where appropriate.

Imperative Experiences

PyTorch is designed to be intuitive, linear in thought, and easy to use. When you execute a line of code, it gets executed. There isn't an asynchronous view of the world. When you drop into a debugger or receive error messages and stack traces, understanding them is straightforward. The stack trace points to exactly where your code was defined. We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines.

Fast and Lean

PyTorch has minimal framework overhead. We integrate acceleration libraries such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed. At the core, its CPU and GPU Tensor and neural network backends are mature and have been tested for years.

Hence, PyTorch is quite fast — whether you run small or large neural networks.

The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives. We've written custom memory allocators for the GPU to make sure that your deep learning models are maximally memory efficient. This enables you to train bigger deep learning models than before.

Extensions Without Pain

Writing new neural network modules, or interfacing with PyTorch's Tensor API was designed to be straightforward and with minimal abstractions.

You can write new neural network layers in Python using the torch API or your favorite NumPy-based libraries such as SciPy.

If you want to write your layers in C/C++, we provide a convenient extension API that is efficient and with minimal boilerplate. No wrapper code needs to be written. You can see a tutorial here and an example here.

Installation

Binaries

Commands to install binaries via Conda or pip wheels are on our website: https://pytorch.org/get-started/locally/

NVIDIA Jetson Platforms

Python wheels for NVIDIA's Jetson Nano, Jetson TX1/TX2, Jetson Xavier NX/AGX, and Jetson AGX Orin are provided here and the L4T container is published here

They require JetPack 4.2 and above, and @dusty-nv and @ptrblck are maintaining them.

From Source

Prerequisites

If you are installing from source, you will need:

Python 3.8 or later (for Linux, Python 3.8.1+ is needed)
A compiler that fully supports C++17, such as clang or gcc (gcc 9.4.0 or newer is required)

We highly recommend installing an Anaconda environment. You will get a high-quality BLAS library (MKL) and you get controlled dependency versions regardless of your Linux distro.

If you want to compile with CUDA support, select a supported version of CUDA from our support matrix, then install the following:

NVIDIA CUDA
NVIDIA cuDNN v8.5 or above
Compiler compatible with CUDA

Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware

If you want to disable CUDA support, export the environment variable USE_CUDA=0. Other potentially useful environment variables may be found in setup.py.

If you are building for NVIDIA's Jetson platforms (Jetson Nano, TX1, TX2, AGX Xavier), Instructions to install PyTorch for Jetson Nano are available here

If you want to compile with ROCm support, install

AMD ROCm 4.0 and above installation
ROCm is currently supported only for Linux systems.

If you want to disable ROCm support, export the environment variable USE_ROCM=0. Other potentially useful environment variables may be found in setup.py.

Install Dependencies

Common

conda install cmake ninja
# Run this command from the PyTorch directory after cloning the source code using the “Get the PyTorch Source“ section below
pip install -r requirements.txt

On Linux

conda install intel::mkl-static intel::mkl-include
# CUDA only: Add LAPACK support for the GPU if needed
conda install -c pytorch magma-cuda110  # or the magma-cuda* that matches your CUDA version from https://anaconda.org/pytorch/repo

# (optional) If using torch.compile with inductor/triton, install the matching version of triton
# Run from the pytorch directory after cloning
make triton

On MacOS

# Add this package on intel x86 processor machines only
conda install intel::mkl-static intel::mkl-include
# Add these packages if torch.distributed is needed
conda install pkg-config libuv

On Windows

conda install intel::mkl-static intel::mkl-include
# Add these packages if torch.distributed is needed.
# Distributed package support on Windows is a prototype feature and is subject to changes.
conda install -c conda-forge libuv=1.39

Get the PyTorch Source

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive

Install PyTorch

On Linux

If you would like to compile PyTorch with new C++ ABI enabled, then first run this command:

export _GLIBCXX_USE_CXX11_ABI=1

If you're compiling for AMD ROCm then first run this command:

# Only run this if you're compiling for ROCm
python tools/amd_build/build_amd.py

Install PyTorch

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop

Aside: If you are using Anaconda, you may experience an error caused by the linker:
build/temp.linux-x86_64-3.7/torch/csrc/stub.o: file not recognized: file format not recognized
collect2: error: ld returned 1 exit status
error: command 'g++' failed with exit status 1
This is caused by ld from the Conda environment shadowing the system ld. You should use a newer version of Python that fixes this issue. The recommended Python version is 3.8.1+.

On macOS

python3 setup.py develop

On Windows

Choose Correct Visual Studio Version.

PyTorch CI uses Visual C++ BuildTools, which come with Visual Studio Enterprise, Professional, or Community Editions. You can also install the build tools from https://visualstudio.microsoft.com/visual-cpp-build-tools/. The build tools do not come with Visual Studio Code by default.

If you want to build legacy python code, please refer to Building on legacy code and CUDA

CPU-only builds

In this mode PyTorch computations will run on your CPU, not your GPU

conda activate
python setup.py develop

Note on OpenMP: The desired OpenMP implementation is Intel OpenMP (iomp). In order to link against iomp, you'll need to manually download the library and set up the building environment by tweaking CMAKE_INCLUDE_PATH and LIB. The instruction here is an example for setting up both MKL and Intel OpenMP. Without these configurations for CMake, Microsoft Visual C OpenMP runtime (vcomp) will be used.

CUDA based build

In this mode PyTorch computations will leverage your GPU via CUDA for faster number crunching

NVTX is needed to build Pytorch with CUDA. NVTX is a part of CUDA distributive, where it is called "Nsight Compute". To install it onto an already installed CUDA run CUDA installation once again and check the corresponding checkbox. Make sure that CUDA with Nsight Compute is installed after Visual Studio.

Currently, VS 2017 / 2019, and Ninja are supported as the generator of CMake. If ninja.exe is detected in PATH, then Ninja will be used as the default generator, otherwise, it will use VS 2017 / 2019.
If Ninja is selected as the generator, the latest MSVC will get selected as the underlying toolchain.

Additional libraries such as Magma, oneDNN, a.k.a. MKLDNN or DNNL, and Sccache are often needed. Please refer to the installation-helper to install them.

You can refer to the build_pytorch.bat script for some other environment variables configurations

cmd

:: Set the environment variables after you have downloaded and unzipped the mkl package,
:: else CMake would throw an error as `Could NOT find OpenMP`.
set CMAKE_INCLUDE_PATH={Your directory}\mkl\include
set LIB={Your directory}\mkl\lib;%LIB%

:: Read the content in the previous section carefully before you proceed.
:: [Optional] If you want to override the underlying toolset used by Ninja and Visual Studio with CUDA, please run the following script block.
:: "Visual Studio 2019 Developer Command Prompt" will be run automatically.
:: Make sure you have CMake >= 3.12 before you do this when you use the Visual Studio generator.
set CMAKE_GENERATOR_TOOLSET_VERSION=14.27
set DISTUTILS_USE_SDK=1
for /f "usebackq tokens=*" %i in (`"%ProgramFiles(x86)%\Microsoft Visual Studio\Installer\vswhere.exe" -version [15^,17^) -products * -latest -property installationPath`) do call "%i\VC\Auxiliary\Build\vcvarsall.bat" x64 -vcvars_ver=%CMAKE_GENERATOR_TOOLSET_VERSION%

:: [Optional] If you want to override the CUDA host compiler
set CUDAHOSTCXX=C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\bin\HostX64\x64\cl.exe

python setup.py develop

Adjust Build Options (Optional)

You can adjust the configuration of cmake variables optionally (without building first), by doing the following. For example, adjusting the pre-detected directories for CuDNN or BLAS can be done with such a step.

On Linux

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py build --cmake-only
ccmake build  # or cmake-gui build

On macOS

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ python setup.py build --cmake-only
ccmake build  # or cmake-gui build

Docker Image

Using pre-built images

You can also pull a pre-built docker image from Docker Hub and run with docker v19.03+

docker run --gpus all --rm -ti --ipc=host pytorch/pytorch:latest

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

Building the image yourself

NOTE: Must be built with a docker version > 18.06

The Dockerfile is supplied to build images with CUDA 11.1 support and cuDNN v8. You can pass PYTHON_VERSION=x.y make variable to specify which Python version is to be used by Miniconda, or leave it unset to use the default.

make -f docker.Makefile
# images are tagged as docker.io/${your_docker_username}/pytorch

You can also pass the CMAKE_VARS="..." environment variable to specify additional CMake variables to be passed to CMake during the build. See setup.py for the list of available variables.

CMAKE_VARS="BUILD_CAFFE2=ON BUILD_CAFFE2_OPS=ON" make -f docker.Makefile

Building the Documentation

To build documentation in various formats, you will need Sphinx and the readthedocs theme.

cd docs/
pip install -r requirements.txt

You can then build the documentation by running make <format> from the docs/ folder. Run make to get a list of all available output formats.

If you get a katex error run npm install katex. If it persists, try npm install -g katex

Note: if you installed nodejs with a different package manager (e.g., conda) then npm will probably install a version of katex that is not compatible with your version of nodejs and doc builds will fail. A combination of versions that is known to work is [email protected] and [email protected]. To install the latter with npm you can run npm install -g [email protected]

Previous Versions

Installation instructions and binaries for previous PyTorch versions may be found on our website.

Getting Started

Three-pointers to get you started:

Resources

Communication

Forums: Discuss implementations, research, etc. https://discuss.pytorch.org
GitHub Issues: Bug reports, feature requests, install issues, RFCs, thoughts, etc.
Slack: The PyTorch Slack hosts a primary audience of moderate to experienced PyTorch users and developers for general chat, online discussions, collaboration, etc. If you are a beginner looking for help, the primary medium is PyTorch Forums. If you need a slack invite, please fill this form: https://goo.gl/forms/PP1AGvNHpSaJP8to1
Newsletter: No-noise, a one-way email newsletter with important announcements about PyTorch. You can sign-up here: https://eepurl.com/cbG0rv
Facebook Page: Important announcements about PyTorch. https://www.facebook.com/pytorch
For brand guidelines, please visit our website at pytorch.org

Releases and Contributing

Typically, PyTorch has three minor releases a year. Please let us know if you encounter a bug by filing an issue.

We appreciate all contributions. If you are planning to contribute back bug-fixes, please do so without any further discussion.

If you plan to contribute new features, utility functions, or extensions to the core, please first open an issue and discuss the feature with us. Sending a PR without discussion might end up resulting in a rejected PR because we might be taking the core in a different direction than you might be aware of.

To learn more about making a contribution to Pytorch, please see our Contribution page. For more information about PyTorch releases, see Release page.

The Team

PyTorch is a community-driven project with several skillful engineers and researchers contributing to it.

PyTorch is currently maintained by Soumith Chintala, Gregory Chanan, Dmytro Dzhulgakov, Edward Yang, and Nikita Shulga with major contributions coming from hundreds of talented individuals in various forms and means. A non-exhaustive but growing list needs to mention: Trevor Killeen, Sasank Chilamkurthy, Sergey Zagoruyko, Adam Lerer, Francisco Massa, Alykhan Tejani, Luca Antiga, Alban Desmaison, Andreas Koepf, James Bradbury, Zeming Lin, Yuandong Tian, Guillaume Lample, Marat Dukhan, Natalia Gimelshein, Christian Sarofeen, Martin Raison, Edward Yang, Zachary Devito.

Note: This project is unrelated to hughperkins/pytorch with the same name. Hugh is a valuable contributor to the Torch community and has helped with many things Torch and PyTorch.

License

PyTorch has a BSD-style license, as found in the LICENSE file.

text's People

Contributors

Stargazers

Watchers

Forkers

chagge vyraun bmccann leonweber napsternxg smerity ink-pad binbinbian dvsrepo koustuvsinha mupavan dmitriy-serdyuk dogancan mattmacy terratenney fehiepsi little1tow zhanglae ajaytalati miyyer leezqcst gagb burtenshaw kobikun latkins nelson-liu joyce94 argilla-io winnerineast smartai alrojo akornilo deeptechlabs mhossny tobby2002 sojvai stevenlol matt-peters jianyuzhan codeaudit atifs aa1607 mainak24 zhangyunyan77 huangyiran blazezhazha woollysocks aotemandess ankithakur47 xkuang wangzhen-nlp czhang99 psavine42 yanhedewang sivareddyg kylegao91 19ai ignaciocases sonalgupta oroszgy hailingc gregorysenay waynedane yongxiongwei ahhegazy siddsach rachtsingh zhongminjin ml-ai-nlp-ir sampwing chenglongchen hengqujushi joseprzmoreno qiao-zhang elanmart entilzha wabyking lingyongyan zbxzc35 ryanleary galsang sinboyxx kelleyyin xiaoduozhou jkr26 kmkurn woailaosang astonzhang shubhampachori12110095 7125messi catcatrun frankatmech bastings heliwang tkim oracle1983 orionr sebastiangehrmann dreamgonfly anupsavvy

text's Issues

Possible bug in LanguageModelingDataset

In the code for LanguageModelingDataset, the original text seems to be pre-processed twice, viz.:

text += text_field.preprocess(line) at line 22
examples = [data.Example.fromlist([text], fields)] at line 26, which in turn calls
setattr(ex, name, field.preprocess(val)) at line 44 of example.py

In fact, if I try to create a simple LanguageModelingDataset, I am getting an error as follows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/datasets/language_modeling.py", line 28, in __init__
    examples = [data.Example.fromlist([text], fields)]
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/data/example.py", line 44, in fromlist
    setattr(ex, name, field.preprocess(val))
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/data/field.py", line 91, in preprocess
    x = self.tokenize(x)
  File "/home/riddasgu/.local/lib/python2.7/site-packages/torchtext/data/field.py", line 63, in <lambda>
    tokenize=(lambda s: s.split()), include_lengths=False,
AttributeError: 'list' object has no attribute 'split'

Feature Request: Implement sorting within batches

I don't think this is currently implemented -- pardon me if it is. It'd be great to sort within batches, most notably for use with pack_padded_sequence.

Load data from a list of lists and build a vocabulary

I have a dataset in which each sample is essentially a sequence of sentences. I have parsed the data to form a list of lists (which if I convert to a numpy array would give an ndarray of shape (M,N) ). Most of the functionality from what I could see is for loading dataset directly from files. Is there any way I could load the data from such a list of lists or a numpy array/pytorch Tensor and then use the build_vocab method on that?

min_freq=0 bug

Noticed:

>>>some_field.build_vocab(some_dataset, min_freq=0)
>>>padding_idx = some_field.vocab.stoi['<pad'>]
>>>print(padding_idx, '<pad>')
12 <pad>

Looks like is not equal to 1 which is not okay.

Printed stoi and itos as well:

>>>print(some_field.vocab.stoi)
defaultdict(<function Vocab.__init__.<locals>.<lambda> at 0x103f4f0d0>, {'<pad>': 12, '1': 2, '2': 3, '9': 4, '0': 5, '5': 6, '4': 7, '6': 8, '8': 9, '3': 10, '7': 11, '<unk>': 13})
>>>print(some_field.vocab.itos)
['<unk>', '<pad>', '1', '2', '9', '0', '5', '4', '6', '8', '3', '7', '<pad>', '<unk>']

Possible reason:
Counter subtract does remove the specials but puts their count at 0.
counter.subtract({tok: counter[tok] for tok in ['<unk>'] + specials})

Possible solution:
Throw an error if min_freq < 1

Error while loading csv

This line complains when loading data via csv:

train, val, test = data.TabularDataset.splits(path='/home/data/',train='train.csv',
    validation='val.csv', test='test.csv', format='csv',
    fields=[('text', text_field), ('labels', label_field)])

Error:

    272         if data[-1] == '\n':
    273             data = data[:-1]
--> 274         return cls.fromlist(list(csv.reader([data]))[0])
    275 
    276     @classmethod

TypeError: fromlist() takes exactly 3 arguments (2 given)

Using python 2.7

Splitting up torchtext/data.py into files for each class

As it currently stands, torchtext/data.py has a lot of functionality in one file. I think it'd be nice to split the file into pieces separated by individual functionality, e.g. a file for Fields, a file for Datasets (and their subclasses, of course), a file for Iterators, etc. In particular, I think the structure of torch.nn would work well here.

I think doing this would be a lot clearer in terms of organization of the codebase, but the biggest issue I have with it is that it changes the syntax for the import from from torchtext import data to import torchtext.data as data (much as one would do import torch.nn as nn), rendering the code backwards-incompatible. But i'm not sure how much weight is placed on this, considering the repo isn't on pypi yet and there's a big WIP on the readme...what do you all think?

py2 fails with snli example

Just FYI, seems to fail on master.
Consider adding a travis contbuild like vision or tnt packages to catch these early.

~/local/examples/snli] python train.py
downloading
extracting
Traceback (most recent call last):
  File "train.py", line 22, in <module>
    train, dev, test = datasets.SNLI.splits(inputs, answers)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/datasets/snli.py", line 47, in splits
    filter_pred=lambda ex: ex.label != '-')
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 324, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 398, in __init__
    examples = [make_example(line, fields) for line in f]
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 232, in fromJSON
    return cls.fromdict(json.loads(data), fields)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 241, in fromdict
    setattr(ex, name, field.preprocess(val))
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 136, in preprocess
    x = Pipeline(str.lower)(x)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 30, in __call__
    x = pipe.call(x)
  File "/home/soumith/local/miniconda2/lib/python2.7/site-packages/torchtext/data.py", line 36, in call
    return self.convert_token(x, *args)
TypeError: descriptor 'lower' requires a 'str' object but received a 'unicode'

Randomly initialising word vectors

There doesn't seem to be the option to initialise word vectors without using pretrained embeddings. There's an option to fill in vectors for tokens missing from the pretrained embeddings with normally distributed values. It would be cool if there was a built in option to initialise embeddings from a uniform distribution without having to specify a word embedding file.

Datasets used in tests

Hello,

Can you please provide the datasets that are used in the "tests"?
It references to this path: "~/chainer-research/jmt-data/pos_wsj/pos_wsj" but I can't find the pos_wsj dataset.
Since the docs are not that great, I think the tests will be a good place to start learning.

Also, in the "tests/vocab.py" it references the Glove 300d vectors but it is not whether it is a text file or what?

Length of iterator fails in Python 2

The division len(dataset) / batch_size will be cast to int in python2, so that math.ceil doesn't really work when len(dataset) is not a multiple of batch size.

Consistency with sorting: torch.RNN default vs torchtext.Iterator default

Torch RNN Default:
Using torch.RNN with batches it is required that the batch is sorted with decreasing lengths.

Default Behavior:*
The default behavior of BucketIterator does not work well with Torch RNN.

    train_iter, dev_iter, test_iter = data.BucketIterator.splits(
        (train, dev, test),
        batch_sizes=(32, 256, 256),
        sort_key=lambda x: data.interleave_keys(-len(x.input), -len(x.output)),
        device=-1)  # Use CPU

By default train_iter is compatible with Torch RNN. All the batches are sorted by decreasing lengths.

But by default dev_iter and test_iter are not sorted by decreasing lengths. This is because train=False then self.sort=True. Then because of this issue, #69 dev_iter and test_iter are actually shuffled.

Possible Solution:
It's possible that solving this issue #69 will solve this one as well but I think these are two separate bugs.

A typo in README

In README, under Data/Batching, padding, and numericalizing (including building a vocabulary object), there is one missing closing paren for the following part:

mt_dev = data.TranslationDataset(
    path='data/mt/newstest2014', exts=('.en', '.de'),
    fields=(src, trg)

Bug in Last Commit: Missing function len() in type Example

Hi everyone!

It seems like the last commit of data.py by @jekbradbury introduced a bug. Here's the stacktrace of my code that works with the second last revision (8b5c731)

Traceback (most recent call last):
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/mulga/oana-2/experiments/src/main/python/nips2017/__main__.py", line 207, in main
    exp.run()
  File "/data/mulga/oana-2/experiments/src/main/python/nips2017/experiment/base_experiment.py", line 431, in run
    self._train(epoch)
  File "/data/mulga/oana-2/experiments/src/main/python/nips2017/experiment/experiment_2.py", line 375, in _train
    for iteration, batch in enumerate(self._data.training_iter, 1):
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/site-packages/torchtext/data.py", line 589, in __iter__
    for minibatch in self.batches:
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/site-packages/torchtext/data.py", line 463, in pool
    for p in batch(data, batch_size * 100, batch_size_fn):
  File "/data/mulga/oana-2/anaconda3/envs/nips-2017/lib/python3.6/site-packages/torchtext/data.py", line 442, in batch
    size_so_far += batch_size_fn(ex)
TypeError: object of type 'Example' has no len()

I didn't have a detailed look at it yet, and can provide a minimal example that reproduces the bug later, but I suppose the stacktrace might already be enough for you guys to know what's going on.

Best,
Patrick

TypeError: decoding Unicode is not supported

Python Verison:

$ python --version
Python 2.7.13

Error:

Traceback (most recent call last):
  File "examples/sample.py", line 81, in <module>
    fields=[('input', qa_field), ('output', qa_field)])
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/dataset.py", line 107, in __init__
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/example.py", line 31, in fromTSV
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/field.py", line 89, in preprocess
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/pipeline.py", line 13, in __call__
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/pipeline.py", line 19, in call
  File "build/bdist.macosx-10.12-x86_64/egg/torchtext/data/field.py", line 89, in <lambda>
TypeError: decoding Unicode is not supported

Possible solution:
Check at line 88 that x is not already a Unicode:
https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L88

Code:
if six.PY2 and isinstance(x, six.string_types) and not isinstance(x, unicode):

Serializing Fields

Is there a canonical way to serialize Fields for later use? (e.g. if you want to load a model, preprocess/numericalize some test data, and then run the model on the test data)?

torch.save won't work on it out of the box since Pipeline and Vocab both have un-pickleable lambda locals. I hackily got around it by just redefining these lambdas as named functions (happy to clean it up and send a PR if you want), but I'm wondering if there's a better way.

Thanks!

Understanding vocabulary of text and labels

I have a dataset containing a text field text and corresponding class field lbl, which consists of only two classes, numeric 0 and 1. I am loading the .tsv file as written in the Readme :

text_field = data.Field()
label_field = data.Field()

train, val, test = data.TabularDataset.splits(path='/data/',train='train.tsv',
    validation='val.tsv', test='test.tsv', format='tsv',
    fields=[('text', text_field), ('lbl', label_field)])

I understand that this line builds the vocabulary :

text_field.build_vocab(train,val)
label_field.build_vocab(train,val)

But when I am testing how many vocabulary items I have for label field using len(label_field.vocab). I am not getting the value 2. While the text field vocabulary seems correct (40k+). What am I doing wrong? Is there any way to view the data within text_field and label_field ?

import error

I have installed this package in my machine with python 2.7, however, when i import it, it gives me error:

Segmentation fault (core dumped)

Consistency with sorting: `sort=True`

Problem:

    train_iter, dev_iter, test_iter = data.BucketIterator.splits(
        (train, dev, test),
        batch_sizes=(32, 256, 256),
        sort_key=lambda x: len(x.input),
        sort=True,
        device=-1)  # Use CPU

If sort=True and train=True, then the train_iter batches are shuffled. This behavior is unexpected.

Cause:
Because by default self.shuffle=True is train=True. Then https://github.com/pytorch/text/blob/master/torchtext/data/iterator.py#L113 shuffle overrides sort.

Possible Solution:
sort=True should override shuffle=None and train=True.

Fuzzy dictionary request for Vocab stoi class member

I would like to make a feature request for making the Vocab.stoi dictionary (make it an option?) such as the below so that approximate words are matched such as "hellpp" -> "help".

from collections import defaultdict

from fuzzywuzzy import process


class FuzzyDict(defaultdict):
    """ FuzzyDict attempts to pair to find a good key match
        before resorting to the passed default_factory
    """

    def __init__(self, default_factory, threshold=85, process_fn=process.extractOne):
        """

        :default_factory: the factory function that outputs the values for
            keys not in the dictionary
        :params threshold: is the score that the process function should
            output to be accepted as a key
        """
        self.default_factory = default_factory
        self.threshold = threshold
        self.process_fn = process_fn

    def __missing__(self, key):
        """ Handle a key that does not exist in the FuzzyDict """

        if len(self) > 0:
            best_choice, score = self.process_fn(key, self.keys())
            if score > self.threshold:
                return self[best_choice]

        return self.default_factory()

Use case is as follows:

>>> fuzz_dict = FuzzyDict(lambda: 0)
>>> fuzz_dict['Obama'] = 1
>>> fuzz_dict['Omaha']
0
>>> fuzz_dict['Oabama']
1
>>> fuzz_dict['Obama ']
1
>>> fuzz_dict[  'Obama']
1
>>> fuzz_dict['  Obama']
1
>>> fuzz_dict['B Obama']
1
>>> fuzz_dict['B. Obama']
1
>>> fuzz_dict['Barack'] = 2
>>> fuzz_dict['Barak']
2
>>> fuzz_dict['help'] = 5
>>> fuzz_dict['hlep']
0
>>> fuzz_dict['helpp']
5

Feature Request: Word+Character-level tokenization

Hi,
Thanks for your awesome work on this, this library looks super useful. I was wondering whether it was possible to tokenize a sequence into both words (list of string) and characters (list of list of 1-len string); from a look through the source code, it doesn't seem supported yet but I may have missed something.

I'd be happy to contribute something to extend torchtext to support this, but I'm not sure what the proper way to handle this would be (ideally it'd be extensible to other tokenization schemes as well, but perhaps that's a stretch). Thoughts?

Thanks!

error on import

File "/home/ehoffer/anaconda2/lib/python2.7/site-packages/torchtext-0.1.1-py2.7.egg/torchtext/data.py", line 87
    def build_vocab(self, *args, lower=False, **kwargs):
                                     ^
SyntaxError: invalid syntax

Maybe a python 2.7 vs. 3 issue

How to use pytorch text in the projects

Hi,

I've very limited python knowledge. I couldn't find how to integrate pytorch-text to my projects. When I type from dataloaders.text.torchtext import data to python reply after import torch I got the following error:

>>> from dataloaders.text.torchtext import data
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named dataloaders.text.torchtext

Then I decided, my system don't have the corrresponding package and I tried to install it like torch-vision package. (pip install torchvision) Unfortunatelly i couldn't make it since there isn't such a package. So how could I use this package ?

[Discussion] Object Design

Background: Looking into including a subword tokenizer. Been having a difficult time figuring out the right abstractions. The problem is a subword tokenizer unlike other tokenizers requires to define its own subword vocabulary.

Thinking about the above, I have a couple discussion questions.

Field object vs Batch object abstraction
The Field object defines instructions for dealing with batches of examples. Should the Batch object handle this responsibility instead? The Field object should only be defined for one example.

Rename the Field object to TextEncoder
Tensor2Tensor defines the functionality of converting text to tensors as a "TextEncoder". "TextEncoder" to me is a clearer object name than "Field". Should we rename the "Field" object?

Reference: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/text_encoder.py

Proper abstraction tokenizers and field objects
The subword tokenizer requires a subword vocabulary. To achieve this, one would need to override the build_vocab method. To implement this, you'd need a separate subword field.

This is not consistent with the moses and spacy tokenizers that do not require separate fields.

Missing check for whether device is None in data.Field.numericalize

I'm getting the following error when calling data.Field.numericalize without a device argument. It seems like there needs to be a check for whether device is None before entering this context manager.

Traceback (most recent call last):
  File "./train.py", line 102, in <module>
    main()
  File "./train.py", line 83, in main
    for step, batch in enumerate(train_iter):
  File "build/bdist.linux-x86_64/egg/torchtext/data.py", line 579, in __iter__
  File "build/bdist.linux-x86_64/egg/torchtext/data.py", line 482, in __init__
  File "build/bdist.linux-x86_64/egg/torchtext/data.py", line 233, in numericalize
  File "/home/dogan/anaconda2/lib/python2.7/site-packages/torch/cuda/__init__.py", line 132, in __enter__
    torch._C._cuda_setDevice(self.idx)
RuntimeError: invalid argument to setDevice

[Feature Request] parameter for detokenization in fields

The ability to specify how to detokenize a tensor using the field.

Particularly this came up when using https://github.com/google/sentencepiece. Would like the ability to add subword units together as part of field. A counter part to tokenize.

By default numericalize should have an opposite counter part to denumericalize.

<unk> token constant

Problem
To reference the '<unk>' token, one needs to rely on the string '<unk>' or vocab.itos[0].

Solution A
The client is able to set the '<unk>' token. The Vocab object throws an error if it is not set.

Solution B
The '<unk>' token needs to be defined as a constant in the Vocab class.

Bug in tokenization during Field preprocessing

Refer to line - https://github.com/pytorch/text/blob/master/torchtext/data.py#L135.

        if self.sequential and isinstance(x, str):

Note that in python2, isinstance(x, str) returns False when x is unicode. Took a while to realize why tokenization was not happening for me. I guess a simple fix would be isinstance(x, six.string_types). Shall I issue a PR (though this is an extremely tiny change)?

Include filter in preprocessing pipeline

Now there is filter_pred in Dataset's constructor, but it happens only before the preprocessing step.
Also the way preprocessing is done now (x = preprocess(x)) does not support filtering operations.
So filtering on tokenized or numerized sequences is not supported or very heavy if done in filter_pred.

Cache build_vocab; Shared vocabulary

src.build_vocab(mt_train, max_size=80000) trg.build_vocab(mt_train, max_size=40000)

In the README example, it looks like build_vocab is used twice on the same dataset. For large datasets this could take awhile.

BucketIterator vs BPTTIterator Tensor vs Not

BPTTIterator numericalizes the data and turns it into a tensor while BucketIterator does not.

Do you think the behavior here should be consistent? Either the data is always converted to tensors or is not?

Using Iterator.splits(...) only on a test Dataset object

I have noticed that for the function body of the Iterator.splits() function in torchtext/data.py. It assumes that the first dataset passed is a training dataset and the following are not.

    @classmethod
    def splits(cls, datasets, batch_sizes=None, **kwargs):
        """Create Iterator objects for multiple splits of a dataset.

        Arguments:
            datasets: Tuple of Dataset objects corresponding to the splits. The
                first such object should be the train set.
            batch_sizes: Tuple of batch sizes to use for the different splits,
                or None to use the same batch_size for all splits.
            Remaining keyword arguments: Passed to the constructor of the
                iterator class being used.
        """
        if batch_sizes is None:
            batch_sizes = [kwargs.pop('batch_size')] * len(datasets)
        ret = []
        for i in range(len(datasets)):
            train = i == 0
            ret.append(cls(
                datasets[i], batch_size=batch_sizes[i], train=train, **kwargs))
        return tuple(ret)

If I were to pass train argument (boolean used to make Variables volatile) inside the kwargs there will be a conflict between the two. In my use case, I had another python script to run a test case and I did not want to load all the train, dev data to use the test dataset for evaluation. It will fail from passing two train arguments inside.

test_iter = data.Iterator.splits(
    (test, ), batch_size=args.batch_size, device=args.gpu, repeat=False, train=False)[0]

I am not sure if this even needs attention, but I thought I might as well post it.

-- Edit
Closed because I should just initialize it with Iterator itself. Did not notice because I followed an example...

[Discussion] Saving the field object

Usage:
The field object is critical to checkpointing as it provides:

tokenization
padding
numericalize

Having the ability to save the field object allows the user, given arbitrary text, to preprocess the text. The preprocessed text is then used with a checkpointed model. Then the output is predicted and interpreted without the output dictionary.

Problem:
torch.save is implemented with pickle. The field object accepts lambdas for tokenization, preprocessing and postprocessing; therefore, cannot be pickled.

Key Points:

The vocab object needs to be pickled because the output of the model is uninterpretable without it.

Discussion:
What is the right abstraction here?
Should the vocab object be saved and the field object discarded?
Is it appropriate to have the field object and the vocab object closely bound?

Example of an embedding loader?

Hi,

is the any chance of an example of a pretrained word embeddings loader?

A single example of how to quickly load say, word2vec or glove, would be really cool. I guess once, people see a common example and use it, it should be straightforward to adapt the loader to other pretrained embeddings.

Thanks a lot 👍

PS - I saw this thread on the opennmt forum, but I couldn't get it to work?

Dimension order of batches

It seems the BucketIterator produces batches in WxN dim order rather than NxW dim order, where N is the batch size. Given that PyTorch tends to want data with the batch dimension first, wouldn't it be better to transpose the output? Is there an advantage to the current output? (I suppose it's nice for RNN inputs, but I'm still trying to figure out why PyTorch wants RNN inputs in that form).

Include Moses Tokenizer

TypeError: BucketIterator object is not an iterator

BucketIterator it be an iterator must also implement the next function.

https://stackoverflow.com/questions/33956034/why-is-an-iterable-object-not-an-iterator

Special token index is verbose.

Context
Special tokens are frequently used for masking or padding or interpreting the model. It's important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.

Problem
Creating two fields, one for French and one for English, there are no class constants for the index of eos_token. The only way to find out the index of eos_token is per instance of the class (etc. self.stoi[eos_token]).

The code by default is not designed to guarantee that the French dictionary has the same EOS index as the English dictionary.

Possible Solution A
With setting the optional parameter 'eos_token' would it be possible to set 'eos_token_index'?

Possible Solution B
Vocab or Field constant for the index of special tokens.

load_vectors does not fail gracefully when the requested #dimensions is not available

While testing out my SQUAD loader I discovered that no matter what dimension is requested for glove.42B load_vectors will download the 300d vectors. And if it's not 300 it will download it over and over ad infinitum:

downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
extracting word vectors into .data_cache
downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
extracting word vectors into .data_cache
downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip
extracting word vectors into .data_cache
downloading word vectors from http://nlp.stanford.edu/data/glove.42B.300d.zip

If you're busy I can fix it myself, but it seems pretty silly to have an option to the API that doesn't actually work.

Epoch event in Iterator

The iterator implementation allows it to repeat. That is cool.

For Machine Learning, it is typical to evaluate the model at the end of an Epoch. Allow the user to add a function to the iterator that is called at the start of a new Epoch.

Building docs

It'd be nice to actually get docs built for people to reference, even while this is still WIP. I'm happy to set up a Sphinx project, but i noticed that pytorch/vision doesn't have docs within the repo, but rather in the main pytorch repo.

Thus, is it preferable to:

Set up a Sphinx project in this repo (and move it over to the main repo when this is less WIP)?
Set up a Sphinx project in the main repo?
Don't even bother with building docs yet?

cc @jekbradbury (@soumith @apaszke may have things they want to say about this as well)

Add interfaces to expose itos and stoi

Maybe I didn't find it through the code, but it'd be nice to expose the token-index translation function through vocabulary interfaces.

error when x is a string instead of a unicode

this line, six.text_type.lower requires the input x to be of type unicode, so when x is a string, this will complain with an error. what about changing this line to be:

 x = Pipeline(six.text_type.lower)(unicode(x))

any ideas

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3

Tried opening a text/plain; charset=utf-8 file with torchtext.

# file -i data/simple_questions_wikidata/train.tsv
data/simple_questions_wikidata/train.tsv: text/plain; charset=utf-8

Got this stack trace:

Traceback (most recent call last):
  File "src/jobs/seq2seq/train.py", line 234, in <module>
    fields=[('input', input_field), ('output', output_field)])
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 56, in splits
    train_data = None if train is None else cls(path + train, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 107, in __init__
    for line in f]
  File "/usr/local/lib/python3.5/dist-packages/torchtext/data/dataset.py", line 106, in <listcomp>
    make_example(line.decode('utf-8') if six.PY2 else line, fields)
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0

Fixed with:
with open(os.path.expanduser(path), encoding='utf-8') as f:

Here: https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py#L104

Does Field support numerical features?

Hi,

I am processing a text dataset. Beside the text feature which could be processed by Field in a subclass of Dataset, there is some float numbers which I would like to use with the text. So how could I create a subclass of data.Dataset with appropriate Field?

Thanks,

Batch does not carry index

Use Case:
replace_unk most strategies of replacing tokens rely on aligning with the source sequence before numericialize

Problem:
Using the Batch object, you are unable to retrieve the original text before padding and numericialize.
There are no indexes stored with the batch to retrieve the original text in the dataset.

Quick work around:
Define a field in dataset that is an 'index' field. While building your dataset, pass in indexes for each item.

Batch will then allow you to look up an index attribute.

Segmentation Fault

I did a git clone of this repo, ran python setup.py install, and then when I try to do a simple import torchtext, I keep getting a segmentation fault.
Based on the comments from #11 , I tried re-installing all relevant package dependencies, such as numpy, scipy, nltk, but the error persists. As observed in the referenced issue, I have also tried the following which works fine:

import nltk (even numpy or scipy or matplotlib here works just fine)
import torchtext

Further, as suggested by @soumith in the other thread, I tried using gdb, which gives me the following point where the segfault is actually happening:

Program received signal SIGSEGV, Segmentation fault.
0x00007fffdfeb7fc0 in PyArray_API () from /home/riddhiman/.local/lib/python2.7/site-packages/numpy/core/multiarray.so

Any pointers as to why this is happening, and how to resolve this issue? For now, I am using the workaround of importing numpy and then importing torchtext, but it would be nice to know the real cause.

Restart Iterator from a particular Epoch and Batch

For Machine Learning tasks, it's useful to be able to restart the task at a checkpoint. For the iterator implementation, it looks it's not possible to start the iterator at an arbitrary epoch and batch.

max_size vocab is not consistent.

Context:
Num field includes the numbers 0 - 9. I set max_size=10. Then I print the vocab that was built:

    num_field.build_vocab(train, max_size=10)
    print(num_field.vocab.itos)
    # ['<unk>', '<pad>', '<s>', '</s>', u'1', u'2']
    print(len(num_field.vocab.itos))
    # 6

Then I checked the words created from tokenization:

print(words)
# [(u'1', 11308), (u'2', 11270), (u'9', 11058), (u'0', 11020), (u'5', 10952), (u'4', 10942), (u'6', 10914), (u'8', 10820), (u'3', 10766), (u'7', 10706), ('</s>', 0), ('<pad>', 0), ('<s>', 0), ('<unk>', 0)]

Looks like the vocab built includes only 6 tokens yet the max_size is 10 while there are 14 possible tokens.

Problem:
If the number of tokens is larger than max_size, build_vocab does not fill up the vocabulary up till max_size.

Possible Solution:
Update https://github.com/pytorch/text/blob/master/torchtext/vocab.py#L129 to not subtract len(self.itos) from max_size.

Error on loading json data

The following code is giving me an AttributeError: 'list' object has no attribute 'items' error:

sent_feats = data.Field()
sequences = data.TabularDataset(path=captions_path, format="json", \
                    fields=[{'caption' : ("sentences", sent_feats)}])

My dict has the following structure:

{ seq_id : { caption_id : { caption: "" } } }

Is there any syntactical error I am making?

Using a field representing real numbers with the iterator

I am trying to learn a regressor on text data and I use torchtext in all my other tasks but I see a problem in using it for this use case.

I define the field for targets as follows:

TARGETS = data.Field(
            sequential=False, tensor_type=torch.DoubleTensor, batch_first=True)
self.fields = [('targets', TARGETS), ('text', TEXT)]
self.train, self.val, self.test = data.TabularDataset.splits(
            path=self.path,
            train=self.train_suffix,
            validation=self.val_suffix,
            test=self.test_suffix,
            format=formatting,
            fields=self.fields)
TEXT.build_vocab(self.train)

I have a file that contains tab separate \t

When I make iterators out of it,

train_iter, val_iter, test_iter = data.Iterator.splits(
                (self.train, self.val, self.test),
                batch_sizes=(self.batch_size, self.test_batch_size,
                             self.test_batch_size),
                sort_key=lambda x: len(x.text),
                shuffle=True)
print(next(iter(train_iter)))

it gives me an error when getting the next batch:

AttributeError: 'Field' object has no attribute 'vocab'

I know this is because I didn't run .build_vocab for the TARGETS field. But why do I really need to do this? What if I just want to get real numbers and compute losses on them?

Any workaround is appreciated. If I am doing something wrong, please let me know too.

pytorch / text Goto Github PK

text's Introduction

More About PyTorch

A GPU-Ready Tensor Library

Dynamic Neural Networks: Tape-Based Autograd

Python First

Imperative Experiences

Fast and Lean

Extensions Without Pain

Installation

Binaries

NVIDIA Jetson Platforms

From Source

Prerequisites

Install Dependencies

Get the PyTorch Source

Install PyTorch

Adjust Build Options (Optional)

Docker Image

Using pre-built images

Building the image yourself

Building the Documentation

Previous Versions

Getting Started

Resources

Communication

Releases and Contributing

The Team

License

text's People

Contributors

Stargazers

Watchers

Forkers

text's Issues

Recommend Projects

Recommend Topics

Recommend Org