Giter Site home page Giter Site logo

asyml / forte Goto Github PK

View Code? Open in Web Editor NEW
239.0 19.0 60.0 18.26 MB

Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/

License: Apache License 2.0

Python 99.23% Perl 0.65% Shell 0.13%
machine-learning natural-language-processing deep-learning python text-data data-processing information-retrieval natural-language pipeline

forte's People

Contributors

atif93 avatar avinashbukkittu avatar bhaskar2443053 avatar cz9779 avatar dependabot[bot] avatar digo avatar gpengzhi avatar haoyulucas avatar hepengfe avatar hunterhector avatar j007x avatar jasonyanwenl avatar jennyzhang-petuum avatar jiaqiang-ruan avatar jieralice13 avatar jrxk avatar jzpang avatar mgupta1410 avatar mingkaid avatar mylibrar avatar piyush13y avatar pushkar-bhuse avatar qinzzz avatar seanrosario avatar swapnull7 avatar vincentyaombzuai avatar wanglec avatar weiwei718 avatar xuezhi-liang avatar zhanyuanucb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

forte's Issues

Interleaving batch processing and pack processing

The current pipeline has an issue when interleaving batch processing and pack processing.

Current behavior: The batch processor won't produce any packs when a pack is not fully processed. The next pack processor will break when it didn't get any pack.

Expected behavior: The next processor should always wait for the previous processor to yield and should not break.

Model or data download.

We would need to provide a stable downland path of the models or data needed for the training.

Refine the training pipeline with executor and caching

We have planned to use the Texar executor for Forte so we have kept the BaseTrainer in the pipeline simple. Now that the executor is basically done. We can start integrating them into our pipeline.

On the other hand, currently, the trainer takes training data from DataPack. This is OK but slow if the training instances are recurring. It is more efficient to cache the training instances (pickle them) during this flow. We only need to provide an interface for the users to implement what to store.

This includes the following steps:

  1. Extend BaseTrainer to TexarTrainer, which will use Executor to implement the methods. If you find there is a need to change the interface of BaseTrainer, feel free to design so.
  2. Refactor ner_trainer to use TexarTrainer.
  3. Serialize the training instance in ner_trainer to speed up.

Docstring should be improved

The docstring of forte should be greatly improved in order to build valid documentation. We could take texar-pytorch as a reference.

gpt2 example is not executable

Run

python multipack_pipeline_gpt2.py

Have

Traceback (most recent call last):
  File "multipack_pipeline_gpt2.py", line 8, in <module>
    pl.init_from_config_path("sample_multipack_pipeline_gpt.yml")
  File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/base_pipeline.py", line 121, in init_from_config_path
    self.init_from_config(configs)
  File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/multipack_pipeline.py", line 47, in init_from_config
    class_args=reader_config.get("kwargs", {}),
  File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/multipack_pipeline.py", line 85, in create_class_with_kwargs
    cls = get_class(class_name)
  File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/utils.py", line 70, in get_class
    "Class not found in {}: {}".format(module_paths, class_name))
ValueError: Class not found in None: PlainSentenceTxtgenReader

Move MicrosoftBingTranslator into examples/

Currently we provide MicrosoftBingTranslator as a processor in Forte core. It can instead be moved into examples/ which would demonstrate how to write processors using 3rd party APIs. The machine translation processor can be provided through a texar model.

process_dataset_example.py is not executable

Run

python process_dataset_example.py

Have

Traceback (most recent call last):
  File "process_dataset_example.py", line 73, in <module>
    srl_dir) = sys.argv[1:]
ValueError: not enough values to unpack (expected 3, got 0)

Fix the usage of replace_operation

The previous understanding of the replace_operation function is wrong, as shown here:
https://github.com/hunterhector/forte/blob/master/forte/data/readers/plaintext_reader_test.py#L53

#Exepected Behavior

  1. The annotations are corresponding to the cleaned text in a data pack.
  2. The user can get the original text from the cleaned text with a function: get_original_text()
  3. The user can get the span ([begin, end]) in the original span by submitting a span in the current span ([begin, end]), with a function: get_original_span()

Please implement these two functions in the DataPack class.

Polish NER example

Is it possible to have a script that can automatically download and prepare the dataset for the user? If the answer is no, can we have a more detailed description of the data preparation part? I am not very clear about how to prepare the CONLL03 data for this example.

Stable serialization format

The current serialization is done through JSONPickle. The main problem of this approach is that every variable is serialized, even control variables and processing metadata. This will make the serialization unstable when we modify some logic in the classes.

We should investigate a new serialization approach that only serializes and deserializes relevant information.

It is also worthwhile to keep track of the serialization system version, in case a change of format.

Record query return information

Now the system has a pre-defined Query entry for IR systems. However, the system hasn't properly defined entries to hold Query return values.

Data structure traversal support

NLP data structures themselves are often interesting. It is often required to traverse a tree or follow a path in a graph in NLP. For example, one can follow the dependency tree to find the headword, or follow a network path (following specific relation links). This can be supported by Forte in a natural way, and we can at the same time attach relevant features to the path (e.g. embeddings along the path, or relation labels along the path).

Sharable serialization and reference for multi-pack

Multipacks are collections of single packs. This is a potential waste of memory and disk space because the data packs can be shared among the multipacks. An advanced approach is to allow the multipacks to keep references only to the single packs. But this would require several re-design:

  1. How to keep the references during serialization and deserialization step?
  2. Does this affect training and batch data creation?

Develop a reader for multiple-choice QA dataset

This issue is to add readers (https://github.com/asyml/forte/tree/master/forte/data/readers) for multiple-choice QA datasets, such as RACE. The tasks involve:

  1. Define an ontology for Multiple-choice QA
  2. Create a reader that read the text from a dataset
  3. Create annotations for each question and choice, including the labels

Extra requirements:

  1. There could be a general version that contains the main logic, which can be extended to cope with different datasets.

Associate embeddings to entries

We need a consistent method to hook embeddings to entries. This can be done by one of the following, which can be applied in different cases:

  1. Store the embeddings directly in the entries, this can be useful when the embedding is difficult to compute and cannot be looked up easily (say passage vector)
  2. Store the vocabulary index to each entry and require a resource (embedding dictionary) to be available at the time of interpretation.
  3. Store the functions of computing the embedding together with the entry, this requires we store both the function and the resource at the same time, which seem to be tricky to implement.

set_fields should be replaced by explicit setters

Currently, all fields in the ontology are set via set_fields method. This should not be used externally since it will allow the user to add any field to the ontology. We should instead allow the code-gen to add explicit setters for each attribute.

Defects in `OntonotesReader`

OntonotesReader has the following issues:

  1. Breaks when encountering V spans with length greater than 1. V spans correspond to "verb" and are ignored, but the current logic for doing so is as follows:

    for label_index, label in enumerate(labels):
    arg_type = label.strip("()*")
    if arg_type == "V":
    continue

    If a V span with length greater than 1 is encountered, only the opening bracket is ignored, and an exception would be raised when encountering the closing bracket.

  2. Actual time complexity is O(n^2) due to appending each word to the string text:

    text += word + " "

    Since Python strings are immutable, a new string is created for each word, yielding O(n^2) complexity.

Implement a SOTA dependency parser for UD

Let's implement a dependency parser for universal dependency. This is a project requiring reading the data, doing training and evaluation and write modules. So after this exercise, one should be able to understand the pipeline pretty well.

Try to implement two dependency parsers on Universal dependency:

  1. Biaffine attention: https://openreview.net/pdf?id=Hk95PK9le
  2. Stack pointer network: https://aclweb.org/anthology/P18-1130

I would suggest using the following repository as references:

  1. https://github.com/XuezheMax/NeuroNLP2/blob/master/neuronlp2/models/parsing.py
  2. https://github.com/tdozat/Parser-v1

Please first try running the training in English and another language as your choice. At the first iteration, let's aim getting the English to the SOTA performance and hopefully the others too.

Changing how we store Options in Question for easier serialization

Right now _options field in a Question is stored directly as a List[Option].
This results in a nested Entry structure which can be difficult for serialization and code generation. Thus, we'll only store _options as List[int] which is a list of tids of the Options.

Design improvements for Resource APIs

Currently, Resource APIs provides save and load methods to serialize and de-serialize its objects. However we can provide additional features to Resources like the following

  • Ability to register and (possibly) de-register keys to/from Resources in a pipeline. We can also accept serialize/deserialize functions during registration. One benefit of de-register is, if a resource is no longer needed, we can de-register and remove it from the resource.
    Note: Currently we have remove() method in Resources which will remove the key from the Resources. We need to further improve the logic of this method for e.g. invalidate the serialize and deserialize functions associated with this key etc.
    We need to invalidate all the states associated with a key during de-registration to avoid inconsistencies arising from creating a new resource with the same name.

  • Provide setter APIs to set serialize and deserialize function to already existing keys of the Resources.

  • Provide a default serialize and de-serialize function so that when users pass in a dict to save and load, it is not mandatory for them to pass in the serialize/deserialize functions as values of this dict.

process_string_example.py is not executable

Run

python process_string_example.py

Have

Traceback (most recent call last):
  File "process_string_example.py", line 152, in <module>
    main()
  File "process_string_example.py", line 128, in main
    1:3]
ValueError: not enough values to unpack (expected 2, got 0)

`DataPack.get()` has bad performance on large packs

The current logic for .get() (which internally calls .get_entries() can be summarized as:

  1. Create sets of valid entry IDs for the given constraints:
    • Entry must be of type entry_type;
    • Entry must be generated by component;
    • Entry must be within span of range_annotation (if coverage index exists).
  2. Take the intersection of these sets, denote it as the "valid set".
  3. If entry_type is Annotation, then a faster code path exists:
    1. Find the lower and upper bounds of entry IDs given range_annotation. This is done using binary search on the Annotation index.
    2. Iterate over all entries within range, further check its span, and yield those that satisfy the check.
  4. Otherwise, simply consider every entry in the valid set, and yield those that satisfy the span check.

Corresponding code is as follows:

def get_entries(self,
entry_type: Type[EntryType],
range_annotation: Optional[Annotation] = None,
components: Optional[Union[str, List[str]]] = None
) -> Iterable[EntryType]:
"""
Get ``entry_type`` entries from the span of ``range_annotation`` in a
DataPack.
Args:
entry_type (type): The type of entries requested.
range_annotation (Annotation, optional): The range of entries
requested. If `None`, will return valid entries in the range of
whole data_pack.
components (str or list, optional): The component generating the
entries requested. If `None`, will return valid entries
generated by any component.
"""
range_begin = range_annotation.span.begin if range_annotation else 0
range_end = (range_annotation.span.end if range_annotation else
self.annotations[-1].span.end)
# valid type
valid_id = self.get_ids_by_type(entry_type)
# valid component
if components is not None:
if isinstance(components, str):
components = [components]
valid_component_id: Set[int] = set()
for component in components:
valid_component_id |= self.get_ids_by_component(component)
valid_id &= valid_component_id
# valid span
if range_annotation is not None:
coverage_index = self.index.coverage_index(type(range_annotation),
entry_type)
if coverage_index is not None:
valid_id &= coverage_index[range_annotation.tid]
if issubclass(entry_type, Annotation):
begin_index = self.annotations.bisect(
Annotation(self, range_begin, range_begin)
)
end_index = self.annotations.bisect(
Annotation(self, range_end, range_end)
)
for annotation in self.annotations[begin_index: end_index]:
if annotation.tid not in valid_id:
continue
if (range_annotation is None or
self.index.in_span(annotation, range_annotation.span)):
yield annotation
elif issubclass(entry_type, (Link, Group)):
for entry_id in valid_id:
entry: EntryType = self.get_entry(entry_id) # type: ignore
if (range_annotation is None or
self.index.in_span(entry, range_annotation.span)):
yield entry

A couple of performance optimizations:

  1. It is not necessary to create new sets for entry_type each time. For most of the type sets for entry_type and component constraints will be very large (containing 30% of all IDs, or even more), and creating such indices may not be faster than iterating over everything.
  2. When there is range_annotation (the common case), if coverage index exists, further span checks are redundant.
  3. The optimization for Annotations can be similarly applied to Links and Groups, since they can also be represented as a single interval in the span check.

Staged process mode

The resource requirement to run a whole project is often huge, such as loading two or three pytorch models in the GPU. When resource is unavailable, one approach is to always process one engine at a time (and serialize the data on disk). We can support this staged processing mode later.

Add copyright header to our code

something like

# Copyright 2019 The Forte Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Standardize classification with a designated Feature entry.

We have a designated Query to be used in information Retrieval models. Similarly, such a design can be used to unify classification problems. We can create a special Feature entry (similar to Query, this would most likely be Singletons). The Feature entry will be referenced by standard classifiers for training or inference.

One particular problem that we need to consider is whether Feature is only attached to DataPack or it can be attached to MultiPack.

Add preprocesing tracking in readers.

Use case:

In a production environment, users may want to preprocess the text before doing any NLP operations. For example, one might delete all HTML tags before feeding them in, but once that is done, they might still want to get the original offset information. We would like to design some functionalities in the pipeline to help record these changes so the users do not need to keep track of all them changes.

Expected Input:

  1. Some pieces of noisy text (you could download a random website for this)
  2. A change operation, which can take the following forms:
  3. A span to be replaced (e.g. [30, 35]), and a replacement text (e.g. , , it can be empty too).
  4. A function that can compute a span (e.g. a regular expression), and a replacement text.

Expected Implementation:

  1. Implement an html_reader, you can take the plain_text reader as a starting example.
  2. It should have the following interfaces. Expected interface:
    • Add two functions named replace(regex, replacement) and replace(span, replacement) in the reader, which take the above-mentioned change operation. The span version simply replaces the designated span with the replacement string, and the regex version will compute the span first. There are detailed design decisions for the interface one should be aware, for example, what if the span is not valid, or what if the regex matches more than one span.
      Let’s first write this function as a utility. But it should be integrated to the base_reader
  3. Once the replace function is called, it will apply the change to the text (which is an attribute of the reader class). It will also record this change in some form, which enables the pipeline to get the original format. Write a function to get the original span given the new modified span, the input to this function should be the new span (e.g. [50, 55]), the list of changes we record in 2, and the expected original span. Let’s first write this function as a utility function, and it should be integrated into the Annotation* object in the pipeline.

Additional requirements:

  1. The reader can take some config parameters.
  2. If we let it take a list of regex-replacement pair as the parameter, then we will have a general reader “regex_preprocessing_reader” that can automatically call those replacement functions. This reader can be extended by other readers too.
  3. We should expose these functionalities in the base_reader so that the user can call it whenever they want.

[*] Annotation is an NLP result entry that covers a span (as opposed to other NLP result entries such as Link and Group).

Moving models to a separate directory and standardizing the use of config object

The current NER and SRL processors (including the NER trainer) are not efficiently designed. Particularly, in the initialize method, we create a model and use config object in a non-standard way. That is, we do not follow any pattern on how to access the config object. Also, creating the model inside the initialize makes the processor to be tightly coupled with that model and hence to use a different model, we need to tweak the initialize method. This defeats the purpose of reusable processor. This issue is created to address the following

  1. Create a directory of models like for e.g., BiRecurrentConvCRF, LabeledSpanGraphNetwork etc. The processors should import the models defined here based on a value in configs passed to the initialize method.

  2. Standardize a way to use config object. Each processor is currently using config object in its own way which will lead lot of repetitiveness and unnecessary complications. To address this issue, we need to ensure a set of standard keys to use for most common configurations like model_path, resource_dir etc. This, however, does not restrict the user to follow this pattern but introduces a principled way to use the config object.

DataPack.get_data usage issue

The get_data() method can get links from the pack, however, the behavior is not very intuitive:
If the request didn't contain the ParentType and ChildType of the requested Link, then no links will be returned. The links will only be returned when both the ParentType and ChildType are also requested. Possible enhancements are:

  1. Clear documentation of such behavior.
  2. Warning when the ParentType or ChildType of the link not specified.
  3. Automatically expand the Annotation request set with the ParentType and ChildType.

Helper function for building text and track offset

One common task for almost all readers is to keep track of the text content and offset. This is a tedious task and error-prone. So it would be helpful if Forte can help this. This can be designed to work well with the text_replacement_func.

wiki_parser example is not executable

Run

python wiki_dump_parse.py

Have

Traceback (most recent call last):
  File "wiki_dump_parse.py", line 16, in <module>
    from forte.data.datasets.wikipedia.dump_reader import WikiDumpReader
  File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/data/datasets/wikipedia/dump_reader.py", line 17, in <module>
    from mwlinks.libs.common import Span
ModuleNotFoundError: No module named 'mwlinks'

Is mwlinks a python package? I cannot find it in pypi.

Create tutorial for the system.

There are a few important tutorials, which should be written as notebooks, let's do them in this order:

  1. Construct pipeline (with code and with config, use existing processor as examples)
  2. Write a reader
  3. Write a pack processor
  4. Understand ontology (using ontology generator)
  5. Write a batch pack processor (batch processor and batcher)
  6. Write a trainer
  7. Build a training pipeline (preprocess, resources and executor)
  8. Model reuse
  9. Construct pipelines with Multipacks (multipack processor, selector)
  10. Understand Information Retrieval pipeline

Two design enhancement for parse_pack

It should take one more parameter: DataPack

The current parse_pack (https://github.com/hunterhector/forte/blob/master/forte/data/readers/base_reader.py#L119) has the following signature:

def parse_pack(self, collection: Any) -> PackType:

It is supposed to create a DataPack or MultiPack. The implementation will always construct a DataPack() first. The pipeline can, in fact, create the DataPack first and let this function to populate the information.

This fix is easy but it would change the interface so we should hold for now.

It should hide the replace_fun functional parameter

We can do this via adding a set_text function in the reader, which take a parameter (text), and automatically assign the replace_func to the set_text function in the DataPack. This won't affect the interface so it could be done now.

Enrich reader test cases

The reader test cases now only compare the text string being read. This is too weak to prove the correctness of a reader. The test should also cover the annotations that are read.

Unify and customizable Conll-like readers

The current ConllU and ontonotes reader are both in the same format, the conll-like readers. There are many variations to the CoNLL format, such as different fields for different columns. The current readers are hard-coded on the fields, and the file extensions. However, one single reader should be able to handle these, with some configurable file extensions and field names.

Wrap StanfordNLP into the pipeline.

This is a practice to help you get familiar with the pipeline, especially how the processor work.

Try to wrap StanfordNLP into the pipeline, note: not Stanford CoreNLP but StanfordNLP

https://stanfordnlp.github.io/stanfordnlp/
https://github.com/stanfordnlp/stanfordnlp

The end product of this exercise should be a running pipeline with tokenization, pos-tagging and parsing provided by them. The pipeline should take a piece of input and produce an output file in JSON format.

You might encounter a set of tasks:

  1. Design an ontology matching their data types: https://stanfordnlp.github.io/stanfordnlp/data_objects.html
  2. Be able to write config files in our style and parse them into their config:
    https://stanfordnlp.github.io/stanfordnlp/pipeline.html
  3. Be able to get the actual annotation spans from their data.
  4. Learn to correctly config the language (en, de, fr).

Additionally, try to find out if their pipeline can be done in batches to improve the processing speed. But this is optional.

Clean up the doc config

Currently, the Travis docs job, .readthedocs.yml and docs/requirements.txt are out of sync. This needs to be fixed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.