asyml / forte Goto Github PK
View Code? Open in Web Editor NEWForte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
License: Apache License 2.0
Forte is a flexible and powerful ML workflow builder. This is part of the CASL project: http://casl-project.ai/
License: Apache License 2.0
The current pipeline has an issue when interleaving batch processing and pack processing.
Current behavior: The batch processor won't produce any packs when a pack is not fully processed. The next pack processor will break when it didn't get any pack.
Expected behavior: The next processor should always wait for the previous processor to yield and should not break.
Use dependency related best practices - add versioning, keep requirement.txt and setup.py in sync, etc - should be solved before the project becomes larger and more complex.
Good resource - https://blog.miguelgrinberg.com/post/the-package-dependency-blues
We would need to provide a stable downland path of the models or data needed for the training.
We have planned to use the Texar executor for Forte so we have kept the BaseTrainer in the pipeline simple. Now that the executor is basically done. We can start integrating them into our pipeline.
On the other hand, currently, the trainer takes training data from DataPack. This is OK but slow if the training instances are recurring. It is more efficient to cache the training instances (pickle them) during this flow. We only need to provide an interface for the users to implement what to store.
This includes the following steps:
The docstring
of forte
should be greatly improved in order to build valid documentation. We could take texar-pytorch
as a reference.
Run
python multipack_pipeline_gpt2.py
Have
Traceback (most recent call last):
File "multipack_pipeline_gpt2.py", line 8, in <module>
pl.init_from_config_path("sample_multipack_pipeline_gpt.yml")
File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/base_pipeline.py", line 121, in init_from_config_path
self.init_from_config(configs)
File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/multipack_pipeline.py", line 47, in init_from_config
class_args=reader_config.get("kwargs", {}),
File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/multipack_pipeline.py", line 85, in create_class_with_kwargs
cls = get_class(class_name)
File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/utils.py", line 70, in get_class
"Class not found in {}: {}".format(module_paths, class_name))
ValueError: Class not found in None: PlainSentenceTxtgenReader
Currently we provide MicrosoftBingTranslator
as a processor in Forte core. It can instead be moved into examples/
which would demonstrate how to write processors using 3rd party APIs. The machine translation processor can be provided through a texar model.
Run
python process_dataset_example.py
Have
Traceback (most recent call last):
File "process_dataset_example.py", line 73, in <module>
srl_dir) = sys.argv[1:]
ValueError: not enough values to unpack (expected 3, got 0)
The previous understanding of the replace_operation function is wrong, as shown here:
https://github.com/hunterhector/forte/blob/master/forte/data/readers/plaintext_reader_test.py#L53
#Exepected Behavior
Please implement these two functions in the DataPack class.
Is it possible to have a script that can automatically download and prepare the dataset for the user? If the answer is no, can we have a more detailed description of the data preparation part? I am not very clear about how to prepare the CONLL03 data for this example.
The current serialization is done through JSONPickle. The main problem of this approach is that every variable is serialized, even control variables and processing metadata. This will make the serialization unstable when we modify some logic in the classes.
We should investigate a new serialization approach that only serializes and deserializes relevant information.
It is also worthwhile to keep track of the serialization system version, in case a change of format.
Now the system has a pre-defined Query
entry for IR systems. However, the system hasn't properly defined entries to hold Query return values.
NLP data structures themselves are often interesting. It is often required to traverse a tree or follow a path in a graph in NLP. For example, one can follow the dependency tree to find the headword, or follow a network path (following specific relation links). This can be supported by Forte in a natural way, and we can at the same time attach relevant features to the path (e.g. embeddings along the path, or relation labels along the path).
Follow -
#96 (review)
Multipacks are collections of single packs. This is a potential waste of memory and disk space because the data packs can be shared among the multipacks. An advanced approach is to allow the multipacks to keep references only to the single packs. But this would require several re-design:
This issue is to add readers (https://github.com/asyml/forte/tree/master/forte/data/readers) for multiple-choice QA datasets, such as RACE. The tasks involve:
Extra requirements:
Refactor the reader interface to accept data sources in all formats in addition to file paths.
If we have associated with the word with an embedding, then we could possibly support nice search functionality in the dense space. This can be done by adding indexed the entries with a library of dense search, for example:
We need a consistent method to hook embeddings to entries. This can be done by one of the following, which can be applied in different cases:
The current trainer is pickling resources along with the class structure into the model, this will affect the transferability of the trained model file. Please try to avoid pickling the class structure there:
https://github.com/hunterhector/forte/blob/master/forte/trainer/ner_trainer.py#L13
Currently, all fields in the ontology are set via set_fields
method. This should not be used externally since it will allow the user to add any field to the ontology. We should instead allow the code-gen to add explicit setters for each attribute.
OntonotesReader
has the following issues:
Breaks when encountering V
spans with length greater than 1. V
spans correspond to "verb" and are ignored, but the current logic for doing so is as follows:
forte/forte/data/readers/ontonotes_reader.py
Lines 236 to 241 in beae4e9
V
span with length greater than 1 is encountered, only the opening bracket is ignored, and an exception would be raised when encountering the closing bracket.
Actual time complexity is O(n^2) due to appending each word to the string text
:
forte/forte/data/readers/ontonotes_reader.py
Line 147 in beae4e9
Let's implement a dependency parser for universal dependency. This is a project requiring reading the data, doing training and evaluation and write modules. So after this exercise, one should be able to understand the pipeline pretty well.
Try to implement two dependency parsers on Universal dependency:
I would suggest using the following repository as references:
Please first try running the training in English and another language as your choice. At the first iteration, let's aim getting the English to the SOTA performance and hopefully the others too.
Right now _options
field in a Question
is stored directly as a List[Option]
.
This results in a nested Entry
structure which can be difficult for serialization and code generation. Thus, we'll only store _options
as List[int]
which is a list of tid
s of the Option
s.
Currently, Resource APIs provides save
and load
methods to serialize and de-serialize its objects. However we can provide additional features to Resources
like the following
Ability to register and (possibly) de-register keys to/from Resources
in a pipeline. We can also accept serialize/deserialize functions during registration. One benefit of de-register is, if a resource is no longer needed, we can de-register and remove it from the resource.
Note: Currently we have remove()
method in Resources
which will remove the key from the Resources
. We need to further improve the logic of this method for e.g. invalidate the serialize and deserialize functions associated with this key etc.
We need to invalidate all the states associated with a key during de-registration to avoid inconsistencies arising from creating a new resource with the same name.
Provide setter APIs to set serialize and deserialize function to already existing keys of the Resources.
Provide a default serialize and de-serialize function so that when users pass in a dict
to save
and load
, it is not mandatory for them to pass in the serialize/deserialize functions as values of this dict.
Run
python process_string_example.py
Have
Traceback (most recent call last):
File "process_string_example.py", line 152, in <module>
main()
File "process_string_example.py", line 128, in main
1:3]
ValueError: not enough values to unpack (expected 2, got 0)
The current logic for .get()
(which internally calls .get_entries()
can be summarized as:
entry_type
;component
;range_annotation
(if coverage index exists).entry_type
is Annotation, then a faster code path exists:
range_annotation
. This is done using binary search on the Annotation index.Corresponding code is as follows:
Lines 676 to 735 in beae4e9
A couple of performance optimizations:
entry_type
each time. For most of the type sets for entry_type
and component
constraints will be very large (containing 30% of all IDs, or even more), and creating such indices may not be faster than iterating over everything.range_annotation
(the common case), if coverage index exists, further span checks are redundant.With the ontology refactoring (https://github.com/hunterhector/forte/pull/27), set_ontology might not be needed as there would be dedicated ontologies. Remove once the ontology refactoring is pushed.
Remove other useless code as well related to reading/setting ontologies - https://github.com/hunterhector/forte/blob/master/forte/pipeline.py#L72
The resource requirement to run a whole project is often huge, such as loading two or three pytorch models in the GPU. When resource is unavailable, one approach is to always process one engine at a time (and serialize the data on disk). We can support this staged processing mode later.
PlainSentenceTxtgenReader
is currently not executable. For example, here, Sentence
class requires three arguments, but only two are given.
Create a base class that constructs the pipeline so new processors/readers can be easily tested. We should also prepare common test data to be used by these tests.
something like
# Copyright 2019 The Forte Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
We have a designated Query to be used in information Retrieval models. Similarly, such a design can be used to unify classification problems. We can create a special Feature
entry (similar to Query, this would most likely be Singletons). The Feature
entry will be referenced by standard classifiers for training or inference.
One particular problem that we need to consider is whether Feature
is only attached to DataPack
or it can be attached to MultiPack
.
Use case:
In a production environment, users may want to preprocess the text before doing any NLP operations. For example, one might delete all HTML tags before feeding them in, but once that is done, they might still want to get the original offset information. We would like to design some functionalities in the pipeline to help record these changes so the users do not need to keep track of all them changes.
Additional requirements:
[*] Annotation is an NLP result entry that covers a span (as opposed to other NLP result entries such as Link and Group).
The current NER and SRL processors (including the NER trainer) are not efficiently designed. Particularly, in the initialize
method, we create a model and use config object in a non-standard way. That is, we do not follow any pattern on how to access the config
object. Also, creating the model inside the initialize
makes the processor to be tightly coupled with that model and hence to use a different model, we need to tweak the initialize
method. This defeats the purpose of reusable processor. This issue is created to address the following
Create a directory of models like for e.g., BiRecurrentConvCRF
, LabeledSpanGraphNetwork
etc. The processors should import the models defined here based on a value in configs passed to the initialize method.
Standardize a way to use config object. Each processor is currently using config
object in its own way which will lead lot of repetitiveness and unnecessary complications. To address this issue, we need to ensure a set of standard keys to use for most common configurations like model_path
, resource_dir
etc. This, however, does not restrict the user to follow this pattern but introduces a principled way to use the config
object.
The get_data() method can get links from the pack, however, the behavior is not very intuitive:
If the request didn't contain the ParentType and ChildType of the requested Link, then no links will be returned. The links will only be returned when both the ParentType and ChildType are also requested. Possible enhancements are:
One common task for almost all readers is to keep track of the text content and offset. This is a tedious task and error-prone. So it would be helpful if Forte can help this. This can be designed to work well with the text_replacement_func.
Run
python wiki_dump_parse.py
Have
Traceback (most recent call last):
File "wiki_dump_parse.py", line 16, in <module>
from forte.data.datasets.wikipedia.dump_reader import WikiDumpReader
File "/Users/pengzhi.gao/Desktop/my_git/forte/forte/data/datasets/wikipedia/dump_reader.py", line 17, in <module>
from mwlinks.libs.common import Span
ModuleNotFoundError: No module named 'mwlinks'
Is mwlinks
a python package? I cannot find it in pypi.
Add a test coverage report of the system and we can later include it in the main README.
There are a few important tutorials, which should be written as notebooks, let's do them in this order:
The current parse_pack (https://github.com/hunterhector/forte/blob/master/forte/data/readers/base_reader.py#L119) has the following signature:
def parse_pack(self, collection: Any) -> PackType:
It is supposed to create a DataPack or MultiPack. The implementation will always construct a DataPack() first. The pipeline can, in fact, create the DataPack first and let this function to populate the information.
This fix is easy but it would change the interface so we should hold for now.
We can do this via adding a set_text function in the reader, which take a parameter (text), and automatically assign the replace_func to the set_text function in the DataPack. This won't affect the interface so it could be done now.
Following the Texar convention, we need to convert to Google format.
The reader test cases now only compare the text string being read. This is too weak to prove the correctness of a reader. The test should also cover the annotations that are read.
The current ConllU and ontonotes reader are both in the same format, the conll-like readers. There are many variations to the CoNLL format, such as different fields for different columns. The current readers are hard-coded on the fields, and the file extensions. However, one single reader should be able to handle these, with some configurable file extensions and field names.
This is a practice to help you get familiar with the pipeline, especially how the processor work.
Try to wrap StanfordNLP into the pipeline, note: not Stanford CoreNLP but StanfordNLP
https://stanfordnlp.github.io/stanfordnlp/
https://github.com/stanfordnlp/stanfordnlp
The end product of this exercise should be a running pipeline with tokenization, pos-tagging and parsing provided by them. The pipeline should take a piece of input and produce an output file in JSON format.
You might encounter a set of tasks:
Additionally, try to find out if their pipeline can be done in batches to improve the processing speed. But this is optional.
Currently, the Travis docs
job, .readthedocs.yml
and docs/requirements.txt
are out of sync. This needs to be fixed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.