julianfrattini / cira Goto Github PK
View Code? Open in Web Editor NEWPython package for functions around the causality in requirements artifacts (CiRA) initiative.
Home Page: http://www.cira.bth.se/
License: Apache License 2.0
Python package for functions around the causality in requirements artifacts (CiRA) initiative.
Home Page: http://www.cira.bth.se/
License: Apache License 2.0
They are many cases in the codebase where values are compared against hard-coded strings, e.g., if junctor == 'OR'
.
Hard-coded string values are prone to maintenance issues.
It is easy to make typos or capitalization mistakes that become hard to identify.
Further, refactoring becomes more challenging, and it is unclear how many different types of something exist, e.g., how many different types of events should be handled.
Therefore, these hard-coded strings should be replaced with enums to avoid maintenance issues.
Example implementation:
from enum import Enum
class Event(str, Enum):
CAUSE = "Cause"
EFFECT = "Effect"
# usage
if event == Event.CAUSE:
# do something
To replace:
Cause
or Effect
OR
, POR
, or AND
<s>
or <pad>
In natural language, the use of a negation may propagate from one event to a closely related second one. Consider the following sentence:
If the user does not close the popup window or press escape, the window closes automatically.
The sentence is labeled correctly by CiRA:
However, CiRA does not propagate the negation in Cause1
to Cause2
- even though it should logically - which results in the following incorrect cause-effect graph (the edge from "the user press escape" to the disjunction should be negated):
A negation seems to propagate between two adjacent events if they share the same variable
.
In #1, we added support for spread event labels, i.e., if one event (e.g., Cause1/c1) is split into two separate instances, then it is still recognized as belonging together. The below example visualizes, how CiRA now correctly resolves the split event label Cause1
(short: c1
) and creates one, not two, nodes in the cause-effect graph from it.
However, one edge case was missed: if the event label is spread, but one event label still contains both variable and condition (i.e., not only the event label but also the sublabel (e.g., Variable
or Condition
) are spread), then CiRA resolves the event incorrectly and creates two nodes. In the following example, the event label Effect1
is spread, but the first instance contains both the Variable
and a portion of the Condition
, wrongfully resolving the event into two effect nodes.
The expected behavior would be that the event resolver creates only one effect node with the variable "Data transmission," and the condition "is possible."
The Pytest CI worklow uses the old andib/cira-dev:latest
Docker image as base for the CI container but should use the repository's version of the cira-dev
image.
Currently, the pipeline does not properly resolve exceptive clauses. Consider the following sentence:
Unless the red button is pushed and the power fails the system continues to operate.
The negated keyword "Unless" is located outside of any event node and will therefore not be associated with them. The correct resolution of this case would associate the negation with all subsequent events of the same type (cause/effect) and the same junctor (i.e., as soon as a chain of conjunctions is followed by a disjunction, the association of the exceptive negation stops).
According to same-origin policy, browsers block requests from different origins by default. This implies that a web application that is supposed to serve as a UI on top of the CiRA pipeline will not be allowed to send requests to the running API since they run on different ports.
Once the API is implemented (#16), create a Docker container that wraps and exposes the API. The generated image shall be published on Dockerhub such that CiRA can be integrated into a docker-compose structure by simply referencing that image and using the CiRA functionality through the API.
Steps:
The CiRA library can be made more usable if the functionality is accessible through an API. Utilizing the FastAPI Python library, build a simple service that handles REST requests and exposes the CiRA functionality. Additionally, package the service together with its backend in a Docker image and disclose it for simple reuse in another container cluster.
Currently, both the manual installation of CiRA and the pre-built image rely on two separate Zenodo uploads (one containing the classification and one the labeling model), where the second upload also contains several, alternative models. While these models serve the proper replication of the associated application, they inflate the download to around 11GB.
In order to minimize the size of the CiRA work package, conduct the following steps:
Currently, events are generated from labels purely using the granularity of those labels, i.e., whether a label is associated with (a property) of an event, or not. Consider the following sentence fragment:
Where it is impractical to substantially eliminate or reduce the hazard [...]
The current resolution algorithm will consider full labels and hence construct the two events [it].(is impractical to substantially eliminate) and [it].(reduce the hazard), while they should actually look as follows:
Event | Resolution in v1.0.0 | NLP-improved resolution |
---|---|---|
1 | [it].(is impractical to substantially eliminate) | [it].(is impractical to substantially eliminate the hazard) |
2 | [it].(reduce the hazard) | [it].(is impractical to substantially reduce the hazard) |
This can be achieved by analyzing the syntax tree of the sentence and resolving events with consideration of their grammatical dependencies. This however also introduces the need for resolving potential ambiguities (like coordination ambiguities in the aforementioned fragment: does "substantially" refer only to "eliminate" or also to "reduce"?).
To make contributions to the CiRA system easier, the requirements, which - so far - are only implicitly known by the core contributors, should be made explicit.
This project should have version information deployed and communicate different states of CiRA.
My proposal for versioning is as follows:
cira application:
setup.py
or something similar could be used to declare the version./api/health
or /api/version
.cira application Docker image:
latest
and the semantic version of the application, e.g., 1.3.2
.FROM cira-dev:latest
when created.cira:1.3.2
.cira-dev Docker image:
latest
tag without any further versioning information. Previous states can be pulled by using the git hash, e.g., cira-dev@sha256:0276f14...
.The current middleware of the container only allows requests coming from the same localhost. This has been shown to be too restrictive.
According to the fetch standard, a http-compliant GET request should not contain a body. However, the API in app.py
contains GET endpoints that require a body.
While some clients (e.g., Postman) allow deviation from the fetch standard, other clients (e.g., the fetch library in JavaScript) will cause something like the following error message: TypeError: Window.fetch: HEAD or GET Request cannot have a body.
This makes the CiRA functionality unusable as one container within a composition, e.g., as the backend to a web interface.
Consequently, it is suggested to update the HTTP method of all API endpoints that require a body from GET to POST.
When running Pytest it throws the following error:
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov-report --cov=src
inifile: /Users/.../cira/pytest.ini
rootdir: /Users/.../cira
The arguments are not supported anymore.
@JulianFrattini What output regarding the code coverage report is desired?
Pytest version 7.1.0
Starting the CiRA app via python app.py
produces the following error message:
File "C:\Users\juf\Workspace\BTH\NLP_RE\cira\app.py", line 9, in <module>
from src.api.service import CiRAService, CiRAServiceImpl
File "C:\Users\juf\Workspace\BTH\NLP_RE\cira\src\api\service.py", line 3, in <module>
from src.cira import CiRAConverter
File "C:\Users\juf\Workspace\BTH\NLP_RE\cira\src\cira.py", line 9, in <module>
from src.converters.labelstograph.graphconverter import GraphConverter
File "C:\Users\juf\Workspace\BTH\NLP_RE\cira\src\converters\labelstograph\graphconverter.py", line 2, in <module>
from src.converters.labelstograph.eventconnector import connect_events
File "C:\Users\juf\Workspace\BTH\NLP_RE\cira\src\converters\labelstograph\eventconnector.py", line 4, in <module>
import util.constants as consts
ModuleNotFoundError: No module named 'util'
The file eventconnector.py seems to attempt to import the constants from util.constants
instead of src.util.constants
and, hence, cannot be resolved.
The development of the API (#16) included methods for serialization and deserialization of complex dataclass
objects with often cyclic dependencies to simple dictionaries, which are better for transmission. These methods shall be used in the loader.py
module, which also requires restructuring the static sentence files.
Installing the requirements as they are currently specified results in the following runtime error:
ValueError Traceback (most recent call last)
Input In [2], in <cell line: 4>()
1 import os, dotenv
2 dotenv.load_dotenv()
----> 4 from src.cira import CiRAConverter
File D:\gitlab_repos\CiRA\cira-main\src\cira.py:8, in <module>
5 from src.classifiers.causalitydetection.causalclassifier import CausalClassifier
7 # converters
----> 8 from src.converters.sentencetolabels.labeler import Labeler
9 from src.converters.labelstograph.graphconverter import GraphConverter
10 from src.converters.labelstograph.eventresolver import SimpleResolver
File D:\gitlab_repos\CiRA\cira-main\src\converters\sentencetolabels\labeler.py:1, in <module>
----> 1 from src.converters.sentencetolabels.model import MultiLabelRoBERTaCustomModel
2 from transformers import RobertaTokenizerFast
3 from transformers import BatchEncoding
File D:\gitlab_repos\CiRA\cira-main\src\converters\sentencetolabels\model.py:6, in <module>
3 from transformers.modeling_outputs import TokenClassifierOutput
4 from transformers import RobertaModel
----> 6 import pytorch_lightning as pl
7 import torch
8 import torch.nn as nn
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\pytorch_lightning\__init__.py:20, in <module>
17 _PACKAGE_ROOT = os.path.dirname(__file__)
18 _PROJECT_ROOT = os.path.dirname(_PACKAGE_ROOT)
---> 20 from pytorch_lightning import metrics # noqa: E402
21 from pytorch_lightning.callbacks import Callback # noqa: E402
22 from pytorch_lightning.core import LightningDataModule, LightningModule # noqa: E402
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\pytorch_lightning\metrics\__init__.py:15, in <module>
1 # Copyright The PyTorch Lightning team.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
(...)
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
---> 15 from pytorch_lightning.metrics.classification import ( # noqa: F401
16 Accuracy,
17 AUC,
18 AUROC,
19 AveragePrecision,
20 ConfusionMatrix,
21 F1,
22 FBeta,
23 HammingDistance,
24 IoU,
25 Precision,
26 PrecisionRecallCurve,
27 Recall,
28 ROC,
29 StatScores,
30 )
31 from pytorch_lightning.metrics.metric import Metric, MetricCollection # noqa: F401
32 from pytorch_lightning.metrics.regression import ( # noqa: F401
33 ExplainedVariance,
34 MeanAbsoluteError,
(...)
39 SSIM,
40 )
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\pytorch_lightning\metrics\classification\__init__.py:14, in <module>
1 # Copyright The PyTorch Lightning team.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
(...)
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
---> 14 from pytorch_lightning.metrics.classification.accuracy import Accuracy # noqa: F401
15 from pytorch_lightning.metrics.classification.auc import AUC # noqa: F401
16 from pytorch_lightning.metrics.classification.auroc import AUROC # noqa: F401
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\pytorch_lightning\metrics\classification\accuracy.py:16, in <module>
1 # Copyright The PyTorch Lightning team.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
(...)
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14 from typing import Any, Callable, Optional
---> 16 from torchmetrics import Accuracy as _Accuracy
18 from pytorch_lightning.metrics.utils import deprecated_metrics
21 class Accuracy(_Accuracy):
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\__init__.py:14, in <module>
11 _PACKAGE_ROOT = os.path.dirname(__file__)
12 _PROJECT_ROOT = os.path.dirname(_PACKAGE_ROOT)
---> 14 from torchmetrics import functional # noqa: E402
15 from torchmetrics.aggregation import CatMetric, MaxMetric, MeanMetric, MinMetric, SumMetric # noqa: E402
16 from torchmetrics.audio import ( # noqa: E402
17 PermutationInvariantTraining,
18 ScaleInvariantSignalDistortionRatio,
(...)
21 SignalNoiseRatio,
22 )
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\functional\__init__.py:14, in <module>
1 # Copyright The PyTorch Lightning team.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
(...)
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
---> 14 from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate
15 from torchmetrics.functional.audio.sdr import scale_invariant_signal_distortion_ratio, signal_distortion_ratio
16 from torchmetrics.functional.audio.snr import scale_invariant_signal_noise_ratio, signal_noise_ratio
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\functional\audio\__init__.py:14, in <module>
1 # Copyright The PyTorch Lightning team.
2 #
3 # Licensed under the Apache License, Version 2.0 (the "License");
(...)
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
---> 14 from torchmetrics.functional.audio.pit import permutation_invariant_training, pit_permutate # noqa: F401
15 from torchmetrics.functional.audio.sdr import ( # noqa: F401
16 scale_invariant_signal_distortion_ratio,
17 signal_distortion_ratio,
18 )
19 from torchmetrics.functional.audio.snr import scale_invariant_signal_noise_ratio, signal_noise_ratio # noqa: F401
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\functional\audio\pit.py:21, in <module>
18 import torch
19 from torch import Tensor
---> 21 from torchmetrics.utilities.imports import _SCIPY_AVAILABLE
23 # _ps_dict: cache of permutations
24 # it's necessary to cache it, otherwise it will consume a large amount of time
25 _ps_dict: dict = {} # _ps_dict[str(spk_num)+str(device)] = permutations
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\utilities\__init__.py:1, in <module>
----> 1 from torchmetrics.utilities.checks import check_forward_full_state_property # noqa: F401
2 from torchmetrics.utilities.data import apply_to_collection # noqa: F401
3 from torchmetrics.utilities.distributed import class_reduce, reduce # noqa: F401
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\utilities\checks.py:22, in <module>
19 import torch
20 from torch import Tensor
---> 22 from torchmetrics.utilities.data import select_topk, to_onehot
23 from torchmetrics.utilities.enums import DataType
26 def _check_for_empty_tensors(preds: Tensor, target: Tensor) -> bool:
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\utilities\data.py:19, in <module>
16 import torch
17 from torch import Tensor, tensor
---> 19 from torchmetrics.utilities.imports import _TORCH_GREATER_EQUAL_1_6, _TORCH_GREATER_EQUAL_1_7, _TORCH_GREATER_EQUAL_1_8
21 if _TORCH_GREATER_EQUAL_1_8:
22 deterministic = torch.are_deterministic_algorithms_enabled
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\utilities\imports.py:119, in <module>
117 _TORCHVISION_GREATER_EQUAL_0_8: Optional[bool] = _compare_version("torchvision", operator.ge, "0.8.0")
118 _TQDM_AVAILABLE: bool = _package_available("tqdm")
--> 119 _TRANSFORMERS_AVAILABLE: bool = _package_available("transformers")
120 _TRANSFORMERS_AUTO_AVAILABLE = _module_available("transformers.models.auto")
121 _PESQ_AVAILABLE: bool = _package_available("pesq")
File D:\gitlab_repos\CiRA\cira-main\venv_python39\lib\site-packages\torchmetrics\utilities\imports.py:36, in _package_available(package_name)
28 """Check if a package is available in your environment.
29
30 >>> _package_available('os')
(...)
33 False
34 """
35 try:
---> 36 return find_spec(package_name) is not None
37 except AttributeError:
38 # Python 3.6
39 return False
File ~\AppData\Local\Programs\Python\Python39\lib\importlib\util.py:114, in find_spec(name, package)
112 else:
113 if spec is None:
--> 114 raise ValueError('{}.__spec__ is None'.format(name))
115 return spec
ValueError: transformers.__spec__ is None```
Currently, if a .env
file exists to define the location of the models and the code is run locally once after the docker container had been started, the environment variables within the container are seemingly overwritten. Within the container, the variables are non-recoverable. To ensure that the code can be used in parallel on a local machine as well as in the container, the environment variables need to be handled properly.
Potential solutions:
.env
file that overwrites the local environment variables.The usage of sentences containing brackets causes the code to break. For example, the sentence "Before access permissions are deactivated, I receive information as to which app functions will no longer work (in full)." from the corona warn app causes an internal server error at the following position:
...
File ".\cira\src\converters\sentencetolabels\labelingconverter.py", line 197, in position_of_next_token
position_of_token = re.search(token, sentence[cursor_pos:]).start()
...
Dockerfile.dev
still uses pip3 install -r requirements.txt
instead of pip3 install -e .
Starting up the API container for the first time always triggers a download that is most likely connected to some of the connected libraries (possibly parts of the BERT model). Because this download only happens once, it does not slow down the continuous usage of that container significantly. But when frequently rebuilding the container (e.g., when performing GitHub actions for testing), this becomes an issue. Hence, identify how to move this download from the API container into the dev container.
Currently, a simplifying assumption of the pipeline is that every event (cause1, cause2, etc.) is associated with exactly one event label. This is however not always the case. Consider the following sentence:
Users which are older than 18 years, are allowed to drive.
The cause-effect graph is simply [users].(are older than 18 years) --> [users].(are allowed to drive)
, but the "which" will not be labeled (as it does not belong semantically to the causal relation). Hence, the event label Cause1 will be split (one spanning from characters 0-5 and one 12-35). There needs to be a level of abstraction above event labels, conveniently labeled "events", which can be associated to multiple event labels.
Please add a screenshot or a figure that represents the CiRA pipeline in the Summary of Artifact
section.
This would improve the first impression for people who discover this repository.
The labelrecovery.py script creates sub-labels with the prefix AEX
, which represents exceptive clauses (e.g., "Unless a button is pressed ..."):
There is only one more use of the AEX
label in a corresponding test case, but apart from that, it remains unused. The necessity of the label has to be investigated: if it is useful, add it to the list of constants. If not, remove it entirely.
The app.py
creates a CiRAServiceImp
in the setup_cira()
method, but the use_GPU
attribute is never set and hence defaults to false
. Consequently, when using the CiRA docker container, it is impossible to configure the usage of GPU "from the outside." This could be made configurable through an environment variable of the container.
The following test cases do not work in my local environment:
cira/test/test_model_locator.py
Lines 17 to 30 in 6b53fa2
They seem to work in the GitHub environment, but it would be great if they would not break locally. Where does the mocker
parameter come from? In my environment, pytest tries to resolve it as a fixture it cannot find. Are there additional libraries necessary?
Additionally, I am not entirely sure about the implications of that test: if the os.path.isfile
is mocked to return False
, then I would actually expect the __load_model_env
method in the model_locator.py file to raise a NameError
as both conditions in which the existence of a file is checked should be evaluated to False
.
In the get_junctors
method of the eventconnector.py, the algorithm tries to determine the first cause event node by looking for the node associated with the event label, which has no predecessor and, hence, is the first.
cira/src/converters/labelstograph/eventconnector.py
Lines 48 to 49 in 6b53fa2
This usually works, but if the first event label of a sentence is an effect and not a cause, then the algorithm will fail - as the first cause event node will have a predecessor (the effect event node). The list comprehension statement will yield an empty list, and trying to access the first element of it ([0]
) will break the system. The following sentence illustrates the example:
This is not an uncommon case. Therefore, determining the first cause event node in a sentence should not rely on the absence of a predecessor, but should rather utilize the order of the labels through the begin
attribute, which denotes the starting index of a label: the first cause event node will always be associated with the cause event label which has the lowest begin
index, regardless of whether it has an effect predecessor or not.
Currently, a simplifying assumption of the pipeline is that junctors ("and" & "or") follow the same precedence rules as any logic, i.e., conjunctions are stronger binding than disjunctions. This is however not always the case. Consider the following sentence:
When the applicant can present an identity card and either is not underage or can present a parent's permission, then an account can be opened with the bank.
The "either ... or" signalizes that in this case, the disjunction (or) binds stronger than the conjunction (and). This feature is currently not implemented and needs to be added. As an underlying rule, the existence of both a conjunction and disjunction between two events followed by a disjunction between the second and a third event needs to be resolved with overruled junctor precedence.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.