anovos / anovos Goto Github PK

Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark

License: Other

Makefile 0.01% Shell 0.02% Python 4.66% CSS 0.65% Dockerfile 0.01% Jupyter Notebook 78.39% HTML 16.27%

feature-engineering machine-learning data-science transformation visualization bigdata scale pyspark python feature-recommendation

anovos's Introduction

Anovos

Anovos is an open source library for feature engineering at scale. Built by data scientists & ML Engineers for the data science community, it provides all the capabilities required for data ingestion, data analysis, data drift & data stability analysis, feature recommendation and feature composition. In addition, it automatically produces easily interpretable professional data reports that help users understand the nature of data at first sight and further enable data scientists to identify and engineer features.

Leveraging the power of Apache Spark behind the scenes, Anovos improves data scientists' productivity and helps them build more resilient and better performing models.

Quick Start

The easiest way to try out Anovos and explore its capabilities is through the provided examples that you can run via Docker without the need to install anything on your local machine.

# Launch an anovos-examples Docker container
sudo docker run -p 8888:8888 anovos/anovos-examples-3.2.2:latest

To reach the Jupyter environment, open the link to http://127.0.0.1:8888/?token... generated by the Jupyter NotebookApp.

If you're not familiar with Anovos or feature engineering, the Getting Started with Anovos guide is a good place to begin your journey. You can find it in the /guides folder within the Jupyter environment.

For more detailed instructions on how to install Docker and how to troubleshoot potential issues, see the examples README.

Using Anovos

Requirements

To use Anovos, you need compatible versions of Apache Spark, Java and Python.

Currently, we officially support the following combinations:

Apache Spark 2.4.x on Java 8 with Python 3.7.x
Apache Spark 3.1.x on Java 11 with Python 3.9.x
Apache Spark 3.2.x on Java 11 with Python 3.10.x

To see what we're currently testing, see this configuration.

Installation

You can install the latest release of Anovos directly through PyPI:

pip install anovos

Documentation

We provide a comprehensive documentation at docs.anovos.ai that includes user guides as well as a detailed API documentation.

For usage examples, see the provided interactive guides and Jupyter notebooks as well as the Spark demo.

Overview

Roadmap

Anovos has designed for to support any feature engineering tasks in a scalable form. To see what's planned for the upcoming releases, see our roadmap.

Development Version

To try out the latest additions to Anovos, you can install it directly from GitHub:

pip install git+https://github.com/anovos/anovos.git

Please note that this version is frequently updated and might not be fully compatible with the documentation available at docs.anovos.ai.

Contribute

We're always happy to discuss and accept improvements to Anovos. To get started, please refer to our Contributing to Anovos page in the documentation.

To start coding, clone this repository, install both the regular and development requirements, and set up the pre-commit hooks:

git clone https://github.com/anovos/anovos.git
cd anovos/
pip install -r requirements.txt
pip install -r dev_requirements.txt
pre-commit install

anovos's People

Contributors

Stargazers

Watchers

anovos's Issues

main script should be included in Anovos package

Currently, main.py is not included in the wheel file.

Hence, users have to navigate to the GitHub repository and retrieve the correct version of main.py for the version of Anovos they have installed if they want to run Anovos workloads from a configuration file. (It's not sufficient to get the latest version of main.py, as the version the user has installed might be incompatible with it.)

Instead, users should be able to just install Anovos from a wheel (e.g., pip install anovos) and run workloads, without having to download a file from GitHub or clone the GitHub repository.

Drop support for Python 3.7

More and more data science packages start to drop support for Python 3.7.

For example, scikit-learn and mlflow no longer support it (PRs #335 and #355)

Provide the Getting-Started-Container through Docker Hub and/or GHCR

I suggest to use GitHub actions push a new version every time it changes.

Then, we can tell users to download the image instead of having to build it themselves.

Interest to contribut

This an interesting initiative and i am looking to contribute, but few questions i have in mind and i would like to know if these are open interesting in the context of anovos:

What do you think of cross engine execution that we can consider spark, ray and dask as main players?
For data quality, Apache Griiffin have done interesting work on spark by providing facilities to implement data quality metrics such as timeliness, uniqueness, etc... This could be achieved through the usage of deequ but that can be tricker if cross engine execution is envisaged
for features engineering, what do you of tools like featuretools from alteryx?

In both cases, the project looks very promising and it will be interested to see how to contribute to anovos.

Outlier detector issues warning that skewed columns are dropped even if there are no skewed columns

The warning message in quality_checker.py#L483 should only be issued if skewed_cols is not empty.

Importing `feature_exploration` downloads almost 500MB from HuggingFace's model hub

Relevant line: https://github.com/anovos/anovos/blob/main/src/main/anovos/feature_recommender/feature_exploration.py#L6

I suggest to lazy load the model. Otherwise, simply importing the module immediately triggers a download of about 0.5 GB. That's rather surprising for users and not fair to HuggingFace.

Issues in handling 'double' datatype

Issues while running end to end with dataset having some columns having 'double' data type.

Got error in invalidEntries_detection function

Also, some of the functions like biasedness_detection showed empty table as output

`test_imputation_matrixFactorization` is flaky and regularly fails

This error appears (predominantly?) for Python 3.7 workflows but not always

outlier_detection Performance Issue

Expected Behavior

outlier_detection functions runs normally with default Anovos spark session setup, for any reasonable-size data

Current Behavior

outlier_detection function either not running at all, or run with very poorly optimization result (more than 2 hours for the test dataset, which has 394 columns)

Steps to reproduce

Link to the dataset:
https://www.kaggle.com/competitions/ieee-fraud-detection/data?select=train_transaction.csv
Sample out around 17000 records of the data, and run outlier_detection function with the sampled data

Specifications

Latest Anovos Release 0.2.2

Version: Anovos 0.2.2
Platform: MacOS, Intel Chip
Subsystem: Python 3.8.8, Spark 3.2.1, Java 8

Possible Solution

Short term solution will be making Anovos default spark session (from anovos.shared.utils import spark) customizable, with input config. Long term solution will be to optimize the runtime and space for outlier_detection function, for it to work properly with big dataset

Anovos_use_case_demo.ipynb shows error when generating report

The Jupyter notebook anovos_use_case_demo.ipynb stored here is now getting many error messages when generating Full Report.

Expected Behavior

In the notebook published on Github, the Full Report section should look like this

Current Behavior

When I run it using Anovos 1.0.0, it could successfully generate the report while charts_to_object function shows below error messages.
2022-10-04 16:15:06.762 | ERROR | anovos.data_report.report_preprocessing:edit_binRange:138 - processing failed during edit_binRange, error 'NoneType' object has no attribute 'split'

(The correlation matrix error is because anovos_basic_report function has set skip_corr_matrix=True by default. This would prevent the Full Report from generating correlation matrix plot, but the sample report here still has the plot. This might confuse the users when running the demo).

When I run it using Anovos 1.0.1, it failed to complete the anovos_basic_report function due to errors in variable clustering function.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [9], in <cell line: 5>()
      2 report_path = output_path+"data_report/report_stats/"
      4 #Option 1 (with ID & Label Column)
----> 5 anovos_basic_report(spark, df_sample, id_col='SK_ID_CURR', label_col="TARGET", 
      6                     event_label=1, output_path=report_path, print_impact=False)

File ~/opt/anaconda3/lib/python3.9/site-packages/anovos/data_report/basic_report_generation.py:186, in anovos_basic_report(spark, idf, id_col, label_col, event_label, skip_corr_matrix, output_path, run_type, auth_key, print_impact)
    184 elif func in AA_funcs:
    185     extra_args = stats_args(output_path, func.__name__)
--> 186     stats = func(spark, idf, drop_cols=id_col, **extra_args)
    187 elif label_col:
    188     if func in AT_funcs:

File ~/opt/anaconda3/lib/python3.9/site-packages/anovos/data_analyzer/association_evaluator.py:258, in variable_clustering(spark, idf, list_of_cols, drop_cols, sample_size, stats_unique, stats_mode, print_impact)
    256 idf_pd = idf_imputed.toPandas()
    257 vc = VarClusHi(idf_pd, maxeigval2=1, maxclus=None)
--> 258 vc.varclus()
    259 odf_pd = vc.rsquare
    260 odf = spark.createDataFrame(odf_pd).select(
    261     "Cluster",
    262     F.col("Variable").alias("Attribute"),
    263     F.round(F.col("RS_Ratio"), 4).alias("RS_Ratio"),
    264 )

File ~/opt/anaconda3/lib/python3.9/site-packages/varclushi/varclushi.py:215, in VarClusHi.varclus(self, speedup)
    212 self.speedup = speedup
    214 if self.speedup is True:
--> 215     return self._varclusspu()
    217 ClusInfo = collections.namedtuple('ClusInfo', ['clus','eigval1','eigval2','pc1','varprop'])
    218 c_eigvals, _, c_princomps, c_varprops = VarClusHi.pca(self.df[self.feat_list])

File ~/opt/anaconda3/lib/python3.9/site-packages/varclushi/varclushi.py:143, in VarClusHi._varclusspu(self)
    140 def _varclusspu(self):
    142     ClusInfo = collections.namedtuple('ClusInfo', ['clus', 'eigval1', 'eigval2', 'eigvecs','varprop'])
--> 143     c_eigvals, c_eigvecs, c_corrs, c_varprops = VarClusHi.correig(self.df[self.feat_list])
    145     self.corrs = c_corrs
    147     clus0 = ClusInfo(clus=self.feat_list,
    148                      eigval1=c_eigvals[0],
    149                      eigval2=c_eigvals[1],
    150                      eigvecs=c_eigvecs,
    151                      varprop=c_varprops[0]
    152                      )

File ~/opt/anaconda3/lib/python3.9/site-packages/varclushi/varclushi.py:38, in VarClusHi.correig(df, feat_list, n_pcs)
     36     varprops = [sum(eigvals)]
     37 else:
---> 38     corr = np.corrcoef(df.values.T)
     39     raw_eigvals, raw_eigvecs = np.linalg.eigh(corr)
     40     idx = np.argsort(raw_eigvals)[::-1]

File <__array_function__ internals>:5, in corrcoef(*args, **kwargs)

File ~/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:2683, in corrcoef(x, y, rowvar, bias, ddof, dtype)
   2679 if bias is not np._NoValue or ddof is not np._NoValue:
   2680     # 2015-03-15, 1.10
   2681     warnings.warn('bias and ddof have no effect and are deprecated',
   2682                   DeprecationWarning, stacklevel=3)
-> 2683 c = cov(x, y, rowvar, dtype=dtype)
   2684 try:
   2685     d = diag(c)

File <__array_function__ internals>:5, in cov(*args, **kwargs)

File ~/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:2518, in cov(m, y, rowvar, bias, ddof, fweights, aweights, dtype)
   2515     else:
   2516         w *= aweights
-> 2518 avg, w_sum = average(X, axis=1, weights=w, returned=True)
   2519 w_sum = w_sum[0]
   2521 # Determine the normalization

File <__array_function__ internals>:5, in average(*args, **kwargs)

File ~/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/function_base.py:380, in average(a, axis, weights, returned)
    377 a = np.asanyarray(a)
    379 if weights is None:
--> 380     avg = a.mean(axis)
    381     scl = avg.dtype.type(a.size/avg.size)
    382 else:

File ~/opt/anaconda3/lib/python3.9/site-packages/numpy/core/_methods.py:181, in _mean(a, axis, dtype, out, keepdims, where)
    179 ret = umr_sum(arr, axis, dtype, out, keepdims, where=where)
    180 if isinstance(ret, mu.ndarray):
--> 181     ret = um.true_divide(
    182             ret, rcount, out=ret, casting='unsafe', subok=False)
    183     if is_float16_result and out is None:
    184         ret = arr.dtype.type(ret)
TypeError: unsupported operand type(s) for /: 'str' and 'int'

Steps to reproduce

Download the use case demo jupyter notebook here
Run the commands until Full Report section and errors will show up depending on which Anovos version we are using

Specifications

Version: Anovos 1.0.0 / Anovos 1.0.1
Platform: MacOS, M1 Chip
Subsystem: Python 3.9, Spark 3.3.0

Possible Solution

For errors using Anovos 1.0.0, the errors related to splitting NoneType objects was shown up during charts_to_object function call. This might be due to passing NoneType to edit_binRange function when generating some plots.

GitHub workflow unit test/demo not triggered for external PR

Expected Behavior:

Test suites (Unit Tests, Code quality) would be triggered for external contributor's PR (OSS style). For example:
PR #220
PR #219
PR #217

Current Behavior:

None of these test suites are triggered for those PRs

Potential Solution:

Look into .github and modify/make change to workflow trigger conditions

Dockerfile_spark_demo creates unnecessarily many layers

Suggestion: Following Docker best practices, the number of layers should be kept minimal.

This can be achieved by removing duplicate calls (e.g., to apt-get update) and by grouping related commands, e.g., as follows:

FROM ubuntu:18.04

RUN add-apt-repository ppa:deadsnakes/ppa && apt-get update \
    && apt-get install -y openjdk-8-jdk git wget python3-pip python3-dev python3.7 python3-distutils python3-setuptools software-properties-common

RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 1 \
    && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 2

RUN wget "https://downloads.apache.org/spark/spark-2.4.8/spark-2.4.8-bin-hadoop2.7.tgz" \
    && tar -xzvf spark-2.4.8-bin-hadoop2.7.tgz \
    && rm spark-2.4.8-bin-hadoop2.7.tgz

RUN ln -s /usr/bin/python3 /usr/bin/python \
    && python3.7 -m pip install pip --upgrade pip \
    && bin/datapane_install ./datapane_install
    && pip3 install "./datapane_install/datapane-0.12.0.tar.gz" \
    && cp ./datapane_install/local-report-base* /usr/local/lib/python3.7/dist-packages/datapane/resources/local_report

ADD requirements.txt .
RUN python3 -m pip install -r requirements.txt

ADD config/log4j.properties .
ADD jars/histogrammar_2.11-1.0.20.jar .
ADD jars/histogrammar-sparksql_2.11-1.0.20.jar .
ADD dist/anovos.zip .
ADD dist/anovos.tar.gz .
ADD dist/main.py .
ADD config/configs.yaml .
ADD data/income_dataset ./data/income_dataset
ADD data/metric_dictionary.csv ./data/metric_dictionary.csv
ADD bin/spark-submit_docker.sh .

CMD ["./spark-submit_docker.sh"]

`anovos.feature_recommender.feature_recommender.init_input_fer()` requires internet access

Expected Behavior

I can import and use Anovos locally. A specific version of Anovos always behaves the same.

Current Behavior

init_input_fer() downloads a CSV file from GitHub every time it is called:

def init_input_fer():
    """

    Returns
    -------
    Loading the Feature Explorer and Recommender (FER) Input DataFrame (FER corpus)
    """
    input_path_fer = "https://raw.githubusercontent.com/anovos/anovos/main/data/feature_recommender/flatten_fr_db.csv"
    df_input_fer = pd.read_csv(input_path_fer)
    return df_input_fer

Also, if this file is ever renamed on or removed from the current main branch, old versions of Anovos no longer work. In addition, if the file gets updated, the behavior of Anovos changes in a way that is really difficult for users to track down.

Steps to reproduce

N/A

Specifications

Version: current main state
Platform: Ubuntu Linux (doesn't matter)
Subsystem:

Possible Solution

Package the feature explorer input data with the library.

Anovos example notebooks issues

In data_transformer_transformers notebook:

Normalization last block failed to run with this error log: https://pastebin.com/vhQQKuAd. Unknown reason
imputation_sklearn failed to run with Example 1 and Example 7 with this error log: https://pastebin.com/s9T8WfF7. Reason: worker crashed (memory issue potentially)

In data_report notebook:

Plot outlier for capital-gain failed to run with this error log: https://pastebin.com/QX9KVbaM.
Reason for this is because the input df does not have capital-gain as a column (it was dropped right above)

Specifications

Latest Anovos Release 0.3.0
Version: Anovos 0.3.0
Platform: MacOS, Intel Chip
Subsystem: Python 3.8.8, Spark 3.2.1, Java 8

Version is defined in two places and doesn't match

https://github.com/anovos/anovos/blob/main/setup.py says 0.3.0

https://github.com/anovos/anovos/blob/main/src/main/anovos/version.py says 0.2.1

Phik Correlation Matrix Error then running for dataset

Expected Behavior

Correlation Matrix is expected to run successfully when running basic report

Current Behavior

Error in correlation matrix generation due to dataset being null

Steps to reproduce

Download Kaggle dataset
https://www.kaggle.com/competitions/ieee-fraud-detection/data

Run the usecase notebook
ieee-fraud-detection-anovos.html.zip

Specifications

Latest Anovos Release 0.2.2

Version: Anovos 0.2.2
Platform: MacOS
Subsystem: Python 3.9, Spark 3.2.1, Java 8

Possible Solution

Correlation matrix can be made as enable/disable by using argument and not kept as mandatory as it is now for basic report.

Doing `read_dataset` in `getting_started_with_anovos.ipynb` gives err

Expected Behavior

I'm not sure what to be expected, but at least it shouldn't returning Exception.

Current Behavior

from anovos.data_ingest.data_ingest import read_dataset

df = read_dataset(
    spark,  # Remember: The first argument of Anovos functions is always an instantiated SparkSession
    file_path='../data/income_dataset/csv',
    file_type='csv',
    file_configs={'header': 'True', 'delimiter': ',', 'inferSchema': 'True'}
)

Returns

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/tmp/ipykernel_235/916348546.py in <module>
      5     file_path='../data/income_dataset/csv',
      6     file_type='csv',
----> 7     file_configs={'header': 'True', 'delimiter': ',', 'inferSchema': 'True'}
      8 )

/opt/conda/lib/python3.7/site-packages/anovos/data_ingest/data_ingest.py in read_dataset(spark, file_path, file_type, file_configs)
     19     :return: Dataframe
     20     """
---> 21     odf = spark.read.format(file_type).options(**file_configs).load(file_path)
     22     return odf
     23 

/usr/local/spark/python/pyspark/sql/readwriter.py in load(self, path, format, schema, **options)
    156         self.options(**options)
    157         if isinstance(path, str):
--> 158             return self._df(self._jreader.load(path))
    159         elif path is not None:
    160             if type(path) != list:

/usr/local/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1308         answer = self.gateway_client.send_command(command)
   1309         return_value = get_return_value(
-> 1310             answer, self.gateway_client, self.target_id, self.name)
   1311 
   1312         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

/usr/local/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o65.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
	at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
	at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
	at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
	at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
	at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
	at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
	at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
	at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
	at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat$class
	at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
	... 31 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.FileFormat$class
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	... 37 more

I'll put the link on shell output from installation to running (and having err) in this pastebin link: https://pastebin.com/WKwcWLnR

Steps to reproduce

Run the anovos-examples:latest docker, go to getting_started_with_anovos.ipynb file. 13th line should gives this err.

Specifications

(Everything is based on ./create_anovos_examples_docker_image.sh's docker)

Version:
- anovos 0.1
Platform:
- Python 3.7.12
- Spark 3.2.0
- Scala 2.12.15
Subsystem:
- Ubuntu Focal

Possible Solution

Seems like something to do with either how Java or Spark is being set up.

confused with the following approach data_ingest read_dataset function

Doing some reformating to resolve flake8 warning and for some reasons i feel like some code are very ambiguous, could someone explain the rational behind the following line of codes:

anovos/src/main/main.py

Lines 19 to 20 in 6020758

    
           f = getattr(data_ingest, "read_dataset") 
        
           read_args = args.get("read_dataset", None)

Slack link doesn't seem to work

Hi, the "community" link in https://anovos.ai/ takes me to a generic Slack page saying "Sign in to your workspace"

Categorical/Numerical columns not being identified correctly

Expected Behavior

Categorical columns that should be detected:
ProductCD
card1 - card6
addr1, addr2
P_emaildomain
R_emaildomain
M1 - M7
isFraud

Current Behavior

Anovos is only able to detect these as categorical columns:
card4, card6
P_emaildomain
R_emaildomain
ProductCD
M1 - M7
As we can see, there are a lot of columns being misidentified.

Steps to reproduce

Dataset link:
https://www.kaggle.com/competitions/ieee-fraud-detection/data
Take out the first 53 columns (last column is M7), and run aggregateType_segregation function on those

Specifications

Latest Anovos Release 0.2.2

Version: Anovos 0.2.2
Platform: MacOS, Intel Chip
Subsystem: Python 3.8.8, Spark 3.2.1, Java 8

Possible Solution

When detecting numerical columns, add another layer to check the cardinality of the columns. If their cardinality is low enough, they should be identified as categorical columns.

Getting started guide basic report generation is giving an exception

Expected Behavior

Should have generated the basic report inside the docker image

Current Behavior

Getting started guide - basic report generation is giving an exception inside the docker image

Steps to reproduce

https://paste-bin.xyz/38686

Specifications

Version: Docker 17.03.1-ce-rc1-mac3 (15924)
Platform: Mac OS BigSur (V11.5.2)
Subsystem:

Possible Solution

Need to investigate

Importing workflow from anovos in Databricks workspace results in ImportError

Expected Behavior

Import Statement completes w/o errors.

Current Behavior

An ImportError is thrown with the following exception:

Steps to reproduce

Set up a Databricks Workspace,
Open a Notebook in Python
execute "%pip install anovos"
execute from anovos import workflow

Specifications

Version: anovos v 1.0.1
Platform: Databricks Workspace w/ Sparak 3.2.1
Subsystem:

Possible Solution

Downgrading/Fixing markupsafe library

Unit test case changes were missed out for #23

As mentioned in #28, unit test case changes corresponding to #23 was missed out

Local installation of anovos throws error while building reverse-geocoder on MacOS

Expected Behavior

Installing anovos in a fresh virtual environment completes w/o errors.

Current Behavior

Currently the install output contains errors when building the module reverse-geocoder (see attached file).
dump.txt

Steps to reproduce

On MacOs:

Create a new virtual environment
run pip install anovos

Specifications

Version: Anovos v1.0.1
Platform: MacOS Ventura 13.0.1, Python 3.10
Subsystem:

Possible Solution

Feature Recommender Lazy Load Model not working properly

Expected Behavior

feature_recommendation function to produce proper result

Current Behavior

feature_recommendation returns an error, TypeError: 'Tensor' object is not callable

Steps to reproduce

from anovos.feature_recommender.feature_recommendation import *
import pandas as pd
df_attr_1 = pd.read_csv( 'https://raw.githubusercontent.com/anovos/anovos/main/data/feature_recommender/test_input_fr.csv' )
feature_recommendation(df_attr_1, name_column='Attribute Name', desc_column='Attribute Description')

Specifications

Version: Python 3.8
Platform:
Subsystem:

Possible Solution

Review the lazy load model class

It is not possible to import `anovos.feature_recommender.feature_recommendation` without first downloading the model

Expected Behavior

I can import anovos.feature_recommender.feature_recommendation without an Exception being raised.

(For example, when generating the API docs, see e.g. https://github.com/anovos/anovos-docs/runs/5379880799?check_suite_focus=true)

Current Behavior

>>> import anovos.feature_recommender.feature_recommendation
2022-03-01 19:16:07.916613: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-01 19:16:07.916641: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kilian/Documents/Work/anovos/GitHub/anovos/src/main/anovos/feature_recommender/feature_recommendation.py", line 23, in <module>
    list_train_fer, df_rec_fer, list_embedding_train_fer = feature_recommendation_prep()
  File "/home/kilian/Documents/Work/anovos/GitHub/anovos/src/main/anovos/feature_recommender/featrec_init.py", line 217, in feature_recommendation_prep
    list_embedding_train_fer = model_fer.model.encode(
  File "/home/kilian/Documents/Work/anovos/GitHub/anovos/src/main/anovos/feature_recommender/featrec_init.py", line 52, in model
    raise FileNotFoundError(
FileNotFoundError: Model has not been downloaded. Please use model_download() function to download the model first

Steps to reproduce

$ cd src/main
$ python

>>> import anovos.feature_recommender.feature_recommendation

Specifications

Version: current main state
Platform: Ubuntu Linux (both locally and on GitHub Actions)
Subsystem:

Possible Solution

Completely solve #94 by not requiring the model at import time.

Unit tests cannot access test data

As became evident during work on #26, the unit tests currently do not point to the correct data path.

I suggest to move the access to test data into a fixture. In the long run, it's probably best to not include any test data in the repo but to store it on e.g. S3 and retrieve it at test time.

Data transformer unit tests sometimes fail with StackOverflowError

see e.g. https://github.com/anovos/anovos/runs/6288850865?check_suite_focus=true#step:15:2284 and https://github.com/anovos/anovos/runs/6273324296?check_suite_focus=true#step:14:3850

It seems to work fine sometimes. Not sure whether there's a pattern.

Wheel is not covered by any test

The new feature_store sub-package was not added to setup.py which leads to an import error when trying to run a workflow. This went completely unnoticed by the automated tests. There should be at least some basic tests that the wheel contains a working version of Anovos.

(The particular issue is fixed with #265.)

Spark demo takes about an hour to run

As mentioned in #26, the Spark demo takes a really long time to run, at least when executed on a single machine.

This is undesirable from two points of view:

New users that want to try out the library and get a first impression of the workflows it offers have to wait a very long time
We cannot use the Spark demo as a smoke test because it blocks all test runners (and, more generally, uses a lot of compute resources for relatively little gain)

I suggest to change the default settings for the demo such that it takes between 5 and 10 minutes (max) to run it. To showcase the benefits of distributed processing, we can add a simple flag to switch back to the current configuration.

One Hot Encoding Bug

Expected Behavior

One hot encoding to create index columns equal to the same number of distinct values.

Current Behavior

For columns without null values (like "income") One Hot Encoding adds an additional column.

Steps to reproduce

from anovos.shared.spark import *
from anovos.data_ingest.data_ingest import read_dataset, write_dataset
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator
from pyspark.ml import Pipeline, PipelineModel

df = read_dataset(spark, ["examples/data/income_dataset/csv"],"csv", {"header" : "True", "inferSchema" : "True"})
list_of_cols = ["income"]
stages = []
index_order='frequencyDesc'
for i in list_of_cols:
stringIndexer = StringIndexer(inputCol=i, outputCol=i + '_index', stringOrderType=index_order, handleInvalid='keep')
stages += [stringIndexer]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df)
odf_indexed = pipelineModel.transform(df)
list_of_cols_vec = []
list_of_cols_idx = []
for i in list_of_cols:
list_of_cols_vec.append(i + "_vec")
list_of_cols_idx.append(i + "_index")

print(list_of_cols_idx)
print(list_of_cols_vec)
encoder = OneHotEncoderEstimator(inputCols=list_of_cols_idx, outputCols=list_of_cols_vec,
handleInvalid='keep')
odf_encoded = encoder.fit(odf_indexed).transform(odf_indexed)
odf_encoded.show(5,False)

Specifications

Version: Anovos - Latest; Python 3.7; Spark 2.4.8
Platform: Any
Subsystem:

Possible Solution

"handleInvalid" flag can be changed as "error" maybe (but to be tested for other scenarios)
or
"length of array" to be used when calculating distinct values, so even on new columns being created by OneHotEncoding from pyspark ml, anovos is able to handle

	f = getattr(data_ingest, "read_dataset")
	read_args = args.get("read_dataset", None)

anovos / anovos Goto Github PK

anovos's Introduction

Anovos

Quick Start

Using Anovos

Requirements

Installation

Documentation

Overview

Roadmap

Development Version

Contribute

anovos's People

Contributors

Stargazers

Watchers

Forkers

anovos's Issues

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior:

Current Behavior:

Potential Solution:

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

In data_transformer_transformers notebook:

In data_report notebook:

Specifications

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Expected Behavior