datacleaner / datacleaner Goto Github PK

View Code? Open in Web Editor NEW

569.0 64.0 179.0 166.6 MB

The premier open source Data Quality solution

License: GNU Lesser General Public License v3.0

Java 91.05% HTML 7.79% Scala 1.02% CSS 0.08% JavaScript 0.06% Batchfile 0.01% Shell 0.01%

data dataquality database desktop datacleaner mdm etl data-analysis data-science profiling

datacleaner's Introduction

DataCleaner

The premier Open Source Data Quality solution.

DataCleaner is a Data Quality toolkit that allows you to profile, correct and enrich your data. People use it for ad-hoc analysis, recurring cleansing as well as a swiss-army knife in matching and Master Data Management solutions.

Where to go for end-user information?

Please visit the DataCleaner community website https://datacleaner.github.io for downloads, news, documentation etc.

Visit our Gitter chat channel https://gitter.im/datacleaner/community for asking questions or discussions.

GitHub markdown pages and issues are used for developers and technical aspects only.

Module structure

The main application modules are:

api - The public API of DataCleaner. Mostly interfaces and annotations that you should use to build your own extensions.
resources - Static resources in DataCleaner
oss-branding - Icons and colors
testware - Useful classes for unit testing of DataCleaner and extension code.
engine
- core - The core engine piece which allows execution of jobs and components as per the API.
- xml-config - Contains utilities for reading and writing job files and configuration files of DataCleaner.
- env - Different/alternative environments that DataCleaner can run in, for instance Apache Spark or webapp-cluster
components
- ... - many sub modules containing built-in as well as additional components/extensions to use with DataCleaner.
- standard-components - a container-project that dependends on all components that are normally bundled in DataCleaner community edition.
desktop
- api - The public API for the DataCleaner desktop application.
- ui - The Swing-based user interface for desktop users
monitor
- api - the API classes and interfaces of DataCleaner monitor

Code style and formatting

In the root of the project you can find 'Formatter-[IDE].xml' files which enable you to import the code formatting rules of the project into your IDE.

Continuous Integration

There's a public build of DataCleaner that can be found on Travis CI:

https://travis-ci.org/datacleaner/DataCleaner

License

Licensed under the Lesser General Public License, see http://www.gnu.org/licenses/lgpl.txt

datacleaner's People

Contributors

Stargazers

Watchers

Forkers

anderssewerinhi anandswarupv vinsjee vaibhavsehgal1 sukhmeetsethi ninqing hemclohumi tomaszguzialek blessedandy mhorner7wd monolithic ripingit frictionlesscoin codeaudit kyssley yeahliu victorlv2010 fiolbs rapidpanther loorenzooo saakaifoundry robinroby guotechfin fuhm yulifengwx anukat2015 arjansh zllc joosjeboon drthakare ttracx rpatil524 kaspersorensen gaobo07 benjaminyu sumannewton mayidudu baohuagu guanji1989 jiafan mrshoks davkrause brokenpeace linkfar neven7 muhammadammad sniperxiaojun kioco andrew8305 xsls mysky528 lujiwei stevens515 xcf007 jackge1979 luke202001 mixergit zhiqinghuang cole2295 archer-christ zmyer yqjack jerryxing98 ashitabh sabaljayson adrianbzg wsgan001 forestlzj lostman80 tool-recommender-bot weiruanyahei cayman007 hdfs010 tyrozhang lanchiang kcsekhar-de gitwb ittw84 litiian imera88 sevenfang jagannathks skymysky dut3062796s adonis2014 zhangziliang04 delibit agilee winterbao zark7777 zengweitju gaoliujie2016 nenoooo ljc520313 pologood tonylv kapeshifk ifris-data wangshu-niguang sociopathicpixel

datacleaner's Issues

Upgrade to AnalyzerBeans 0.42

We want to include the new Referential Integrity analyzer.

Use proxy settings in "Options" dialog to also influence HTTPS proxy settings

Currently the proxy settings used in the options dialog only influence HTTP behaviour, not HTTPS behaviour. So a user would have to do a workaround like this for the command line:

-DproxySet=true -Dhttps.proxyHost=your.proxy.hostname.com -Dhttps.proxyPort=1234

Add result elements from latest result directly on dashboard

See http://eobjects.org/trac/ticket/896

There's a patch, but doesn't quite look as great as we would maybe want.

Visual/graph oriented perspective on job editing

Right now we have the "visualization" option which provides a read-only view of the edited job in a graph-oriented manner.

We need a way to make this view interactive.

Add switch in bottom of window to switch between "Visual" (new) editor, "Classic" editor.
Replace everything on the right-hand side of the schematree with the graph view, when the above switch is set to "Visual".
Let the visual editor panel implement the job builder listeners so that it can automatically add/remove components when events occur.
Add double-click behaviour on component in visual editor - double clicking should present the component editor panel.
Add a way to draw lines between components (shift+click and right click menu based) and upon dropping a line on a component, bring up column selection (mapped columns effectuate the "line" since they are defined by the dependencies between components).
Add x,y coordinates to component information in analysis XML file format.
Remove the "Visualize" button.

More TODO points in the pull request - #130

Unify component interfaces

We currently have three component interfaces in DataCleaner/AnalyzerBeams:

Filter (splits a data processing stream into more streams)
Transformer (appends columns and/or records to a stream)
Analyzer (consumes records and produces an analysis result)

We would like these component types to converge. Sometimes traits of one of the components are also useful in other situations. Also, a few scenarios are currently not possible, especially the case where a transformer consumes a large set of records and only after a while emits them back into the stream (e.g. in sorting records, or waiting for a long-running batch service).

Requirements to API for components:

Should be able to specify multiple output "streams", like a filter.
Should be able to specify multiple output fields, like a transformer.
Should be able to specify multiple output records, like a transformer with an injected OutputRowCollector.
Should be able to produce arbitrary filter outcomes, like a general purpose and "grouper".
Should be able to consume arbitrary filter outcomes, potentially spawning one component for each grouped value. Example of use-case: #377
Should be able to store records in a temporary collection-like store, and retreive them again later and emit them in a way similar to using the OutputRowCollector.
Should be able to produce a AnalyzerResult (consider renaming AnalyzerResult though, if Analyzer itself is deprecated). Separately described: #225
Should be able to produce one or more completely new data streams. This will enable current analyzers such as the Duplicate Detection analyzer to produce datasets with pairs, groups etc. for further processing in the same job. Separately described: #224

New example: Invoking child analysis job

For both DC monitor and DC desktop, we should have an example of this cool transformation type.

Coalesce multiple fields transformation

Create table via DC's UI

Make it possible to create tables on (at least JDBC) datastores, since this is a common wish when wanting to insert data using the "Insert into table" writer. Currently you can only use existing tables or use the "Create staging table" component, but that is not always as convenient.

UI properties for Referential Integrity analyzer

Filter on visible schema or table names

See http://eobjects.org/trac/ticket/651

Read data from multiple files as a single source

See original issue: http://eobjects.org/trac/ticket/1183

Sometimes, datastores in XML or CSV format (or other file-based formats) are so large, that they are provided as a bunch of files.

It would be helpful if DataCleaner could access such a collection of files as a single datasource. It could do either of:

concatenating the contents of these files to become a single large table, or
map each file to a separate table, or
a combination of the two above (e.g a set of XML files, each containing multiple tables)
This is an important enhancement, as there is no convenient way to deal with this type of input today. Both the following workarounds are suboptimal:
Catting all the files together to a single file, then processing that: since this typically concerns large datasources, catting together all the files is both time and disk consuming;
Invoking DataCleaner separately on each file: with a large number of files (e.g Dutch Kadaster's BAG compact file comes as 600 XML files) the overhead of firing up multiple DC's becomes costly.

Automatic discovery of nodes in cluster module

Right now our cluster setup in DataCleaner requires a configured list of node URLs in the cluster. This is fine if the cluster is constant, but it would be even better if nodes can automatically be added, so we should rather maintain a mutable list of nodes in the master(s) of the cluster.

Consider getting inspiration from discovery modules of e.g. ElasticSearch; described here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html

Expose Wizard module as JavaScript API

Make the Wizard system more flexible and consumeable in third party applications that inherit from DCs base.

Make popup optional by having targetDiv as external configuration
Expose JS method startMethod(type, wizardName, targetDiv)

Make it possible to save a job without analyzers

Although the job will be incomplete, it should be possible to save a job without analyzers in it. That will make jobs useful as child jobs in the new "Invoke child Analysis job" transformation.

More convenient reordering of columns

The current reordering of columns option is quite heavy-weight if you want to move many columns around. Some ways to improve:

Make drag-n-drop possible in the list of columns.
Have buttons for moving a column all the way to the top or all the way to the bottom.

Let JavaScriptCallbacks optionally return boolean to influence behaviour

Right now the new Wizard JS API based on JavaScriptCallbacks is "either or" in the sense that if a callback is implemented, it ALWAYS rules the behaviour of the script. Rather we could use the "typeof" javascript operator to find out if the invoked callbacks return a boolean, and if so we can propogate this boolean's value back to the central script, making it easier to filter out only callback events of interest in the hostpage.

Option to show all values in Value Distribution result

See http://eobjects.org/trac/ticket/655

More lively progress bars on "progress information" panel of Result window

Enable the execute button always, suggest to send outcome to 'file'

Now the 'execute' button is disabled unless you add an analyzer or add a write results to file option in a transformer.
Sometimes we simply want to add or standardize data and have no need for an analyzer.
We could simplify do this by, if there is no Analyzer we always suggest to write to file.

Success

Support for NTLM based proxies

We have a client with a NTLM security based HTTP proxy. This is the error he gets:

WARN 14:43:59 RequestProxyAuthentication - NTLM authentication error: Credentials cannot be used for NTLM authentication: org.apache.http.auth.UsernamePasswordCredentials

Simplify message regarding server time and time format

On the scheduling dialog we have this phrasing:

Date with respect to Server Time :2014-06-19 13:08:45
Date is in following format: yyyy-MM-dd HH:MM:SS

I would just say "Server time: " to begin with and not mention the format at all - this is evident from the input box above which features a sample value if you click it.

On linux DC monitor repository is empty because tenant name is lowercased and folder is called 'DC'.

Ability to hide graphs on crosstab result panels

Refer to forum discussion: http://datacleaner.org/topic/355/Graphs

Requirement button on Analyzer panels doesn't get updated when a default requirement is set

Result window should lay out results by component type, not source table

... As it is when loading a saved results ... This should be default behaviour. It is rare that we have multiple source tables per job, and even when we have it does not improve overview very much.

Pluggable datastore dialogs in extensions

See http://eobjects.org/trac/ticket/791

Web service to provide overview of running jobs

Running jobs on a DC monitor instance should be monitored. We will allow then a status of running jobs to be seen, like this:

ROLE_ADMIN should allow seeing running jobs for his own tenant.
ROLE_GOD should be allowed seeing running jobs for all tenants.

Update countrycodes example file

There's a new version of Graham Rhind's country file. See

http://www.grcdi.nl/countrycodes.htm

Add progress logging to slave nodes in cluster

(Logging in log file, not necesarily towards end user)

Allow transformer without direct 'requirement' to consume records from multiple indirect requirements

Right now we have a very strict mode of evaluating whether a transformer (or other component) can consume a row: Only if all of it's dependencies are satisfied for execution.

But if a transformer consumes records from multiple other component, which have opposite requirements, it will always end in NO records are consumeable. In such cases we should allow execution, not disallow it.

This feature is especially important since removing it can (finally) get us rid of the "merged outcome" construct which was never a first class citizen of AnalyzerBeans or DataCleaner.

Allow clustered job to be cancelled via web service

Bucketing: A way to "flag" multiple filter outcomes with a label and treat all flags with a common action

Basically an extension of CompoundComponentRequirement and #159 ...

A use case scenario we hear more and more is that a data quality job has a main flow which validates data, and that there is also then a need for "bucketing", ie. marking all records that wasn't validated and put them in one or more "buckets".

A practical way to do this within DataCleaner could be that all filter outcomes could be flagged with certain labels such as "invalid", "doubtful" etc. (up to the user).

Each label would then be mapped to some action. I think we can maybe even pre-define that action to be that the records should be inserted into some staging table. Exactly how this should happen needs to be elaborated.

Do not show "drill to detail" links when no drill data is available

In result screens of DataCleaner monitor, if no drill information is stored, we shouldn't show the metrics as links with the typical green arrow.

Add read/write locking to RepositoryFile

Avoid concurrent writes+reads to repository files by adding a ReadWriteLock.

Show full cell contents when hovering over DCTable cell

If you have a long text in a cell of a DCTable, there should be a tooltip displaying the full text contents (maybe limited to some high amount of chars for REALLY long texts).

Hybrid repository implementation

We need a repository implementation which can use a hybrid of different backing implementations, depending on file characteristics.

The typical scenario will be that configuration/job files we will keep in a central database. Very large (data) files will be kept directly on disk. Virtually they will function as if co-located.

Collapse/hide input columns in "Preview" window when transformer has many input columns

Right now the window for Previewing a transformation is showing both input and output. That works well for small transformations, but if there's a lot of input columns, it becomes hard to see what is input and what is output. Split up the screen and then make the input part restricted in initial size. Maybe collapse it or so.

Easily prefix all output columns of transformer

See http://eobjects.org/trac/ticket/683

Button "New timeline chart" on dashboard has lost it's styling

... I think the styling is probably in shared.css and with the recent refactorings it has been lost? Should be easy to fix.

Realtime progress information at component level

Improve the "progress information" panel with more information about the individual running components.

The progress information should contain the current log panel in the bottom, but in the top/middle of the screen should be a diagram of the processing flow. At every component a number of records should be regularly updated.

New triggering mode for "One-time" scheduled triggers

We want to provide a 4th mode for the user to schedule jobs: One-time scheduled

This should basically work in the same way as a periodic scheduling option, except that it should only run once (at a particular time, expressed via a cron expression, like the periodic scheduling) and then no more.

Add audit of job uploads and save old versions

Currently, when a new job is being uploaded using DC monitor HTTP upload, the old job file is replaced entirely. We should keep a copy of it and also make it visible in the "History" panel that it was changed.

Standard font does not render '~' properly

Refer to: http://datacleaner.org/topic/1006/Tokenizer---Delimiter-issue

Load AnalysisResult even when original Analyzer is not deserializable

If we deprecate or remove an old analyzer, serialized AnalysisResults should still be deserializable.

One case for this is the duplicate detection results that are shipped with DC as an example monitoring case. Since we do not want to ship the actual analyzer with DC anymore, we would need to be able to deserialize the result without it.

Provide file at CLI with job variables and configuration overrides

Request from consultant Mark Ansink (via mail):

Ability to specify a .properties file at the command line interface (CLI), containing job variables/parameters and also (added feature suggested by me) configuration property overrides.

Transformer that can compose transformations from another job

Potential bounds exception with FileFilter

See scenario here:
http://datacleaner.org/topic/464/How-i-can-relay-the-monitor-with-the-desktop-apllication

Seems this can happen when there's a file being evaluated where the filename is shorter than the extension name, and thus our substring is invalid.

Don't exclude physical columns that are included by the user

Group column checkboxes by their origin

In the property widget for input column arrays, we show a (sometimes long) list of input columns. To make this list more comprehensible, we should group them by their origin. Each group should be collabsible, to hide the cruft.

A bit like this:

Source records
[x] Col A
[ ] Col B
Transformer 1
[ ] Col C
[x] Col D
Transformer 2
[x] Col E
[x] Col F

Allow running jobs to be cancelled via web service

Similar to #55 - but for jobs that are running on single node.

I suggest we make a new method on JobEngine for this. It should provide a way to request a cancellation of the job.