Giter Site home page Giter Site logo

datacleaner / datacleaner Goto Github PK

View Code? Open in Web Editor NEW
569.0 64.0 179.0 166.6 MB

The premier open source Data Quality solution

License: GNU Lesser General Public License v3.0

Java 91.05% HTML 7.79% Scala 1.02% CSS 0.08% JavaScript 0.06% Batchfile 0.01% Shell 0.01%
data dataquality database desktop datacleaner mdm etl data-analysis data-science profiling

datacleaner's Introduction

DataCleaner

Build Status: Linux Gitter chat

DataCleaner logo

The premier Open Source Data Quality solution.

DataCleaner is a Data Quality toolkit that allows you to profile, correct and enrich your data. People use it for ad-hoc analysis, recurring cleansing as well as a swiss-army knife in matching and Master Data Management solutions.

Where to go for end-user information?

Please visit the DataCleaner community website https://datacleaner.github.io for downloads, news, documentation etc.

Visit our Gitter chat channel https://gitter.im/datacleaner/community for asking questions or discussions.

GitHub markdown pages and issues are used for developers and technical aspects only.

Module structure

The main application modules are:

  • api - The public API of DataCleaner. Mostly interfaces and annotations that you should use to build your own extensions.
  • resources - Static resources in DataCleaner
  • oss-branding - Icons and colors
  • testware - Useful classes for unit testing of DataCleaner and extension code.
  • engine
    • core - The core engine piece which allows execution of jobs and components as per the API.
    • xml-config - Contains utilities for reading and writing job files and configuration files of DataCleaner.
    • env - Different/alternative environments that DataCleaner can run in, for instance Apache Spark or webapp-cluster
  • components
    • ... - many sub modules containing built-in as well as additional components/extensions to use with DataCleaner.
    • standard-components - a container-project that dependends on all components that are normally bundled in DataCleaner community edition.
  • desktop
    • api - The public API for the DataCleaner desktop application.
    • ui - The Swing-based user interface for desktop users
  • monitor
    • api - the API classes and interfaces of DataCleaner monitor

Code style and formatting

In the root of the project you can find 'Formatter-[IDE].xml' files which enable you to import the code formatting rules of the project into your IDE.

Continuous Integration

There's a public build of DataCleaner that can be found on Travis CI:

https://travis-ci.org/datacleaner/DataCleaner

License

Licensed under the Lesser General Public License, see http://www.gnu.org/licenses/lgpl.txt

datacleaner's People

Contributors

anandswarupv avatar ankit2711 avatar arjansh avatar balendra avatar claudiaphi avatar davkrause avatar dependabot[bot] avatar gmlewis avatar hdrexler avatar hettyk avatar jakubneubauer avatar jhorcicka avatar joosjeboon avatar kaspersorensen avatar khouzvicka avatar leeth avatar losd avatar mennob avatar mhorner avatar michaelaerni avatar nancy-sharma avatar nsrivastava avatar rposkocil avatar sameerarora avatar saurabharorax avatar stefanjanssen avatar tomaszguzialek avatar vaibhavsehgal avatar vaibhavsehgal1 avatar vinsjee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datacleaner's Issues

Visual/graph oriented perspective on job editing

Right now we have the "visualization" option which provides a read-only view of the edited job in a graph-oriented manner.

We need a way to make this view interactive.

  • Add switch in bottom of window to switch between "Visual" (new) editor, "Classic" editor.
  • Replace everything on the right-hand side of the schematree with the graph view, when the above switch is set to "Visual".
  • Let the visual editor panel implement the job builder listeners so that it can automatically add/remove components when events occur.
  • Add double-click behaviour on component in visual editor - double clicking should present the component editor panel.
  • Add a way to draw lines between components (shift+click and right click menu based) and upon dropping a line on a component, bring up column selection (mapped columns effectuate the "line" since they are defined by the dependencies between components).
  • Add x,y coordinates to component information in analysis XML file format.
  • Remove the "Visualize" button.

More TODO points in the pull request - #130

Unify component interfaces

We currently have three component interfaces in DataCleaner/AnalyzerBeams:

  • Filter (splits a data processing stream into more streams)
  • Transformer (appends columns and/or records to a stream)
  • Analyzer (consumes records and produces an analysis result)

We would like these component types to converge. Sometimes traits of one of the components are also useful in other situations. Also, a few scenarios are currently not possible, especially the case where a transformer consumes a large set of records and only after a while emits them back into the stream (e.g. in sorting records, or waiting for a long-running batch service).

Requirements to API for components:

  • Should be able to specify multiple output "streams", like a filter.
  • Should be able to specify multiple output fields, like a transformer.
  • Should be able to specify multiple output records, like a transformer with an injected OutputRowCollector.
  • Should be able to produce arbitrary filter outcomes, like a general purpose and "grouper".
  • Should be able to consume arbitrary filter outcomes, potentially spawning one component for each grouped value. Example of use-case: #377
  • Should be able to store records in a temporary collection-like store, and retreive them again later and emit them in a way similar to using the OutputRowCollector.
  • Should be able to produce a AnalyzerResult (consider renaming AnalyzerResult though, if Analyzer itself is deprecated). Separately described: #225
  • Should be able to produce one or more completely new data streams. This will enable current analyzers such as the Duplicate Detection analyzer to produce datasets with pairs, groups etc. for further processing in the same job. Separately described: #224

Create table via DC's UI

Make it possible to create tables on (at least JDBC) datastores, since this is a common wish when wanting to insert data using the "Insert into table" writer. Currently you can only use existing tables or use the "Create staging table" component, but that is not always as convenient.

Read data from multiple files as a single source

See original issue: http://eobjects.org/trac/ticket/1183

Sometimes, datastores in XML or CSV format (or other file-based formats) are so large, that they are provided as a bunch of files.

It would be helpful if DataCleaner could access such a collection of files as a single datasource. It could do either of:

  • concatenating the contents of these files to become a single large table, or
  • map each file to a separate table, or
  • a combination of the two above (e.g a set of XML files, each containing multiple tables)
    This is an important enhancement, as there is no convenient way to deal with this type of input today. Both the following workarounds are suboptimal:
  • Catting all the files together to a single file, then processing that: since this typically concerns large datasources, catting together all the files is both time and disk consuming;
  • Invoking DataCleaner separately on each file: with a large number of files (e.g Dutch Kadaster's BAG compact file comes as 600 XML files) the overhead of firing up multiple DC's becomes costly.

Automatic discovery of nodes in cluster module

Right now our cluster setup in DataCleaner requires a configured list of node URLs in the cluster. This is fine if the cluster is constant, but it would be even better if nodes can automatically be added, so we should rather maintain a mutable list of nodes in the master(s) of the cluster.

Consider getting inspiration from discovery modules of e.g. ElasticSearch; described here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html

Expose Wizard module as JavaScript API

Make the Wizard system more flexible and consumeable in third party applications that inherit from DCs base.

  • Make popup optional by having targetDiv as external configuration
  • Expose JS method startMethod(type, wizardName, targetDiv)

Make it possible to save a job without analyzers

Although the job will be incomplete, it should be possible to save a job without analyzers in it. That will make jobs useful as child jobs in the new "Invoke child Analysis job" transformation.

More convenient reordering of columns

The current reordering of columns option is quite heavy-weight if you want to move many columns around. Some ways to improve:

  • Make drag-n-drop possible in the list of columns.
  • Have buttons for moving a column all the way to the top or all the way to the bottom.

Let JavaScriptCallbacks optionally return boolean to influence behaviour

Right now the new Wizard JS API based on JavaScriptCallbacks is "either or" in the sense that if a callback is implemented, it ALWAYS rules the behaviour of the script. Rather we could use the "typeof" javascript operator to find out if the invoked callbacks return a boolean, and if so we can propogate this boolean's value back to the central script, making it easier to filter out only callback events of interest in the hostpage.

Enable the execute button always, suggest to send outcome to 'file'

Now the 'execute' button is disabled unless you add an analyzer or add a write results to file option in a transformer.
Sometimes we simply want to add or standardize data and have no need for an analyzer.
We could simplify do this by, if there is no Analyzer we always suggest to write to file.

Success

Support for NTLM based proxies

We have a client with a NTLM security based HTTP proxy. This is the error he gets:

WARN 14:43:59 RequestProxyAuthentication - NTLM authentication error: Credentials cannot be used for NTLM authentication: org.apache.http.auth.UsernamePasswordCredentials

Simplify message regarding server time and time format

On the scheduling dialog we have this phrasing:

Date with respect to Server Time :2014-06-19 13:08:45
Date is in following format: yyyy-MM-dd HH:MM:SS

I would just say "Server time: " to begin with and not mention the format at all - this is evident from the input box above which features a sample value if you click it.

Web service to provide overview of running jobs

Running jobs on a DC monitor instance should be monitored. We will allow then a status of running jobs to be seen, like this:

  • ROLE_ADMIN should allow seeing running jobs for his own tenant.
  • ROLE_GOD should be allowed seeing running jobs for all tenants.

Allow transformer without direct 'requirement' to consume records from multiple indirect requirements

Right now we have a very strict mode of evaluating whether a transformer (or other component) can consume a row: Only if all of it's dependencies are satisfied for execution.

But if a transformer consumes records from multiple other component, which have opposite requirements, it will always end in NO records are consumeable. In such cases we should allow execution, not disallow it.

This feature is especially important since removing it can (finally) get us rid of the "merged outcome" construct which was never a first class citizen of AnalyzerBeans or DataCleaner.
multiple_requirements

Bucketing: A way to "flag" multiple filter outcomes with a label and treat all flags with a common action

Basically an extension of CompoundComponentRequirement and #159 ...

A use case scenario we hear more and more is that a data quality job has a main flow which validates data, and that there is also then a need for "bucketing", ie. marking all records that wasn't validated and put them in one or more "buckets".

A practical way to do this within DataCleaner could be that all filter outcomes could be flagged with certain labels such as "invalid", "doubtful" etc. (up to the user).

Each label would then be mapped to some action. I think we can maybe even pre-define that action to be that the records should be inserted into some staging table. Exactly how this should happen needs to be elaborated.

Hybrid repository implementation

We need a repository implementation which can use a hybrid of different backing implementations, depending on file characteristics.

The typical scenario will be that configuration/job files we will keep in a central database. Very large (data) files will be kept directly on disk. Virtually they will function as if co-located.

Realtime progress information at component level

Improve the "progress information" panel with more information about the individual running components.

The progress information should contain the current log panel in the bottom, but in the top/middle of the screen should be a diagram of the processing flow. At every component a number of records should be regularly updated.

New triggering mode for "One-time" scheduled triggers

We want to provide a 4th mode for the user to schedule jobs: One-time scheduled

This should basically work in the same way as a periodic scheduling option, except that it should only run once (at a particular time, expressed via a cron expression, like the periodic scheduling) and then no more.

Add audit of job uploads and save old versions

Currently, when a new job is being uploaded using DC monitor HTTP upload, the old job file is replaced entirely. We should keep a copy of it and also make it visible in the "History" panel that it was changed.

Load AnalysisResult even when original Analyzer is not deserializable

If we deprecate or remove an old analyzer, serialized AnalysisResults should still be deserializable.

One case for this is the duplicate detection results that are shipped with DC as an example monitoring case. Since we do not want to ship the actual analyzer with DC anymore, we would need to be able to deserialize the result without it.

Group column checkboxes by their origin

In the property widget for input column arrays, we show a (sometimes long) list of input columns. To make this list more comprehensible, we should group them by their origin. Each group should be collabsible, to hide the cruft.

A bit like this:

Source records
[x] Col A
[ ] Col B
Transformer 1
[ ] Col C
[x] Col D
Transformer 2
[x] Col E
[x] Col F

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.