Giter Site home page Giter Site logo

metafacture-documentation's People

Contributors

acka47 avatar cboehme avatar dr0i avatar fsteeg avatar tobiasnx avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

tobiasnx acka47

metafacture-documentation's Issues

metamorph: use and combination of reset, flushWith and sameEntity attributes

  • since some time I'm struggling with the appropriate use of the reset, flushWith and sameEntity attributes on the various collector artefacts but also on the entity tag of Metamorph
  • Mostly after some try and error definitions I got a result which was the one I expected but I had difficulties to explain the clear differences between the examples
  • Hopefully I can now provide a distinguished example which might be helpful to discuss the mechanisms behind the implementation
  • This directory contains the examples.
  • flushWith and reset=true is used in the MorphScript_case_1 whereas sameEntity=true and reset=false delivers the correct result in this Morph script_case_2
  • the output files are for case 1 and case 2
  • it seems combine uses the sameEntity=true attribute for collecting the various values of an Entity in combination with an If Statement (and no inside collectors) while concat needs the flushWith Attribute in combination with IF and combine with reset=true attribute as the outer collector
  • During our get Together meeting in HH I already mentioned my troubles with this and a helpful voice in the audience (I couldn't see as Skype participant... (edit: me,@dr0i)) was optimistic for being able to give more background infomation
  • By the way these examples are part of my attempt to transform our XSLT based transformations for Solr SearchDocuments into a Metafacture/Metamorph implementations. My main motivation for this is the question: Where are the limits and differences of Metamorph compared to traditional XSLT procedures. If you are interested how it looks like as XSLT you can find it here Hints are welcome if you know more elegant possibilities to express match ranges matches(@code, '[b-su-z8]') in XSLT compared to the way I have done this

Thanks for any replies!!

Günter

Examination of documentation status

We realized that the documentation of Metafacture is scattered, often not linked to each other and sometimes outdated.
To improve the documentation we first have to get an overview of the existing documenation. To examine all parts of the documentation we want to create a table in this repository and add the following informationen to this table:

  • URL to documentation
  • short description
  • target group: User vs. Developer
  • up to date? (last update)
  • written in which language
  • what does the documentation belong to? (core, fix, morph, extension, ...)
  • does the page need a link to metafacture.org?

The goal of this ticket is to create the table as described above to form a basis with wich we can discuss the next steps.

New morph function "UrlEscaper" implementing RFC 3986

In lobid we use the gdata PercentEscaper ,which implements RFC 3986, instead of the java URLEncoder. With the former it's easy to escape URIs which have e.g. slashes in their URI path, see lobid/lodmill#517.
It's deliberately a new morph function (and not just a bug fix of the existing URLEncode) because it uses a foreign library, gdata, and it is agreed to keep these at a minimum. For the ease of use it would be nice, though, if it would be part of metafacture-core.

use of collect / count and sort triple commands

Problem

We get in trouble because of insufficient memory and had to split the data in smaller sets for processing which is not only cumbersome but also gives wrong results because analysis has to be done on the complete data set.

Question

I have seen there is a mechanism triggered by a flag called 'memorylow'
https://github.com/culturegraph/metafacture-core/blob/master/src/main/java/org/culturegraph/mf/stream/pipe/sort/AbstractTripleSort.java#L99
which makes it possible to swap triples to the file system as temporary store

  • is this mechanism used by any other? Are there any experiences with it?
  • As far as I have seen is the memorylow flag only accessible via Java (not Flux) because there is no setter-method. Did I miss something?

own steps so far

  • I created a test client app where I used the count and sort commands together with the memoryflag mechanism.
  • the memorylow signature requires long parameters which are never used
  • the current process function evaluates the memoryflag but this mechanisms can't work IMHO because calling 'process' for the first time the buffer is always empty and memorylow is set to false which deactivates the removal to external storage mechanism immediately.
  • I changed this a little bit just to not switch off the removal machanisms and to see how it works. Of course there should be a more sophisticated implementation
  • But: the sort and count mechanisms work as expected: The tempFiles collections isn't empty and the temporarily stored triples are being read using a SortedTripleFileFacade
  • for me it seems the mechanism was principally designed but not finished. But: I guess something like this is necessary to analyse large data sets at the latest e.g. to cluster the whole content of all the German library networks as it was done by @mgeipel and @cboehme
    Christoph: could you provide more background information? - Thanks a lot
  • other possibilities could be the following commands:
    • write-triples org.culturegraph.mf.stream.sink.TripleWriter
    • write-triple-objects org.culturegraph.mf.stream.sink.TripleObjectWriter
    • reorder-triple org.culturegraph.mf.stream.pipe.TripleReorder
      which I haven't tried out by myself so far. Are there any practical examples how to use them? This would be really nice. It the spirit of our get together meeting to share more examples?

Thanks for any hints - Günter!

Metamorph: remove duplicates from array

Is there already a way to remove duplicates from metamorph arrays?

Preferably, a "unique" attribute or option would exist within the "entity" tag.
Otherwise, the implementation of an ArrayDuplicateRemover would be the way to go, I guess. This solution should be more expensive in terms of calculation duration.

Corresponding issue: hbz/lobid-organisations#18
Morph code used so far: https://github.com/hbz/lobid-organisations/blob/c72429235ddd3fe713449e39b8456df921f32b6b/src/main/resources/morph-enriched.xml#L304
Morph output: http://beta.lobid.org/organisations/DE-9
The occurrence of this duplicate is caused by input data. Nevertheless, it is desirable to have a removal option.

Highlight signature of flux-commands

The signature is the central info when constructing an FLUX workflow it shows which input is needed and what output is created by an modul. This gives orientation which modul can be combined.
It should be highlighted so one can combiner flux moduls more easily.

change-id
---------
- description:	By default changes the record ID to the value of the '_id' literal (if present). Use the contructor to choose another literal as ID source.
- options:	keepidliteral (boolean), idliteral (String), keeprecordswithoutidliteral (boolean)
- signature:	StreamReceiver -> StreamReceiver
- java class:	org.metafacture.mangling.RecordIdChanger

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.