The gaffer-doc from gchq

Update User Guide

The user guide contains some typos, and it could be made easier to follow for users who would like to start from a blank project rather than cloning the examples.

Fix broken links

There are several links that still point to files in github/gchq/Gaffer/doc. These need to be updated to point to the files in this repository.

Add note to top of spark operations doc about reading from r files

When run on Accumulo, reading directly from R files is supported in GetDataFrameOfElements and GetJavaRDDOfAllElements.

Add examples for the new InRange predicates

Depends on gchq/koryphe#64.

Create Transform operation examples

Similar to the Aggregate and Filter operations, a few examples for usage of the Transform operation should be written.

Spark operation examples don't show how to specify Hadoop conf

The examples for operations such as GetJavaRDDOfAllElements don't show how the Hadoop configuration can be passed in as an option. Passing this option in isn't essential if you're running your code either via the Hadoop command or within a Spark job that has been configured with Hadoop, but it means you can't override properties in the configuration.

Copied from gchq/Gaffer#1328

Add SingletonList doc

Update FederatedStore Doc with new Features

Add documentation to explain that Get operations returns a lazy iterable

For example if a user executes a GetAllElements on Accumulo:

final Iterable<? extends Element> elements = graph.execute(new GetAllElements(), getUser());

The 'elements' iterable is lazy and the query is only executed on Accumulo when you start iterating around the results. So if you add another element 'X' to the graph before you consume the 'elements' iterable you will notice the results now also contain 'X'.

For this reason you should be very careful if you do an AddElements with a lazy iterable returned from a Get query on the same Graph. The problem that could arise is that the AddElements will lazily consume the lazy iterable of elements, potentially causing duplicates to be added.

To do a Get followed by an Add on the same Graph, we recommend consuming and caching the Get results first. For a small number of results, this can be done simply using the ToList operation in your chain. e.g:

new OperationChain.Builder()
                .first(new GetAllElements())
                .then(new ToList<>())
                .then(new AddElements())
                .build();

For a large number of results you could add them to the gaffer cache temporarily:

new OperationChain.Builder()
                .first(new GetAllElements())
                .then(new ExportToGafferResultCache<>())
                .then(new DiscardOutput())
                .then((Operation) new GetGafferResultCacheExport())
                .then(new AddElements())
                .build()

Update Development Guide for new Operations

With the addition of new Operations, the development guide could do to be updated with all of the potentially required tasks associated with adding a new Operation.

Investigate splitting up the large getting started pages

The getting started pages are getting quite big. It would be good to try and break them down into smaller pages for each section.

Add documentation for MATCHED_VERTEX and ADJACENT_MATCHED_VERTEX

This should go in the Filtering and/or Views section within the User Guide.

The documentation should explain that when applying filtering, aggregation and transformation in Views instead of selecting a property name you can select one of these fields: VERTEX, SOURCE, DESTINATION, DIRECTED, MATCHED_VERTEX, ADJACENT_MATCHED_VERTEX.

Add While example

Mark functions/predicates/operations as deprecated if they have the Deprecated annotation

When producing the examples for the different functions, predicates and operations we should check if the class is annotated with Deprecated. If it is, we should add a note to tell users that it is deprecated in the description.

Add ForEach doc

Document GetGraphFrameOfElements

Examples should be written for the GetGraphFrameOfElements operation.

Document timestamp property

The Dev guide contains a note about setting the timestamp property in the schema, but doesn't say anything about what it's for. I think it just lets you specify which property is used to set the timestamp in an Accumulo key, but that doesn't achieve anything as far as the user is concerned.

Add Reduce doc

Add documentation for the addElementsFromHdfs logic updates

Setting numReduceTasks has now been deprecated, and instead setting the min and/or max should be used. This means Accumulo, in most cases, will be able to choose the right amount of reducers for the user, based on the number of tablet servers. If the minimum is more than the amount Accumulo chooses it will update to be more, and equally if the maximum is less than the Accumulo amount, it will reduce the number of reducers to be the maximum.

Create GetSchema examples

The new GetSchema operation should have a few examples, to demonstrate usage, and mention the implications of the 'compact' boolean flag.

Enable travis ci builds

Update to Gaffer version to 1.0.0-RC4

Also update koryphe version to 1.0.0 and gaffer-tools version to 1.0.0-RC4

Add documentation for Global View filters

We don't really have any documentation for what and how global filters work in a View.

We should also document using a global groupBy and global properties/excludeProperties.

Update NamedOperation example to include Score field

Should either add a new example or modify the existing examples to include the new optional field of "Score".

Map and FlatMap examples

Examples for the new Map and FlatMap operations (see gchq/Gaffer#1345)

Document GetWalks

Show operation example results in both java and json

Use the same code tabs we use for showing how to create an operation to display the results of the operation. This should show what the results look like in java (or a simple toString) and json.

Add REST API documentation

This could involve just including the static Swagger documentation for the REST API.

Add Functions section

This should be similar to the Predicates section.

ScoreOperationChain example

An example should be written for the ScoreOperationChain to demonstrate how it is used.

Update the way the configuration is encoded in getJavaRddOfElementsWithHadoopConf

Currently as a fix until 1.0.0-RC3, the configuration is encoded without using:

final String encodedConf = AbstractGetRDDHandler.convertConfigurationToString(configuration);

This needs to be updated to use the above line when 1.0.0-RC3 is released.

Add an explanation of the deprecation of seedMatching to Filter docs

This just involves copying the comment from gchq/Gaffer#1798 to the Filtering page in the documentation.

Add a Testing page that documents our testing coverage

This page should detail the level of our testing, such as what combinations of different schemas have been tested with each of the different ingest mechanisms on the different Stores.

It should also describe our Integration Test Suite and the areas of the framework that are included/missed.

Add NamedView documentation

Add an explanation of undirected edges to the Schema walkthrough

This should explain that undirected edges are bidirectional and how they are aggregated. It should also mention that Gaffer will flip the edge for consistency so the source is always ordered 'less' than the destination (based on natural ordering).

Create Aggregate Operation Example

Examples to be written, demonstrating usage of the new Aggregate operation.

Update sketches documentation with link to performance comparison of HLLs

We should update the documentation on sketches to include this link to show the difference in performance of Clearspring's HLL and Datasketches HLL (https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html) and use this to justify recommending use of datasketches over Clearspring.

Create StringContains example

This is dependent on Gaffer 1.0.0-RC3

Add Cache docs

Extract out the duplication of setting up a cache (in NamedOperations, Jobs and Federated Store within the Dev Guide) and make 1 cache section in the Dev guide. The other sections should reference it.