Giter Site home page Giter Site logo

marklogic-community / corb2 Goto Github PK

View Code? Open in Web Editor NEW
19.0 14.0 16.0 7.67 MB

MarkLogic tool for bulk loading, processing, and reporting on content.

License: Other

Java 90.70% XQuery 1.62% JavaScript 0.99% XSLT 0.46% CSS 0.04% HTML 1.67% Perl 4.53%
batch-processing java marklogic corb xcc corb-jobs batch-job javascript-modules xquery

corb2's Introduction

Maven Central - download the latest version Codecov code coverage Snyk Known Vulnerabilities Badge

CoRB is a Java tool designed for bulk content-reprocessing of documents stored in MarkLogic. CoRB stands for Content Reprocessing in Bulk and is a multi-threaded workhorse tool at your disposal. In a nutshell, CoRB works off a list of documents in a database and performs operations against those documents. CoRB operations can include generating a report across all documents, manipulating the individual documents, or a combination thereof.

User Guide

This document and the wiki provide a comprehensive overview of CoRB and the options available to customize the execution of a CoRB job, as well as the ModuleExecutor Tool, which can be used to execute a single (XQuery or JavaScript) module in MarkLogic.

For additional information, refer to the CoRB Wiki.

Downloads

Download the latest release directly from https://github.com/marklogic-community/corb2/releases or resolve dependencies through Maven Central.

Compatability

Note: marklogic-xcc 8 is backwards compatible to MarkLogic 5 and runs on Java 1.6 or later.

Getting Help

To get help with CoRB

The entry point is the main method in the com.marklogic.developer.corb.Manager class. CoRB requires the MarkLogic XCC JAR in the classpath, preferably the version that corresponds to the MarkLogic server version, which can be downloaded from https://developer.marklogic.com/products/xcc. Use Java 1.8 or later.

CoRB needs options specified through one or more of the following mechanisms:

  1. command-line parameters
  2. Java system properties ex: -DXCC-CONNECTION-URI=xcc://user:password@localhost:8202
  3. As properties file in the class path specified using -DOPTIONS-FILE=myjob.properties. Relative and full file system paths are also supported.

If specified in more than one place, a command line parameter takes precedence over a Java system property, which take precedence over a property from the OPTIONS-FILE properties file.

Note: Any or all of the properties can be specified as Java system properties or key value pairs in properties file.

Note: CoRB exit codes 0 - successful, 0 - nothing to process (ref: EXIT-CODE-NO-URIS), 1 - initialization or connection error and 2 - execution error

Note: CoRB now supports Logging Job Metrics back to the MarkLogic database log and/or as document in the database.

Options

Option Description
INIT-MODULE An XQuery or JavaScript module which, if specified, will be invoked prior to URIS-MODULE. XQuery and JavaScript modules need to have .xqy and .sjs extensions respectively.
INIT-TASK Java Task which, if specified, will be called prior to URIS-MODULE. This can be used addition to INIT-MODULE for custom implementations.
OPTIONS-FILE A properties file containing any of the CoRB options. Relative and full file system paths are supported.
PROCESS-MODULE XQuery or JavaScript to be executed in a batch for each URI from the URIS-MODULE or URIS-FILE. Module is expected to have at least one external or global variable with name URI. XQuery and JavaScript modules need to have .xqy and .sjs extensions respectively. If returning multiple values from a JavaScript module, values must be returned as Sequence.
PROCESS-TASK
Java Class that implements com.marklogic.developer.corb.Task or extends com.marklogic.developer.corb.AbstractTask. Typically, it can talk to PROCESS-MODULE and the do additional processing locally such save a returned value.
  • com.marklogic.developer.corb.ExportBatchToFileTask Generates a single file, typically used for reports. Writes the data returned by the PROCESS-MODULE to a single file specified by EXPORT-FILE-NAME. All returned values from entire CoRB will be streamed into the single file. If EXPORT-FILE-NAME is not specified, CoRB uses URIS_BATCH_REF returned by URIS-MODULE as the file name.
  • com.marklogic.developer.corb.ExportToFileTask Generates multiple files. Saves the documents returned by each invocation of PROCESS-MODULE to a separate local file within EXPORT-FILE-DIR where the file name for each document will be the based on the URI.
PRE-BATCH-MODULE An XQuery or JavaScript module which, if specified, will be run before batch processing starts. XQuery and JavaScript modules need to have .xqy and .sjs extensions respectively.
PRE-BATCH-TASK Java Class that implements com.marklogic.developer.corb.Task or extends com.marklogic.developer.corb.AbstractTask. If PRE-BATCH-MODULE is also specified, the implementation is expected to invoke the XQuery and process the result if any. It can also be specified without PRE-BATCH-MODULE and an example of this is to add a static header to a report.
  • com.marklogic.developer.corb.PreBatchUpdateFileTask included - Writes the data returned by the PRE-BATCH-MODULE to EXPORT-FILE-NAME, which can particularly be used to to write dynamic headers for CSV output. Also, if EXPORT-FILE-TOP-CONTENT is specified, this task will write this value to the EXPORT-FILE-NAME - this option is especially useful for writing fixed headers to reports. If EXPORT-FILE-NAME is not specified, CoRB uses URIS_BATCH_REF returned by URIS-MODULE as the file name.
    POST-BATCH-MODULE An XQuery or JavaScript module which, if specified, will be run after batch processing is completed. XQuery and JavaScript modules need to have .xqy and .sjs extensions respectively.
    POST-BATCH-TASK Java Class that implements com.marklogic.developer.corb.Task or extends com.marklogic.developer.corb.AbstractTask. If POST-BATCH-MODULE is also specified, the implementation is expected to invoke the XQuery and process the result if any. It can also be specified without POST-BATCH-MODULE and an example of this is to add static content to the bottom of the report.
    • com.marklogic.developer.corb.PostBatchUpdateFileTask included - Writes the data returned by the POST-BATCH-MODULE to EXPORT-FILE-NAME. Also, if EXPORT-FILE-BOTTOM-CONTENT is specified, this task will write this value to the EXPORT-FILE-NAME. If EXPORT-FILE-NAME is not specified, CoRB uses URIS_BATCH_REF returned by URIS-MODULE as the file name.
    THREAD-COUNT The number of worker threads. Default is 1.
    URIS-MODULE URI selector module written in XQuery or JavaScript. Expected to return a sequence containing the uris count, followed by all the uris. Optionally, it can also return an arbitrary string as a first item in this sequence - refer to URIS_BATCH_REF section below. XQuery and JavaScript modules need to have .xqy and .sjs extensions respectively. JavaScript modules must return a Sequence.
    URIS-FILE If defined instead of URIS-MODULE, URIs will be loaded from the file located on the client. There should only be one URI per line. This path may be relative or absolute. For example, a file containing a list of document identifiers can be used as a URIS-FILE and the PROCESS-MODULE can query for the document based on this document identifier.
    XCC-CONNECTION-URI Connection string to MarkLogic XDBC Server. Multiple connection strings can be specified with comma as a separator.

    Additional options

    Option Description
    BATCH-SIZE The number of URIs to be executed in single transform. Default is 1. If more than 1, PROCESS-MODULE will receive a delimited string as the $URI variable, which needs to be tokenized to get individual URIs. The default delimiter is ;, which can be overridden with the option BATCH-URI-DELIM described below.
    Sample code for transform:
    declare variable URI as xs:string external;
    let $all-uris := fn:tokenize($URI,";")
    BATCH-URI-DELIM Use if the default delimiter ';' cannot be used to join multiple URIS when BATCH-SIZE is greater than 1. Default is ;.
    DECRYPTER The class name of the options value dycrypter, which must implement com.marklogic.developer.corb.Decrypter. Encryptable options include XCC-CONNECTION-URI, XCC-USERNAME, XCC-PASSWORD, XCC-HOSTNAME, XCC-PORT, and XCC-DBNAME.
    COLLECTION-NAME Value of this parameter will be passed into the URIS-MODULE via external or global variable with the name URIS.
    COMMAND Pause, resume, and stop the execution of CoRB. Possible commands include: PAUSE, RESUME, and STOP. If the COMMAND-FILE is modified and either there is no COMMAND or an invalid value is specified, then execution will RESUME.
    COMMAND-FILE A properties file used to configure COMMAND and THREAD-COUNT while CoRB is running. For instance, to temporarily pause execution, or to lower the number of threads in order to throttle execution.
    COMMAND-FILE-POLL-INTERVAL The regular interval (seconds) in which the existence of the COMMAND-FILE is tested can be controlled by using this property. Default is 1.
    CONNECTION-POLICY Algorithm for balancing load across multiple hosts used by com.marklogic.developer.corb.DefaultContentSourcePool. Options include ROUND-ROBIN, RANDOM and LOAD. Default option is ROUND-ROBIN. LOAD option returns the ContentSource or Connection with least number of active sessions.
    CONTENT-SOURCE-POOL Class that implements com.marklogic.developer.corb.ContentSourcePool and used to manage ContentSource instances or connections. Default is com.marklogic.developer.corb.DefaultContentSourcePool.
    CONTENT-SOURCE-RENEW Boolean value indicating whether to periodically check to see if a ContentSource resolves to a different IP address and create a new ContentSource to add to the resource pool. This can help transparently deal with proxies that have dynamic pools of IP addresses. Default is true
    CONTENT-SOURCE-RENEW-INTERVAL The regular interval (seconds) in which to resolve ContentSource IP address and add to the pool. This can help when a DNS entry may return multiple IP addresses and help spread traffic among multiple endpoints. Default is 60
    DISK-QUEUE Boolean value indicating whether the CoRB job should spill to disk when a maximum number of URIs have been loaded in memory, in order to control memory consumption and avoid Out of Memory exceptions for extremely large sets of URIs.
    DISK-QUEUE-MAX-IN-MEMORY-SIZE The maximum number of URIs to hold in memory before spilling over to disk. Default is 1000.
    DISK-QUEUE-TEMP-DIR The directory where the URIs queue can write to disk when the maximum in-memory items has been exceeded. If not specified then TEMP-DIR value will be used. If neither are specified, then the default behavior is to use java.io.tmpdir.
    ERROR-FILE-NAME Used when FAIL-ON-ERROR is false. If specified true, removes duplicates from the errored URIs along with error messages will be written to this file. Uses BATCH-URI-DELIM or default ';' to separate URI and error message.
    EXIT-CODE-IGNORED-ERRORS Returns this exit code when there were errors and FAIL-ON-ERROR=false. Default is 0.
    EXIT-CODE-NO-URIS Returns this exit code when there is nothing to process. Default is 0.
    EXPORT_FILE_AS_ZIP If true, PostBatchUpdateFileTask compresses the output file as a zip file.
    EXPORT-FILE-BOTTOM-CONTENT Used by com.marklogic.developer.corb.PostBatchUpdateFileTask to append content to EXPORT-FILE-NAME after batch process is complete.
    EXPORT-FILE-DIR Export directory parameter is used by com.marklogic.developer.corb.ExportBatchToFileTask or similar custom task implementations.
    Optional: Alternatively, EXPORT-FILE-NAME can be specified with a full path.
    EXPORT-FILE-NAME Shared file to write output of com.marklogic.developer.corb.ExportBatchToFileTask - should be a file name with our without full path.
    • EXPORT-FILE-DIR Is not required if a full path is used.
    • If EXPORT-FILE-NAME is not specified, CoRB attempts to use URIS_BATCH_REF as the file name and this is especially useful in case of automated jobs where file name can only be determined by the URIS-MODULE - refer to URIS_BATCH_REF section below.
    EXPORT-FILE-PART-EXT The file extension for export files being processed. ex: .tmp - if specified, com.marklogic.developer.corb.PreBatchUpdateFileTask adds this temporary extension to the export file name to indicate EXPORT-FILE-NAME is being actively modified. To remove this temporary extension after EXPORT-FILE-NAME is complete, com.marklogic.developer.corb.PostBatchUpdateFileTask must be specified as POST-BATCH-TASK.
    EXPORT-FILE-REQUIRE-PROCESS-MODULE Boolean value indicating whether or not to require a PROCESS-MODULE when an Export*ToFile PROCESS-TASK is specified. This can help avoid confusion when the PROCESS-MODULE was accidentally not configured and no files are generated. Default is true
    EXPORT-FILE-SORT If ascending or descending, lines will be sorted. If |distinct is specified after the sort direction, duplicate lines from EXPORT-FILE-NAME will be removed. i.e. ascending|distinct or descending|distinct
    EXPORT-FILE-SORT-COMPARATOR A java class that must implement java.util.Comparator. If specified, CoRB will use this class for sorting in place of ascending or descending string comparator even if a value was specified for EXPORT-FILE-SORT.
    EXPORT-FILE-TOP-CONTENT Used by com.marklogic.developer.corb.PreBatchUpdateFileTask to insert content at the top of EXPORT-FILE-NAME before batch process starts. If it includes the string @URIS_BATCH_REF, it is replaced by the batch reference returned by URIS-MODULE.
    EXPORT-FILE-URI-TO-PATH Boolean value indicating whether to convert doc URI to a filepath. Default is true
    FAIL-ON-ERROR Boolean value indicating whether the CoRB job should fail and exit if a process module throws an error. Default is true. This option will not handle repeated connection failures.
    INSTALL Whether to install the Modules in the Modules database. Specify true or 1 for installation. Default is false.
    LOADER-BASE64-ENCODE Boolean option specifying whether the content loaded by FileUrisStreamingXMLLoader or FileUrisXMLLoader (with the option LOADER-USE-ENVELOPE=true) should be base64 encoded, or appended as the child of the /corb-loader/content element. Default is false.
    LOADER-PATH The path to the resource (file or folder) that will be the input source for a loader class that extends AbstractFileUrisLoader, such as FileUrisDirectoryLoader, FileUrisLoader, FileUrisStreamingXmlLoader, FileUrisXmlLoader, and FileUrisZipLoader
    LOADER-SET-URIS-BATCH-REF Boolean option indicating whether a file loader should set the URIS_BATCH_REF. Default is false.
    LOADER-USE-ENVELOPE Boolean value indicating whether FileUris loaders should use an XML envelope, in order to send file metadata in addition to the file content.
    JOB-NAME Name of the current Job.
    JOB-SERVER-PORT Optional port number to start a lightweight HTTP server which can be used to monitor, change the number of threads, and pause/resume the CoRB job. Port number must be a valid port(s) or a valid range of ports.
    • Ex: 9080
    • Ex: 9080,9083,9087
    • Ex: 9080-9090
    • Ex: 9080-9083,9085-9090
    The job server will bind to a port from the configured port number(s). By default, if the JOB-SERVER-PORT option is not specified, a job server is not started.

    When a port is specified and available, the job server URL will be logged to the console with both the UI http://<host>:<port> and metrics URL http://<host>:<port>/metrics. (grep for string com.marklogic.developer.corb.JobServer logUsage)

    The metrics URL supports the following parameters:

    • COMMAND=pause (or resume).
    • CONCISE=true limits the amound of data returned
    • FORMAT=json (or xml) returns job stats in the requested format
    • THREAD-COUNT=<#> will adjust the number of threads for the executing job
    MAX-OPTS-FROM-MODULE Maximum number of custom inputs from the URIS-MODULE to other modules. Default is 10.
    METADATA The variable name that needs to be defined in the server side query to use the metadata set by the URIS-LOADER.
    METADATA-TO-PROCESS-MODULE If this option is set to true, XML-METADATA is set as an external variable with name METADATA to PROCESS-MODULE as well. Default is false.
    METRICS-COLLECTIONS Adds the metrics document to the specified collection.
    METRICS-DATABASE Uses the value provided to save the metrics document to the specified Database. The XCC connection specified should have the following privilege http://marklogic.com/xdmp/privileges/xdmp-invoke
    METRICS-LOG-LEVEL String value indicating the log level that the CoRB job should use to log metrics to ML Server Error log. Possible values are none, emergency, alert, critical, error, warning, notice, info, config, debug, fine, finer, finest. Default is none, which means metrics are not logged.
    METRICS-MODULE XQuery or JavaScript to be executed at the end of the CoRB Job to save the metrics document to the database. There is an XQuery module (save-metrics.xqy) and a JavaScript module (saveMetrics.sjs) provided. You can use these modules as a template to customize the the metrics document saved to the database. XQuery and JavaScript modules need to have '{@code .xqy}' and{@code .sjs} extensions respectively.
    METRICS-NUM-FAILED-TRANSACTIONS Maximum number of failed transaction to be logged in the metrics. Default is 0.
    METRICS-NUM-SLOW-TRANSACTIONS Maximum number of slow transaction to be logged in the metrics. Default is 0.
    METRICS-ROOT Uses the value provided to as the URI Root for saving the metrics document.
    METRICS-SYNC-FREQUENCY Frequency (in seconds) at which the metrics document needs to be updated in the database. By default the metrics document is not periodically updated and is only written once at the end of the job.
    MODULE-ROOT Default is /.
    MODULES-DATABASE Uses the XCC-CONNECTION-URI if not provided; use 0 for file system.
    NUM-TPS-FOR-ETC Number of recent transactions per second (tps) values used to calculate estimated completion time (ETC). Default is 10.
    POST-BATCH-MINIMUM-COUNT The minimum number of results that must be returned for the POST-BATCH-MODULE or POST-BATCH-TASK to be executed. Default is 1.
    PRE-POST-BATCH-ALWAYS-EXECUTE Boolean value indicating whether the PRE_BATCH and POST_BATCH module or task should be executed without evaluating how many URIs were returned by the URI selector.
    PRE-BATCH-MINIMUM-COUNT The minimum number of results that must be returned for the PRE-BATCH-MODULE or PRE-BATCH-TASK to be executed. Default is 1.
    QUERY-RETRY-LIMIT Number of re-query attempts before giving up. Default is 2.
    QUERY-RETRY-INTERVAL Time interval, in seconds, between re-query attempts. Default is 20.
    QUERY-RETRY-ERROR-CODES A comma separated list of MarkLogic error codes for which a QueryException should be retried.
    QUERY-RETRY-ERROR-MESSAGE A comma separated list of values that if contained in an exception message a QueryException should be retried.
    SSL-CONFIG-CLASS A java class that must implement com.marklogic.developer.corb.SSLConfig. If not specified, CoRB defaults to com.marklogic.developer.corb.TrustAnyoneSSLConfig for xccs connections.
    URIS-LOADER Java class that implements com.marklogic.developer.corb.UrisLoader. A custom class to load URIs instead of built-in loaders for URIS-MODULE or URIS-FILE options. Example: com.marklogic.developer.corb.FileUrisXMLLoader
    URIS-REDACTED Optional boolean flag indicating whether URIs should be excluded from logging, console, and JobStats metrics. Default is false.
    URIS-REPLACE-PATTERN One or more replace patterns for URIs - Used by java to truncate the length of URIs on the client side, typically to reduce java heap size in very large batch jobs, as the CoRB java client holds all the URIS in memory while processing is in progress. If truncated, PROCESS-MODULE needs to reconstruct the URI before trying to do fn:doc() to fetch the document.
    Usage: URIS-REPLACE-PATTERN=pattern1,replace1,pattern2,replace2,...)
    Example:
    URIS-REPLACE-PATTERN=/com/marklogic/sample/,,.xml, - Replace /com/marklogic/sample/ and .xml with empty strings. So, CoRB client only needs to cache the id '1234' instead of the entire URI /com/marklogic/sample/1234.xml. In the transform PROCESS-MODULE, we need to do let $URI := fn:concat("/com/marklogic/sample/",$URI,".xml")
    XCC-CONNECTION-RETRY-LIMIT Number attempts to connect to ML before giving up. Default is 3.
    XCC-CONNECTION-RETRY-INTERVAL Time interval, in seconds, between retry attempts. Default is 60.
    XCC-CONNECTION-HOST-RETRY-LIMIT Number attempts to connect to ML before giving up on a host. If not specified, it defaults to XCC-CONNECTION-RETRY-LIMIT
    XCC-DBNAME (Optional) Name of the content database to execute against
    XCC-HOSTNAME Required if XCC-CONNECTION-URI is not specified. Multiple host can be specified with comma as a separator.
    XCC-HTTPCOMPLIANT Optional boolean flag to indicate whether to enable HTTP 1.1 compliance in XCC. If this option is set, the xcc.httpcompliant System property will be set. Default is true.
    XCC-PASSWORD Required if XCC-CONNECTION-URI is not specified.
    XCC-PORT Required if XCC-CONNECTION-URI is not specified.
    XCC-PROTOCOL (Optional) Used if XCC-CONNECTION-URI is not specified. The XCC scheme to use; either xcc or xccs. Default is xcc.
    XCC-TIME-ZONE The ID for the TimeZone that should be set on XCC RequestOption. When a value is specified, it is parsed using TimeZone.getTimeZone() and set on XCC RequestOption for each Task. Invalid ID values will produce the GMT TimeZone. If not specified, XCC uses the JVM default TimeZone.
    XCC-URL-ENCODE-COMPONENTS Indicate whether or not the XCC connection string components should be URL encoded. Possible values are always, never, and auto. Default is auto.
    XCC-USERNAME Required if XCC-CONNECTION-URI is not specified.
    XML-FILE In order to use this option a class com.marklogic.developer.corb.FileUrisXMLLoader has to be specified in the URIS-LOADER option. If defined instead of URIS-MODULE, XML nodes will be used as URIs from the file located on the client. The file path may be relative or absolute. Default processing will select all of the child elements of the document element (i.e. /*/*). The XML-NODE option can be specified with an XPath to address a different set of nodes.
    XML-METADATA An XPath to address the node that contains metadata portion of the XML. This must be different from the XML-NODE. The metadata is set as an external variable with name METADATA to PRE-BATCH-MODULE and POST-BATCH-MODULE and also PROCESS-MODULE if enabled by METADATA-TO-PROCESS-MODULE.
    XML-NODE An XPath to address the nodes to be returned in an XML-FILE by the com.marklogic.developer.corb.FileUrisXMLLoader. For example, a file containing a list of nodes wrapped by a parent element can be used as a XML-FILE and the PROCESS-MODULE can unquote the URI string as node to do further processing with the node. If not specified, the default behavior is to select the child elements of the document element (i.e. /*/*)
    XML-SCHEMA Path to a W3C XML Schema to be used by com.marklogic.developer.corb.FileUrisStreamingXMLLoader or com.marklogic.developer.corb.FileUrisXMLLoader to validate an XML-FILE, and used by com.marklogic.developer.corb.SchemaValidateBatchToFileTask and com.marklogic.corb.SchemaValidateToFileTask post-process tasks to validate documents returned from a process module.
    XML-SCHEMA-HONOUR-ALL-SCHEMALOCATIONS Boolean value indicating whether to set the feature http://apache.org/xml/features/honour-all-schemaLocations. Default is true
    XML-TEMP-DIR Temporary directory used by com.marklogic.developer.corb.FileUrisStreamingXMLLoader to store files extracted from the XML-FILE. If not specified, TEMP-DIR value will be used. If neither are specified, then the default Java temp directory will be used.

    If a module, including those specified by PRE-BATCH-MODULE, PROCESS-MODULE or POST-BATCH-MODULE have an external or global variable named URIS_BATCH_REF, the variable will be set to the first non-numeric item in the sequence returned by URIS-MODULE. This means that, when used, the URIS-MODULE must return a sequence with the special string value first, then the URI count, then the sequence of URIs to process.

    As an example, a batch ref can be a link/id of a document that manages the status of the batch job, where pre-batch module updates the status to start and post-batch module can set it to complete. This example can be used to manage status and errors in automated batch jobs.

    ExportBatchToFileTask, PreBatchUpdateFileTask and PostBatchUpdateFileTask use URIS_BATCH_REF as the file name if EXPORT-FILE-NAME is not specified. This is useful for automated jobs where name of the output file name can be determined only by the URIS-MODULE.

    URIS_TOTAL_COUNT

    Total count of uris is set as an external variable to PRE-BATCH-MODULE and POST-BATCH-MODULE (since 2.4.5)

    Any property specified with prefix (with '.') INIT-MODULE, URIS-MODULE, PRE-BATCH-MODULE, PROCESS-MODULE, POST-BATCH-MODULE will be set as an external variable in the corresponding XQuery module (if that variable is defined as an external string variable in XQuery module). For JavaScript modules the variables need be defined as global variables.

    Custom Input Examples:

    • URIS-MODULE.maxLimit=1000 Expects an external string variable maxLimit in URIS-MODULE XQuery or global variable for JavaScript.
    • PROCESS-MODULE.startDate=2015-01-01 Expects an external string variable startDate in PROCESS-MODULE XQuery or global variable for JavaScript.

    Alternatively, URIS-MODULE can pass custom inputs to PRE-BATCH-MODULE, PROCESS-MODULE, POST-BATCH-MODULE by returning one or more of the property values in above format before the count the of URIs. If the URIS-MODULE needs URIS_BATCH_REF (above) as well, it needs to be just before the URIs count.

    Custom Input From URIS-MODULE Example

    let $uris := cts:uris()
    return ("PROCESS-MODULE.foo=bar", "POST-BATCH-MODULE.alpha=10", fn:count($uris), $uris)

    Appending |ADHOC to the name or path of a XQuery module (with .xqy extension) or JavaScript (with .sjs or .js extension) module will cause the module to be read from the file system and executed in MarkLogic without being uploaded to Modules database. This simplifies running CoRB jobs by not requiring deployment of any code to MarkLogic, and makes the set of CoRB files and configuration more self contained.

    INIT-MODULE, URIS-MODULE, PROCESS-MODULE, PRE-BATCH-MODULE and POST-BATCH-MODULE can be specified adhoc by adding the suffix |ADHOC for XQuery or JavaScript (with .sjs or .js extension) at the end. Adhoc XQuery or JavaScript remains local to the CoRB and is not deployed to MarkLogic. The XQuery or JavaScript module should be in its named file and that file should be available on the file system, including being on the java classpath for CoRB.

    Adhoc Examples:
    • PRE-BATCH-MODULE=adhoc-pre-batch.xqy|ADHOC adhoc-pre-batch.xqy must be on the classpath or in the current directory.
    • PROCESS-MODULE=/path/to/file/adhoc-transform-module.xqy|ADHOC XQuery module file with full path in the file system.
    • URIS-MODULE=adhoc-uris.sjs|ADHOC Adhoc JavaScript module in the classpath or current directory.

    Inline Adhoc Modules

    It is also possible to set a module option with inline code blocks, rather than a file path. This can be done by prepending either INLINE-XQUERY| or INLINE-JAVASCRIPT| to the option value, followed by the XQuery or JavaScript code to execute. Inline code blocks are executed as "adhoc" modules and are not uploaded to the Modules database. The |ADHOC suffix is optional for inline code blocks.

    Inline Adhoc Example
    URIS-MODULE=INLINE-XQUERY|xquery version '1.0-ml'; let $uris := cts:uris('', 'document', cts:collection-query('foo')) return (count($uris), $uris)

    JavaScript Modules

    JavaScript modules are supported and can be used in place of an XQuery module. However, if returning multiple values (ex: URIS-MODULE), values must be returned as a Sequence. MarkLogic JavaScript API has helper functions to convert Arrays into Sequence (Sequence.from()) and inserting values into another Sequence (fn.insertBefore()).

    JavaScript module must have an .sjs file extension when deployed to Modules database. However, adhoc JavaScript modules support both .sjs and .js file extensions.

    For example, a simple URIS-MODULE may look like this:

    let uris = cts.uris();
    fn.insertBefore(uris, 0, fn.count(uris));

    To return URIS_BATCH_REF, we can do the following:

    fn.insertBefore(fn.insertBefore(uris, 0, fn.count(uris)), 0, "batch-ref")

    Note: Do not use single quotes within (adhoc) JavaScript modules. If you must use a single quote, escape it with a quote (ex: ''text'')

    It is often required to protect the database connection string or password from unauthorized access. So, CoRB optionally supports encryption of the entire XCC URL or any parts of the XCC URL (if individually specified), such as XCC-PASSWORD.

    Option Description
    DECRYPTER Must implement com.marklogic.developer.corb.Decrypter. Encryptable options include XCC-CONNECTION-URI, XCC-USERNAME, XCC-PASSWORD, XCC-HOSTNAME, XCC-PORT, and XCC-DBNAME
    • com.marklogic.developer.corb.PrivateKeyDecrypter (Included) Requires private key file
    • com.marklogic.developer.corb.JasyptDecrypter (Included) Requires jasypt-*.jar in classpath
    • com.marklogic.developer.corb.HostKeyDecrypter (Included) Requires Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files
    PRIVATE-KEY-FILE Required property for PrivateKeyDecrypter. This file should be accessible in the classpath or on the file system
    PRIVATE-KEY-ALGORITHM (Optional)
    • Default algorithm for PrivateKeyDecrypter is RSA.
    • Default algorithm for JasyptDecrypter is PBEWithMD5AndTripleDES
      JASYPT-PROPERTIES-FILE (Optional) Property file for the JasyptDecrypter. If not specified, it uses default jasypt.proeprties file, which should be accessible in the classpath or file system.

      com.marklogic.developer.corb.PrivateKeyDecrypter

      PrivateKeyDecrypter automatically detects if the text is encrypted. Unencrypted text or clear text is returned as-is. Although not required, encrypted text can be optionally enclosed with "ENC" ex: ENC(xxxxxx) to clearly indicate that it is encrypted.

      Generate keys and encrypt XCC URL or password using one of the options below.

      Java Crypt

      • Use the PrivateKeyDecrypter class inside the CoRB JAR with the gen-keys option to generate a key.
        java -cp /path/to/lib/* com.marklogic.developer.corb.PrivateKeyDecrypter gen-keys /path/to/private.key /path/to/public.key RSA 1024

      Note: if not specified, default algorithm: RSA, default key-length: 1024

      • Use the PrivateKeyDecrypter class inside the CoRB JAR with the encrypt option to encrypt the clear text such as an xcc URL or password.
        java -cp /path/to/lib/* com.marklogic.developer.corb.PrivateKeyDecrypter encrypt /path/to/public.key clearText RSA

      Note: if not specified, default algorithm: RSA

      RSA keys

      • openssl genrsa -out private.pem 1024 Generate a private key in PEM format
      • openssl pkcs8 -topk8 -nocrypt -in private.pem -out private.pkcs8.key Create a PRIVATE-KEY-FILE in PKCS8 standard for java
      • openssl rsa -in private.pem -pubout > public.key Extract public key
      • echo "uri or password" | openssl rsautl -encrypt -pubin -inkey public.key | base64 Encrypt URI or password. Optionally, the encrypted text can be enclosed with "ENC" ex: ENC(xxxxxx)

      ssh-keygen

      • ssh-keygen ex:key as id_rsa after selecting a passphrase
      • openssl pkcs8 -topk8 -nocrypt -in id_rsa -out id_rsa.pkcs8.key (asks for passphrase)
      • openssl rsa -in id_rsa -pubout > public.key (asks for passphrase)
      • echo "password or uri" | openssl rsautl -encrypt -pubin -inkey public.key | base64

      com.marklogic.developer.corb.JasyptDecrypter

      JasyptDecrypter automatically detects if the text is encrypted. Unencrypted text or clear text is returned as-is. Though, not required, encrypted text can be optionally enclosed with "ENC" ex: ENC(xxxxxx) to clearly indicate that it is encrypted.

      Encrypt the URI or password as below. It is assumed that the jasypt distribution is available on your machine.

      jasypt-1.9.2/bin/encrypt.sh input="uri or password" password="passphrase" algorithm="algorithm" (ex: PBEWithMD5AndTripleDES or PBEWithMD5AndDES)

      jasypt.properties file

      jasypt.algorithm=PBEWithMD5AndTripleDES #(If not specified, default is PBEWithMD5AndTripleDES)
      jasypt.password=passphrase

      com.marklogic.developer.corb.HostKeyDecrypter

      HostKeyDecrypter uses internal server identifiers to generate a private key unique to the host server. It then uses that private key as input to AES-258 encryption algorithm. Due to the use of AES-258, it requires JCE Unlimited Strength Jurisdiction Policy Files.

      Note: certain server identifiers used may change in cases of driver installation or if underlying hardware changes. In such cases, passwords will need to be regenerated. Encrypted passwords will be always be unique to the server they are generated on.

      Encrypt the password as follows:
      java -cp /path/to/lib/* com.marklogic.developer.corb.HostKeyDecrypter encrypt clearText

      To test if server is properly configured to use the HostKeyDecrypter:
      java -cp /path/to/lib/* com.marklogic.developer.corb.HostKeyDecrypter test

      SSL Support

      CoRB provides support for SSL over XCC. As a prerequisite to enabling CoRB SSL support, the XDBC server must be configured to use SSL. It is necessary to specify XCC-CONNECTION-URI property with a protocol of 'xccs'. To configure a particular type of SSL configuration use the following property:

      Option Description
      SSL-CONFIG-CLASS Must implement com.marklogic.developer.corb.SSLConfig
      • com.marklogic.developer.corb.TrustAnyoneSSLConfig (Included)
      • com.marklogic.developer.corb.TwoWaySSLConfig (Included) supports 2-way SSL

      com.marklogic.developer.corb.TrustAnyoneSSLConfig

      TrustAnyoneSSLConfig is the default implementation of the SSLContext. It will accept any certificate presented by the MarkLogic server.

      com.marklogic.developer.corb.TwoWaySSLConfig

      TwoWaySSLConfig is more complete and configurable implementation of the SSLContext. It supports SSL with mutual authentication. It is configurable via the following properties:

      Option Description
      SSL-PROPERTIES-FILE (Optional) A properties file that can be used to load a common SSL configuration.
      SSL-KEYSTORE Location of the keystore certificate.
      SSL-KEYSTORE-PASSWORD (Encrytable) Password of the keystore file.
      SSL-KEY-PASSWORD (Encryptable) Password of the private key.
      SSL-KEYSTORE-TYPE Type of the keystore such as 'JKS' or 'PKCS12'.
      SSL-ENABLED-PROTOCOLS (Optional) A comma or colon separated list of acceptable SSL protocols, in priority order. Default is TLSv1.2.
      SSL-CIPHER-SUITES A comma or colon separated list of acceptable cipher suites used.

      Load Balancing and Failover with Multiple Hosts

      CoRB 2.4+ supports load balancing and failover using com.marklogic.developer.corb.ContentSourcePool. This is automatically enabled when multiple comma separated values (supports encryption) are specified for for XCC-CONNECTION-URI or XCC-HOSTNAME.

      XCC-CONNECTION-URI=xcc://hostname1:8000/dbname,xcc://hostname2:8000/dbname,..

      OR

      XCC-HOST-NAME=hostname1,hostname2,..

      The default implementation for com.marklogic.developer.corb.ContentSourcePool is com.marklogic.developer.corb.DefaultContentSourcePool. It uses below options for CONNECTION-POLICY for allocating connections to callers.

      • ROUND-ROBIN - (Default) Connections are allocated using round-robin algorithm.
      • RANDOM - Connections are randomly allocated.
      • LOAD - Host with least number of active connections is allocated to caller.

      Query and Connection Retries

      CoRB automatically retries the requests a given URI when it encounters com.marklogic.xcc.exceptions.ServerConnectionException from MarkLogic. If necessary, the number of retry attempts can be configured using XCC-CONNECTION-RETRY-LIMIT. If multiple hosts are specified, we can optionally configure retries per each host using XCC-CONNECTION-HOST-RETRY-LIMIT. CoRB waits at least XCC-CONNECTION-RETRY-INTERVAL seconds before a connection is retried on a failed host.

      CoRB also supports retries of requests failed due to query errors. This feature is only intended for sporadic query errors which are not specific to a particular URI. A good example may include occasional time out exceptions from MarkLogic when the ML is too busy and request time limit is low. We can configure which queries can be retried using QUERY-RETRY-ERROR-CODES or QUERY-RETRY-ERROR-MESSAGE (when error codes are not available). If necessary, the number of query retry attempts can be configured using QUERY-RETRY-LIMIT. CoRB waits at least QUERY-RETRY-INTERVAL seconds before retrying a query.

      QUERY-RETRY-ERROR-CODES=XDMP-EXTIME,SVC-EXTIME
      QUERY-RETRY-ERROR-MESSAGE=ErrorMsg1,ErrorMsg2

      Refer to the wiki for examples of how to execute a CoRB job and various ways of configuring the job options.

      Sometimes, a two or more staged CoRB job with both a selector and transform isn't necessary to get the job done. Sometimes, only a single query needs to be executed and the output captured to file. Maybe even to execute only a single query with no output captured. In these cases, the ModuleExecutor Tool can be used to quickly and efficiently execute your XQuery or JavaScript files.

      corb2's People

      Contributors

      bbandlamudi avatar dependabot-preview[bot] avatar dependabot[bot] avatar dmcassel avatar eedeebee avatar hansenmc avatar masyukun avatar mblakele avatar mpheckel avatar rjkennedy98 avatar rjrudin avatar vjsaradhi avatar

      Stargazers

       avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

      Watchers

       avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

      corb2's Issues

      Change Monitor threadPool task message logging level from SEVERE to WARNING

      Users have encountered this condition and the SEVERE level seemed to indicate that there was an immediate issue. The ThreadPool methods employed are not guaranteed to produce accurate results, and may be a false positive. Therefore, the level should be reduced to WARNING. Additionally, include the numbers in the message to indicate what the discrepancy is, and consider a deferred check following this.

      https://github.com/marklogic/corb2/blob/master/src/main/java/com/marklogic/developer/corb/Monitor.java#L118

      if (completed >= taskCount) {
                      if (pool.getActiveCount() > 0 || (pool.getTaskCount() - pool.getCompletedTaskCount()) > 0) {
                          LOG.log(SEVERE, "Thread pool is still active with all the tasks completed and received. We shouldn't see this message.");
                      }
      

      Publish corb2 to bintray

      A few ML libraries are in bintray - https://bintray.com/search?query=marklogic - and I've followed suit - https://bintray.com/rjrudin/maven .

      The benefit of storing corb2 is there's a standard way to pull down corb2 in Maven or Gradle via the jcenter repository. In Gradle, that just means including the following

      repositories { jcenter() }
      

      Here's an example of using bintray - https://github.com/rjrudin/ml-app-deployer/blob/master/build.gradle#L51

      Note that you need an account. I am happy to use mine to publish corb2 if so desired, but you can setup an account easily by logging in with your Github account.

      Corb2 needs to print underlying exception when disk queue write fails.

      We had a production job failue today with the exception "SEVERE: Error writing to DiskQueue backing store". DiskQueue.offer() method isn't printing the underlying exception, so we had no idea what went wrong.

      We need to examine similar instances in the code and try to print underlying exception or at least provide some information about what went wrong.

      Support ingestion of large file into multiple files with custom transform along with an option to create a metadata file.

      Similar to how we do the processing, but we may benefit from having a new manager class. We can start with xml files at first, but we need to keep the implementations separate from core code. Also, provide an option to specify collections and permissions (?).

      1. Use stream source with xpath eval to calculate number of nodes. (uris)
      2. Use stream source with eval (?) to read the metadata and pass to query for transformation (pre-batch)
      3. Use stax read nodes and submit to concurrent executor as we read them i.e., do not hold them in memory. (transform)
        Note: See if we can improve CoRB uri loading (?)
      4. Add an option to call a query with optional parameters (post batch)
      5. We need to add an ability to pass variables from pre-batch to other modules (even for regular corb manager, developers are requesting).

      We can discuss this further and hash out the details. This is originated to support a request from the client. We probably need to have this ready in a month or so. Also, we have number of people request adding ingestion to corb workflow. We still need to keep corb light weight and avoid compile time dependency on external jars.

      Corb 2.3.1 is throwing IllegalStateException when ArrayQueue capacity is full.

      If we receive more URIs than the count says, we probably should throw a warning or error and continue instead of exiting with error. I think we used to do that before the ArrayQueue. Not a critical issue (for now) since it is a user error, but we need to think about it. May be we can override the add method as well.

      at java.util.AbstractQueue.add(AbstractQueue.java:98)
      at com.marklogic.developer.corb.QueryUrisLoader.open(QueryUrisLoader.java:173)
      at com.marklogic.developer.corb.Manager.populateQueue(Manager.java:711)
      at com.marklogic.developer.corb.Manager.run(Manager.java:495)
      at com.marklogic.developer.corb.Manager.main(Manager.java:169)

      Note: This was reported by a developer at the client site.

      Update to Java 8, drop support for Java 7 and 6

      Java 7 has been end of life since April, 2015 and Java 6 since February, 2013.

      The following updates can/should be made:

      • Update the build.gradle and pom.xml to compile to Java 1.8
      • Update code to leverage newer java features, such as try-with-resources
      • update externalsortinginjava to latest 2.x (requires Java 8)
      • enable sonarqube sonar-scanner (requires Java 8)

      Provide an easier way to extend ModuleExecutor and customize handling of ResultSequence

      The current implementation has a private writeToFile() method invoked from the public run() method. If someone wanted to extend the ModuleExecutor and customize the handling of the ResultSequence, they would need to re-implement the entire run() method.

      It would be easier/better to expose a method, such as processResultSequence(ResultSequence seq) that an extending class could override, rather than duplicating the implementation of the run() method.

      CORB execution does not work with xcc.httpcompliant=true

      I have an Amazon ELB setup in front of a cluster of ML nodes. MLCP and mlGradle tasks work fine deploying to this load balancer as long as I set the xcc.httpcompliant flag to true. When I attempt the same when I execute com.marklogic.developer.corb.Manager, I get a BAD_REQUEST error.

      SEVERE: Error while running CORB
      com.marklogic.developer.corb.CorbException: While invoking Uris Module
      	at com.marklogic.developer.corb.QueryUrisLoader.open(QueryUrisLoader.java:180)
      	at com.marklogic.developer.corb.Manager.populateQueue(Manager.java:710)
      	at com.marklogic.developer.corb.Manager.run(Manager.java:494)
      	at com.marklogic.developer.corb.Manager.main(Manager.java:143)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:601)
      	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
      Caused by: com.marklogic.xcc.exceptions.ServerResponseException: Module spawn request rejected (400, BAD_REQUEST). Is this an XDBC server?
       [Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: address=internal-ml-cluster-irx-dev-lb-1031654958.us-east-1.elb.amazonaws.com/10.239.12.224:8050, pool=0/64]]]
       [Client: XCC/8.0-5]
      	at com.marklogic.xcc.impl.handlers.NotFoundCodeHandler.handleResponse(NotFoundCodeHandler.java:39)
      	at com.marklogic.xcc.impl.handlers.EvalRequestController.serverDialog(EvalRequestController.java:96)
      	at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:88)
      	at com.marklogic.xcc.impl.SessionImpl.submitRequestInternal(SessionImpl.java:437)
      	at com.marklogic.xcc.impl.SessionImpl.submitRequest(SessionImpl.java:432)
      	at com.marklogic.developer.corb.QueryUrisLoader.open(QueryUrisLoader.java:130)
      	... 8 more
      

      Is this compliance mode flag being ignored? I would see similar errors using MLCP if the flag was not set.

      Add printing of (better) help/usage to console

      if you run Corb2 via Roxy (with latest PR), but don't provide any arguments, you get:

      $ ./ml local corb
      Jul 15, 2016 9:17:50 AM com.marklogic.developer.corb.Manager main
      SEVERE: Error initializing CORB
      java.lang.NullPointerException: PROCESS-TASK or PROCESS-MODULE must be specified
          at com.marklogic.developer.corb.Manager.initOptions(Manager.java:336)
          at com.marklogic.developer.corb.Manager.init(Manager.java:204)
          at com.marklogic.developer.corb.AbstractManager.init(AbstractManager.java:173)
          at com.marklogic.developer.corb.Manager.main(Manager.java:162)
      

      Particularly showing java.lang.NullPointerException looks odd. Maybe take MLCP as example?

      $ ./ml local mlcp
      
      usage: mlcp COMMAND [ARGS]
      
      Available commands:
        IMPORT  import data to a MarkLogic database
        EXPORT  export data from a MarkLogic database
        COPY    copy data from one MarkLogic database to another
        EXTRACT extract data from MarkLogic forests
        HELP    list available commands
        VERSION print the version
      
      $ ./ml local mlcp import
      
      16/07/15 09:23:29 ERROR contentpump.ContentPump: Error parsing command arguments: 
      16/07/15 09:23:29 ERROR contentpump.ContentPump: Missing required option: input_file_path
      usage: IMPORT [-aggregate_record_element <QName>]
             [-aggregate_record_namespace <namespace>] [-aggregate_uri_id
             <QName>] [-archive_metadata_optional <true,false>] [-batch_size
             <number>] [-collection_filter <String>] [-content_encoding
             <encoding>] [-copy_collections <true,false>] [-copy_permissions
             <true,false>] [-copy_properties <true,false>] [-copy_quality
             <true,false>] [-data_type <data type>] [-database <database>]
             [-delimited_root_name <root name>] [-delimited_uri_id <column
             name>] [-delimiter <delimiter>] [-directory_filter <String>]
             [-document_type <type>] [-fastload <true,false>]
             [-filename_as_collection <true,false>] [-generate_uri <true,
             false>] [-hadoop_conf_dir <directory>] -host <host>
             [-input_compressed <true,false>] [-input_compression_codec <codec>]
             -input_file_path <path> [-input_file_pattern <regex pattern>]
             [-input_file_type <type>] [-max_split_size <number>]
             [-min_split_size <number>] [-mode <mode>] [-namespace <namespace>]
             [-output_cleandir <true,false>] [-output_collections <collections>]
             [-output_directory <directory>] [-output_graph <graph>]
             [-output_language <language>] [-output_override_graph <graph>]
             [-output_partition <partition name>] [-output_permissions
             <permissions>] [-output_quality <quality>] [-output_uri_prefix
             <prefix>] [-output_uri_replace <list>] [-output_uri_suffix
             <suffix>] [-password <password>] [-port <port>]
             [-sequencefile_key_class <class name>] [-sequencefile_value_class
             <class name>] [-sequencefile_value_type <value type>] [-split_input
             <true,false>] [-streaming <true,false>] [-temporal_collection
             <String>] [-thread_count <count>] [-thread_count_per_split <count>]
             [-tolerate_errors <true,false>] [-transaction_size <number>]
             [-transform_function <String>] [-transform_module <String>]
             [-transform_namespace <String>] [-transform_param <String>]
             [-type_filter <String>] [-uri_id <uri name>] [-username <username>]
             [-xml_repair_level <level>]
       -aggregate_record_element <QName>         Element name in which each
                                                 document is found
       -aggregate_record_namespace <namespace>   Element namespace in which each
                                                 document is found
       -aggregate_uri_id <QName>                 Deprecated. Name of the first
                                                 element or attribute within a
                                                 record element to be used as
                                                 document URI. If omitted, a
                                                 sequence id will be generated
                                                 to  form the document URI.
       -archive_metadata_optional <true,false>   Whether to allow empty metadata
                                                 when importing archive
      ...
      

      Estimate time providing unrealistic estimates

      In certain cases, because of changes in URIs being processed or because of changes in server performance, CORB job is providing an inaccurate ETC.
      exempli gratia:
      INFO: completed 1050801/1500000, 263 tps(avg), 19 tps(cur), ETC 00:28:25, 8 active threads.

      Connect to multiple hosts

      I don't think this is possible yet - I'd like to be able to specify multiple hosts to connect to (same port/username/password/database for all of them) and have corb2 round-robin requests to those hosts.

      In the past, I've had the benefit of a load-balancer in front of the hosts, but I've run into environments where that's not an option.

      I'm thinking of something like:

      XCC-HOSTS=host1,host2,host3
      XCC-USERNAME=
      XCC-PASSWORD=
      XCC-PORT=
      XCC-DATABASE=

      PROCESS-MODULE ignored

      my options file:

      MODULE-ROOT=/corb/
      PROCESS-MODULE=getContractorUris.sjs
      THREAD-COUNT=4
      URIS-MODULE=addProp.sjs
      XCC-USERNAME=admin
      XCC-PASSWORD=admin
      XCC-HOSTNAME=ea4-ml1
      XCC-PORT=8030

      my corb invoke (using latest roxy):

      ./ml local corb -DOPTIONS-FILE=./corb-config/test.properties
      /System/Library/Frameworks/Ruby.framework/Versions/2.0/usr/lib/ruby/2.0.0/universal-darwin16/rbconfig.rb:213: warning: Insecure world writable dir /Users/smitrovi/apps/mlcp-8.0-5/bin in PATH, mode 040777
      logging to CONSOLE
      Mar 13, 2017 10:53:36 AM com.marklogic.developer.SimpleLogger configureLogger
      INFO: setting up logging for: com.marklogic.ps
      Mar 13, 2017 10:53:36 AM com.marklogic.developer.corb.Manager loadPropertiesFile
      INFO: Loading ./corb-config/test.properties from filesystem
      Exception in thread "main" java.lang.NullPointerException: PROCESS-TASK or XQUERY-MODULE must be specified
      	at com.marklogic.developer.corb.Manager.createManager(Manager.java:371)
      	at com.marklogic.developer.corb.Manager.main(Manager.java:182)
      

      If I change PROCESS-MODULE to XQUERY-MODULE than the error doesn't pop-up anymore.

      Provide ability to log metrics associated with query executions to ErrorLog

      Metrics (e.g. total execution time) for the selector and transform could be written to the ErrorLog. These could be implemented as trace events. Having this capability built in to the corb is desirable primarily from an operations standpoint:

      1. ability to monitor which jobs are running when
      2. visibility into how metrics associated with those jobs have changed over time (e.g. are the jobs speeding up after a change/upgrade to the infrastructure)

      provide scripts to automate the creation of test database and XDBC server for integration and performance tests

      The current set of integration tests expect an XCC server to be available. In order to be able to run the integration tests (executed by the standard gradle build), a user would need to first create an XDBC server and database.

      Leverage ml-gradle in order to automatically create and teardown the XDBC server, database, and users needed for the execution of integrationg and perfromance tests.

      Invalid TLS padding data

      I was asked to file this as a Corb issue by Support. We are using:
      MarkLogic version – 8.0-5.2
      XCC Version – 8.0-4.2
      CORB2 Version – 2.2.0

      com.marklogic.developer.corb.CorbException: While invoking Uris Module
      at com.marklogic.developer.corb.QueryUrisLoader.open(QueryUrisLoader.java:152)
      at com.marklogic.developer.corb.Manager.populateQueue(Manager.java:560)
      at com.marklogic.developer.corb.Manager.run(Manager.java:346)
      at com.marklogic.developer.corb.extension.RestartManager.main(RestartManager.java:65)
      Caused by: com.marklogic.xcc.exceptions.ServerConnectionException: Error parsing HTTP headers: Invalid TLS padding data
      [Session: user=s020499, cb={default} [ContentSource: user=xxxxxxx, cb={none} [provider: SSLconn address=xxxx.xxxx.xxxxx.xxxx.com/xx.xx.xx.xx:443, pool=2/64]]]
      [Client: XCC/8.0-1]
      at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:125)
      at com.marklogic.xcc.impl.SessionImpl.submitRequestInternal(SessionImpl.java:437)
      at com.marklogic.xcc.impl.SessionImpl.submitRequest(SessionImpl.java:432)
      at com.marklogic.developer.corb.QueryUrisLoader.open(QueryUrisLoader.java:131)
      ... 3 more
      Caused by: java.io.IOException: Error parsing HTTP headers: Invalid TLS padding data
      at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
      at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1714)
      at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:968)
      at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:893)
      at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:767)
      at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
      at com.marklogic.io.SslByteChannel.unwrapNetData(SslByteChannel.java:449)
      at com.marklogic.io.SslByteChannel.fillBufferFromEngineInsideHandshake(SslByteChannel.java:229)
      at com.marklogic.io.SslByteChannel.readInsideHandshake(SslByteChannel.java:213)
      at com.marklogic.io.SslByteChannel.handleHandshake(SslByteChannel.java:408)
      at com.marklogic.io.SslByteChannel.fillBufferFromEngine(SslByteChannel.java:198)
      at com.marklogic.io.SslByteChannel.read(SslByteChannel.java:143)
      at com.marklogic.http.HttpChannel$ChannelInputStream.timedRead(HttpChannel.java:501)
      at com.marklogic.http.HttpChannel$ChannelInputStream.fillBuffer(HttpChannel.java:489)
      at com.marklogic.http.HttpChannel$ChannelInputStream.read(HttpChannel.java:448)
      at com.marklogic.http.HttpChannel$ChannelInputStream.read(HttpChannel.java:468)
      at com.marklogic.http.HttpHeaders.nextHeaderLine(HttpHeaders.java:313)
      at com.marklogic.http.HttpHeaders.parseResponseHeaders(HttpHeaders.java:272)
      at com.marklogic.http.HttpChannel.parseHeaders(HttpChannel.java:321)
      at com.marklogic.http.HttpChannel.receiveMode(HttpChannel.java:294)
      at com.marklogic.http.HttpChannel.getResponseCode(HttpChannel.java:193)
      at com.marklogic.xcc.impl.handlers.EvalRequestController.serverDialog(EvalRequestController.java:76)
      at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:87)
      ... 6 more
      Caused by: javax.net.ssl.SSLHandshakeException: Invalid TLS padding data
      ... 29 more
      Caused by: javax.crypto.BadPaddingException: Invalid TLS padding data
      at sun.security.ssl.CipherBox.removePadding(CipherBox.java:692)
      at sun.security.ssl.CipherBox.decrypt(CipherBox.java:423)
      at sun.security.ssl.InputRecord.decrypt(InputRecord.java:154)
      at sun.security.ssl.EngineInputRecord.decrypt(EngineInputRecord.java:192)
      at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:962)
      ... 26 more

      Update AbstractTask class to handle retries for all QueryExceptions

      Some of the ML exceptions especially SVC-EXTIME exceptions are not thrown as Retryable exceptions, though they can be retried. So, we might need to modify the AbstractTask to retry the URIs that ended up with a QueryException at least once. We might want to set the default query retries to '1', unless overridden with an config option. It would be great if we can get this implemented sooner than later.

      Remove compile warnings

      When I imported corb2 into eclipse (Luna), I get 132 compiler warnings. Most are easy fixes - unused imports, unused variables. But there are also a number of resource leaks, though most are in the test cases.

      I've pasted them below. I can put together a PR for this if others feel these are worth removing.

      Description Resource Path Location Type
      BlockingQueue is a raw type. References to generic type BlockingQueue should be parameterized ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 484 Java Problem
      BlockingQueue is a raw type. References to generic type BlockingQueue should be parameterized ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 507 Java Problem
      Comparator is a raw type. References to generic type Comparator should be parameterized PostBatchUpdateFileTask.java /corb2/src/main/java/com/marklogic/developer/corb line 78 Java Problem
      Comparator is a raw type. References to generic type Comparator should be parameterized PostBatchUpdateFileTask.java /corb2/src/main/java/com/marklogic/developer/corb line 105 Java Problem
      List is a raw type. References to generic type List should be parameterized StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 99 Java Problem
      List is a raw type. References to generic type List should be parameterized StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 107 Java Problem
      Map.Entry is a raw type. References to generic type Map<K,V>.Entry<K,V> should be parameterized Manager.java /corb2/src/main/java/com/marklogic/developer/corb line 292 Java Problem
      Resource leak: '' is never closed TestUtils.java /corb2/src/test/java/com/marklogic/developer/corb line 79 Java Problem
      Resource leak: 'br' is not closed at this location HostKeyDecrypter.java /corb2/src/main/java/com/marklogic/developer/corb line 113 Java Problem
      Resource leak: 'br' is not closed at this location HostKeyDecrypter.java /corb2/src/main/java/com/marklogic/developer/corb line 113 Java Problem
      Resource leak: 'br' is not closed at this location HostKeyDecrypter.java /corb2/src/main/java/com/marklogic/developer/corb line 142 Java Problem
      Resource leak: 'br' is not closed at this location HostKeyDecrypter.java /corb2/src/main/java/com/marklogic/developer/corb line 142 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 69 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 78 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 90 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 102 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 111 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 123 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 132 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 177 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 191 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 201 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 208 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 222 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 229 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 248 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 264 Java Problem
      Resource leak: 'instance' is never closed FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 299 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 73 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 85 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 99 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 130 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 144 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 185 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 198 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 212 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 230 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 251 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 262 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 273 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 283 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 294 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 310 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 344 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 367 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 376 Java Problem
      Resource leak: 'instance' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 387 Java Problem
      Resource leak: 'sc' is never closed HostKeyDecrypter.java /corb2/src/main/java/com/marklogic/developer/corb line 76 Java Problem
      Resource leak: 'writer' is never closed QueryUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 179 Java Problem
      The import java.io.BufferedReader is never used StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 21 Java Problem
      The import java.io.File is never used StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 22 Java Problem
      The import java.io.FileReader is never used StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 23 Java Problem
      The import java.security.NoSuchAlgorithmException is never used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 34 Java Problem
      The import java.util.Collection is never used StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 27 Java Problem
      The import java.util.LinkedHashSet is never used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 37 Java Problem
      The import java.util.logging.Level is never used ModuleExecutorTest.java /corb2/src/test/java/com/marklogic/developer/corb line 47 Java Problem
      The import java.util.logging.LogRecord is never used ModuleExecutorTest.java /corb2/src/test/java/com/marklogic/developer/corb line 48 Java Problem
      The import java.util.Set is never used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 39 Java Problem
      The import javax.net.ssl.X509TrustManager is never used TrustAnyoneSSLConfigTest.java /corb2/src/test/java/com/marklogic/developer/corb line 22 Java Problem
      The import org.mockito.exceptions.base.MockitoException is never used FileUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 35 Java Problem
      The import org.mockito.Mockito.anyString is never used ExportBatchToFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 29 Java Problem
      The import org.mockito.Mockito.anyString is never used FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 33 Java Problem
      The import org.mockito.Mockito.anyString is never used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 36 Java Problem
      The import org.mockito.Mockito.anyString is never used ModuleExecutorTest.java /corb2/src/test/java/com/marklogic/developer/corb line 51 Java Problem
      The import org.mockito.Mockito.mock is never used AbstractTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 57 Java Problem
      The import org.mockito.Mockito.mock is never used IOUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 39 Java Problem
      The import org.mockito.Mockito.mock is never used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 37 Java Problem
      The import org.mockito.Mockito.when is never used FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 35 Java Problem
      The import org.mockito.Mockito.when is never used IOUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 40 Java Problem
      The import org.mockito.Mockito.when is never used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 38 Java Problem
      The method decrypt(String) from the type JasyptDecrypterTest.TestDecrypt is never used locally JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 198 Java Problem
      The method toURL() from the type File is deprecated ModuleExecutorTest.java /corb2/src/test/java/com/marklogic/developer/corb line 531 Java Problem
      The static field AbstractTask.MODULE_PROPS should be accessed in a static way AbstractTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 268 Java Problem
      The static field AbstractTask.MODULE_PROPS should be accessed in a static way AbstractTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 269 Java Problem
      The value of the local variable connectionUri is not used ModuleExecutorTest.java /corb2/src/test/java/com/marklogic/developer/corb line 85 Java Problem
      The value of the local variable context is not used TwoWaySSLConfigTest.java /corb2/src/test/java/com/marklogic/developer/corb line 172 Java Problem
      The value of the local variable expResult is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 424 Java Problem
      The value of the local variable expResult is not used TransformTest.java /corb2/src/test/java/com/marklogic/developer/corb line 73 Java Problem
      The value of the local variable instance is not used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 411 Java Problem
      The value of the local variable props is not used ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 1075 Java Problem
      The value of the local variable records is not used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 91 Java Problem
      The value of the local variable records is not used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 106 Java Problem
      The value of the local variable records is not used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 125 Java Problem
      The value of the local variable records is not used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 142 Java Problem
      The value of the local variable records is not used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 154 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 111 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 153 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 172 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 181 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 187 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 193 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 199 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 219 Java Problem
      The value of the local variable result is not used AbstractManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 425 Java Problem
      The value of the local variable result is not used AbstractTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 267 Java Problem
      The value of the local variable result is not used ExportBatchToFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 90 Java Problem
      The value of the local variable result is not used ExportBatchToFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 100 Java Problem
      The value of the local variable result is not used ExportToFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 185 Java Problem
      The value of the local variable result is not used FileUrisLoaderTest.java /corb2/src/test/java/com/marklogic/developer/corb line 223 Java Problem
      The value of the local variable result is not used IOUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 137 Java Problem
      The value of the local variable result is not used IOUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 144 Java Problem
      The value of the local variable result is not used JasyptDecrypterTest.java /corb2/src/test/java/com/marklogic/developer/corb line 179 Java Problem
      The value of the local variable result is not used ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 1037 Java Problem
      The value of the local variable result is not used ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 1060 Java Problem
      The value of the local variable result is not used ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 1068 Java Problem
      The value of the local variable result is not used ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 1078 Java Problem
      The value of the local variable result is not used ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 1105 Java Problem
      The value of the local variable result is not used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 387 Java Problem
      The value of the local variable result is not used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 402 Java Problem
      The value of the local variable result is not used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 437 Java Problem
      The value of the local variable result is not used PostBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 448 Java Problem
      The value of the local variable result is not used PreBatchUpdateFileTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 123 Java Problem
      The value of the local variable result is not used StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 159 Java Problem
      The value of the local variable result is not used StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 260 Java Problem
      The value of the local variable result is not used StringUtilsTest.java /corb2/src/test/java/com/marklogic/developer/corb/util line 266 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 63 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 72 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 82 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 93 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 106 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 118 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 234 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 255 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 308 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 317 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 327 Java Problem
      The value of the local variable result is not used TaskFactoryTest.java /corb2/src/test/java/com/marklogic/developer/corb line 336 Java Problem
      The value of the local variable value is not used AbstractTaskTest.java /corb2/src/test/java/com/marklogic/developer/corb line 627 Java Problem
      Type safety: The expression of type BlockingQueue needs unchecked conversion to conform to BlockingQueue ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 485 Java Problem
      Type safety: The expression of type BlockingQueue needs unchecked conversion to conform to BlockingQueue ManagerTest.java /corb2/src/test/java/com/marklogic/developer/corb line 508 Java Problem
      Type safety: The expression of type Comparator needs unchecked conversion to conform to Comparator PostBatchUpdateFileTask.java /corb2/src/main/java/com/marklogic/developer/corb line 92 Java Problem
      Type safety: The expression of type Comparator needs unchecked conversion to conform to Comparator PostBatchUpdateFileTask.java /corb2/src/main/java/com/marklogic/developer/corb line 97 Java Problem

      Option to De-duplicate output

      Could you throw in a flag to sort and/or deduplicate the output from the CORB job? This seems to be a pretty common use case for batch extracts, and we currently do this with bash scripts on the command line.

      Best way to distribute a job?

      In some testing it doesn't seem like CORB supports load balancers? We experienced that CORB gets EOF errors when sent through a load balancer but works fine when sent directly to a host. Further, it seems that only way to distribute CORB's load is to run multiple instances of it and point each instance to separate hosts. Is there a better way?

      MODULES-DATABASE parameter not used?

      We have been observing the behavior that the MODULES-DATABASE properties parameter has no effect; it doesn't appear to override the associated modules database of the database specified in the connection string.

      Provide real-time metrics for the Job as it progresses.

      It would be useful to have a mechanism for clients to fetch metrics on-demand for the job that is running. Ex: Clients can connect to a http port that is opened as part of the corb job and continuously provide updates.

      Sample URIs Module wiki needs a simpler example

      There is only one example and it's a complex one. It makes sense to have a complex one to show what you can do (although a description of what it's doing might be helpful for some), but having a simple example to show how easy it is would be useful as well.

      DiskQueue throughput/performance

      The throughput/performance of the DiskQueue is less than that of the default ArrayQueue and does not appear to be leveraging all available processing threads. Need to investigate to see if the default parameters need adjustment (max in memory size, refill ratio, etc), or if the implementation should be adjusted to maximize throughput/performance.

      No ETC calculated when TPS is less than 1

      When running CoRB with long-running tasks and the Threads Per Second (TPS) are less than one, the TPS is reported as 0 and the Estimated Time of Completion (ETC) was "00:00:-1" when it should have calculated and displayed an ETC of several hours.

      We should change CoRB to calculate TPS with more precision and report a decimal number for TPS if less than 1 (i.e. 0.02) and use the decimal number to calculate the ETC.

      The latest corb removed WARNING: Slow receive! Consider increasing max heap...

      A user has reported an issue about this warning and I noticed that we are no longer printing this. This use case happens with two use cases - 1) memory is two low 2) fetch rate from ML is slow. The case I have seen today appears to be related to fetch rate as memory is fine. We should continue to print this warning and improve the error message based on which case it is.

      Also, the QueryUrisLoader logQueueStatus may need to check the memory a bit more frequently - need to check performance implications here.

      Set URIs via an input parameter that is a query

      getUriLoader() in Manager.java requires that the impl of UrisLoader either be QueryUrisLoader, FileUrisLoader, or a custom implementation of UrisLoader. That means the query must either in the modules database, on the filesystem, or in a custom implementation of UrisLoader. This makes it very difficult to programmatically define custom queries, as they must be in one of those 3 places.

      Instead, I'd like to be able to set an input param such as URIS-QUERY="cts:uris(...)". getUrisLoader() could then check for that input, and if it exists, create an instance of QueryUrisLoader with the value of URIS-QUERY as the query to be executed.

      QueryUrisLoader can then be modified to get adhocQuery (currently line 110) from an instance variable as opposed to reading it from the filesystem or a modules database.

      This then opens the door for being able to invoke corb without any dependencies on modules on the filesystem or in the modules database - they can instead be passed in as strings (I'll have another ticket for doing this for the process module as well). What I have in mind is in something like Groovy shell, a developer ought to be able to do something like this:

      ml.corb().transformCollection("collection-name", "xdmp:node-insert-child(doc($URI)/element(), <test/>")
      

      That transform collection would construct a uris query on the given collection, and then the 2nd arg would be the process module code. More complicated corb jobs would still most likely use files in the modules database, but for quick/simple corb jobs, this would be very flexible and easy.

      I'm happy to put together a PR for this if this seems useful to others.

      Consolidate options for corb

      Corb options initialization and loading is scattered in several classes making it difficult to extend corb with new feature such as multi host load balancer, options logging etc.

      The plan is to merge the options and logic from TransfromOptions (will become obsolete), Manager, AbstractManager, AbstractTask (and other tasks) etc into Options.java class. Updated Options.java class should be flexible enough to hold the default options as well as custom properties.

      Execute pre batch module with external variables

      As a corb2 user
      I want to TITLE
      So that my pre batch module can be parameterized to behave in different ways, as I often need a user-provided parameter to determine what column headers I need to display

      • I recommend "PRE_BATCH.(variable name)", similar to "URIS.(variable name)".
      • I can submit a PR, would appreciate any pointers about where to look in the codebase and how to setup a test for it.

      Support JavaScript modules

      Add support for calling JavaScript modules.

      FYI, I ran a test with Corb 1 and found I could use a JavaScript transformation module, but not for getting URIs. For the transformation module, "URI" just works as a global (rather than external) variable.

      Custom Variables From Selector Not Being Passed

      I was trying to pass dynamically created custom vars from the selector but received an error on the transform side. I was able to get them working but only after I had changed the name of the variables. Originally, the variables had hyphens in them as in URIS-MODULE.my-custom-var="blah". It was only after removing the hyphens that it worked: URIS-MODULE.myCustomVar="blah". I'm not sure if underscores work either..

      Request that either they be allowed to be used or the readME be updated to alert users of the limitation. They are very handy, thanks!

      Sample Processing Module wiki only shows reporting

      The wiki page with the Sample Processing Module only shows a reporting example; there isn't one for actually reprocessing (modifying documents), which is CORB's core use case. Add a wiki that gives a reprocessing example.

      There is a wiki for Updating a Node in a Document, but it doesn't include the code.

      separate unit, integration, and performance tests

      Ensure that unit tests run quickly, without external dependencies(i.e. a MarkLogic instance with specific ports, app servers, and databases available). This will make the builds faster, easier for others to contribute and execute unit tests, and facilitate the use of https://travis-ci.org/

      • For unit tests, use Mock objects for MarkLogic connections, or move the test methods out into integration tests
      • Create integration tests (i.e. /src/test/java/com/marklogic/developer/corb/ManagerIT)
      • Similarly, create an area for performance tests (i.e. /src/test/java/com/marklogic/developer/corb/ManagerPT) in order to facilitate longer running load/stress tests
      • Modify the build and create tasks in order to execute gradle integrationTest and gradle performanceTest

      Recommend Projects

      • React photo React

        A declarative, efficient, and flexible JavaScript library for building user interfaces.

      • Vue.js photo Vue.js

        🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

      • Typescript photo Typescript

        TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

      • TensorFlow photo TensorFlow

        An Open Source Machine Learning Framework for Everyone

      • Django photo Django

        The Web framework for perfectionists with deadlines.

      • D3 photo D3

        Bring data to life with SVG, Canvas and HTML. 📊📈🎉

      Recommend Topics

      • javascript

        JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

      • web

        Some thing interesting about web. New door for the world.

      • server

        A server is a program made to process requests and deliver data to clients.

      • Machine learning

        Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

      • Game

        Some thing interesting about game, make everyone happy.

      Recommend Org

      • Facebook photo Facebook

        We are working to build community through open source technology. NB: members must have two-factor auth.

      • Microsoft photo Microsoft

        Open source projects and samples from Microsoft.

      • Google photo Google

        Google ❤️ Open Source for everyone.

      • D3 photo D3

        Data-Driven Documents codes.