marklogic / marklogic-contentpump Goto Github PK

MarkLogic Contentpump (mlcp)

Home Page: http://developer.marklogic.com/products/mlcp

License: Apache License 2.0

Java 98.83% HTML 0.38% CSS 0.14% Batchfile 0.05% Shell 0.03% JavaScript 0.26% XQuery 0.31%

marklogic-contentpump's Introduction

MarkLogic Content Pump

MarkLogic Content Pump (mlcp) is a command-line tool that provides the fastest way to import, export, and copy data to or from MarkLogic databases. Core features of mlcp include:

Bulk load billions of local files
Split and load large, aggregate XML files or delimited text
Bulk load billions of triples or quads from RDF files
Archive and restore database contents across environments
Export data from a database to a file system
Copy subsets of data between databases

You can run mlcp across many threads on a single machine or across many nodes in a cluster. Mlcp can now run against MarkLogic clusters hosted on AWS/Azure.

The MarkLogic Connector for Hadoop is an extension to Hadoop’s MapReduce framework that allows you to easily and efficiently communicate with a MarkLogic database from within a Hadoop job. From 10.0-5, Hadoop Connector is removed from a separate release, but mlcp still uses Hadoop Connector as an internal dependency.

Release Notes

What's New in mlcp 11.2.0

Excluded all the transitive dependencies to improve security and maintenance.
Upgraded Avro, Commons-cli, Commons-compress, Commons-io, Xml-apis, Jakarta.xml.soap-api, Jcl-over-slf4, and Commons-logging-api libraries to mitigate security vulnerabilities.
Removed Zookeeper, Bliki-core, Logback-classic, and Htrace-core libraries to mitigate security vulnerabilities.
Upgraded the Maven Javadoc plugin from 2.10.3 to 3.6.3 to mitigate security vulnerability.
Removed all dependencies and repository references related to deprecated mapr.

What's New in mlcp 11.1.0

Now requires JRE 11 or later.
Added support for reverse proxy and connection to MarkLogic Cloud.
Upgraded Jena libraries from 2.13.0 to 4.9.0 to mitigate security vulnerability.
Upgraded Jackson, Hadoop, Xstream, and Guava libraries to mitigate security vulnerabilities.

What's New in mlcp 11.0.3

Removed an unused json dependency to mitigate security vulnerability.

What's New in mlcp 11.0.2

Upgraded libthrift from 0.14.0 to 0.17.0 to mitigate security vulnerability.

What's New in mlcp 11.0.0

Upgraded Hadoop Library to 3.3.4
Upgraded jackson-annotations, jackson-core, jackson-databind, Xerces, woodstox-core to mitigate security vulnerability.
Bug fixes.

Getting Started

Getting Started with mlcp

Documentation

For official product documentation, please refer to:

mlcp User Guide

Wiki pages of this project contain useful information when you work on development:

Wiki Page of marklogic-contentpump

Required Software

Required Software for mlcp
Apache Maven (version >= 3.0.3) is required to build mlcp and the Hadoop Connector.

Build

Steps to build mlcp:

$ git clone https://github.com/marklogic/marklogic-contentpump.git
$ cd marklogic-contentpump
$ mvn clean package -DskipTests=true

The build writes to the respective deliverable directory under the root directory marklogic-contentpump/.

For information on contributing to this project see CONTRIBUTING.md. For information on working on the development of this project see project wiki page.

Tests

The unit tests included in this repository are designed to provide illustrative examples of the APIs and to sanity check external contributions. MarkLogic Engineering runs a more comprehensive set of unit, integration, and performance tests internally. To run the unit tests, execute the following command from the marklogic-contentpump/ root directory:

$ mvn test

For detailed information about running unit tests, see Guideline to Run Tests.

Have a question? Need help?

If you have questions about mlcp or the Hadoop Connector, ask on StackOverflow. Tag your question with mlcp and marklogic. If you find a bug or would like to propose a new capability, file a GitHub issue.

Support

mlcp and the Hadoop Connector are maintained by MarkLogic Engineering and distributed under the Apache 2.0 license. They are designed for use in production applications with MarkLogic Server. Everyone is encouraged to file bug reports, feature requests, and pull requests through GitHub. This input is critical and will be carefully considered. However, we can’t promise a specific resolution or timeframe for any request. In addition, MarkLogic provides technical support for release tags of mlcp and the Hadoop Connector to licensed customers under the terms outlined in the Support Handbook. For more information or to sign up for support, visit help.marklogic.com.

marklogic-contentpump's People

Contributors

Stargazers

Watchers

marklogic-contentpump's Issues

Failures caused by deadlocks on import

I am loading a full archive export from a ML 8 database into ML 9. I am getting the following errors for vast numbers of documents:

com.marklogic.xcc.exceptions.RetryableXQueryException: XDMP-DEADLOCK: Deadlock detected locking app-content-001-3 /search/http://my.doc.urlPrefix/4869064/

After this error I see tons of errors like this:

17/06/07 08:20:43 WARN mapreduce.ContentWriter: Failed document: /search/http://my.doc.urlPrefix/4869064/

There are properties on any of the documents in the archive. There are about 13.5M docs in collection A, about 1M docs in collection B, and about 500K docs in collection C. All documents in all collections have a trailing “/“ in their URI’s. There are no triggers, no transforms, nothing I can see that would cause any deadlocks. Yet I am seeing tons of deadlock warning messages and subsequent failures when those deadlocks can’t be resolved.

The archive was created by pointing the latest version of mlcp at a ML 8.0-5.2 database and creating an archive export. I am attempting to import that archive into ML 9.0-1.1.

The mlcp import command is this:

~/mlcp-9.0.1/bin/mlcp.sh import -host localhost -port [port] -username admin -password [admin password] -mode local -input_file_type archive -input_file_path /path/to/my.demo &

The error on the mlcp side is this:

2017-06-07 10:26:58.525 INFO [15] (AbstractRequestController.runRequest): automatic query retries (1) exhausted, throwing: com.marklogic.xcc.exceptions.RetryableXQueryException: XDMP-DEADLOCK: Deadlock detected locking app-content-001-4 /search/http://my.app.uriPrefix/4162061/
[Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: address=localhost/127.0.0.1:8041, pool=1/64]]]
[Client: XCC/9.0-1, Server: XDBC/9.0-1.1]
com.marklogic.xcc.exceptions.RetryableXQueryException: XDMP-DEADLOCK: Deadlock detected locking app-content-001-4 /search/http://my.app.uriPrefix/4162061/
[Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: address=localhost/127.0.0.1:8041, pool=1/64]]]
[Client: XCC/9.0-1, Server: XDBC/9.0-1.1]
com.marklogic.xcc.impl.handlers.ServerExceptionHandler.handleResponse(ServerExceptionHandler.java:34)
com.marklogic.xcc.impl.handlers.ContentInsertController.serverDialog(ContentInsertController.java:149)
com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:87)
com.marklogic.xcc.impl.SessionImpl.insertContent(SessionImpl.java:325)
com.marklogic.xcc.impl.SessionImpl.insertContentCollectErrors(SessionImpl.java:297)
com.marklogic.mapreduce.ContentWriter.insertBatch(ContentWriter.java:476)
com.marklogic.contentpump.DatabaseContentWriter.write(DatabaseContentWriter.java:231)
com.marklogic.contentpump.DatabaseContentWriter.write(DatabaseContentWriter.java:56)
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:55)
com.marklogic.contentpump.DocumentMapper.map(DocumentMapper.java:34)
com.marklogic.contentpump.BaseMapper.run(BaseMapper.java:79)
com.marklogic.contentpump.LocalJobRunner$LocalMapTask.call(LocalJobRunner.java:409)
java.util.concurrent.FutureTask.run(FutureTask.java:262)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

option to run mlcp through a load balancer

It would be good if mlcp could run through a load balancer.

There have been a number of customers trying to do this, and support issues raised, where only the load balancer is directly accessible from outside the cluster. mlcp's requirement for direct access to each participating node means this setup can't work if mlcp is run outside the cluster.

Perhaps if there was an option to pin mlcp to the hostname provided on the command line, then with the correct session stickiness it would be able to work through the load balancer. That is, if I provide '-hostname x.y.com', mlcp should send all communications to x.y.com.

This has come up a number of times with AWS and the ELB.

Text data is not properly writing to marklogic database using hadoop connector

I am reading a sample csv data then using hadoop connector to write into marklogic database as Text. Problem is, few data is writen to the databse random number of times. lets say, i am storing 10 records so there are 10 insertions to the marklogic database but few records are written multiple times randomly not sure why it is happening?

I have shared my code as well. Can anybody let me know why is this such randomness in behaviour to write with hadoop connector api with marklogic?

Here is the mapper code

public static class CSVMapper extends Mapper<LongWritable, Text, DocumentURI, Text> {
    static int i = 1;
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        // TODO Auto-generated method stub
        ObjectMapper mapper = new ObjectMapper();
         String line = value.toString();      //line contains one line of your csv file.
         System.out.println("line value is - "+line);

           String[] singleData = line.split("\n");
            for(String lineData : singleData)
            {
                String[] fields = lineData.split(",");
                Sample sd = new Sample(fields[0], fields[1], fields[2].trim(), fields[3]);

                String jsonInString = mapper.writeValueAsString(sd);
                Text txt = new Text();
                 txt.set(jsonInString);
                //do your processing here
                System.out.println("line Data is    - "+line);
                System.out.println("jsonInString is -  "+jsonInString);
                final DocumentURI outputURI1 = new DocumentURI("HadoopMarklogicNPPES-"+i+".json");
                i++;

                context.write(outputURI1,txt);                      
            }   
    }
}

Here is the main method -

Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    Job job = Job.getInstance(conf, "Hadoop Marklogic MarklogicHadoopCSVDataDump");
    job.setJarByClass(MarklogicHadoopCSVDataDump.class);

    // Map related configuration
    job.setMapperClass(CSVMapper.class);

    job.setMapOutputKeyClass(DocumentURI.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputFormatClass(ContentOutputFormat.class); 
    ContentInputFormatTest.setInputPaths(job, new Path("/marklogic/sampleData.csv"));
    conf = job.getConfiguration();
    conf.addResource("hadoopMarklogic.xml");        

    try {
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    } catch (ClassNotFoundException | InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

Here is the sample csv data -

"Complaint ID "," Product "," Sub-product "," Issue 
"1350210 "," Bank account or service "," Other bank product/service "," Account opening  closing  or management "
"1348006 "," Debt collection "," Other (phone  health club  etc.) "," Improper contact or sharing of info "
"1351347 "," Bank account or service "," Checking account "," Problems caused by my funds being low"
"1347916 "," Debt collection "," Payday loan "," Communication tactics"
"1348296 "," Credit card ","  "," Identity theft / Fraud / Embezzlement"
"1348136 "," Money transfers "," International money transfer "," Money was not available when promised"

ArrayIndexOutOfBoundsException when importing to a database with no forest attached.

[jchen@jchen-z620 mlcp]$ mlcp-9.0/bin/mlcp.sh import -username admin -password admin -host jchen -copy_collections -port 5275 -input_file_type archive -input_file_path /tmp/dump
16/09/01 17:28:19 INFO contentpump.LocalJobRunner: Content type: XML
16/09/01 17:28:19 INFO contentpump.ContentPump: Job name: local_925113862_1
16/09/01 17:28:19 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
16/09/01 17:28:21 ERROR contentpump.MultithreadedMapper: 0
java.lang.ArrayIndexOutOfBoundsException: 0
at com.marklogic.contentpump.DatabaseContentWriter.write(DatabaseContentWriter.java:140)
at com.marklogic.mapreduce.ContentWriter.write(ContentWriter.java:1)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at com.marklogic.contentpump.ImportDocumentMapper.map(ImportDocumentMapper.java:54)
at com.marklogic.contentpump.ImportDocumentMapper.map(ImportDocumentMapper.java:1)
at com.marklogic.contentpump.BaseMapper.runThreadSafe(BaseMapper.java:51)
at com.marklogic.contentpump.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:379)
at com.marklogic.contentpump.MultithreadedMapper.run(MultithreadedMapper.java:215)
at com.marklogic.contentpump.LocalJobRunner$LocalMapTask.call(LocalJobRunner.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/09/01 17:28:21 INFO contentpump.LocalJobRunner: completed 0%
16/09/01 17:28:21 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
16/09/01 17:28:21 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 4
16/09/01 17:28:21 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 4
16/09/01 17:28:21 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 0
16/09/01 17:28:21 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
16/09/01 17:28:21 INFO contentpump.LocalJobRunner: Total execution time: 1 sec

Add a command-line option to skip empty delimited text fields

Currently, an empty element will be generated for an empty field in CSV data. Add a command-line option to suppress that.

I expect you can do this with a transform, but that's custom code and extra time during ingest.

Add mlcp to top level description

GitHub search only looks at the repo title and top level description for search. Many people look for this tool by the term 'mlcp'. It should be easily findable with that search term.

Unable to import xml containing a unicode characters as aggregates

Clone of #47 on 8.0-develop branch

MLCP throws invalid skipped message and does not import text document when it has collection info

I have back ported the test from 9.0, mlcp-da-test15.xml , I see inconsistent behavior between 9.0 vs 8.0 .

I understand the we dont copy collections,quality,permissions in 8.0 but we do in 9.0, this issue is more about not been able to import text documents when we have collections defined on the text document

For simple steps:

mlcp import type forest
Attach F1 to db
load content into database
Now detach F1 from db
Attach F2 to db
Import data using mlcp import and input_type forest from F1
Validate results

documents I loaded:

xquery version "1.0-ml";

xdmp:document-load("QA_HOME/mlcp/data/text/text1.txt", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/text/1.txt</uri><format>text</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-load("QA_HOME/mlcp/data/text/text2.txt", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/text/2.txt</uri><format>text</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-load("QA_HOME/mlcp/data/binary/996.jpg", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/binary/1.jpg</uri><format>binary</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-load("QA_HOME/mlcp/data/binary/lemon.gif", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/binary/2.gif</uri><format>binary</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-set-property("/Test/mlcp-export-xmlquery-filter/naked/example.xml",<fruit><name>apple</name><color>red</color></fruit>);

xdmp:document-set-collections("/Test/mlcp-export-xmlquery-filter/naked/example.xml", "xmldocs"),
xdmp:document-set-property("/Test/mlcp-export-xmlquery-filter/fruit3.xml",<ex:fruit xmlns:ex="http://marklogic.com/example"><name>apple</name><color>red</color></ex:fruit>);

Command that I used in the test

QA_HOME/mlcp/runmlcp.sh QA_HOME IMPORT  -input_file_path MLD_HOME/Forests/mlcp-f15a -input_file_type forest -host localhost -port 5275 -username admin -password admin -output_collections A -output_uri_prefix /Test1 -mode local

Result I see

Skipped record: () from /Test/mlcp-export-xmlquery-filter/naked/example.xml in file:/var/opt/MarkLogic/Forests/mlcp-f15a/00000000/TreeData, reason: fragment or link
Skipped record: () from /Test/mlcp-export-xmlquery-filter/text/2.txt in file:/var/opt/MarkLogic/Forests/mlcp-f15a/00000000/TreeData, reason: fragment or link

Test for bugtrack integration

ArrayIndexOutOfBoundsException when importing to a database with no forest attached.

Clone of #20

Options copy_collections/permissions/quality not honored completely when importing from an archive.

The current implementation on b8 honors the copy options for naked properties but not for documents.

REGR:test for bug 20059 i.e. bug20059.xml is failing in regression form 17 sept 2016

I see test bug20059.xml is failing in regression from 17th sept, because following reason

/space/b8_0/qa/mlcp/runmlcp.sh /space/b8_0/qa import -username admin -password admin -host localhost -port 5275 -copy_collections true -copy_permissions true -copy_properties true -copy_quality true -input_file_path file:////space/b8_0/qa/mlcp/data/export/bug18908/1 -input_file_type archive

stderr/out from shell cmd: 16/09/20 14:32:56 INFO contentpump.LocalJobRunner: Content type: XML
16/09/20 14:32:57 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 4
16/09/20 14:32:57 ERROR contentpump.LocalJobRunner: Error running task:
com.thoughtworks.xstream.mapper.CannotResolveClassException: com.marklogic.ps.xqsync.XQSyncDocumentMetadata
at com.thoughtworks.xstream.mapper.DefaultMapper.realClass(DefaultMapper.java:56)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.DynamicProxyMapper.realClass(DynamicProxyMapper.java:55)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.PackageAliasingMapper.realClass(PackageAliasingMapper.java:88)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.ClassAliasingMapper.realClass(ClassAliasingMapper.java:79)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.ArrayMapper.realClass(ArrayMapper.java:74)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.MapperWrapper.realClass(MapperWrapper.java:30)
at com.thoughtworks.xstream.mapper.CachingMapper.realClass(CachingMapper.java:45)
at com.thoughtworks.xstream.core.util.HierarchicalStreams.readClassType(HierarchicalStreams.java:29)
at com.thoughtworks.xstream.core.TreeUnmarshaller.start(TreeUnmarshaller.java:133)
at com.thoughtworks.xstream.core.AbstractTreeMarshallingStrategy.unmarshal(AbstractTreeMarshallingStrategy.java:32)
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1052)
at com.thoughtworks.xstream.XStream.unmarshal(XStream.java:1036)
at com.thoughtworks.xstream.XStream.fromXML(XStream.java:912)
at com.marklogic.contentpump.DocumentMetadata.fromXML(DocumentMetadata.java:70)
at com.marklogic.contentpump.ArchiveRecordReader.getMetadataFromStream(ArchiveRecordReader.java:185)
at com.marklogic.contentpump.ArchiveRecordReader.nextKeyValue(ArchiveRecordReader.java:128)
at com.marklogic.contentpump.LocalJobRunner$TrackingRecordReader.nextKeyValue(LocalJobRunner.java:444)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at com.marklogic.contentpump.LocalJobRunner$LocalMapTask.call(LocalJobRunner.java:378)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/09/20 14:32:58 INFO contentpump.LocalJobRunner: completed 25%
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: completed 50%
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: completed 75%
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 11231
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 11231
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 11231
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
16/09/20 14:33:10 INFO contentpump.LocalJobRunner: Total execution time: 13 sec

MLCP is missing to log skipped message for Naked properties when importing from a forest

Regression test mlcp-da-test15.xml has a scenario for reproducing this issue.

Load following documents into database

 xquery version "1.0-ml";
        xdmp:document-insert("/Test/mlcp-export-xmlquery-filter/fruit1.xml",<ex:fruits xmlns:ex="http://marklogic.com/example"><ex:fruit><ex:name>apple</ex:name><ex:color>red</ex:color></ex:fruit>    <ex:fruit ><ex:name>orange</ex:name><ex:color>orange</ex:color></ex:fruit></ex:fruits>,(xdmp:permission("xa", "read")), "xmldocs");
        xdmp:document-insert("/Test/mlcp-export-xmlquery-filter/fruit2.xml",<ex:fruits xmlns:ex="http://marklogic.com/example"><ex:fruit><ex:name>apple</ex:name><ex:color>red</ex:color></ex:fruit> <ex:fruit><ex:name>grape</ex:name><ex:color>purple</ex:color></ex:fruit> </ex:fruits>,(xdmp:permission("xa", "read")), "xmldocs");
        xdmp:document-insert("/Test/mlcp-export-xmlquery-filter/fruit3.xml",<ex:fruits xmlns:ex="http://marklogic.com/example"><ex:fruit><ex:name>watermellon</ex:name><ex:color>green</ex:color></ex:fruit> <ex:fruit><ex:name>orange</ex:name><ex:color>orange</ex:color></ex:fruit> </ex:fruits>,(xdmp:permission("xa", "read")), "xmldocs");

 declareUpdate();
        xdmp.documentInsert("/Test/mlcp-export-query-filter/JSONfruit1.json", {"fruits" :[ {"fruit": {"name":"apple","color": "red"}},{"fruit": {"name":"orange","color": "orange"}}]},[xdmp.permission("xa", "read")],"jsondocs"),
        xdmp.documentInsert("/Test/mlcp-export-query-filter/JSONfruit2.json", {"fruits" :[ {"fruit": {"name":"apple","color": "red"}},{"fruit": {"name":"kiwi","color": "green"}}]},[xdmp.permission("xa", "read")],"jsondocs"),
        xdmp.documentInsert("/Test/mlcp-export-query-filter/JSONfruit3.json", {"fruits" :[ {"fruit": {"name":"strawberry","color": "red"}},{"fruit": {"name":"orange","color": "orange"}}]},[xdmp.permission("xa", "read")],"jsondocs");

 xquery version "1.0-ml";

xdmp:document-load("QA_HOME/mlcp/data/text/text1.txt", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/text/1.txt</uri><format>text</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-load("QA_HOME/mlcp/data/text/text2.txt", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/text/2.txt</uri><format>text</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-load("QA_HOME/mlcp/data/binary/996.jpg", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/binary/1.jpg</uri><format>binary</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-load("QA_HOME/mlcp/data/binary/lemon.gif", <options xmlns="xdmp:document-load"><uri>/Test/mlcp-export-xmlquery-filter/binary/2.gif</uri><format>binary</format><collections><collection>mlcp-export-xmlquery-filter</collection></collections></options>),

xdmp:document-set-property("/Test/mlcp-export-xmlquery-filter/naked/example.xml",<fruit><name>apple</name><color>red</color></fruit>);

xdmp:document-set-collections("/Test/mlcp-export-xmlquery-filter/naked/example.xml", "xmldocs"),
xdmp:document-set-property("/Test/mlcp-export-xmlquery-filter/fruit3.xml",<ex:fruit xmlns:ex="http://marklogic.com/example"><name>apple</name><color>red</color></ex:fruit>);

Detach your database forest and attach new forest

Now run below query pointing input file path to detached forest

Command and standard out

running following shell command:
rm /space/b8_0/qa/mlcp/log/mlcpconsolelog.xml;
/space/b8_0/qa/mlcp/runmlcp.sh /space/b8_0/qa IMPORT -input_file_path /var/opt/MarkLogic/Forests/mlcp-f15a -input_file_type forest -host localhost -port 5275 -username admin -password admin -output_collections A -output_uri_prefix /Test1 -mode local
(echo '<root xmlns:log4j="'"test"'" >';cat /space/b8_0/qa/mlcp/log/mlcpconsolelog.xml ;echo '')>/space/b8_0/qa/mlcp/log/temp.xml;
rm /space/b8_0/qa/mlcp/log/mlcpconsolelog.xml

stderr/out from shell cmd: 16/09/27 14:35:07 INFO contentpump.LocalJobRunner: Content type is set to MIXED. The format of the inserted documents will be determined by the MIME type specification configured on MarkLogic Server.
16/09/27 14:35:07 INFO input.FileInputFormat: Total input paths to process : 5
16/09/27 14:35:08 WARN contentpump.ImportDocumentMapper: Skipped record: () from /Test/mlcp-export-xmlquery-filter/fruit3.xml in file:/var/opt/MarkLogic/Forests/mlcp-f15a/00000000/TreeData, reason: fragment or link
16/09/27 14:35:08 INFO contentpump.LocalJobRunner: completed 100%
16/09/27 14:35:08 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
16/09/27 14:35:08 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 11
16/09/27 14:35:08 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 10
16/09/27 14:35:08 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 10
16/09/27 14:35:08 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
16/09/27 14:35:08 INFO contentpump.LocalJobRunner: Total execution time: 0 sec

Actual:

I am missing to see skipped message for naked properties and not seeing in the database after import.

16/09/27 14:31:08 WARN contentpump.ImportDocumentMapper: Skipped record: () from _/Test/mlcp-export-xmlquery-filter/naked/example.xm_l in file:/project/qa/skottam/Forests/mlcp-f15a/00000002/TreeData, reason: fragment or link

Support modifying document collections within a transform

incorrect links to docs in README

The "Required Software" section of README.md contains links to pubs. This is probably true across all the branches. I only looked at 8.0-develop and master. Pubs is an internal server, so these must be changed.

You should be able to replace the "pubs.marklogic.com:8011" part of the URL with "docs.marklogic.com", anywhere it occurs in README.md.

You might also consider making the links versions-specific for 8.0, though this is not strictly necessary and might be an ongoing maintenance headache you don't want.

For example, a link like http://docs.marklogic.com/guide/mlcp/install takes you to the default doc version. That is 8.0 right now, but will be 9.0 after ML9 ships. If you instead use a link like http://docs.marklogic.com/8.0/guide/mlcp/install on 8.0-develop and 8.0-master, then it will always go to the appropriate version for 8.0.

The problem is that you'd have to remember to update the links in README.md for each major release (9, 10, etc.). There's plenty of other places where we chose the lighter maintenance burden over accuracy. :)

copyright year not filled in when building only javadoc

If you build the javadoc standalone using the javadoc:javadoc target (as the nightly doc build would like to do), the copyright notice at the bottom of the generated javadoc pages contains the variable ${thisyear} instead of the actual year. For example, it looks like this:

Copyright © ${thisyear} MarkLogic Corporation. All Rights Reserved.

It should be possible to build just the javadoc, without giving the lawyers hysterics. :)

You can find the generated javadoc in mapreduce/target/site/javadoc, and you can examine the copyright notice by looking at the very bottom of any pages, such as index.html.

I tried the following two build commands, both with the same result:

cd marklogic-contentpump
mvn --projects mapreduce javadoc:javadoc

cd mapreduce
mvn javadoc:javadoc

(javadoc:javadoc is the standard maven javadoc target, not some strange thing I invented.)

Document mlcp SSL support.

mlcp can leverage the Hadoop Connector SSL support through -conf parameter:

https://lists.marklogic.com/search/?q=ssl%20mlcp#query:ssl%20mlcp%20from%3A%22Jane%20Chen%22+page:1+mid:ny2iy3t4uz2esxbv+state:results

I recently learned that this was actually used by Aetna, and people at World Bank made a syntax error when trying to copy the model. So it would be worth documenting it.

Misleading command line option description

type - mlcp.bat import (assuming mlcp is in path)

This will list out all the possible options for import command -

search for -transform_param and check the description. Description says "Name of the transform function"

it should be "Optional extra data to pass through to a custom transformation function"

Stream ZipEntry reading of CompressedRDFReader

CompressedRDFReader now buffers ZipEntrys in java heap to read in zipped file. Instead, it should stream the reading process, which is desired and preferrable. CompressedAggregateReader and CompressedDelimitedTextReader has done the same.

In an attempt to do the above job, I found that there is some class in CompressedRDFReader that closes the ZipInputStream in middle of a job. It is related to a java bug JDK-6539065. For CompressedAggregateReader, because I know it's XMLStreamReader that closed the ZipInputStream in middle of a job, we can bypass the issue by overload ZipInputStream to disable the behavior. However, for CompressedRDFReader, I didn't figure out who closed ZipInputStream.

You can refer to bug:39688 and bug:25918 for more info.

Please investigate. Thanks.

MLCP documentation typo

Open https://docs.marklogic.com/guide/mlcp.pdf

Go to page number 72

Look at the example command given -

mlcp.sh copy -mode local -host srchost -port 8000 \ -username user -password password \ -input_file_path /space/mlcp/txform/data \ -transform_module /example/mlcp-transform.sjs \ -transform_function transform \ -transform_param "my-value"

Should this be 'import' instead of 'copy' ?

Add teardowns of MLCP unit tests

Clean up the setups after tests are finished

Misleading command line option description

Clone of #25 on 9.0-ea4 branch

Duplicate documents ingested when split_input and generate_uri are used together for CSV file

This is a clone of bug:39564 on b8 branch.
The bug is about ingesting a CSV file and some lines are ingested more than once into Database.

Remove maprfs jar from deliverable mlcp-mapr bundle

Update 7.0 unit tests so that it's self-contained

Add setup and teardown of 7.0 unit tests

Unable to import xml containing a unicode characters as aggregates

Input file:

<p><q>b Justus-Liebig-Universität , Gie&#x1d4b7;en , Germany.</q></p>

Mlcp command and logging:

[jsun@msun-z620 mlcp-8.0]$ bin/mlcp.sh import -username admin -password admin -host msun -port 8000 -mode local -input_file_path /space/tmp/special-char.xml -input_file_type aggregates -aggregate_record_element q -generate_uri
16/10/25 10:15:41 INFO contentpump.LocalJobRunner: Content type: XML
16/10/25 10:15:42 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
16/10/25 10:15:44 ERROR mapreduce.ContentWriter: XDMP-DOCCHARREF: Invalid character reference "55349" at /space/tmp/special-char.xml-0-1 line 1
16/10/25 10:15:44 WARN mapreduce.ContentWriter: Failed document /space/tmp/special-char.xml-0-1 in file:/space/tmp/special-char.xml at line 1:7
16/10/25 10:15:44 INFO contentpump.LocalJobRunner:  completed 100%
16/10/25 10:15:44 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
16/10/25 10:15:44 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 1
16/10/25 10:15:44 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 1
16/10/25 10:15:44 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 1
16/10/25 10:15:44 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
16/10/25 10:15:44 INFO contentpump.LocalJobRunner: Total execution time: 2 sec

However, if this document is ingested as plain xml, it went through:

[jsun@msun-z620 mlcp-8.0]$ bin/mlcp.sh import -username admin -password admin -host msun -port 8000 -mode local -input_file_path /space/tmp/special-char.xml
16/10/25 10:11:20 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
16/10/25 10:11:21 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
16/10/25 10:11:21 INFO contentpump.LocalJobRunner:  completed 100%
16/10/25 10:11:21 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter:
16/10/25 10:11:21 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 1
16/10/25 10:11:21 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 1
16/10/25 10:11:21 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 1
16/10/25 10:11:21 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
16/10/25 10:11:21 INFO contentpump.LocalJobRunner: Total execution time: 0 sec

mlcp:Missing metadata Warning message in mlcp-8.0 branch

I have test mlcp-set-copy-col-perm-qlty-false.xml checked in 8.0-nightly

we see
Skipped record: () from /Test/mlcp-copy-binquery-filter/binary/1.jpg in file:/tmp/mlcp_export_archive/20160920113220-0700-000004-BINARY.zip, reason: Missing metadataSkipped record: () from /Test/mlcp-copy-binquery-filter/binary/2.gif in file:/tmp/mlcp_export_archive/20160920113220-0700-000009-BINARY.zip, reason: Missing metadata

My understanding is we dont have metadata in 8.0-nightly and I am expecting no skipped records here.
test started failing from 09/18/2016.

 /space/b8_0/qa/mlcp/runmlcp.sh /space/b8_0/qa import -copy_collections false -copy_permissions false -copy_properties false -copy_quality false -input_file_path /tmp/mlcp_export_archive -input_file_type archive -output_uri_prefix /Test1/  -options_file /space/b8_0/qa/mlcp/data/804/input_option_files/import_options.txt;

stdout:

stderr/out from shell cmd: rm: cannot remove \u2018/space/b8_0/qa/mlcp/log/mlcpconsolelog.xml\u2019: No such file or directory
16/09/20 11:32:40 INFO contentpump.LocalJobRunner: Content type: XML
16/09/20 11:32:41 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 10
**16/09/20 11:32:41 WARN contentpump.ImportDocumentMapper: Skipped record: () from /Test/mlcp-copy-binquery-filter/binary/1.jpg in file:/tmp/mlcp_export_archive/20160920113220-0700-000004-BINARY.zip, reason: Missing metadata
16/09/20 11:32:41 WARN contentpump.ImportDocumentMapper: Skipped record: () from /Test/mlcp-copy-binquery-filter/binary/2.gif in file:/tmp/mlcp_export_archive/20160920113220-0700-000009-BINARY.zip, reason: Missing metadata**
16/09/20 11:32:41 INFO contentpump.LocalJobRunner:  completed 100%
16/09/20 11:32:41 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: 
16/09/20 11:32:41 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 11
16/09/20 11:32:41 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 9
16/09/20 11:32:41 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 9
16/09/20 11:32:41 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
16/09/20 11:32:41 INFO contentpump.LocalJobRunner: Total execution time: 0 sec

MLCP does not set non-zero exit status upon error

MLCP does not return failure status by the way of its exit code. That makes it rather hard to integrate it into automated processes in a robust fashion.

Example:

imogas 1005_% mlcp import -host localhost -port 8000 -input_file_path foo
17/06/20 00:25:54 INFO contentpump.LocalJobRunner: Content type is set to MIXED.  The format of the  inserted documents will be determined by the MIME  type specification configured on MarkLogic Server.
17/06/20 00:25:54 INFO contentpump.ContentPump: Job name: local_1494554887_1
17/06/20 00:25:54 ERROR contentpump.LocalJobRunner: Error checking output specification:
17/06/20 00:25:54 ERROR contentpump.LocalJobRunner: No input files found with the specified input path file:/home/hans/Development/configuration-service/foo and input file pattern .*
imogas 1006_% echo $?
0

The expected exit status would have been non-zero. At present, non-exit error status is generated by MLCP only for usage errors, not for execution errors.

Support modifying document permissions within a transform

Streaming reading ZipEntry of Compressed input format: DelimitedText, DelimtedJSON and Aggregates

Clone of bug:39688 and bug:25918 on 9.0 branch.

NullPointer Exception on ssl run

I am not able to reproduce this again but to make sure if we can find a reason on why this happen based on the data what we have

Here is the command I am using

 running following shell command: /space/Head/qa/mlcp/runmlcpHadoop.sh /space/Head/qa import -username admin -password admin -host engrlab-130-159.engrlab.marklogic.com -port 5275 -input_file_path file:///project/engineering/qa/mlcp/data/foo.bin -streaming true -document_type binary -output_uri_replace "foo,'foobug17062distributed'"
stderr/out from shell cmd: Today is SSL and Client_CERT enabled: true

IMPORT -conf /space/Head/qa/mlcp/mlcp-9.0/conf/MlcpSSLConfig.xml -username admin -password admin -host engrlab-130-159.engrlab.marklogic.com -port 5275 -input_file_path file:///project/engineering/qa/mlcp/data/foo.bin -streaming true -document_type binary -output_uri_replace foo,'foobug17062distributed'
###############SSL certificate location:- /project/qa/nightly-regression/ssl/user.p12

stderr/out from shell cmd: 16/11/08 00:43:29 INFO contentpump.LocalJobRunner: Content type: BINARY
16/11/08 00:43:29 INFO contentpump.ContentPump: Job name: distributed_791921316_1
16/11/08 00:43:29 INFO client.RMProxy: Connecting to ResourceManager at rh7v-intel64-cdh-1.marklogic.com/172.18.133.113:8032
16/11/08 00:43:31 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
16/11/08 00:43:31 INFO mapreduce.JobSubmitter: number of splits:1
16/11/08 00:43:31 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1475872281186_18660
16/11/08 00:43:31 INFO impl.YarnClientImpl: Submitted application application_1475872281186_18660
16/11/08 00:43:32 INFO mapreduce.Job: The url to track the job: http://rh7v-intel64-cdh-1.marklogic.com:8088/proxy/application_1475872281186_18660/
16/11/08 00:43:32 INFO mapreduce.Job: Running job: job_1475872281186_18660
16/11/08 00:43:37 INFO mapreduce.Job: Job job_1475872281186_18660 running in uber mode : false
16/11/08 00:43:37 INFO mapreduce.Job:  map 0% reduce 0%
16/11/08 00:43:42 INFO mapreduce.Job: Task Id : attempt_1475872281186_18660_m_000000_0, Status : FAILED
Error: java.lang.NullPointerException
        at com.marklogic.contentpump.StreamingDocumentReader.nextKeyValue(StreamingDocumentReader.java:42)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

16/11/08 00:43:47 INFO mapreduce.Job:  map 100% reduce 0%
16/11/08 00:43:48 INFO mapreduce.Job: Job job_1475872281186_18660 completed successfully
16/11/08 00:43:48 INFO mapreduce.Job: Counters: 36
        File System Counters
                FILE: Number of bytes read=33554432
                FILE: Number of bytes written=140864
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=122
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=1
                HDFS: Number of large read operations=0
                 HDFS: Number of write operations=0
        Job Counters
                Failed map tasks=1
                Launched map tasks=2
                Other local map tasks=1
                Rack-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=5778
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=5778
                Total vcore-seconds taken by all map tasks=5778
                Total megabyte-seconds taken by all map tasks=5916672
        Map-Reduce Framework
                Map input records=1
                Map output records=1
                Input split bytes=122
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=52
                CPU time spent (ms)=1920
                Physical memory (bytes) snapshot=258510848
                Virtual memory (bytes) snapshot=2791763968
                Total committed heap usage (bytes)=446693376
        com.marklogic.mapreduce.MarkLogicCounter
                INPUT_RECORDS=1
                OUTPUT_RECORDS=1
                OUTPUT_RECORDS_COMMITTED=1
                OUTPUT_RECORDS_FAILED=0
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0

MLCP User Guide Typo/Error

Look at the MLCP user guide page number 45

cat delim.opt -input_file_type delimited_text -delimiter "tab"

It says a delimiter can be "tab"

now look at the page number 78

it says -delimiter data type is a character.

MLCP COPY command with ssl throws IO Exception

Here is the command I am using

/space/Head/qa/mlcp/runmlcpHadoop.sh /space/Head/qa copy -input_username admin -input_password admin -input_host engrlab-130-159.engrlab.marklogic.com -input_port 5275 -copy_collections true -copy_permissions true -copy_properties true -copy_quality true -output_username admin -output_password admin -output_host engrlab-130-159.engrlab.marklogic.com -output_port 9112

runmlcpHadoop.sh will handle the SSL setting and modify the options to have -conf like below

stderr/out from shell cmd: Today is SSL and Client_CERT enabled: true

COPY -conf /space/Head/qa/mlcp/mlcp-9.0/conf/MlcpSSLConfig.xml -input_username admin -input_password admin -input_host engrlab-130-159.engrlab.marklogic.com -input_port 5275 -copy_collections true -copy_permissions true -copy_properties true -copy_quality true -output_username admin -output_password admin -output_host engrlab-130-159.engrlab.marklogic.com -output_port 9112

and
cat /space/Head/qa/mlcp/mlcp-9.0/conf/MlcpSSLConfig.xml

<configuration>
	<property>
		<name>mapreduce.marklogic.input.usessl</name>
		<value>true</value>
	</property>
	<property>
		<name>mapreduce.marklogic.input.ssloptionsclass</name>
		<value>test.hadoop.SslOptions</value>
	</property>
	<property>
		<name>mapreduce.marklogic.output.usessl</name>
		<value>true</value>
	</property>
	<property>
		<name>mapreduce.marklogic.output.ssloptionsclass</name>
		<value>test.hadoop.SslOptions</value>
	</property>
</configuration>

I see the following on the console

16/11/08 10:31:14 INFO client.RMProxy: Connecting to ResourceManager at rh7v-intel64-cdh-1.marklogic.com/172.18.133.113:8032
16/11/08 10:31:45 ERROR contentpump.ContentPump: Error running a ContentPump job
java.io.IOException: java.io.IOException: com.marklogic.xcc.exceptions.ServerConnectionException: SSL wrapped byte channel
 [Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=engrlab-130-159.engrlab.marklogic.com/172.18.130.159:9112, pool=0/64]]]
 [Client: XCC/9.0-1]
	at com.marklogic.mapreduce.MarkLogicOutputFormat.checkOutputSpecs(MarkLogicOutputFormat.java:95)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:562)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
	at com.marklogic.contentpump.ContentPump.submitJob(ContentPump.java:304)
	at com.marklogic.contentpump.ContentPump.runCommand(ContentPump.java:205)
	at com.marklogic.contentpump.ContentPump.main(ContentPump.java:63)
Caused by: java.io.IOException: com.marklogic.xcc.exceptions.ServerConnectionException: SSL wrapped byte channel
 [Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=engrlab-130-159.engrlab.marklogic.com/172.18.130.159:9112, pool=0/64]]]
 [Client: XCC/9.0-1]
	at com.marklogic.mapreduce.ContentOutputFormat.checkOutputSpecs(ContentOutputFormat.java:233)
	at com.marklogic.mapreduce.MarkLogicOutputFormat.checkOutputSpecs(MarkLogicOutputFormat.java:92)
	... 12 more
Caused by: com.marklogic.xcc.exceptions.ServerConnectionException: SSL wrapped byte channel
 [Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=engrlab-130-159.engrlab.marklogic.com/172.18.130.159:9112, pool=0/64]]]
 [Client: XCC/9.0-1]
	at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:149)
	at com.marklogic.xcc.impl.SessionImpl.submitRequestInternal(SessionImpl.java:437)
	at com.marklogic.xcc.impl.SessionImpl.submitRequest(SessionImpl.java:432)
	at com.marklogic.mapreduce.ContentOutputFormat.initialize(ContentOutputFormat.java:374)
	at com.marklogic.mapreduce.ContentOutputFormat.checkOutputSpecs(ContentOutputFormat.java:165)
	... 13 more
Caused by: java.io.EOFException: SSL wrapped byte channel
	at com.marklogic.io.SslByteChannel.handleHandshake(SslByteChannel.java:442)
	at com.marklogic.io.SslByteChannel.pushToEngine(SslByteChannel.java:401)
	at com.marklogic.io.SslByteChannel.write(SslByteChannel.java:361)
	at com.marklogic.http.HttpChannel.writeBuffer(HttpChannel.java:388)
	at com.marklogic.http.HttpChannel.writeHeaders(HttpChannel.java:380)
	at com.marklogic.http.HttpChannel.flushRequest(HttpChannel.java:361)
	at com.marklogic.http.HttpChannel.receiveMode(HttpChannel.java:286)
	at com.marklogic.http.HttpChannel.getResponseCode(HttpChannel.java:193)
	at com.marklogic.xcc.impl.handlers.EvalRequestController.serverDialog(EvalRequestController.java:76)
	at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:88)
	... 17 more
java.io.IOException: java.io.IOException: com.marklogic.xcc.exceptions.ServerConnectionException: SSL wrapped byte channel
 [Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=engrlab-130-159.engrlab.marklogic.com/172.18.130.159:9112, pool=0/64]]]
 [Client: XCC/9.0-1]
	at com.marklogic.mapreduce.MarkLogicOutputFormat.checkOutputSpecs(MarkLogicOutputFormat.java:95)
	at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:562)
	at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
	at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
	at com.marklogic.contentpump.ContentPump.submitJob(ContentPump.java:304)
	at com.marklogic.contentpump.ContentPump.runCommand(ContentPump.java:205)
	at com.marklogic.contentpump.ContentPump.main(ContentPump.java:63)
Caused by: java.io.IOException: com.marklogic.xcc.exceptions.ServerConnectionException: SSL wrapped byte channel
 [Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=engrlab-130-159.engrlab.marklogic.com/172.18.130.159:9112, pool=0/64]]]
 [Client: XCC/9.0-1]
	at com.marklogic.mapreduce.ContentOutputFormat.checkOutputSpecs(ContentOutputFormat.java:233)
	at com.marklogic.mapreduce.MarkLogicOutputFormat.checkOutputSpecs(MarkLogicOutputFormat.java:92)
	... 12 more
Caused by: com.marklogic.xcc.exceptions.ServerConnectionException: SSL wrapped byte channel
 [Session: user=admin, cb={default} [ContentSource: user=admin, cb={none} [provider: SSLconn address=engrlab-130-159.engrlab.marklogic.com/172.18.130.159:9112, pool=0/64]]]
 [Client: XCC/9.0-1]
	at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:149)
	at com.marklogic.xcc.impl.SessionImpl.submitRequestInternal(SessionImpl.java:437)
	at com.marklogic.xcc.impl.SessionImpl.submitRequest(SessionImpl.java:432)
	at com.marklogic.mapreduce.ContentOutputFormat.initialize(ContentOutputFormat.java:374)
	at com.marklogic.mapreduce.ContentOutputFormat.checkOutputSpecs(ContentOutputFormat.java:165)
	... 13 more
Caused by: java.io.EOFException: SSL wrapped byte channel
	at com.marklogic.io.SslByteChannel.handleHandshake(SslByteChannel.java:442)
	at com.marklogic.io.SslByteChannel.pushToEngine(SslByteChannel.java:401)
	at com.marklogic.io.SslByteChannel.write(SslByteChannel.java:361)
	at com.marklogic.http.HttpChannel.writeBuffer(HttpChannel.java:388)
	at com.marklogic.http.HttpChannel.writeHeaders(HttpChannel.java:380)
	at com.marklogic.http.HttpChannel.flushRequest(HttpChannel.java:361)
	at com.marklogic.http.HttpChannel.receiveMode(HttpChannel.java:286)
	at com.marklogic.http.HttpChannel.getResponseCode(HttpChannel.java:193)
	at com.marklogic.xcc.impl.handlers.EvalRequestController.serverDialog(EvalRequestController.java:76)
	at com.marklogic.xcc.impl.handlers.AbstractRequestController.runRequest(AbstractRequestController.java:88)
	... 17 more

ok

mlcp unittest reusing AssignmentManager

The following unit tests failed with my recent checkin:

testImportTransformMedlineZipFast(com.marklogic.contentpump.TestImportAggregate): expected:<[2]> but was:<[0]>
testImportDelimitedText_(com.marklogic.contentpump.TestImportDelimitedText): expected:<[1]> but was:<[0]>

The problem is the AssignmentManager did not get re-initialize after changing database in the test.

test.hadoop.WriteTextToHadoop not writing files to MAPR hdfs

The following program is suppose to create a file called foo on mapr hdfs at the following location: /user/builder/{hostname}/foo

The program runs but does not write any output to HDFS.

Here is the invocation:
/space/hadoop/bin/hadoop jar /space/jsolis/b8_0/qa/lib/utilities.jar test.hadoop.WriteTextToHadoop

If I manually create foo on the HDFS filesystem then run the program I get:
/space/hadoop/bin/hadoop jar /space/jsolis/b8_0/qa/lib/utilities.jar test.hadoop.WriteTextToHadoop
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory maprfs:///user/builder/jsolis-z600.marklogic.com/foo already exists

So it seems like it's trying to create the file.

Here is my sample conf file called WriteTextToHadoop.xml: i can also provide the sample program WriteTextToHadoop.java if needed.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Used with test.hadoop.WriteTextToHadoop -->

<configuration>
<property>  <name>tmpjars</name>    <value>file:///space/hadoop/lib/maprdb-5.1.0-mapr-tests.jar,file:///space/hadoop/lib/flexjson-2.1.jar,file:///space/hadoop/lib/commons-collections-3.2.2.jar,file:///space/hadoop/lib/guava-14.0.1.jar,file:///space/hadoop/lib/libprotodefs-5.1.0-mapr.jar,file:///space/hadoop/lib/commons-configuration-1.6.jar,file:///space/hadoop/lib/central-logging-5.1.0-mapr.jar,file:///space/hadoop/lib/commons-logging-1.1.3-api.jar,file:///space/hadoop/lib/maprdb-mapreduce-5.1.0-mapr-tests.jar,file:///space/hadoop/lib/maprdb-shell-5.1.0-mapr.jar,file:///space/hadoop/lib/jackson-core-2.7.1.jar,file:///space/hadoop/lib/spring-asm-3.0.3.RELEASE.jar,file:///space/hadoop/lib/jackson-annotations-2.7.1.jar,file:///space/hadoop/lib/spring-beans-3.0.3.RELEASE.jar,file:///space/hadoop/lib/commons-io-2.4.jar,file:///space/hadoop/lib/ojai-1.0.jar,file:///space/hadoop/lib/spring-context-3.0.3.RELEASE.jar,file:///space/hadoop/lib/maprbuildversion-5.1.0-mapr.jar,file:///space/hadoop/lib/mapr-tools-5.1.0-mapr-tests.jar,file:///space/hadoop/lib/commons-el-1.0.jar,file:///space/hadoop/lib/commons-codec-1.5.jar,file:///space/hadoop/lib/xcc.jar,file:///space/hadoop/lib/json-20080701.jar,file:///space/hadoop/lib/maprdb-5.1.0-mapr.jar,file:///space/hadoop/lib/mapr-hbase-5.1.0-mapr-tests.jar,file:///space/hadoop/lib/spring-shell-1.1.0-mapr-1602.jar,file:///space/hadoop/lib/protobuf-java-2.5.0.jar,file:///space/hadoop/lib/mapr-hbase-5.1.0-mapr.jar,file:///space/hadoop/lib/slf4j-api-1.7.12.jar,file:///space/hadoop/lib/jackson-databind-2.7.1.jar,file:///space/hadoop/lib/mysql-connector-java-5.1.25.jar,file:///space/hadoop/lib/utilities.jar,file:///space/hadoop/lib/commons-email-1.2.jar,file:///space/hadoop/lib/spring-core-3.0.3.RELEASE.jar,file:///space/hadoop/lib/maprfs-5.1.0-mapr.jar,file:///space/hadoop/lib/mapr-streams-5.1.0-mapr.jar,file:///space/hadoop/lib/commons-lang-2.5.jar,file:///space/hadoop/lib/log4j-1.2.17.jar,file:///space/hadoop/lib/maprdb-mapreduce-5.1.0-mapr.jar,file:///space/hadoop/lib/spring-expression-3.0.3.RELEASE.jar,file:///space/hadoop/lib/antlr4-runtime-4.5.jar,file:///space/hadoop/lib/jline-2.11.jar,file:///space/hadoop/lib/marklogic-mapreduce2.jar,file:///space/hadoop/lib/json-smart-1.2.jar,file:///space/hadoop/lib/maprfs-diagnostic-tools-5.1.0-mapr.jar,file:///space/hadoop/lib/commons-logging-1.1.3.jar,file:///space/hadoop/lib/mapr-tools-5.1.0-mapr.jar,file:///space/hadoop/lib/mapr-streams-5.1.0-mapr-tests.jar,file:///space/hadoop/lib/ojai-mapreduce-1.0.jar,</value></property>
    <property>
        <name>mapreduce.marklogic.input.username</name>
        <value>hadoopRead</value>
    </property>
    <property>
        <name>mapreduce.marklogic.input.password</name>
        <value>hadoopRead</value>
    </property>
    <property>
        <name>mapreduce.marklogic.input.host</name>
        <value>jsolis-z600.marklogic.com</value>
    </property>
    <property>
        <name>mapreduce.marklogic.input.port</name>
        <value>5275</value>
    </property>
    <property>
        <name>mapreduce.marklogic.input.mode</name>
        <value>basic</value>
    </property>
    <property>
    <name>mapreduce.marklogic.input.subdocumentexpr</name>
    <value>//*:bar</value>
    </property>
    <property>
        <name>mapreduce.marklogic.output.content.type</name>
        <value>TEXT</value>
    </property>
    <property>
      <name>mapreduce.map.output.compress</name>
      <value>true</value>
    </property>
    <property>
      <name>mapred.map.output.compress.codec</name>
      <value>org.apache.hadoop.io.compress.SnappyCodec</value>
    </property>
</configuration>```

NullPointerException thrown from MultithreadedMapper doing a forest import.

mlcp-9.0/bin/mlcp.sh import -username admin -password admin -host jchen -port 5275 -input_file_type forest -input_file_path /space/projects/head/xdmp/src/Data2/Forests/Documents

16/07/20 15:23:01 ERROR contentpump.MultithreadedMapper:
java.lang.NullPointerException
at com.marklogic.contentpump.MultithreadedMapper$SubMapRecordReader.nextKeyValue(MultithreadedMapper.java:285)
at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
at com.marklogic.contentpump.BaseMapper.runThreadSafe(BaseMapper.java:45)
at com.marklogic.contentpump.MultithreadedMapper$MapRunner.run(MultithreadedMapper.java:376)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Have a forest with properties fragments. Not sure why it's not a problem with other errors. Seems like something we are not covering in regression tests.

copy_collections option not working in mlcp IMPORT

copy_collections option is not working in mlcp IMPORT

Steps to reproduce:

Exported data along with the collections using mlcp EXPORT command.

mlcp EXPORT -host $ML_INPUT_HOST -username $ML_INPUT_USER -password $ML_INPUT_PASSWORD -port $ML_INPUT_PORT -database $ML_SOURCE_CONTENT_DB -output_file_path ./data -copy_collections true

Imported the data into marklogic from the same /data directory where the document are exported

mlcp IMPORT -host $ML_OUTPUT_HOST -username $ML_OUTPUT_USER -password $ML_OUTPUT_PASSWORD -port $ML_OUTPUT_PORT -database $ML_DESTINY_CONTENT_DB -input_file_path ./data -copy_collections true -document_type xml -output_uri_replace "/C:/mlcp-8.0-4/mlcp-8.0-4/bin/data,''"

On verifying, documents are Imported but the collections are not propagated.

Is there any issue with copy_collections true while mlcp IMPORT?

Update the README to better explain what mlcp and mapreduce are

Add some up-front explanation to help new users (likely via Google) understand what this repo is and how to use it.

MarkLogic Hadoop connector examples missing in 2.1.6

Open MarkLogic doc Hadoop HelloWorld example
In step 3 we mention $CONNECTOR_HOME/lib/marklogic-mapreduce-examples-version.jar
This jar is not found in the 2.1.6 package.

NullPointerException thrown from MultithreadedMapper when a null value is returned from a reader.

See #4 for stack and fix.

MLCP Export - Document Selector Issue

While exporting when I was using -document_selector with date range conditions >= and <= ,I am not able to export the data

Sample XML

<root><loadTime>2017-05-03T02:08:00.796029-05:00</loadTime><name>John</name></root>
<root><loadTime>2017-05-03T07:08:00.796029-05:00</loadTime><name>Ram</name></root>
<root><loadTime>2017-05-05T07:08:00.796029-05:00</loadTime><name>Sam</name></root>

Consider the above XML are in the marklogic
When I tried to export using -document_selector
with condition as

/root[loadTime >= xs:dateTime("2017-05-03T02:08:00.796029-05:00") and loadTime <= xs:dateTime("2017-05-04T07:08:00.796029-05:00")

then in the output I got Zero records, even though we have data for it.

But when I gave > and < it is working.

/root[loadTime > xs:dateTime("2017-05-03T03:08:00.796029-05:00") and loadTime < xs:dateTime("2017-05-04T08:08:00.796029-05:00")

So can you please suggest us what is the issue when we use >= and <=?

MLCP export is not exporting all the documents in the database if we run export immediately after a document insert

We have a regression test bug23994.xml which fails inconsistently in regression, On investigation I found that if you do following actions in sequence then I am missing a document in export.

 xquery version "1.0-ml";
xdmp:document-insert("/5gye4kd5gkwxluzk/attach10/Ementa Traduzida.doc.converted.thumb_05.png", text {"hello"}),
xdmp:document-insert("/2phej4a2uz3dye4z/attach11/Engenharia de soft.doc.converted.large_05.png", text {"world"});

immediately run following export command

QA_HOME/mlcp/runmlcp.sh QA_HOME export -host localhost -port 5275 -username admin -password admin -mode local -output_file_path TMP_DIR/bug23994 -compress true

its not consistent but you will see

drwxrwxr-x 5 skottam qa 4096 2016-09-21 09:20:07.705599933 -0700 marklogic-contentpump
directory exists
however, it was created today
data/xml/bigdata directory exists.
-rwxrwxrwx 1 skottam qa 752 Sep 21 09:27 /space/b8_0/qa/mlcp/marklogic-contentpump/conf/log4j.properties
export -host localhost -port 5275 -username admin -password admin -mode local -output_file_path /tmp/bug23994 -compress true

stderr/out from shell cmd: 16/09/21 09:27:21 DEBUG contentpump.ContentPump: Command: EXPORT
16/09/21 09:27:21 DEBUG contentpump.ContentPump: Arguments: -host localhost -port 5275 -username admin -password admin -mode local -output_file_path /tmp/bug23994 -compress true 
16/09/21 09:27:22 DEBUG contentpump.ContentPump: Running in: localmode
16/09/21 09:27:22 DEBUG contentpump.LocalJobRunner: Thread pool size: 4
16/09/21 09:27:22 DEBUG mapreduce.MarkLogicInputFormat: Split query: xquery version "1.0-ml";
import module namespace hadoop = "http://marklogic.com/xdmp/hadoop" at "/MarkLogic/hadoop.xqy";
xdmp:host-name(xdmp:host()),
hadoop:get-splits('', 'fn:collection()','()')
16/09/21 09:27:22 INFO mapreduce.MarkLogicInputFormat: Fetched 1 forest splits.
16/09/21 09:27:22 DEBUG mapreduce.MarkLogicInputFormat: Added split start: 0, length: 1, forestId: 10590950184291082726, hostName: localhost
16/09/21 09:27:22 INFO mapreduce.MarkLogicInputFormat: Made 1 splits.
16/09/21 09:27:22 DEBUG mapreduce.MarkLogicInputFormat: start: 0, length: 1, forestId: 10590950184291082726, hostName: localhost
16/09/21 09:27:22 DEBUG mapreduce.MarkLogicRecordReader: split location: localhost
16/09/21 09:27:22 DEBUG mapreduce.MarkLogicRecordReader: xquery version "1.0-ml"; 
declare namespace mlmr="http://marklogic.com/hadoop";
declare variable $mlmr:splitstart as xs:integer external;
declare variable $mlmr:splitend as xs:integer external;
declare option xdmp:output "indent=no";declare option xdmp:output "indent-untyped=no";xdmp:with-namespaces((),fn:unordered(fn:unordered(fn:collection())[$mlmr:splitstart to $mlmr:splitend]))
16/09/21 09:27:22 DEBUG mapreduce.MarkLogicRecordReader: Connect to forest 10590950184291082726 on localhost
16/09/21 09:27:22 DEBUG mapreduce.MarkLogicRecordReader: Input query: com.marklogic.xcc.impl.AdhocImpl@3e606ae5

16/09/21 09:27:22 WARN utilities.URIUtil: Error parsing URI /5gye4kd5gkwxluzk/attach10/Ementa Traduzida.doc.converted.thumb_05.png.

16/09/21 09:27:22 DEBUG contentpump.OutputArchive: Creating output archive: /tmp/bug23994/20160921092722-0700-000000-TEXT.zip
16/09/21 09:27:22 DEBUG contentpump.OutputArchive: Default charset: UTF-8
16/09/21 09:27:22 DEBUG contentpump.OutputArchive: closing output archive: /tmp/bug23994/20160921092722-0700-000000-TEXT.zip
16/09/21 09:27:22 INFO contentpump.LocalJobRunner:  completed 100%
16/09/21 09:27:22 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: 
16/09/21 09:27:22 INFO contentpump.LocalJobRunner: **INPUT_RECORDS: 1**
16/09/21 09:27:22 INFO contentpump.LocalJobRunner: **OUTPUT_RECORDS: 1**
16/09/21 09:27:22 INFO contentpump.LocalJobRunner: Total execution time: 0 sec

Update open-source inventory

Make sure we’ve documented third-party code and anything required to comply with its licenses.

NPE thrown when not specifying a correct codec value

Run mlcp with -input_compression_codec and an invalid value, such as "gz" (which a user may reasonably do if they don't see that the docs spell out "gzip")
You'll get an NPE error message instead of the intended one, which is "Unsupported codec"

The error is due to this line - https://github.com/marklogic/marklogic-contentpump/blob/8.0-master/mlcp/src/main/java/com/marklogic/contentpump/CompressedAggXMLReader.java#L130 - looks like it should be using codecString in the error message instead of codec, as codec is null.

Bad results with -split_input

I checked out mlcp and built via the instructions at https://github.com/marklogic/marklogic-contentpump

Then unzipped the mlcp bin deliverable and

$ marklogic-contentpump/mlcp/deliverable/mlcp-9.0-EA3/bin/mlcp.sh version

gave

chamlin@MacPro-3445:bug$ /Users/chamlin/tickets/18184-lds-mlcp/marklogic-contentpump/mlcp/deliverable/mlcp-9.0-EA3/bin/mlcp.sh version
ContentPump version: 9.0-EA3
Java version: 1.8.0_51
Hadoop version: 2.6.0
Supported MarkLogic versions: 6.0 - 9.0

Next run

$ marklogic-contentpump/mlcp/deliverable/mlcp-9.0-EA3/bin/mlcp.sh -options_file test-quotes.txt
17/02/02 21:29:11 INFO contentpump.LocalJobRunner: Content type: XML
17/02/02 21:29:12 INFO contentpump.ContentPump: Job name: local_755095747_1
17/02/02 21:29:12 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
17/02/02 21:29:13 INFO contentpump.LocalJobRunner:  completed 100%
17/02/02 21:29:13 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: 
17/02/02 21:29:13 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 900
17/02/02 21:29:13 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 900
17/02/02 21:29:13 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 900
17/02/02 21:29:13 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
17/02/02 21:29:13 INFO contentpump.LocalJobRunner: Total execution time: 0 sec

This reports 900 records and I see 900 in the database. All is well.

But, if I uncomment the lines

#-split_input 
#true
#--max_split_size
#1000

and rerun I get

marklogic-contentpump/mlcp/deliverable/mlcp-9.0-EA3/bin/mlcp.sh -options_file test-quotes.txt
17/02/02 21:45:30 INFO contentpump.LocalJobRunner: Content type: XML
17/02/02 21:45:30 INFO contentpump.ContentPump: Job name: local_2038694039_1
17/02/02 21:45:30 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
17/02/02 21:45:30 INFO contentpump.DelimitedTextInputFormat: 72 DelimitedSplits generated
17/02/02 21:45:31 ERROR contentpump.LocalJobRunner: Error running task: 
java.lang.RuntimeException: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter
	at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:442)
	at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:452)
	at com.marklogic.contentpump.SplitDelimitedTextReader.initParser(SplitDelimitedTextReader.java:205)
	at com.marklogic.contentpump.SplitDelimitedTextReader.initialize(SplitDelimitedTextReader.java:62)
	at com.marklogic.contentpump.LocalJobRunner$TrackingRecordReader.initialize(LocalJobRunner.java:439)
	at com.marklogic.contentpump.LocalJobRunner$LocalMapTask.call(LocalJobRunner.java:373)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: (line 2) invalid char between encapsulated token and delimiter
	at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
	at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
	at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:498)
	at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:439)
	... 9 more
17/02/02 21:45:31 INFO contentpump.LocalJobRunner:  completed 1%
17/02/02 21:45:31 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: 
17/02/02 21:45:31 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 910
17/02/02 21:45:31 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 910
17/02/02 21:45:31 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 910
17/02/02 21:45:31 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
17/02/02 21:45:31 INFO contentpump.LocalJobRunner: Total execution time: 1 sec

and ML shows 888 in the db after the run. So the count shows more, and the db shows less.

I did a little debugging in SplitDelimitedTextReader.java, to show the start/end of the splits. It appears the file is OK, but when the split falls (after moving back one) on the second quote in a line, you get this parse error. It turns out that parserIterator.hasNext() actually reads a record, and the parse fails. The 888 is explained by the records thrown away when the split has the error. I have no idea why the input records are even higher when the split is tossed, certainly there weren't 910 records commited. No errors showed in the log file.

Running the other file as

marklogic-contentpump/mlcp/deliverable/mlcp-9.0-EA3/bin/mlcp.sh -options_file test-jp.txt

gives

17/02/02 21:53:03 INFO contentpump.LocalJobRunner: Content type: XML
17/02/02 21:53:03 INFO contentpump.ContentPump: Job name: local_367057373_1
17/02/02 21:53:03 INFO contentpump.FileAndDirectoryInputFormat: Total input paths to process : 1
17/02/02 21:53:03 INFO contentpump.DelimitedTextInputFormat: 138 DelimitedSplits generated
17/02/02 21:53:04 INFO contentpump.LocalJobRunner:  completed 0%
17/02/02 21:53:04 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: 
17/02/02 21:53:04 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 899
17/02/02 21:53:04 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 899
17/02/02 21:53:04 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_COMMITTED: 899
17/02/02 21:53:04 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS_FAILED: 0
17/02/02 21:53:04 INFO contentpump.LocalJobRunner: Total execution time: 1 sec

From debugging, it looks like the problem comes up when the split is at the beginning of a record. Changing the split max size up or down changes the number of records ingested.

The reason I was testing with Japanese text is because it looks like the file seek in SplitDelimitedTextReader is byte-oriented and I wondered if it could land in the middle of a multi-byte UTF-8 encoding of a character. Or, even if there might be a problem with \r\n on Windows systems.
bug.zip

Document how to use mlcp-mapr bundle

Should let users know that mlcp-mapr bundle has a dependency on maprfs java library and we don't include this library in our deliverable. Give instructions on how to resolve depencencies.

From MapR documentation, I find their maven repo is:
http://repository.mapr.com/maven/

The artifact mlcp depends on is

<groupId>com.mapr</groupId>
<artifactId>mapr-release</artifactId>
<version>5.1.0-mapr</version>

REGR: mlcp direct access tests are missing a document while extracting from old data

Following regression tests are failing from 27th sept, as 26th is no mlcp day. I see last 25 sept being last pass day.

mlcp-da-test-json.xml
mlcp-da-test2.xml
mlcp-da-test3.xml
mlcp-da-test4.xml
mlcp-da-test5.xml

After my investigation on the test mlcp-da-test-json.xml, i found the following command is not extracting all the documents [ missing document json_xml_basic_test28.xml )

/space/b8_0/qa/mlcp/runmlcp.sh /space/b8_0/qa EXTRACT -input_file_path /space/b8_0/qa/mlcp/data/extract/forests/mlcp-ext-src-f11,/space/b8_0/qa/mlcp/data/extract/forests/mlcp-ext-src-f12 -output_file_path /space/b8_0/qa/mlcp/data/output/mlcp-ext-src-jsonmix -mode local

--->
stderr/out from shell cmd: 2016-09-28
drwxrwxr-x 5 skottam qa 4096 2016-09-28 11:32:38.939173861 -0700 marklogic-contentpump
directory exists
however, it was created today
data/xml/bigdata directory exists.
-rwxrwxrwx 1 skottam qa 754 Sep 28 15:26 /space/b8_0/qa/mlcp/marklogic-contentpump/conf/log4j.properties
EXTRACT -input_file_path /space/b8_0/qa/mlcp/data/extract/forests/mlcp-ext-src-f11,/space/b8_0/qa/mlcp/data/extract/forests/mlcp-ext-src-f12 -output_file_path /space/b8_0/qa/mlcp/data/output/mlcp-ext-src-jsonmix -mode local

stderr/out from shell cmd: 16/09/28 15:26:22 INFO input.FileInputFormat: Total input paths to process : 10
16/09/28 15:26:22 INFO contentpump.LocalJobRunner:  completed 100%
16/09/28 15:26:22 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: 
16/09/28 15:26:22 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 32
16/09/28 15:26:22 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 32
16/09/28 15:26:22 INFO contentpump.LocalJobRunner: Total execution time: 0 sec

The same command on Sunday regression

running following shell command: 
     /space/builder/builds/linux64/b8_0/qa/mlcp/runmlcp.sh /space/builder/builds/linux64/b8_0/qa EXTRACT -input_file_path /space/builder/builds/linux64/b8_0/qa/mlcp/data/extract/forests/mlcp-ext-src-f11,/space/builder/builds/linux64/b8_0/qa/mlcp/data/extract/forests/mlcp-ext-src-f12 -output_file_path /space/builder/builds/linux64/b8_0/qa/mlcp/data/output/mlcp-ext-src-jsonmix -mode local 

stderr/out from shell cmd: 16/09/25 11:07:48 INFO input.FileInputFormat: Total input paths to process : 10
16/09/25 11:07:48 INFO contentpump.LocalJobRunner:  completed 100%
16/09/25 11:07:48 INFO contentpump.LocalJobRunner: com.marklogic.mapreduce.MarkLogicCounter: 
16/09/25 11:07:48 INFO contentpump.LocalJobRunner: INPUT_RECORDS: 33
16/09/25 11:07:48 INFO contentpump.LocalJobRunner: OUTPUT_RECORDS: 33
16/09/25 11:07:48 INFO contentpump.LocalJobRunner: Total execution time: 0 sec

Fix backward compatibility of mlcp

As part of ongoing work of mlcp, some xdmp builtins are used to replace corresponding admin functions. Should make it backward compatible when mlcp is run against previous versions.

Separate unit and integration tests

The current set of tests function mostly as integration tests that require a live instance of MarkLogic, and take ~15 minutes to run.

Unit tests should run quickly, and not rely on external systems.
Mock objects should be used in place of a live database for unite tests, or the test should be moved into an integration test suite.

It would be helpful, and make it easier for contributors, if the tests and build were organized to facilitate separate execution of unit tests and integration tests. This would also make it easier to integrate with CI services, such as Travis CI and/or Circle CI without having to exclude the test phase.

Add proper mlcp SSL support.