jexp / batch-import Goto Github PK

generic csv file neo4j batch importer

Home Page: https://neo4j.com/docs/operations-manual/current/tools/import/

Shell 2.06% Java 96.28% Batchfile 1.66%

batch-import's Introduction

Hi, my name is Michael 👋

I've been passionate about software development for more than 25 years.

For the last 13 years, I've been working on the open source Neo4j graph database filling many roles, most recently leading the Developer Relations team, working on DX initiatives and organizing the Neo4j Labs efforts.

As caretaker of the Neo4j community and ecosystem I love to work with graph-related projects, users, and contributors.

In general I enjoy many aspects of programming languages, learning new things every day, participating in exciting and ambitious open source projects and contributing and writing software related books and articles. I spoke at numerous conferences and helped organized several of them.

My efforts in the Java community got me accepted to the JavaChampions program.

I'm running a weekly girls-only coding club at our local school.

🔭 I’m currently working on https://github.com/neo4j-contrib/sql2cypher
🌱 I’m currently learning how to manage teams better, SEO, chatgpt :)
💬 Ask me about: java, kotlin, graphs, graphql, Cypher
📫 How to reach me: https://chaos.social/@mesirii
😄 Pronouns: he/him
⚡ Fun fact: I used to own a coffee shop Die Buchbar which had a bookish, steam-punk interior and hosted local artists.

Social

Mastodon
LinkedIn
Twitter - much less these days :(

Content

Past Projects

I've initiated these projects: neo4j-graph-algorithms (graph-data-science), neo4j-graphql, neo4j-spark-connector, neo4j-streams (kafka), neo4j-apoc-procedures, csv-bulk-importer, spring-data-neo4j, cypher-dsl, neo4j-jdbc.

batch-import's People

Contributors

Stargazers

Watchers

Forkers

mhluongo espeed crisweber quantisan antapos jackrabb1t javimarlop robobeans ankit-rakha dingjing hellcoderz rsaporta ijayasin albedium jgutierrezt marcotambasteb davidoury drtobbe venatir gerardmoroney mneedham sirocchj nseddik sheymann redapple lpfeifferjr ibayer quyetdd kapilr-9 checkert dmontag dav009 ncabay maxdemarzi yak1ex webverse luismoralesalonso bowenli37 etheleon sillson jomski2009 mhuelfen timmsc podmanicky gtdca98 kliontis kylemarkwilliams ed42 vkn98033 yangtzhe apsaltis danieljue drstarry carlzhangxuan kranthicu tiepologian machbio alex-mirkin jy4618272 xmzhu2 jasaitis alisheikh 1manstartup next-fjaviergarcianuno neo4jane chenmiao apcj datasaintist sripadapavan maharajan6 3in sjeremic mikeaddison93 ashneo76 jbkielis celsomarques ramay130 hunjaejung kennyworld stevenzhangyu foruforo yuzhu712 graphstory freelancezw iampawansingh ogawatetsuo14 benheubl chimpinano c13720123875 yorkhuang-au stafot lixiaosi33 doolingdavid sobolsigizmund rpidanny rucky2013 billho logicjwell joe2hpimn dhjwlove

batch-import's Issues

Importing less than half of nodes in CSV file

I'm using the 2.0 branch since I need support for labels.

I have a CSV file with about 23m nodes:

wc -l all_nodes_22nov.csv
23048550 all_nodes_22nov.csv

When I import them using batch-import it only seems to import half of them:

~/apache-maven-2.2.1/bin/mvn exec:java -Dexec.mainClass="org.neo4j.batchimport.Importer" -Dexec.args="neo4j/data/graph.db all_nodes_22nov.csv all_rels_22nov.csv"

Some stuff here...

Usage: Importer data/dir nodes.csv relationships.csv [node_index node-index-name fulltext|exact nodes_index.csv rel_index rel-index-name fulltext|exact rels_index.csv ....]
Using: Importer neo4j/data/graph.db all_nodes_22nov.csv all_rels_22nov.csv

Using Existing Configuration File
.........^[[A...........^[[A................................................................................ 2885198 ms for 10000000
..............
Importing 11448052 Nodes took 3302 seconds
Total import time: 3305 seconds

As you can see, it's only importing about 11m nodes.

I exported this data from a MySQL table that has an ID column as the primary key. I mapped the ID column to i:id in my CSV file so (based on my understanding) each record should be getting a unique id so I don't think the problem is caused by IDs being overwritten.

Any ideas on what the problem might be and how I can fix it?

Thanks

Indexing problem

I am totally new to neo4j and batch-importer, therefore, its most probably some stupid mistake.

I have these files:
Nodes: a
Edges: b

mac505213:socnet linas$ cat a
Node Rels Propery
1 1 USER
90 1 USER

mac505213:socnet linas$ cat b
Start Ende Type Property
1 90 FRIEND Property
90 1 FRIEND Property

And I have this error:

java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar target/db a b
Physical mem: 4096MB, Heap size: 3640MB
use_memory_mapped_buffers=false
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=215M
neo_store=/Users/linas/devel/batch-import/target/db/neostore
neostore.relationshipstore.db.mapped_memory=1000M
neostore.propertystore.db.index.mapped_memory=5M
neostore.propertystore.db.mapped_memory=1000M
dump_configuration=true
cache_type=none
neostore.nodestore.db.mapped_memory=200M

Importing 2 Nodes took 0 seconds
0 seconds
Exception in thread "main" org.neo4j.graphdb.NotFoundException: id=90
at org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl.java:846)
at org.neo4j.kernel.impl.batchinsert.BatchInserterImpl.createRelationship(BatchInserterImpl.java:439)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:146)
at org.neo4j.batchimport.Importer.main(Importer.java:52)

Interestingly, If I change 90 to 10 in both files, everything works fine.
Most probably, I don't understand the input format.

Please help me.

Compatibility with v2.0.0M02

Hi,

Is there any chance to update the repo to support new neo4j v2.0.0-M02?

example of format of csv files

Format of each csv file in documentation will be very helpful. Example follows:

$ cat nodes.csv
name age works_on
Michael 37 neo4j
Selina 14
Rana 6
Selma 4

$ cat nodes_index.csv
0 name age works_on
1 Michael 37 neo4j
2 Selina 14
3 Rana 6
4 Selma 4

$ cat rels.csv
start end type since counter:int
1 2 FATHER_OF 1998-07-10 1
1 3 FATHER_OF 2007-09-15 2
1 4 FATHER_OF 2008-05-03 3
3 4 SISTER_OF 2008-05-03 5
2 3 SISTER_OF 2007-09-15 7

$ cat rels_index.csv
-1 since counter:int
0 1998-07-10 1
1 2007-09-15 2
2 2008-05-03 3
3 2008-05-03 5
4 2007-09-15 7

Easier way for multi node types?

Leaving this here because I'm wondering if I'm missing something obvious, plus it might help somebody in the same boat.

I used the importer for pulling in MovieLens data. It's way faster than what I was doing before pumping it in over the wire!

I wanted to have differing properties for users and movies, so the reference line needs to be different. So I parsed the user and movie data into separate node files. The first pass I imported the user nodes with a blank rels file, and then I import the movie nodes with real rels data (I'd calculated the node id with a counter shared across users and movies).

This worked ok left me wondering, is there a better way than this two pass approach with the current implementation?

Start/End Rel sort and creating relationship properties

In the rels csv what is the significance of the start/end fields. I'm having a hard time understanding what the significance is and how to use it for my current data.

Here is an example of my two csvs:

user.tsv:

id  username    email     full_name 
1   wolff   [email protected]  Wolff 
2   test    [email protected] Test

user_rels.tsv:

id  type          lang        date_joined           
1   staff       en-us      2010-09-22 11:58:10  
2           en-us      2010-09-29 12:00:09

I think my core misunderstanding is how the relationship is created from the row in user.tsv to the properties in user_rels.tsv.

When I run a batch-import I get the following error:

java -server -Dfile.encoding=UTF-8 -Xmx4G -jar db/batch-import-jar-with-dependencies.jar db/neo4j-development db/csv/Users.tsv db/csv/Users_rels.tsv
Using Existing Configuration File
Importing 116416 Nodes took 1 seconds
Total import time: 1 seconds
Exception in thread "main" java.lang.NumberFormatException: For input string: "staff"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
    at java.lang.Long.parseLong(Long.java:410)
    at java.lang.Long.parseLong(Long.java:468)
    at org.neo4j.batchimport.Importer.id(Importer.java:144)
    at org.neo4j.batchimport.Importer.importRelationships(Importer.java:108)
    at org.neo4j.batchimport.Importer.main(Importer.java:63)

Obviously I'm not setting up my rels file correctly. Could someone please point out my error.

Thanks.

Issue With Automatic Indexing

Hi,

Thanks for developing this code; am looking forward to using as I learn more about neo4j and graph.

I have an issue with automatic indexing I am hoping you can assist with.

For the fundamental example using nodes.csv and rels.csv in the Github documentation it all works really well; no problem.

However when I try to do the automatic indexing I get an error.

Here are the steps I took:

Edited batch.properties and added line "batch_import.node_index.users=exact" (not quotes).
Used the following nodes.csv.

name:string:users age works_on
Michael 37 neo4j
Selina 14
Rana 6
Selma 4

Used the following rels.csv

start:string:users end:string:users type since counter:int
Michael Selina FATHER_OF 1998-07-10 1
Michael Rana FATHER_OF 2007-09-15 2
Michael Selma FATHER_OF 2008-05-03 3
Rana Selma SISTER_OF 2008-05-03 5
Selina Rana SISTER_OF 2007-09-15 7

Ran the following command.

mvn clean compile exec:java -Dexec.mainClass="org.neo4j.batchimport.Importer" -Dexec.args="neo4j/data/graph.db nodes.csv rels.csv"

Here is the output:

[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Neo4j Batch Importer 1.9-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.4.1:clean (default-clean) @ batch-import ---
[INFO] Deleting /home/user/batch-import/target
[INFO]
[INFO] --- maven-resources-plugin:2.5:resources (default-resources) @ batch-import ---
[debug] execute contextualize
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-compiler-plugin:2.1:compile (default-compile) @ batch-import ---
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 51 source files to /home/user/batch-import/target/classes
[INFO]
[INFO] >>> exec-maven-plugin:1.2.1:java (default-cli) @ batch-import >>>
[INFO]
[INFO] <<< exec-maven-plugin:1.2.1:java (default-cli) @ batch-import <<<
[INFO]
[INFO] --- exec-maven-plugin:1.2.1:java (default-cli) @ batch-import ---
Usage java -jar batchimport.jar data/dir nodes.csv relationships.csv [node_index node-index-name fulltext|exact nodes_index.csv rel_index rel-index-name fulltext|exact rels_index.csv ....]
Using Existing Configuration File

Importing 5 Nodes took 0 seconds

Total import time: 0 seconds
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NullPointerException
at org.neo4j.batchimport.Importer.lookup(Importer.java:100)
at org.neo4j.batchimport.Importer.id(Importer.java:139)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:115)
at org.neo4j.batchimport.Importer.doImport(Importer.java:188)
at org.neo4j.batchimport.Importer.main(Importer.java:74)
... 6 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11.420s
[INFO] Finished at: Fri Jul 05 08:42:27 PHT 2013
[INFO] Final Memory: 14M/51M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (default-cli) on project batch-import: An exception occured while executing the Java class. null: InvocationTargetException: NullPointerException -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Other information
[Maven]
Apache Maven 3.0.5 (r01de14724cdef164cd33c7c8c2fe155faf9602da; 2013-02-19 21:51:28+0800)
Maven home: /opt/apache-maven
Java version: 1.6.0_45, vendor: Sun Microsystems Inc.
Java home: /usr/java/jdk1.6.0_45/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-358.11.1.el6.x86_64", arch: "amd64", family: "unix"
[Java]
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

I should point out this works fine for the fundamental example but not when I use automatic indexing.

Appreciate any guide on where I have gone wrong here. Let me know if you need more information to diagnose.

Label support

Any idea if/when Neo4j 2.0 labels will be supported by the batch importer?

long string problem of node index

I got the following error in running batch-import with long strings(ex. 100000 characters) of a node index file. But I got no error with the old batch-import-jar-with-dependencies.jar file.

E:\work\batch-import>mvn clean compile exec:java -Dexec.mainClass="org.neo4j.batchimport.Importer" -Dexec.args="neo4j/data/graph.db nodes.csv rels.csv node_index users fulltext nodes_index.csv"
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Neo4j Batch Importer 1.9-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ batch-import ---
[INFO] Deleting E:\work\batch-import\target
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ batch-import ---
[WARNING] Using platform encoding (MS949 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 1 resource
[INFO]
[INFO] --- maven-compiler-plugin:2.1:compile (default-compile) @ batch-import ---
[WARNING] File encoding has not been set, using platform encoding MS949, i.e. build is platform dependent!
[INFO] Compiling 51 source files to E:\work\batch-import\target\classes
[INFO]
[INFO] >>> exec-maven-plugin:1.2.1:java (default-cli) @ batch-import >>>
[INFO]
[INFO] <<< exec-maven-plugin:1.2.1:java (default-cli) @ batch-import <<<
[INFO]
[INFO] --- exec-maven-plugin:1.2.1:java (default-cli) @ batch-import ---
Usage java -jar batchimport.jar data/dir nodes.csv relationships.csv [node_index node-index-name fulltext|exact nodes_index.csv rel_index rel-index-name fulltext|exact rels_index.csv ....]
Using Existing Configuration File

Importing 97044 Nodes took 0 seconds
................
Importing 1628244 Relationships took 5 seconds

Total import time: 10 seconds
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 32768
at org.neo4j.batchimport.utils.Chunker.nextWord(Chunker.java:53)
at org.neo4j.batchimport.importer.ChunkerLineData.nextWord(ChunkerLineData.java:37)
at org.neo4j.batchimport.importer.ChunkerLineData.readLine(ChunkerLineData.java:47)
at org.neo4j.batchimport.importer.AbstractLineData.parse(AbstractLineData.java:118)
at org.neo4j.batchimport.importer.AbstractLineData.processLine(AbstractLineData.java:67)
at org.neo4j.batchimport.Importer.importIndex(Importer.java:145)
at org.neo4j.batchimport.Importer.importIndex(Importer.java:178)
at org.neo4j.batchimport.Importer.doImport(Importer.java:192)
at org.neo4j.batchimport.Importer.main(Importer.java:74)
... 6 more
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 17.570s
[INFO] Finished at: Fri Jun 28 10:41:53 KST 2013
[INFO] Final Memory: 19M/361M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:java (default-cli) on project batch-import: An exception occured while executing the Java class. null: InvocationTargetException: 32768 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

Duplication of nodes

I'm importing around 350k nodes, and have a nodes.csv file with 354k lines. The importer runs just fine, and nodes are created. However, there it imports (and reports importing) over 700k nodes. Upon investigation, every node I looked at had an exact duplicate but with a difference of 1 in the id. All the relations are also completely wrong, but this clearly because they refer to the correct node ids.
I've tried with and without automatic node indexing.

Issue when importing UTF-8 csv file

When a UTF-8 csv file is imported and property type is specified the importer throws dataype not found
ex. for header in the file "name:string:users" the importer throws

Exception in thread "main" java.lang.IllegalArgumentException: Unknown Type string
at org.neo4j.batchimport.importer.Type.fromString(Type.java:164)
at org.neo4j.batchimport.importer.AbstractLineData.createHeaders(AbstractLineData.java:44)
at org.neo4j.batchimport.importer.CsvLineData.(CsvLineData.java:16)
at org.neo4j.batchimport.Importer.createLineData(Importer.java:144)
at org.neo4j.batchimport.Importer.importNodes(Importer.java:84)
at org.neo4j.batchimport.Importer.doImport(Importer.java:199)
at org.neo4j.batchimport.Importer.main(Importer.java:74)

Moved to https://github.com/neo4j/neo4j/issues/586

Moved to neo4j/neo4j#586

useless relationship type but mandatory

Hi,

The program requires a type for each relationship, however I don't need it.

Way to reproduce: remove the type column in the rels.csv sample

NegativeArraySizeException when exporting relations

Hi, I'm having trouble importing my rels.csv file, even with the example given in the readme.md.

code compiled with
mvn clean compile assembly:single

end exectued with
java -server -Xmx4G -jar ../batch-import/target/batch-import-jar-with-dependencies.jar neo4j/data/data.db nodes.csv rels.csv

the nodes and rels files are those on the readme.md examples

I get this error

Using Existing Configuration File

Importing 5 Nodes took 0 seconds 

Total import time: 0 seconds 
Exception in thread "main" java.lang.NegativeArraySizeException
at org.neo4j.batchimport.importer.RowData.createMapData(RowData.java:37)
at org.neo4j.batchimport.importer.RowData.<init>(RowData.java:27)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:101)
at org.neo4j.batchimport.Importer.main(Importer.java:63)

Use an argument parser library

Such as http://jcommander.org/ or http://args4j.kohsuke.org/.
This will remove a lot of boilerplate code and greatly ease the interaction with batch-import.

I just post this issue in case someone has time before I do it myself :)

"no suitable method found for loggerDirectory" error during 'mvn clean compile'

Hi,
A fresh clone on an Ubuntu 12.04 with maven3 and OpenJDK 1.6.0_24 produces error below at the compile stage:

[ERROR] .../batch-import/src/main/java/org/neo4j/unsafe/batchinsert/BatchInserterImpl.java:[69,29] error: no suitable method found for loggerDirectory(File)

By the way, project builds without any problems with the same configuration for an older version of batch-import from 2012 August.

Support indexing of array properties?

I can explicitly index arbitrary array properties in neography, and used to be able to use @maxdemarzi's batch-import fork circa February 2013 to load and index string array properties.

It seems now that Max blew away his old, slightly hacky code for loading and indexing string array properties. At least, with master branch here I can load arbitrary array properties easily but I haven't yet figured out how to index them. (I've tried the auto-index option as well as explicit indexing; with the explicit index I tried both a comma-separated column and also with a new row for each element).

Is indexing array properties supported in any way by the current master tree?

Edge property type is string instead of type defined in header

Might there be a bug in reading the types of edge properties? I have an edge TSV file that contains lines like:

bash-4.1$ head -n 5 edges.tsv 
start       end type    correlation:double  importance:double   pvalue:double   distance:int
37626   49171   ASSOCIATION -0.344578   0   1.40818904813735        
14256   3992    ASSOCIATION -0.3184399  0.001083176 2.36807765466284        
39138   36494   ASSOCIATION -0.01435775 0   2.44707505427999        
36421   17548   ASSOCIATION 0.3961878   0.0001725019    1.56627511748036

The distance property is not defined for every edge. The importer script seems to create a database in which correlation, importance and pvalue are all strings instead of double. The distance values, oddly enough, are casted correctly to int. Typecasting works without problems for nodes.

The edge file is created in a Python script where the code creating a one line is

edge_tsv_file.write( start + "\t" + stop + "\t" + edgeName + "\t"  + "\t".join( [ lineDict[i] for i in edgeAttributes ] ) + "\t" + "" + "\t" + "\n" )

(notice "" for indicating the absence of distance value)

Id capacity exceeded

For an import of 3,993,323 nodes and 24,440,732 edges I have this exception:

Exception in thread "main" org.neo4j.kernel.impl.nioneo.store.UnderlyingStorageException: Id capacity exceeded

Any ideas?

Custom Separator for Arrays

Arrays are by default separated by ,. I'm dealing with many values containing , so it would be nice if there is a field in the config allowing to change the Separator.

last column has nulls error

Hi,
I'm basically having the same issue as #7
#7

Taking the example csv files from the wiki page, I get the following:

$ rm -rf target/db/*
(./neo4j/neo4j-community-1.8/batch-import)$ java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar target/db nodes.csv rels.csv
Using Existing Configuration File

Total import time: 1 seconds
Exception in thread "main" java.util.NoSuchElementException
at java.util.StringTokenizer.nextToken(StringTokenizer.java:332)
at org.neo4j.batchimport.importer.RowData.parse(RowData.java:66)
at org.neo4j.batchimport.importer.RowData.split(RowData.java:81)
at org.neo4j.batchimport.importer.RowData.updateMap(RowData.java:92)
at org.neo4j.batchimport.Importer.importNodes(Importer.java:93)
at org.neo4j.batchimport.Importer.main(Importer.java:57)

I tried copy+paste, I even rewrote the csv by hand on the host server (with tab delimiters) and it fails on the node import of the nodes.csv file. For the rows with missing last column data, if I add a tab it still fails, but if I add a tab and some data in the last column then it works.

On PLATFORM=linux-rhel5-x86_64
Cloned and built without error. Noticed it pulled the 1.9 SNAPSHOT and I'm running 1.8 server.
java version 1.6.0_37

Seems like it should be a file format issue, but I rewrote them several times with the same error.

Reproducible from your side?

Thx,
Larry

Importing failure with rels.tsv

Ran
mvn clean compile exec:java -Dexec.mainClass="org.neo4j.batchimport.Importer" -Dexec.args="neo4j/data/graph.db nodes.tsv rels.tsv" and received the following errors

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NullPointerException
at org.neo4j.batchimport.Importer.lookup(Importer.java:100)
at org.neo4j.batchimport.Importer.id(Importer.java:139)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:115)
at org.neo4j.batchimport.Importer.doImport(Importer.java:188)
at org.neo4j.batchimport.Importer.main(Importer.java:74)

Importing nodes was successful so I'm guessing the problem lies with the rels.tsv?

Here's the link to a smaller sample of my rels file.
https://www.dropbox.com/s/77jhbmyskks9jxi/rels_sample.tsv

suggestion: tsv / csv distinction

it may be more clear to use the extensions ".tsv" or ".txt" if using tab-separated values.

http://en.wikipedia.org/wiki/Tab-separated_values
http://en.wikipedia.org/wiki/Comma-separated_values

check for field names at beginning of import

If field names are missing from the relationships or nodes files, an error is thrown at the beginning of the index creation process. This can be after hours of importing, so it'd be helpful if the check for field names happened before the import, not after.

Add building the Label scan store to the batchinserter

Otherwise, the batch-created database takes a long time to start due to that scan store being created at startup time. A hit that could be taken during insertion IMHO.

org.neo4j.kernel.impl.nioneo.store.UnderlyingStorageException during import on Windows

I'm trying to import a 3 GB node file and a 1.5 GB edge file on Windows. Using neo4j version community-1.9.M01. After crunching away on the nodes file for about 20 sec, I get the exception shown in the title. Here is the full stack trace. Note, the batch-importer works fine for me on much smaller files.

C:\batch-import-master>java -server -Xmx12G -jar target/batch-import-jar-with-dependencies.jar target/db nodes2.csv edges1.csv
Using Existing Configuration File
.........................Exception in thread "main" org.neo4j.kernel.impl.nioneo.store.UnderlyingStorageException: Unable to close store C:\batch-import-master\target\db\neostore.propertystore.db.strings
at org.neo4j.kernel.impl.nioneo.store.CommonAbstractStore.close(CommonAbstractStore.java:636)
at org.neo4j.kernel.impl.nioneo.store.PropertyStore.closeStorage(PropertyStore.java:118)
at org.neo4j.kernel.impl.nioneo.store.CommonAbstractStore.close(CommonAbstractStore.java:571)
at org.neo4j.kernel.impl.nioneo.store.NeoStore.closeStorage(NeoStore.java:231)
at org.neo4j.kernel.impl.nioneo.store.CommonAbstractStore.close(CommonAbstractStore.java:571)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.shutdown(BatchInserterImpl.java:703)
at org.neo4j.batchimport.Importer.finish(Importer.java:83)
at org.neo4j.batchimport.Importer.main(Importer.java:77)
Caused by: java.io.IOException: The requested operation cannot be performed on a file with a user-mapped section open
at sun.nio.ch.FileDispatcherImpl.truncate0(Native Method)
at sun.nio.ch.FileDispatcherImpl.truncate(Unknown Source)
at sun.nio.ch.FileChannelImpl.truncate(Unknown Source)
at org.neo4j.kernel.impl.nioneo.store.CommonAbstractStore.close(CommonAbstractStore.java:606)
... 7 more

Batch Import on the existing nodes and relationships

Hi ,
Am trying to use Batch-import utility for my Use case and want to do bulk load on the existing nodes and relationships. is there any thing i can do like that using this utility ??

I want that it should append the existing relationships b/w existing nodes is it possible ??
Thanks
Prashant

possible typo, please check -- FIXED

-- This is fixed!

Under current limitations of parallel batch inserter, documentation says:

only up to 2bn relationships (due to an int based multi-map)

is it correct? or do you mean only upto 2bn nodes?

Feature Request: Allow for Indexing

Love using the batch import, would love it more if it would create indexes as well as nodes and edges.

Current workaround is to use external index (obvious since we have the data and node_ids), but we don't have relationship_ids.

Skip relationships with missing nodes instead of failing

When either the "start" or "end" node is a relationship does not exist the import fails with:

[WARNING]
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
        at java.lang.Thread.run(Thread.java:722)
Caused by: org.neo4j.kernel.impl.nioneo.store.InvalidRecordException: NodeRecord[12972393] not in use
        at org.neo4j.kernel.impl.nioneo.store.NodeStore.getRecord(NodeStore.java:252)
        at org.neo4j.kernel.impl.nioneo.store.NodeStore.getRecord(NodeStore.java:125)
        at org.neo4j.unsafe.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl.java:1190)
        at org.neo4j.unsafe.batchinsert.BatchInserterImpl.createRelationship(BatchInserterImpl.java:750)
        at org.neo4j.batchimport.Importer.importRelationships(Importer.java:158)
        at org.neo4j.batchimport.Importer.doImport(Importer.java:236)
        at org.neo4j.batchimport.Importer.main(Importer.java:83)
        ... 6 more
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] An exception occured while executing the Java class. null

NodeRecord[12972393] not in use

Where 12972393 was the missing node ID.

Is it possible for a warning message to be printed and the relationship skipped instead of having the whole import fail? Even if this was not the default behavior, I think it would be a useful feature as a configuration option.

Direct buffer memory error

Hi,

I try to build on Netbeans, but have this error:

Tests in error:
testFillBuffer(org.neo4j.batchimport.handlers.RelationshipUpdateCacheTest): Direct buffer memory

java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:633)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:98)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
at org.neo4j.batchimport.handlers.RelationshipUpdateCache.createBuffers(RelationshipUpdateCache.java:72)
at org.neo4j.batchimport.handlers.RelationshipUpdateCache.(RelationshipUpdateCache.java:56)
at org.neo4j.batchimport.handlers.RelationshipUpdateCacheTest.testFillBuffer(RelationshipUpdateCacheTest.java:37)

my netbeans.conf file has "-J-Xms512m -J-XX:PermSize=512m -J-Xmx1400m -J-XX:MaxPermSize=1400m"

Any idea of what's going on?

Problem during importing relations with auto indexing

Hi,
I'm trying to import nodes and I've got following error:

    Importing 20000 Nodes took 3 seconds

    Importing 118 Nodes took 0 seconds

    Total import time: 9 seconds
    Exception in thread "main" java.util.NoSuchElementException: More than one element in org.mapdb.Bind$5$1@1a0225b. First element is '1' and the second element is '2'
    at org.neo4j.helpers.collection.IteratorUtil.singleOrNull(IteratorUtil.java:122)
    at org.neo4j.helpers.collection.IteratorUtil.singleOrNull(IteratorUtil.java:289)
    at org.neo4j.batchimport.index.LongIterableIndexHits.getSingle(LongIterableIndexHits.java:33)
    at org.neo4j.batchimport.index.LongIterableIndexHits.getSingle(LongIterableIndexHits.java:12)
    at org.neo4j.batchimport.Importer.lookup(Importer.java:100)
    at org.neo4j.batchimport.Importer.id(Importer.java:139)
    at org.neo4j.batchimport.Importer.importRelationships(Importer.java:115)
    at org.neo4j.batchimport.Importer.doImport(Importer.java:188)
    at org.neo4j.batchimport.Importer.main(Importer.java:74)

I've got in batch.properties:
batch_import.node_index.nodetype=exact
batch_import.node_index.activity_id=exact
batch_import.node_index.student_id=exact

I'm importing nodes from two files. students.csv:

    nodetype:string:nodetype    id_student:int:student_id   student_name
    student 62296   cc65ccc2d7a1a634d53
    student 62297   32731e2b3f905201751
    student 62298   f467e881fb927293bfd
    student 62299   f58a8028ca573feb8d5
    ....

and main.csv:

    nodetype:string:nodetype    [...]   id:int:activity_id
    act [...]   1
    act [...]   2
    act [...]   3
    act [...]   4
    act [...]   5
    act [...]   6
    act [...]   7

my relations file:

    id:int:activity_id  id_student:int:student_id   type
    1   62296   SOLVED_BY
    2   62296   SOLVED_BY
    3   62296   SOLVED_BY
    4   62296   SOLVED_BY
    5   62296   SOLVED_BY
    6   62296   SOLVED_BY
    7   62296   SOLVED_BY
    8   62296   SOLVED_BY
    9   62296   SOLVED_BY
    10  62296   SOLVED_BY

Any advice what I'm doing wrong?;)

edge list support?

Hi,

The current importer need numerical ID of nodes to work, which correspond to the line in the node list. But how to do when the data is just an edge list? e.g.:

Source, Target
Michael, Selina
Rana, Selma
Michael, Selma

Exception thrown unless EOL are dos style

It seems that files with line endings using just 'newline' throws an exception (see below).

when i process the .csv to have carriage-returns in additions to a newline the script works.

csv files were created on osx mysql workbench - i added the newlines with :
$perl -pe 's/\r\n|\n|\r/\r\n/g' nodes.csv > newNodesFile.csv
and then the script worked.

FROM TERMINAL:
$java -server -Xmx4G -jar target/batch-import-jar-with-dependencies.jar target/graph.db bnodes.csv rels.csv

Using Existing Configuration File
remote_logging_host=127.0.0.1
forced_kernel_id=
read_only=false
neo4j.ext.udc.host=udc.neo4j.org
logical_log=nioneo_logical.log
online_backup_enabled=false
remote_logging_port=4560
gc_monitor_threshold=200ms
array_block_size=120
load_kernel_extensions=true
neostore.relationshipstore.db.mapped_memory=1000M
node_auto_indexing=false
intercept_committing_transactions=false
keep_logical_logs=true
dump_configuration=true
gc_monitor_wait_time=100ms
cache_type=none
intercept_deserialized_transactions=false
neostore.nodestore.db.mapped_memory=200M
neo4j.ext.udc.first_delay=600000
neo4j.ext.udc.reg=unreg
lucene_searcher_cache_size=2147483647
neo4j.ext.udc.interval=86400000
use_memory_mapped_buffers=true
rebuild_idgenerators_fast=true
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.strings.mapped_memory=200M
neostore.propertystore.db.arrays.mapped_memory=130M
neo_store=neostore
logging.threshold_for_rotation=104857600
neostore.propertystore.db.index.mapped_memory=5M
backup_slave=false
neostore.propertystore.db.mapped_memory=1000M
gcr_cache_min_log_interval=60s
relationship_grab_size=100
relationship_auto_indexing=false
string_block_size=120
lucene_writer_cache_size=2147483647
node_cache_array_fraction=1.0
grab_file_lock=true
remote_logging_enabled=false
allow_store_upgrade=false
neo4j.ext.udc.enabled=true
execution_guard_enabled=false
relationship_cache_array_fraction=1.0
online_backup_port=6362

Total import time: 0 seconds
Exception in thread "main" org.neo4j.graphdb.NotFoundException: id=29
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.getNodeRecord(BatchInserterImpl.java:902)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.createRelationship(BatchInserterImpl.java:455)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:243)
at org.neo4j.batchimport.Importer.main(Importer.java:86)

option for indexing only

Let's say someone inserted nodes and relationships into neo4j database and they didn't index nodes and relationships. If there is a way to rerun the batch-import or an extension just to index nodes only / relationships only or both will be very helpful. Currently if you try to index only nodes or relationships, it overwrites the database.

Moved to https://github.com/neo4j/neo4j/issues/587

Moved to neo4j/neo4j#587

Execution of parallel batch inserter

Will be helpful if execution of parallel batch inserter is clearly stated with an example showing use of nodes.csv and rels.csv.
runs with main class: org.neo4j.batchimport.ParallelImporter while using nodes.csv and rels.csv
A statement saying Parallel Batch Inserter currently doesn't support indexing so if you require indexing please use batch-import will make things pretty clear and will remove any chance of confusion.

FORMAT:

data/dir nodes.csv relationships.csv #nodes #max-props-per-node #usual-rels-pernode #max-rels-per-node #max-props-per-rel rel,types

Example:

MAVEN_OPTS="-Xmx50G -Xms50G -server -d64 -XX:NewRatio=5" mvn clean test-compile exec:java -Dexec.mainClass=org.neo4j.batchimport.ParallelImporter -Dexec.args="./neo4j-community-1.9.M04/data/graph.db ./nodes.csv ./relationships.csv 4000000 2 100 200 2 KNOWS, FRIEND"

Configurable relationship discovery

In the case of importing several node CSVs (with e.g. 1 CSV <-> 1 table "dump"), it would be extra nice to have batch-import figure out what the relationships are.

Manually specifying ALL relationships between nodes created from different tables is quite error-prone.

What about a new parameter like the last one in this configuration example :

batch_import.nodes_files=file1.csv,file2.csv,file3.csv batch_import.discoverable_links=file1.csv ref_column file2.csv id,file2.csv ref_column file3.csv id

That means: automatically create between nodes whose:

(from file1.csv) ref_column equals the id of file2.csv nodes
(from file2.csv) ref_columns equals the id of file3.csv nodes

I guess these attributes could be removed from the nodes, the ref_column would become the name of the relationship.

If one of the file or column does not exists, an error would be thrown.
If batch_import.rels_files is set, then it has highest precedent on this new property.

What do you think about it?

This will be backwards compatible and I can develop it soon (as I actually need it).

Node id feature: NotFoundException

Hey,
Instead of referring to row numbers in my csv file I want to refer to node id's. So my node file looks like:

i:id    identifier:string:TokenNode
....        ....
99          %$% 
....        ....

and the rel file:

start   end type
99       98 OCCURENCE

I run the importer with:

 mvn test-compile exec:java -Dexec.mainClass="org.neo4j.batchimport.Importer"   -   Dexec.args="batch.properties target/graph.db nodes.csv rels.csv"

And the following exception is returned:

    Caused by: org.neo4j.graphdb.NotFoundException: id=99

But the Id is in my file and the Importer just imports the first node with id 1 and crashs with the first relation above.

Regards Toa

Difficulty with automatic indexing

I'm completely new to this, and I'm basically just trying to get the examples working. I manage to make a database just fine until I try to add auto indexing.

I have added

batch_import.node_index.users=exact

to the batch.properties file.

I am using the example files exactely as they are (exept without int-declaration in rels, it didn't seem to like that):

nodes.csv:
name:string:users age works_on
Michael 37 neo4j
Selina 14
Rana 6
Selma 4

rels.csv:
start:string:users end:string:users type since counter
Michael Selina FATHER_OF 1998-07-10 1
Michael Rana FATHER_OF 2007-09-15 2
Michael Selma FATHER_OF 2008-05-03 3
Rana Selma SISTER_OF 2008-05-03 5
Selina Rana SISTER_OF 2007-09-15 7

This is what I get when I try:

Importing 4 Nodes took 0 seconds

Total import time: 1 seconds
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.
java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
sorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.NullPointerException
at org.neo4j.batchimport.Importer.lookup(Importer.java:100)
at org.neo4j.batchimport.Importer.id(Importer.java:139)
at org.neo4j.batchimport.Importer.importRelationships(Importer.java:115)

    at org.neo4j.batchimport.Importer.doImport(Importer.java:188)
    at org.neo4j.batchimport.Importer.main(Importer.java:74)
    ... 6 more

[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] An exception occured while executing the Java class. null

[INFO] ------------------------------------------------------------------------
[INFO] For more information, run Maven with the -e switch
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10 seconds
[INFO] Finished at: Thu Jul 04 11:02:19 CEST 2013
[INFO] Final Memory: 28M/183M
[INFO] ------------------------------------------------------------------------

What am I doing wrong?

Unmatched quote in value causes AbstractLineData::parse to try to slurp up whole file before continuing to next line

Seems to deal with matched quotes fine, but I had an import that was hanging and I eventually tracked it to badly sanitized data from the MySQL dump

In my case it was specifically in the first column in the row, but I don't know if that matters...

... I might dig into more details of this later, or might not.

Thanks for the importer, though.

format checker

example: java -server -Xmx5G -jar ./batch-import-jar-with-dependencies.jar ./neo4j-enterprise-1.8.1/data/graph.db ./nodes.csv ./rels.csv node_index NodesIndex exact ./nodes_index.csv rel_index RelsIndex exact ./rels_index.csv

When someone runs batch-import with the nodes, rels, nodes_index and rels_index.csv files, if a format checker can check the format of all the files then it will be great. Currently if the format of the rels_index.csv is not correct, one might have to re-run the batch process.

Behaviour for index nodes

Hi,

I had some trouble getting an index up and running with the batch-importer. I got no error and no warning.

More or less by accident, I realized that I had a misspelled the file names of the indexes.

But only by looking into the source code I was able to understand why I got no feedback, if the index csv file does not exist the Importer does not prompt an error he just returns with no message:

if (!indexFile.exists()) return;

In importIndex(...), may be calling the Report here would help a bit.

Cheers,
Stephan Froede

Allow batch importing to already existing database.

We're always getting requests for this. Maybe a way to specify the node id and rel id that the import should start from.

Importing Foreign Language Characters

I am having a hard time importing a non-english database. A couple of examples to illustrate the problem: "Karl-Heinz Böckstiegel" imports as "Karl-Heinz B��ckstiegel", "Laurent Lévy" imports as "Laurent L��vy". The same applies to exporting text in Chinese, etc. Not sure if I am doing anything wrong (I am using UTF-8 .csv files), or missing any encoding parameter of batch-import, but if not it would be really useful to get this issue fixed.

Feature request: typing support for index properties and numeric indices

Currently it seems that all the index properties are typed as string. It would be great to have a typing support for these as well and be able to build numeric indices as explained in Neo4j docs (http://docs.neo4j.org/chunked/stable/indexing-lucene-extras.html).

MVN package problem

Is there an specific version of mvn to use in order to get the .jar file generated correctly?.

I'm currently trying with mvn 3.1.0, when I do mvn package I get failed tests[2]

If I skip the tests I get a .jar file without a manifest and without dependencies. [1]

[1] "no main manifest attribute, in batch-import-1.9-SNAPSHOT.jar".
[2]
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project batch-import: There are test failures.
[ERROR]
[ERROR] Please refer to /Users/dav009/Downloads/batch-import-master 2/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Node Types?

How are nodes of different types handled? For example I have order nodes, product nodes, and address nodes. One could envision the relationship between an order node and product node as ORDERED (with maybe the date and quantity as properties of the relationship). One ccould also envision a relationship between the address node and order node as SHIPPED-TO. There are obvious different properties for each of these node types. If I try to do this with several different batch imports then the node id invariably gets duplicated. Ideas?

Backslash isn't escaped

If you use unique index the following nodes will be equal: "\o" and "o" or "spd" and "spd". I assume that they are not equal. Maybe escape special characters with Backticks (`). My quickfix solution was to escape the backslashes in my csv file \o and spd.

Exception is:

java.util.NoSuchElementException: More than one element in     org.mapdb.Bind$5$1@239f768f

Indexing with Parallel Batch Inserter

Node and Relationship Indexing option with Parallel Batch Inserter will be helpful and make it a better tool!