Neo4j creating relationships using csv

11,297

Be sure to have schema indexes in place to speed up looking up start nodes. Before running the import do a:

CREATE INDEX ON :Movie(title)
CREATE INDEX ON :Keyword(word)

Make sure the indexes are populated and online (check with :schema command).

Refactor your Cypher command into two queries, to make use of the indexes - for now a index consists only of a label and one property:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/gondil/temp.csv" AS csv
FIELDTERMINATOR '|'
MERGE (m:Movie {title:csv.title })
ON CREATE SET m.year = toInt(csv.year)
MERGE (k:Keyword {word:csv.word})

second pass over the file

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/gondil/temp.csv" AS csv
FIELDTERMINATOR '|'
MATCH (m:Movie {title:csv.title })
MATCH (k:Keyword {word:csv.word})
MERGE (m)-[:Has {weight:1}]->(k);
Share:
11,297

Related videos on Youtube

Gondil
Author by

Gondil

Updated on June 28, 2022

Comments

  • Gondil
    Gondil almost 2 years

    I am trying to create relationships between 2 types of nodes using csv file loaded. I have already created all Movies and Keywords nodes. I created also indexes on :Movie(title) and :Keyword(word).

    My csv file looks like:

    "title"|year|"word" //header

    "Into the Wild"|2007|"1990s" //line with title, year and keyword

    "Into the Wild"|2007|"abandoned-bus"

    My query:

    LOAD CSV WITH HEADERS FROM "file:/home/gondil/temp.csv" AS csv
    FIELDTERMINATOR '|'
    MATCH (m:Movie {title:csv.title,year: toInt(csv.year)}), (k:Keyword {word:csv.word})
    MERGE (m)-[:Has {weight:1}]->(k);
    

    Query runs for about one hour and than it shows error "Unknown error". What a redundant Error description.

    I thought it is due to 160K keywords and over 1M movies and over 4M lines in csv. So I shorten a csv to just one line and it is still running for about 15 minutes with no stop.

    Where is the problem? How to write a query for creating relationships between 2 already created nodes?

    I can also delete all nodes and build my database other way but it will be better to not delete all that created nodes.

    Note: I shouldn't have a hardware problems cause I use Super PC from our faculty.

    • ADTC
      ADTC almost 8 years
      "redundant" I don't think it means what you think it means.
  • Gondil
    Gondil over 9 years
    I executed :schema. It returned Indexes ON :Keyword(word) ONLINE and ON :Movies(title) ONLINE. Than I executed your suggested query on csv with just 2 lines and it is running for 15 min+ yet. Can't figure out what is wrong. I tested just to return nodes detected by csv file and it takes about 118ms.
  • Gondil
    Gondil over 9 years
    Now I'm confused. Browser interface of Neo says I have :Movies and :Keywords nodes but what is strange, it shows that I have also :Has rels. So some must be created. I tried some very primitive queries such return me a single node. It taken about 30 seconds but it should take few ms. Some of that queries execution last really long and than an Unknown error occurs. Ex.: I tried to make one single relationship not from csv just merge 2 nodes and it haven't been done. I'm really desperate of it. Do you know what could be wrong with it?I would like to write you some message not off-topic comment
  • Gondil
    Gondil about 9 years
    Hello I returned to this after long time, I deleted everything in my database and then start first using periodic commit query you posted. I let it run and go out of my laptop but forgot to set not to sleep after some time. When I come back there was Unknown error but when I looked to webadmin interface there was some nodes and what was strange count was increasing. And is still increasing. Now when I execute query to count nodes and execute it in few minutes later count increase. My database is running on remote PC. why it showed me error? How can I know what's going on?
  • Gondil
    Gondil about 9 years
    now it is running about 21 hours and there are only 74K nodes created. There is no information about running query but on webadmin interface is count of properties and nodes still increasing veeery slowly. I compute some basic equation and it seems that to finish it I'll need to wait 19 days. And it's only 1.2M nodes. How can I stop this madness? How can I run it to successfull and relatively fast end?