Giter Site home page Giter Site logo

Comments (4)

hag0r avatar hag0r commented on May 20, 2024

Hi,
no, you're not configuring it wrong, it just seems that DBSCAN takes very long for this large dataset. It has to compute many distances and performs lots of comparisons.

If it works for smaller data sets I would guess that:

  1. The operation just takes long time (maybe also with some interruptions for other processes?)
  2. Maybe the garbage collector has to run too often to free up space which causes your program to stall frequently.

Can you tell the setup you are running your program on? Number of nodes, CPU, etc.

from stark.

cristinaluengoagullo avatar cristinaluengoagullo commented on May 20, 2024

Hi,
Thanks! I'm running it on a 6 node cluster with yarn and Zeppelin to run the spark jobs. Each node has 50GB and 16 cores available, and I used executors with 12g and 4 cores each. When I use a ppd of 10 the execution finishes with yarn errors on different containers (out of memory) and huge garbage collection times for each task. But when I use a value of 20 for example, the garbage collection time is small and the program just stalls in line 149 of DBScan.scala. There seems to be progress in the map function (line 149), but very very slow (for example, I left the program running 2 days straight and it had not gotten to 10% of the progress of the map function when I checked).

I also tried using the BSPartitioner, but it crashed earlier than with the GridPartitioner. And I also tried repartitioning the data before calling the dbscan run function, but still, it gets stalled :(

I was thinking that there's a comment in DBScan.scala (148) that suggests to repartition the data after the groupBy. Maybe I'll try to modify that part and see if there's any improvement. Do you think it could work?
Thanks again!

from stark.

hag0r avatar hag0r commented on May 20, 2024

It may help. Though, I also found a place were we collected the cluster content in a List object, instead of reusing Iterator. This will lead to performance issues for large partitions.

I will try to solve this

from stark.

cristinaluengoagullo avatar cristinaluengoagullo commented on May 20, 2024

Ok thank you very much!!

from stark.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.