Giter Site home page Giter Site logo

Comments (14)

lgalarra avatar lgalarra commented on June 22, 2024 1

Hi Rémi,

any update on this, do you still need this feature ?

I have no news so far as I have not had the time to work on it, but the application I mentioned to you via email still holds. I am going to re-implement HIPAR using scikit-mine. HIPAR is a method to mine rules of the form

pattern => y =some linear function

like for example:

wine-variety = "Grenache" => wine-price = -320 + 4 * score-by-critics + 6 * age

Mining the left-hand side of rules can be done in an LCM fashion, however our search space can be large and I would like to exploit parallel techniques to speed-up computation. Furthermore, there is another particularity: HIPAR can the discretize numerical variables. For example, before descending deeper in the search tree, HIPAR may discretize the variable score-by-critics in two bins: high-score (>-=4) and low score (<4) and those conditions become new itemsets for the exploration rooted at the pattern wine-variety="Grenache". That is why we need as much flexibility as possible.

Given my current constraints, I will not have time to work on this before the beginning of October. Suggestions are of course welcome.

Best,
Luis

from scikit-mine.

lgalarra avatar lgalarra commented on June 22, 2024 1

Go for it! If I have a particular problem in this regard, I will then open a new issue (or perhaps reopen this one). So far you have provided me with good hints.

from scikit-mine.

remiadon avatar remiadon commented on June 22, 2024

Here is the code for this

    def generate(self):
        """Online discovery of itemsets.
        Note that this is purely sequential, hence no parallelisation is done.
        """
        if self.item_to_tids is None:  # .fit has not been called
            return

        supp_sorted_items = filter(lambda e: len(e[1]) >= self._min_supp, self.item_to_tids.items())
        for item, tids in supp_sorted_items:
            yield self._explore_item(item, tids)

_explore_items returns a pd.DataFrame

from scikit-mine.

lgalarra avatar lgalarra commented on June 22, 2024

Wow, That is great news! Thanks Rémi! I see the limitation with multi-threading. I am not familiar with your implementation but I wonder if you could use the same synchronization mechanisms for the online version. How is the output data structure synchronized in the original implementation?

from scikit-mine.

remiadon avatar remiadon commented on June 22, 2024

@lgalarra thanks a lot.

If I understand your question, synchronization if done at the very end : we consider every original item as a root node and explore it. This process results in a "list of DataFrames". Finally we concat these DataFrames to obtain one single big DataFrame
Am I clear enough ?

If you have additional use cases in mind where LCM could used it would probably give us more details to provide a generic implementation.

from scikit-mine.

lgalarra avatar lgalarra commented on June 22, 2024

Yes, it is very clear, thanks. Indeed, parallelizing the online version will require to synchronize the threads when they yield a result. It is not a trivial problem. If more than one thread has a result to yield, the others will have to stall, which is inefficient. The threads could also buffer their results, but then we will need an additional consumer thread that is continuously sniffing in their buffers to see if there are patterns to report. In any case, it is probably a challenge not worth the pain.

from scikit-mine.

remiadon avatar remiadon commented on June 22, 2024

I was more thinking about using coroutines in this case (well perhaps your coroutines are handled using threads), but I think async programming is well suited for this : you pull a batch of patterns from LCM and launch some other code execution. While doing this you don't block, but ask LCM to look for the next batch of data

This makes LCM work as a background task while you are using patterns for you purpose.

from scikit-mine.

lgalarra avatar lgalarra commented on June 22, 2024

Hi Rémi,

What is a batch in your definition? If I understand correctly you propose to delegate the synchronization process to the user. Is it right? If so, I think it is a wise idea.

I see it this way: The user may define a set of listeners (where the number of listeners equals the number of threads) that execute some code whenever a thread pushes a pattern. Then it is the user's duty to craft its exploration strategy based on the threads. Does it make sense?

from scikit-mine.

remiadon avatar remiadon commented on June 22, 2024

What is a batch in your definition?

To me a batch is a subset of the patternset you want to discover


If I understand correctly you propose to delegate the synchronization process to the user.

Exactly, my idea is

  • skmine exposes an async function, that yields a set of patterns (or batch)
  • user can choose how to call this function

I see it this way: The user may define a set of listeners

I think the most appealing sort of example we can mimic is DB/http request
LCM can be seen as a general purpose itemset miner. Users may want to subscribe to it, just like they would subscribe to some database.


So now let's put this into code
This example is close to my idea

import asyncio

async def main_function():  # user side
   lcm = LCM(min_supp=.2)  # min_supp of 20%
   async for batch in lcm:
       # do whatever you want with patterns

Hope this makes sense

from scikit-mine.

lgalarra avatar lgalarra commented on June 22, 2024

skmine exposes an async function, that yields a set of patterns (or batch)

Would this function run in a single thread?

I am asking this because my ambition would be to implement a parallel miner based on a set of asynchronous subscriptions that give me different patterns (e.g., they explore disjoint sub-spaces). Would that be possible? And even first, does it make sense?

Thanks Rémi!

from scikit-mine.

remiadon avatar remiadon commented on June 22, 2024

@lgalarra I think yes
Once in asynchronous mode, it is your choice to choose an Executor for the code

The best ressource I've found is here

But again, I don't think threads are the best option, because the current LCM implementation is mostly pure Python, and holds the GIL. Processes might be a better option

If you have a mature example that would be nice to include it in our documentation :)

from scikit-mine.

lgalarra avatar lgalarra commented on June 22, 2024

I have planned to devote some time for this, so I will come back to you soon. Thanks!

from scikit-mine.

remiadon avatar remiadon commented on June 22, 2024

@lgalarra any update on this, do you still need this feature ?

Do we have a motivating example ?

from scikit-mine.

remiadon avatar remiadon commented on June 22, 2024

@lgalarra I may be closing this, as I think your use case is going OK with the current implem

I may be reworking the current implem of LCM, to get better performances

from scikit-mine.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.