Describe the workflow you want to enable LCM mines a lot of patter

Here is the code for this <div class="highlight highlight-source-python notranslat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

online discovery in LCM about scikit-mine HOT 14 CLOSED

scikit-mine commented on June 22, 2024

online discovery in LCM

from scikit-mine.

Comments (14)

lgalarra commented on June 22, 2024 1

Hi Rémi,

any update on this, do you still need this feature ?

I have no news so far as I have not had the time to work on it, but the application I mentioned to you via email still holds. I am going to re-implement HIPAR using scikit-mine. HIPAR is a method to mine rules of the form

pattern => y =some linear function

like for example:

wine-variety = "Grenache" => wine-price = -320 + 4 * score-by-critics + 6 * age

Mining the left-hand side of rules can be done in an LCM fashion, however our search space can be large and I would like to exploit parallel techniques to speed-up computation. Furthermore, there is another particularity: HIPAR can the discretize numerical variables. For example, before descending deeper in the search tree, HIPAR may discretize the variable score-by-critics in two bins: high-score (>-=4) and low score (<4) and those conditions become new itemsets for the exploration rooted at the pattern wine-variety="Grenache". That is why we need as much flexibility as possible.

Given my current constraints, I will not have time to work on this before the beginning of October. Suggestions are of course welcome.

Best,
Luis

from scikit-mine.

lgalarra commented on June 22, 2024 1

Go for it! If I have a particular problem in this regard, I will then open a new issue (or perhaps reopen this one). So far you have provided me with good hints.

from scikit-mine.

remiadon commented on June 22, 2024

Here is the code for this

    def generate(self):
        """Online discovery of itemsets.
        Note that this is purely sequential, hence no parallelisation is done.
        """
        if self.item_to_tids is None:  # .fit has not been called
            return

        supp_sorted_items = filter(lambda e: len(e[1]) >= self._min_supp, self.item_to_tids.items())
        for item, tids in supp_sorted_items:
            yield self._explore_item(item, tids)

_explore_items returns a pd.DataFrame

from scikit-mine.

lgalarra commented on June 22, 2024

Wow, That is great news! Thanks Rémi! I see the limitation with multi-threading. I am not familiar with your implementation but I wonder if you could use the same synchronization mechanisms for the online version. How is the output data structure synchronized in the original implementation?

from scikit-mine.

remiadon commented on June 22, 2024

@lgalarra thanks a lot.

If I understand your question, synchronization if done at the very end : we consider every original item as a root node and explore it. This process results in a "list of DataFrames". Finally we concat these DataFrames to obtain one single big DataFrame
Am I clear enough ?

If you have additional use cases in mind where LCM could used it would probably give us more details to provide a generic implementation.

from scikit-mine.

lgalarra commented on June 22, 2024

Yes, it is very clear, thanks. Indeed, parallelizing the online version will require to synchronize the threads when they yield a result. It is not a trivial problem. If more than one thread has a result to yield, the others will have to stall, which is inefficient. The threads could also buffer their results, but then we will need an additional consumer thread that is continuously sniffing in their buffers to see if there are patterns to report. In any case, it is probably a challenge not worth the pain.

from scikit-mine.

remiadon commented on June 22, 2024

I was more thinking about using coroutines in this case (well perhaps your coroutines are handled using threads), but I think async programming is well suited for this : you pull a batch of patterns from LCM and launch some other code execution. While doing this you don't block, but ask LCM to look for the next batch of data

This makes LCM work as a background task while you are using patterns for you purpose.

from scikit-mine.

lgalarra commented on June 22, 2024

Hi Rémi,

What is a batch in your definition? If I understand correctly you propose to delegate the synchronization process to the user. Is it right? If so, I think it is a wise idea.

I see it this way: The user may define a set of listeners (where the number of listeners equals the number of threads) that execute some code whenever a thread pushes a pattern. Then it is the user's duty to craft its exploration strategy based on the threads. Does it make sense?

from scikit-mine.

remiadon commented on June 22, 2024

What is a batch in your definition?

To me a batch is a subset of the patternset you want to discover

If I understand correctly you propose to delegate the synchronization process to the user.

Exactly, my idea is

skmine exposes an async function, that yields a set of patterns (or batch)
user can choose how to call this function

I see it this way: The user may define a set of listeners

I think the most appealing sort of example we can mimic is DB/http request
LCM can be seen as a general purpose itemset miner. Users may want to subscribe to it, just like they would subscribe to some database.

So now let's put this into code
This example is close to my idea

import asyncio

async def main_function():  # user side
   lcm = LCM(min_supp=.2)  # min_supp of 20%
   async for batch in lcm:
       # do whatever you want with patterns

Hope this makes sense

from scikit-mine.

lgalarra commented on June 22, 2024

skmine exposes an async function, that yields a set of patterns (or batch)

Would this function run in a single thread?

I am asking this because my ambition would be to implement a parallel miner based on a set of asynchronous subscriptions that give me different patterns (e.g., they explore disjoint sub-spaces). Would that be possible? And even first, does it make sense?

Thanks Rémi!

from scikit-mine.

remiadon commented on June 22, 2024

@lgalarra I think yes
Once in asynchronous mode, it is your choice to choose an Executor for the code

The best ressource I've found is here

But again, I don't think threads are the best option, because the current LCM implementation is mostly pure Python, and holds the GIL. Processes might be a better option

If you have a mature example that would be nice to include it in our documentation :)

from scikit-mine.

lgalarra commented on June 22, 2024

I have planned to devote some time for this, so I will come back to you soon. Thanks!

from scikit-mine.

remiadon commented on June 22, 2024

@lgalarra any update on this, do you still need this feature ?

Do we have a motivating example ?

from scikit-mine.

remiadon commented on June 22, 2024

@lgalarra I may be closing this, as I think your use case is going OK with the current implem

I may be reworking the current implem of LCM, to get better performances

from scikit-mine.

online discovery in LCM about scikit-mine HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent