Comments (14)
Hi Rémi,
any update on this, do you still need this feature ?
I have no news so far as I have not had the time to work on it, but the application I mentioned to you via email still holds. I am going to re-implement HIPAR using scikit-mine. HIPAR is a method to mine rules of the form
pattern => y =some linear function
like for example:
wine-variety = "Grenache" => wine-price = -320 + 4 * score-by-critics + 6 * age
Mining the left-hand side of rules can be done in an LCM fashion, however our search space can be large and I would like to exploit parallel techniques to speed-up computation. Furthermore, there is another particularity: HIPAR can the discretize numerical variables. For example, before descending deeper in the search tree, HIPAR may discretize the variable score-by-critics in two bins: high-score (>-=4) and low score (<4) and those conditions become new itemsets for the exploration rooted at the pattern wine-variety="Grenache". That is why we need as much flexibility as possible.
Given my current constraints, I will not have time to work on this before the beginning of October. Suggestions are of course welcome.
Best,
Luis
from scikit-mine.
Go for it! If I have a particular problem in this regard, I will then open a new issue (or perhaps reopen this one). So far you have provided me with good hints.
from scikit-mine.
Here is the code for this
def generate(self):
"""Online discovery of itemsets.
Note that this is purely sequential, hence no parallelisation is done.
"""
if self.item_to_tids is None: # .fit has not been called
return
supp_sorted_items = filter(lambda e: len(e[1]) >= self._min_supp, self.item_to_tids.items())
for item, tids in supp_sorted_items:
yield self._explore_item(item, tids)
_explore_items
returns a pd.DataFrame
from scikit-mine.
Wow, That is great news! Thanks Rémi! I see the limitation with multi-threading. I am not familiar with your implementation but I wonder if you could use the same synchronization mechanisms for the online version. How is the output data structure synchronized in the original implementation?
from scikit-mine.
@lgalarra thanks a lot.
If I understand your question, synchronization if done at the very end : we consider every original item as a root node and explore it. This process results in a "list of DataFrames". Finally we concat these DataFrames to obtain one single big DataFrame
Am I clear enough ?
If you have additional use cases in mind where LCM could used it would probably give us more details to provide a generic implementation.
from scikit-mine.
Yes, it is very clear, thanks. Indeed, parallelizing the online version will require to synchronize the threads when they yield a result. It is not a trivial problem. If more than one thread has a result to yield, the others will have to stall, which is inefficient. The threads could also buffer their results, but then we will need an additional consumer thread that is continuously sniffing in their buffers to see if there are patterns to report. In any case, it is probably a challenge not worth the pain.
from scikit-mine.
I was more thinking about using coroutines in this case (well perhaps your coroutines are handled using threads), but I think async programming is well suited for this : you pull a batch of patterns from LCM and launch some other code execution. While doing this you don't block, but ask LCM to look for the next batch of data
This makes LCM work as a background task while you are using patterns for you purpose.
from scikit-mine.
Hi Rémi,
What is a batch in your definition? If I understand correctly you propose to delegate the synchronization process to the user. Is it right? If so, I think it is a wise idea.
I see it this way: The user may define a set of listeners (where the number of listeners equals the number of threads) that execute some code whenever a thread pushes a pattern. Then it is the user's duty to craft its exploration strategy based on the threads. Does it make sense?
from scikit-mine.
What is a batch in your definition?
To me a batch is a subset of the patternset you want to discover
If I understand correctly you propose to delegate the synchronization process to the user.
Exactly, my idea is
- skmine exposes an async function, that yields a set of patterns (or batch)
- user can choose how to call this function
I see it this way: The user may define a set of listeners
I think the most appealing sort of example we can mimic is DB/http request
LCM can be seen as a general purpose itemset miner. Users may want to subscribe
to it, just like they would subscribe to some database.
So now let's put this into code
This example is close to my idea
import asyncio
async def main_function(): # user side
lcm = LCM(min_supp=.2) # min_supp of 20%
async for batch in lcm:
# do whatever you want with patterns
Hope this makes sense
from scikit-mine.
skmine exposes an async function, that yields a set of patterns (or batch)
Would this function run in a single thread?
I am asking this because my ambition would be to implement a parallel miner based on a set of asynchronous subscriptions that give me different patterns (e.g., they explore disjoint sub-spaces). Would that be possible? And even first, does it make sense?
Thanks Rémi!
from scikit-mine.
@lgalarra I think yes
Once in asynchronous mode, it is your choice to choose an Executor
for the code
The best ressource I've found is here
But again, I don't think threads are the best option, because the current LCM implementation is mostly pure Python, and holds the GIL. Processes might be a better option
If you have a mature example that would be nice to include it in our documentation :)
from scikit-mine.
I have planned to devote some time for this, so I will come back to you soon. Thanks!
from scikit-mine.
@lgalarra any update on this, do you still need this feature ?
Do we have a motivating example ?
from scikit-mine.
@lgalarra I may be closing this, as I think your use case is going OK with the current implem
I may be reworking the current implem of LCM, to get better performances
from scikit-mine.
Related Issues (20)
- avoid data copies in PeriodicCycleMiner
- SLIM for high dimensional data HOT 1
- Make scikit-mine profile friendly HOT 2
- KRIMP Imputation
- MDLP Discretizer v2 HOT 1
- notebook for PeriodicCycleMiner
- fetch_instancart is broken HOT 1
- [perf] skmine.periodic.cycles.extract_triples is really slow for n_points > 200
- Question about parameter k of SLIM HOT 2
- Return type of SLIM HOT 4
- Don't understand return values of decision_function of SLIM HOT 3
- apriori and CBA HOT 1
- MDLPDiscretizer: cut_points_ is not sorted HOT 2
- inherit sklearn BaseEstimator HOT 2
- MDLPDiscretizer: cut_points_ sometimes contains ambiguous values HOT 1
- max_time follows an anti-pattern in SLIM HOT 1
- preprocessing of transactionnal database affect code length HOT 2
- OneVsOneClassifier for SLIM doesnt't work
- environment for doc generation
- CoverTransformer HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scikit-mine.