Giter Site home page Giter Site logo

intentmining's Introduction

Overview (Intent Mining from past conversations for Conversational Agent)

ITER-DBSCAN implementation for unbalanced data clustering. The algorithm is is tested on short text dataset (conversational intent mining from utterances) and achieve state-of-the art result. The work in accepted in COLING-2020. All the dataset and results are shared for future evaluation.

paper Link: https://www.aclweb.org/anthology/2020.coling-main.366/

Please note, we have only shared the base ITER-DBSCAN implementation. The parallelized implementation of ITER-DBSCAN is not shared.

All the raw and processed dataset is shared for future research in Data and ProcessedData folder.

The result of ITER-DBSCAN and parallelized ITER-DBSCAN evaluation on the dataset is shared in NewResults and publishedResults folder.

Code (API Reference)

API Reference : ITER-DBSCAN Implementation - Iteratively adapt dbscan parameters for unbalanced data (text) clustering The change of core parameters of DBSCAN i.e. distance and minimum samples parameters are changed smoothly to find high to low density clusters. At each iteration distance parameter is increased by 0.01 and minimum samples are decreased by 1. The algorithm uses cosine distance for cluster creation.

ITER-DBSCAN(initial_distance, initial_minimum_samples, delta_distance, delta_minimum_samples, max_iteration, threshold, features) Parameters:

  • initial_distance: initial distance for initial cluster creation (default: 0.10)
  • initial_minimum_samples: initial minimum sample count for initial cluster creation (default: 20)
  • delta_distance: change in distance parameter at each iteration(default: 0.01)
  • delta_minimum_samples: change in minimum sample parameter (of DBSCAN) at each iteration(default: 0.01)
  • max_iteration : maximum number of iteration the DBSCAN algorithm will run for cluster creation(default: 5)
  • threshold: threshold parameter controls the size of the cluster, any cluster contains more than threshold parameter will be discarded. (default: 300)
  • features: default values is None, the algorithm expects a list of short texts. In case the representation is pre-computed for text or data sources (pass features values as "precomputed"). default: None

In our experiments, delta_distance and delta_minimum_samples changed constantly by a factor of 0.01 and 1 respectively.

API Usage

Download ITER-DBSCAN package from Pypi repository. pip install ShortTextClustering

Sample Code

Load Packages

import pandas as pd
from ShortTextClustering.ITER_DBSCAN import ITER_DBSCAN
from ShortTextClustering.evaluation import EvaluateDataset

Load Dataset

df = pd.read_excel("WebApplicationsCorpus.xlsx")
df.head(5)
data intent
0 Alternative to Facebook Find Alternative
1 How do I delete my Facebook account? Delete Account
2 Are there any good Pandora alternatives with g... Find Alternative
3 Is it possible to export my data from Trello t... Export Data
4 Is there an online alternative to iGoogle Find Alternative

Distribution of intents

df.intent.value_counts()
Find Alternative    23
Filter Spam         20
Delete Account      17
Sync Accounts        9
Change Password      8
None                 6
Export Data          5
Name: intent, dtype: int64

Remove Intent type "None"

print('Before: ', len(df))
df = df.loc[df.intent != 'None']
print('After: ', len(df))
df = df.reset_index()
del df['index']
Before:  88
After:  82
df.intent.value_counts()
Find Alternative    23
Filter Spam         20
Delete Account      17
Sync Accounts        9
Change Password      8
Export Data          5
Name: intent, dtype: int64

Generate cluster labels for short text dataset

dataset = df.data.values.tolist()
%%time
model = ITER_DBSCAN(initial_distance=0.3, initial_minimum_samples=16, delta_distance=0.01, delta_minimum_samples=1, max_iteration=15)
Wall time: 0 ns
%%time
labels = model.fit_predict(dataset)
Wall time: 48 ms
df['cluster_ids'] = labels

Cluster distribution

Noisy points are marked as -1

df.cluster_ids.value_counts()
-1    33
 0    13
 1    12
 3     5
 2     5
 6     4
 4     4
 7     3
 5     3
Name: cluster_ids, dtype: int64

Clustered Data result

df.loc[df.cluster_ids == 0]
data intent cluster_ids
1 How do I delete my Facebook account? Delete Account 0
9 How can I delete my 160by2 account? Delete Account 0
10 How can I permanently delete my Yahoo mail acc... Delete Account 0
12 How to delete my imgur account? Delete Account 0
14 How to delete a Sify Mail account Delete Account 0
15 How to permanently delete a 37signals ID Delete Account 0
16 How can I delete my Hunch account? Delete Account 0
75 How can I delete my Twitter account? Delete Account 0
76 How do I delete my LinkedIn account? Delete Account 0
77 How do I delete my Gmail account? Delete Account 0
78 How do I delete my Experts Exchange account? Delete Account 0
79 How do I delete my Ohloh profile? Delete Account 0
80 How can I permanently delete my MySpace account? Delete Account 0

Evaluate ITER-DBSCAN performance on a dataset with different parameters

evaluate_dataset = EvaluateDataset(filename='WebApplicationsCorpus.xlsx', filetype='xlsx', text_column='data', 
                                   target_column='intent')
parameters = [
             {
               "distance":0.3, 
               "minimum_samples":16, 
               "delta_distance":0.01, 
               "delta_minimum_samples":1, 
               "max_iteration":15
             },
             {
               "distance":0.25, 
               "minimum_samples":14, 
               "delta_distance":0.01, 
               "delta_minimum_samples":1, 
               "max_iteration":12
             }, 
             {
               "distance":0.28, 
               "minimum_samples":12, 
               "delta_distance":0.01, 
               "delta_minimum_samples":1, 
               "max_iteration":12
             }
             ]

Generate different metrics of parameter evaluation with ITER-DBSCAN

%%time
results = evaluate_dataset.evaulate_iter_dbscan(parameters)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 14.10it/s]

Wall time: 229 ms
result_df = pd.DataFrame.from_dict(results)
result_df
distance minimum_samples delta_distance delta_minimum_samples max_iteration time percentage_labelled clusters noisy_clusters homogeneity_score completeness_score normalized_mutual_info_score adjusted_mutual_info_score adjusted_rand_score accuracy precision recall f1 intents
0 0.30 16 0.01 1 15 0.06 56.82 8 0 0.76 0.88 0.81 0.79 0.81 0.852273 75.0 85.2 79.7 5
1 0.25 14 0.01 1 12 0.03 42.05 6 0 0.70 0.82 0.76 0.73 0.74 0.818182 72.4 81.8 76.6 5
2 0.28 12 0.01 1 12 0.04 46.59 7 0 0.73 0.85 0.79 0.77 0.78 0.840909 74.1 84.1 78.7 5

Citation

If you are using this code in your work, please cite this paper:

@inproceedings{chatterjee-sengupta-2020-intent, title = "Intent Mining from past conversations for Conversational Agent", author = "Chatterjee, Ajay and Sengupta, Shubhashis", booktitle = "Proceedings of the 28th International Conference on Computational Linguistics", month = dec, year = "2020", address = "Barcelona, Spain (Online)", publisher = "International Committee on Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.coling-main.366", pages = "4140--4152", abstract = "Conversational systems are of primary interest in the AI community. Organizations are increasingly using chatbot to provide round-the-clock support and to increase customer engagement. Many commercial bot building frameworks follow a standard approach that requires one to build and train an intent model to recognize user input. These frameworks require a collection of user utterances and corresponding intent to train an intent model. Collecting a substantial coverage of training data is a bottleneck in the bot building process. In cases where past conversation data is available, the cost of labeling hundreds of utterances with intent labels is time-consuming and laborious. In this paper, we present an intent discovery framework that can mine a vast amount of conversational logs and to generate labeled data sets for training intent models. We have introduced an extension to the DBSCAN algorithm and presented a density-based clustering algorithm ITER-DBSCAN for unbalanced data clustering. Empirical evaluation on one conversation dataset, six different intent dataset, and one short text clustering dataset show the effectiveness of our hypothesis.", }

intentmining's People

Contributors

ajaychatterjee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.