Overview (Intent Mining from past conversations for Conversational Agent)

ITER-DBSCAN implementation for unbalanced data clustering. The algorithm is is tested on short text dataset (conversational intent mining from utterances) and achieve state-of-the art result. The work in accepted in COLING-2020. All the dataset and results are shared for future evaluation.

paper Link: https://www.aclweb.org/anthology/2020.coling-main.366/

Please note, we have only shared the base ITER-DBSCAN implementation. The parallelized implementation of ITER-DBSCAN is not shared.

All the raw and processed dataset is shared for future research in Data and ProcessedData folder.

The result of ITER-DBSCAN and parallelized ITER-DBSCAN evaluation on the dataset is shared in NewResults and publishedResults folder.

Code (API Reference)

API Reference : ITER-DBSCAN Implementation - Iteratively adapt dbscan parameters for unbalanced data (text) clustering The change of core parameters of DBSCAN i.e. distance and minimum samples parameters are changed smoothly to find high to low density clusters. At each iteration distance parameter is increased by 0.01 and minimum samples are decreased by 1. The algorithm uses cosine distance for cluster creation.

ITER-DBSCAN(initial_distance, initial_minimum_samples, delta_distance, delta_minimum_samples, max_iteration, threshold, features) Parameters:

initial_distance: initial distance for initial cluster creation (default: 0.10)
initial_minimum_samples: initial minimum sample count for initial cluster creation (default: 20)
delta_distance: change in distance parameter at each iteration(default: 0.01)
delta_minimum_samples: change in minimum sample parameter (of DBSCAN) at each iteration(default: 0.01)
max_iteration : maximum number of iteration the DBSCAN algorithm will run for cluster creation(default: 5)
threshold: threshold parameter controls the size of the cluster, any cluster contains more than threshold parameter will be discarded. (default: 300)
features: default values is None, the algorithm expects a list of short texts. In case the representation is pre-computed for text or data sources (pass features values as "precomputed"). default: None

In our experiments, delta_distance and delta_minimum_samples changed constantly by a factor of 0.01 and 1 respectively.

API Usage

Download ITER-DBSCAN package from Pypi repository. pip install ShortTextClustering

Sample Code

Load Packages

import pandas as pd
from ShortTextClustering.ITER_DBSCAN import ITER_DBSCAN
from ShortTextClustering.evaluation import EvaluateDataset

Load Dataset

df = pd.read_excel("WebApplicationsCorpus.xlsx")

df.head(5)

	data	intent
0	Alternative to Facebook	Find Alternative
1	How do I delete my Facebook account?	Delete Account
2	Are there any good Pandora alternatives with g...	Find Alternative
3	Is it possible to export my data from Trello t...	Export Data
4	Is there an online alternative to iGoogle	Find Alternative

Distribution of intents

df.intent.value_counts()

Find Alternative    23
Filter Spam         20
Delete Account      17
Sync Accounts        9
Change Password      8
None                 6
Export Data          5
Name: intent, dtype: int64

Remove Intent type "None"

print('Before: ', len(df))
df = df.loc[df.intent != 'None']
print('After: ', len(df))
df = df.reset_index()
del df['index']

Before:  88
After:  82

df.intent.value_counts()

Find Alternative    23
Filter Spam         20
Delete Account      17
Sync Accounts        9
Change Password      8
Export Data          5
Name: intent, dtype: int64

Generate cluster labels for short text dataset

dataset = df.data.values.tolist()

%%time
model = ITER_DBSCAN(initial_distance=0.3, initial_minimum_samples=16, delta_distance=0.01, delta_minimum_samples=1, max_iteration=15)

Wall time: 0 ns

%%time
labels = model.fit_predict(dataset)

Wall time: 48 ms

df['cluster_ids'] = labels

Cluster distribution

Noisy points are marked as -1

df.cluster_ids.value_counts()

-1    33
 0    13
 1    12
 3     5
 2     5
 6     4
 4     4
 7     3
 5     3
Name: cluster_ids, dtype: int64

Clustered Data result

df.loc[df.cluster_ids == 0]

	data	intent
1	How do I delete my Facebook account?	Delete Account
9	How can I delete my 160by2 account?	Delete Account
10	How can I permanently delete my Yahoo mail acc...	Delete Account
12	How to delete my imgur account?	Delete Account
14	How to delete a Sify Mail account	Delete Account
15	How to permanently delete a 37signals ID	Delete Account
16	How can I delete my Hunch account?	Delete Account
75	How can I delete my Twitter account?	Delete Account
76	How do I delete my LinkedIn account?	Delete Account
77	How do I delete my Gmail account?	Delete Account
78	How do I delete my Experts Exchange account?	Delete Account
79	How do I delete my Ohloh profile?	Delete Account
80	How can I permanently delete my MySpace account?	Delete Account

Evaluate ITER-DBSCAN performance on a dataset with different parameters

evaluate_dataset = EvaluateDataset(filename='WebApplicationsCorpus.xlsx', filetype='xlsx', text_column='data', 
                                   target_column='intent')

parameters = [
             {
               "distance":0.3, 
               "minimum_samples":16, 
               "delta_distance":0.01, 
               "delta_minimum_samples":1, 
               "max_iteration":15
             },
             {
               "distance":0.25, 
               "minimum_samples":14, 
               "delta_distance":0.01, 
               "delta_minimum_samples":1, 
               "max_iteration":12
             }, 
             {
               "distance":0.28, 
               "minimum_samples":12, 
               "delta_distance":0.01, 
               "delta_minimum_samples":1, 
               "max_iteration":12
             }
             ]

Generate different metrics of parameter evaluation with ITER-DBSCAN

%%time
results = evaluate_dataset.evaulate_iter_dbscan(parameters)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 14.10it/s]

Wall time: 229 ms

result_df = pd.DataFrame.from_dict(results)

result_df

	distance	minimum_samples	delta_distance	delta_minimum_samples	max_iteration	time	percentage_labelled	clusters	homogeneity_score	completeness_score	normalized_mutual_info_score	adjusted_mutual_info_score	adjusted_rand_score	accuracy	precision	recall	f1	intents
0	0.30	16	0.01	1	15	0.06	56.82	8	0.76	0.88	0.81	0.79	0.81	0.852273	75.0	85.2	79.7	5
1	0.25	14	0.01	1	12	0.03	42.05	6	0.70	0.82	0.76	0.73	0.74	0.818182	72.4	81.8	76.6	5
2	0.28	12	0.01	1	12	0.04	46.59	7	0.73	0.85	0.79	0.77	0.78	0.840909	74.1	84.1	78.7	5

Citation

If you are using this code in your work, please cite this paper:

@inproceedings{chatterjee-sengupta-2020-intent, title = "Intent Mining from past conversations for Conversational Agent", author = "Chatterjee, Ajay and Sengupta, Shubhashis", booktitle = "Proceedings of the 28th International Conference on Computational Linguistics", month = dec, year = "2020", address = "Barcelona, Spain (Online)", publisher = "International Committee on Computational Linguistics", url = "https://www.aclweb.org/anthology/2020.coling-main.366", pages = "4140--4152", abstract = "Conversational systems are of primary interest in the AI community. Organizations are increasingly using chatbot to provide round-the-clock support and to increase customer engagement. Many commercial bot building frameworks follow a standard approach that requires one to build and train an intent model to recognize user input. These frameworks require a collection of user utterances and corresponding intent to train an intent model. Collecting a substantial coverage of training data is a bottleneck in the bot building process. In cases where past conversation data is available, the cost of labeling hundreds of utterances with intent labels is time-consuming and laborious. In this paper, we present an intent discovery framework that can mine a vast amount of conversational logs and to generate labeled data sets for training intent models. We have introduced an extension to the DBSCAN algorithm and presented a density-based clustering algorithm ITER-DBSCAN for unbalanced data clustering. Empirical evaluation on one conversation dataset, six different intent dataset, and one short text clustering dataset show the effectiveness of our hypothesis.", }

anatanick / intentmining Goto Github PK

intentmining's Introduction

Overview (Intent Mining from past conversations for Conversational Agent)

Code (API Reference)

API Usage

Sample Code

Load Packages

Load Dataset

Distribution of intents

Remove Intent type "None"

Generate cluster labels for short text dataset

Cluster distribution

Clustered Data result

Evaluate ITER-DBSCAN performance on a dataset with different parameters

Generate different metrics of parameter evaluation with ITER-DBSCAN

Citation

intentmining's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent