ggozad / collective.classification Goto Github PK

Content classification/clustering through language processing

Python 100.00%

collective.classification's Introduction

Introduction
============

*collective.classification* aims to provide a set of tools for automatic
document classification. Currently it makes use of the
`Natural Language Toolkit`_ and features a trainable document classifier based
on Part Of Speech (POS) tagging, heavily influenced by `topia.termextract`_.
This product is mostly intended to be used for experimentation and
development. Currently english and dutch are supported.

  .. _`Natural Language Toolkit`: http://www.nltk.org
  .. _`topia.termextract`: http://pypi.python.org/pypi/topia.termextract/

What is this all about?
=======================

It's mostly about having fun! The package is in a very early experimental
stage and awaits eagerly contributions. You will get a good understanding of
what works or not by looking at the tests. You might also be able to do some
useful things with it:

    1) Term extraction can be performed to provide quick insight on what a
    document is about.
    2) On a large site with a lot of content and tags (or subjects in the
    plone lingo) it might be difficult to assign tags to new content. In this
    case, a trained classifier could provide useful suggestions to an editor
    responsible for tagging content.
    3) Similar documents can be found based on term similarity.
    4) Clustering can help you organize unclassified content into groups.

How it works?
=============

At the moment there exist the following type of utilities:

  * *POS taggers*, utilities for classifying words in a document
    as `Parts Of Speech`_. Two are provided at the moment, a Penn TreeBank
    tagger and a trigram tagger. Both can be trained with some other language
    than english which is what we do here.
  * *Term extractors*, utilities responsible for extracting the important
    terms from some document. The extractor we use here, assumes that in a
    document only nouns matter and uses a POS tagger to find those mostly used
    in a document. For details please look at the code and the tests.
  * *Content classifiers*, utilities that can tag content in predefined
    categories. Here, a `naive Bayes`_ classifier is used. Basically, the
    classifier looks at already tagged content, performs term extraction and
    trains itself using the terms and tags as an input. Then, for new content,
    the classifier will provide suggestions for tags according to the
    extracted terms of the content.
  * Utilities that find *similar content* based on the extracted terms.
  * *Clusterers*, utilities that without prior knowledge of content
    classification can group content into groups according to feature
    similarity. At the moment NLTK's `k-means`_ clusterer is used.


  .. _`Parts Of Speech`: http://en.wikipedia.org/wiki/Part-of-speech_tagging
  .. _`naive Bayes`: http://en.wikipedia.org/wiki/Naive_Bayes_classifier
  .. _`k-means`: http://en.wikipedia.org/wiki/K-means_clustering

Installation & Setup
====================

Before running buildout, make sure you have yaml and its python bindings
installed (use macports on osx, or your package installer on linux). If nltk
exists for your OS you might as well install that, otherwise it will be
fetched when you run buildout.

To get started you will simply need to add the package to your "eggs" section
and run buildout, restart your Plone instance and install the
"collective.classification" package using the quick-installer or via the
"Add-on Products" section in "Site Setup".

**WARNING: Upon first time installation linguistic data will be fetched from
NLTK's repository and stored locally on your filesystem. It's not big (about 400kb) but you need the plone user to have access to its "home". Running the
tests will also fetch more data from nltk bringing the total to about 225Mb, so not for the faint at disk space.**

How to use it?
==============
  * For a parsed document you can call the term view to display the identified
    terms (just append *@@terms* to the url of the content to call the view).
  * In order to use the classifier and get suggested tags for some content,
    you can call *@@suggest-categories* on the content. This comes down to
    appending @@suggest-categories to the url in your browser. A form will
    come up with suggestions, choose the ones that seem appropriate and apply.
    You will need to have the right to edit the document in order to call the
    view.
  * You can find similar content for some content based on its terms by
    calling the *@@similar-items* view. 
  * For clustering you can just call the *@@clusterize* view from anywhere.
    The result is not deterministic but hopefully helpful;). You need manager
    rights for this so as to not allow your users to DOS your site!

collective.classification's People

Stargazers

Watchers

Forkers

soerensigfusson avoinea doubleotoo bjornlilja fangzheng354

collective.classification's Issues

Improvement: for a folder, opt to classify a (non-folder) default item therein

= Use case =

Text at http://www.brighton.ac.uk/centrim/research/projects/mental-models/folder_contents is minimal.

That folder defaults to view a page http://www.brighton.ac.uk/centrim/research/projects/mental-models/overview in which text is substantial.

= Feature (improvement) =

http://www.brighton.ac.uk/centrim/research/projects/mental-models/@@subjectsuggest should:

detect that there is a default view for the folder
offer an option, to make suggestions based on that item (not based upon the folder).

A more exotic additional option would be to make suggestions based upon /both/ the folder /and/ its default item.

Feature request: Limit the number of documents to parse and classify

On really huge sites it is not doable to parse all documents.
It would be nice if we had a Int field for limiting the query.

Confusing status message after reparsing

After you have pressed "Reparse all documents", the status info show "Term extractor trained and NP storage updated. You will need to re-train the classifier as well."

The first sentece is left-over from previous release, and isn't that usefriendly. Maybe it should just say: "Documents reparsed. You will need to re-train the classifier as well."

prefer @@suggest-categories to @@subjectsuggest

= Motivations =

Plone speak for developers may be 'subjects', but to everyday users of Plone 3.x, the expression is:

Error when selecting N-gram but no category

As a default no category is selected so when you chose N-gram and press save, you will get the error:

Time 2010/03/02 11:17:05.890 GMT+1
User Name (User Id) soren (soren)
Request URL http://localhost:8080/nordic/@@classifier-settings-controlpanel
Exception Type ValueError
Exception Value concat() expects at least one object!
Traceback (innermost last):

Module plone.postpublicationhook.hook, line 74, in publish
Module ZPublisher.mapply, line 88, in mapply
Module ZPublisher.Publish, line 42, in call_object
Module zope.formlib.form, line 769, in call
Module Products.Five.formlib.formbase, line 55, in update
Module zope.formlib.form, line 750, in update
Module zope.formlib.form, line 594, in success
Module collective.classification.browser.controlpanel, line 153, in save_action
Module nltk.corpus.reader.tagged, line 211, in tagged_sents
Module nltk.corpus.reader.tagged, line 148, in tagged_sents
Module nltk.corpus.reader.util, line 412, in concat
ValueError: concat() expects at least one object!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.