Giter Site home page Giter Site logo

rust-tfidf's People

Contributors

btc avatar ferristseng avatar martinfrances107 avatar zackpierce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

rust-tfidf's Issues

Document implementation for HashMap<String, usize>

It would probably be useful to have a default implementation of Document and ProcessedDocument for HashMap<String, usize>.

I did a first attempt, but somehow can't get it to compile:

impl<T> Document for HashMap<T, usize> {
    type Term = T;
}

impl<T> ProcessedDocument for HashMap<String, usize> where T : PartialEq {
    fn term_frequency<K>(&self, term: K) -> usize where K : Borrow<T> {
        &self.get(&term).unwrap_or(0)
    }

    fn max(&self) -> Option<&T> {
        match self.iter().max_by_key(|&(_, v)| v) {
            Some(&(ref k, _)) => Some(k),
            None => None
        }
    }
}

The error message I'm getting:

src/lib.rs:75:6: 75:7 error: the type parameter `T` is not constrained by the impl trait, self type, or predicates [E0207]
src/lib.rs:75 impl<T> ProcessedDocument for HashMap<String, usize> where T : PartialEq {
                   ^
src/lib.rs:75:6: 75:7 help: run `rustc --explain E0207` to see a detailed explanation
error: aborting due to previous error

Any ideas?

Relicense under dual MIT/Apache-2.0

This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic on IRC to discuss.

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it
.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright, due to not being a "creative
work", e.g. a typo fix) and then add the following to your README:

## License

Licensed under either of

 * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):

// Copyright 2016 rust-tfidf Developers
//
// Licensed under the Apache License, Version 2.0, <LICENSE-APACHE or
// http://apache.org/licenses/LICENSE-2.0> or the MIT license <LICENSE-MIT or
// http://opensource.org/licenses/MIT>, at your option. This file may not be
// copied, modified, or distributed except according to those terms.

It's commonly asked whether license headers are required. I'm not comfortable
making an official recommendation either way, but the Apache license
recommends it in their appendix on how to use the license.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT OR Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.

Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.

Get TfIdf Vector

I'm still trying to wrap my head around TF-IDF, therefore this might be a stupid question :)

I want to compare the similarity between two documents. I already have code in place to extract the words from the documents and to count the words. The result is a HashMap<String, usize>.

What I want to get now is a vector that contains TF-IDF values for every word that occurs in the documents, so that I can determine the cosine similarity between them.

Is this possible with the current API? If I understand it correctly, the tfidf function simply calculates the TF-IDF value for a single word, right? Does IDF even make much sense if there are only 2 documents?

a performance question and proposal to expose the `Idf` trait

Hi! Thanks for publishing this software. It's quite helpful to potentially be able to use your library instead of re-implementing TFIDF myself. I am grateful for the time and attention you've given to this.

For my use-case, I am working with a large corpus of documents and trying to understand if I can use this library in a way which will have suitable performance.

Examples in this repo show examples of the form:

for term in terms:
  for doc in docs:
    score = compute_tfidf(term, doc) 

where compute_tfidf is either TfIdfDefault::tfidf or MyTfIdfStrategy::tfidf


  1. Is it true that the idf implementations exposed by this crate all require a O(n) linear iteration over the documents/corpus?
  2. Is it possible to use the idf functions on their own, without going through tfidf?

Presented in pseudocode here, I would like to do the following:

for term in terms:
  idf = compute_idf(term, docs)
  for doc in docs:
    score = compute_tf(term, doc) * idf

Concretely, I have tried to use the library in the following way, but ran into an error that I don't quite understand yet:

use tfidf::idf::{InverseFrequencySmoothIdf};
use tfidf::tf::DoubleHalfNormalizationTf;
use tfidf::{Tf, Idf};

for term in terms {
  let idf = idf::InverseFrequencyIdf::idf(term, docs)
  for doc in docs {
    let tf = tf::DoubleHalfNormalizationTf::tf(term, doc)
    let tfidf = tf * idf
  }
}

Error 1:

Unresolved import `tfidf::Idf`

Error 2:

No function or associated item named `idf` found for struct `InverseFrequencySmoothIdf` in the current scope

Further exploration has led me to discover that this may be occurring simply because Idf isn't exposed. Would it be okay for me to submit a patch which modifies lib.rs to expose Idf?

by changing:

pub use prelude::{
  Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
  SmoothingFactor, Tf, TfIdf,
};

proposed:

pub use prelude::{
  Document, ExpandableDocument, NaiveDocument, NormalizationFactor, ProcessedDocument,
  SmoothingFactor, Tf, TfIdf, Idf,
};

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.