finalfusion / finalfusion-rust Goto Github PK

View Code? Open in Web Editor NEW

86.0 3.0 10.0 608 KB

finalfusion embeddings in Rust

Home Page: https://finalfusion.github.io/

License: Other

Rust 99.83% Shell 0.17%

embeddings word2vec glove rust-library fasttext finalfusion

finalfusion-rust's People

Stargazers

Watchers

Forkers

realnicolasbourbaki vss96 bytesnake jguhlin djc victor-mlai leohscl kampersanda thiur tempbottle

finalfusion-rust's Issues

Use library-specific errors

We now turn every error into a black box using failure. However, this makes it very hard for downstream users to give appropriate error messages. We should switch to one or more crate-specific error types.

Fail CI upon benchmark failure

This is a trivial change, but should go in a separate PR.

NdNorms ChunkIdentifier not in header.

I'm implementing stand-alone reading + writing for the different chunks in my Python fork and it seems like we never added the `NdNorms' chunk identifier to the list in the header.

I think it's not a breaking change if we add the identifier to the header. Inside finalfusion, we don't use the chunk identifiers for storage, vocab and norms in any way, iirc.

Release 0.11

edit: Copied the list from the other comment to get an indicator for the list ticks.

Going through the API:

prelude.rs: #105

prelude re-exports SimpleVocab and SubwordVocab but not the aliases for FinalfusionSubwordVocab etc. I think it makes more sense to re-export the aliases and leave SubwordVocab in chunks::vocab.
MMapQuantizedArray is the only storage not re-exported.
NdNorms is the only public chunk not being re-exported.

chunks::storage::mod.rs:
chunks::storage::array.rs: #93

NdArray derives Debug but not MmapArray.
NdArray should derive Clone

chunks::storage::wrappers.rs:

StorageViewWrap implies that it wraps a view while it actually wraps a viewable storage. Think about renaming?

chunks::mod.rs:

Might be nice to explain the concept, right now the module docs state finalfusion chunks

chunks::io.rs: #92
Nothing public but from maintainers perspective:

typeid_impl is lacking docs, choices 1 for u8 and 10 for f32 seem arbitrary - iirc in order to leave room for other int and float types. Could still use some docs to make that clear once this is forgotten.

chunks::metadata.rs: #91

Encapsulate the inner Value.
Alternatives to lock in toml as our choice? Upside of toml is that we get easy serialization and heterogeneous collections. Downside is that we always need Values to construct.

chunks::norms.rs: #90

Encapsulate NdNorms' inner array
~~impl Index for NdNorms since Norms seems to be just that but without the [ ]-indexing~~ Index::index returns references. I'd still like to make norm a method directly on NdNorms since NdNorms is entirely useless without importing Norms otherwise.

chunks::vocab.rs: #89

Rename vocabtypes as discussed
Think about reorganizing in separate modules, i.e. vocab::mod.rs, vocab::subword.rs, vocab::simple.rs.
Remove Clone from Vocab's requirements and the other places where it pops up because of this requirement. (e.g. Indexer bounds)

compat::fasttext
compat::{text.rs, word2vec.rs}
embeddings.rs

Go through trait-bounds.

io.rs
lib.rs
similarity.rs

Could be nicer to read if put into module with similarity::analogy.rs and similarity::similarity.rs

subword.rs

Indexers could live in their own indexer.rs

util.rs

What needs to be done until then?

Probably clean up chunks::Vocab::{NgramIndices, SubwordIndices}, consolidate into one trait?. Related to that, bracketing words per default forces us to collect the indices in both methods and return them as Vec.

Add support for embedding pruning

Add support for pruning embeddings, where N embeddings are retained. Words for which embeddings are removed are mapped to their nearest neighbor.

This should provide more or less the same functionality as pruning in spaCy:

https://spacy.io/api/vocab#prune_vectors

I encourage some investigation here. Some ideas:

The most basic version could simply retain the embeddings of the N most frequent words and map all the remaining words to the nearest neighbor in the N embeddings that are retained.
Select vectors such that the similarities to the pruned vectors is maximized. The challenge here is making it tractable.
An approach similar to quantization, where k-means clustering is performed with N clusters. The embedding matrix is then replaced by the cluster centroid matrix. Each word maps to the cluster it is in. (This could reuse the KMeans stuff from reductive, which is already a dependency of finalfusion).

I would focus on (1) and (3) first.

Benefits:

Compresses the embedding matrix.
Faster than quantized embedding matrices, because simple lookups are used.
Could later be applied to @sebpuetz 's non-hashed subword n-grams as well.
Could perhaps be combined with quantization for even better compression.

Analogy interface is too inflexible

While reimplementing finalfusion-inspector in Rust, I bumped into a small annoyance. The analogy method takes an array of &str:

query: [&str; 3]

However, oftentimes you have a [String; 3]. We should relax this type to:

query: [impl AsRef<str>; 3]

This should not break the API, since it allows a superset of types of the original signature.

Increase crates.io bus factor

@sebpuetz @twuebi

Same as with PyPI, I think it would also be good to have additional owners on crates.io for the finalfusion crates and finalfrontier. Let me know your username (I think it's the same as the GitHub handle, but you probably have to log on once to create it on crates.io).

Document SubwordVocab::new

This constructor is not documented (and non-trivial).

Question: why don't we have support for computing analogy test accuracy for embeddings with quantized storage?

Hi! This may sound stupid but I have been wondering about this... I tried to reconstruct the whole QuantizedArray storage into NdArray storage and wrap it into StorageViewWrap so that I can compute the accuracy of analogy test, but why don't we have a function for this so we can compute accuracy directly?(Or we do but I fail to find it?)

Reading GoogleNews word2vec model fails

Trying to read the GoogleNews-vectors-negative300.bin word2vec model triggers this assert:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/chunks/vocab/simple.rs#L28

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `3000000`,
  right: `2999997`: words contained duplicate entries.'

(when constructing a new simple vocabulary, the number of indices (3,000,000) ends up different than the number of words (2,999,997))

After some investigations I removed this word trimming and it worked fine afterwards:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/compat/word2vec.rs#L98

I assume the model contains tokens that get trimmed into the same words.

Should I create a pull request to remove this line? Or is there something I'm doing wrong?

The model I used is from: https://code.google.com/archive/p/word2vec
Code:

let mut reader = BufReader::new(File::open("GoogleNews-vectors-negative300.bin").unwrap());
let model = Embeddings::read_word2vec_binary(&mut reader).unwrap();

Capacity overflow on Windows

When running the following code sample on Windows, it crashes:

use std::io::{BufReader, Read};
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::WordSimilarity;


fn main() {
    let mut reader = BufReader::new(File::open("resources/english-skipgram-mincount-50-ctx-10-ns-5-dims-300.fifu").unwrap());

   // Read the embeddings.
   let embeddings: Embeddings<VocabWrap, StorageViewWrap > =
       Embeddings::read_embeddings(&mut reader)
       .unwrap();

}

with the following error:

thread 'main' panicked at 'capacity overflow', src\liballoc\raw_vec.rs:750:5
stack backtrace:
   0: core::fmt::write
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libcore\fmt\mod.rs:1063
   1: std::io::Write::write_fmt<std::sys::windows::stdio::Stderr>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\io\mod.rs:1426
   2: std::sys_common::backtrace::_print
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\sys_common\backtrace.rs:62
   3: std::sys_common::backtrace::print
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\sys_common\backtrace.rs:49
   4: std::panicking::default_hook::{{closure}}
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:204
   5: std::panicking::default_hook
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:224
   6: std::panicking::rust_panic_with_hook
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:470
   7: std::panicking::begin_panic_handler
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:378
   8: core::panicking::panic_fmt
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libcore\panicking.rs:85
   9: core::panicking::panic
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libcore\panicking.rs:52
  10: alloc::raw_vec::capacity_overflow
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\liballoc\raw_vec.rs:750
  11: alloc::raw_vec::{{impl}}::allocate_in::{{closure}}<f32,alloc::alloc::Global>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\raw_vec.rs:79
  12: core::result::Result<(), alloc::collections::TryReserveError>::unwrap_or_else<(),alloc::collections::TryReserveError,closure-1>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libcore\result.rs:851
  13: alloc::raw_vec::RawVec<f32, alloc::alloc::Global>::allocate_in<f32,alloc::alloc::Global>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\raw_vec.rs:79
  14: alloc::raw_vec::RawVec<f32, alloc::alloc::Global>::with_capacity_zeroed<f32>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\raw_vec.rs:147
  15: finalfusion::util::padding<f32>
             at C:\Users\Roland\.cargo\registry\src\github.com-1ecc6299db9ec823\finalfusion-0.12.1\src\util.rs:66
  16: alloc::vec::from_elem<f32>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\vec.rs:1730
  17: std::sys::windows::alloc::{{impl}}::dealloc
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\sys\windows\alloc.rs:48
  18: std::alloc::__default_lib_allocator::__rdl_dealloc
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\alloc.rs:270
  19: core::slice::{{impl}}::index<u8>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libcore\slice\mod.rs:2891
  20: core::ops::function::FnOnce::call_once<fn(finalfusion::chunks::vocab::subword::SubwordVocab<finalfusion::subword::HashIndexer<fnv::FnvHasher>>) -> finalfusion::chunks::vocab::wrappers::VocabWrap,(finalfusion::chunks::vocab::subword::SubwordVocab<finalfusi
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libcore\ops\function.rs:232
  21: finalfusion::chunks::storage::wrappers::{{impl}}::read_chunk<std::io::buffered::BufReader<std::fs::File>>
             at C:\Users\Roland\.cargo\registry\src\github.com-1ecc6299db9ec823\finalfusion-0.12.1\src\chunks\storage\wrappers.rs:250
  22: finalfusion::embeddings::{{impl}}::read_embeddings<finalfusion::chunks::vocab::wrappers::VocabWrap,finalfusion::chunks::storage::wrappers::StorageViewWrap,std::io::buffered::BufReader<std::fs::File>>
             at C:\Users\Roland\.cargo\registry\src\github.com-1ecc6299db9ec823\finalfusion-0.12.1\src\embeddings.rs:404
  23: stuff::main
             at .\src\bin\stuff.rs:12
  24: std::rt::lang_start::{{closure}}<()>
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libstd\rt.rs:67
  25: std::rt::lang_start_internal::{{closure}}::{{closure}}
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\rt.rs:52
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: process didn't exit successfully: `target\debug\stuff.exe` (exit code: 101)

On Linux, the same code runs fine. Is there a problem with finalfusion on Windows?

Add an empirical comparison between array types

Compare disk/memory use and speed of and add this to the README, to give people an idea of the trade-offs:

Array
Memory-mapped array
Quantized array
Memory-mapped quantized array.

I have already done parts of this, so I can take this one.

What needs to be done before finalfusion 1?

I think that for downstream users it would be great if we had a first release. Due to semver, once we release 1.0.0, we cannot break APIs anymore in 1.x.y. Basically, once someone puts

finalfusion = "1"

in their Cargo.toml, it should always work for 1.x.y. Of course, we can extend the API.

This issue is for discussing what still needs to be done before 1.0.0. What sore thumbs are sticking out. Also, we should probably look if there are any public APIs that need to be private or crate-private.

Adding a method for ngram -> idx mappings

I was wondering if there are reasons against exposing a method to get tuples of (ngram, idx) for both in vocabulary and out of vocabulary words.

Could be nice to inspect similarity of the subword vectors through the python module. Although, we probably can't get all components that make up in vocab words because we precompute theri representations.

Release Explicit Conversions?

Not sure whether this should go as a dot-release or as a new version...

Zero-allocation lookups

Add a method Embedings::embedding_into that constructs the embedding in-place in an ArrayViewMut1. Obviously, this would be a copy for a known word, but for unknown words, this avoids an allocation in case one already has memory allocated (e.g. in downstream software where you allocate memory for embeddings for all words in a sentence).

Documentation outdated in 'io.rs'?

Here, if I understand it correctly, is outdated right? Since we have Embeddings<_, MmapQuantizedArray> already.

Quantized Array interface

Constructor
accessors?

Validate quantizer invariants in read/mmap_chunk

If some of the quantizer parameters are wrong, the finalfusion file could be malformed. We should return Err in such cases, rather than relying on assertions in reductive.

fastText support tracking issue

This issue is for tracking addition of reading functionality for fastText embeddings.

Modify subword handling to support different indexers. PR: #25
Add constuctor trait for bucketing indexers. PR: #28
Add FastTextIndexer. PR: #29
Modify SubwordVocab to be generic over the indexer. PR: #30
Add support for fastText-derived vocabularies to SubwordVocab. PR: #31
Add and implement FastTextReader trait. PR: #32
Update finalfusion-utils to support the fastText file format.
Add static method to finalfusion-python to read fastText embeddings.
Add unit tests for reading fastText with a toy file.

@sebpuetz I added you as an assignee for the reviewing. I have nearly all functionality implemented already.

Pretrained embedding fetcher

I think it would be nice to have a small utility data structure to fetch pretrained embeddings. I don't think this needs to be part of the finalfusion crate, since it is not really core functionality. The basic idea is:

We'd have a repository finalfusion-fetcher with some metadata file (probably JSON), mapping embedding file identifiers to URLs. E.g. fasttext.wiki.nl.fifu could map to http://www.sfs.uni-tuebingen.de/a3-public-data/finalfusion-fasttext/wiki/wiki.nl.fifu
A small crate (possibly in the same repo), would provide a datastructure Fetcher With a constructor that retrieves the metadata and gives a fetcher:
```
let fetcher = Fetcher::fetch_metadata().unwrap();
```
A user could then open embeddings:
```
let dutch_embeddings = fetcher.open("fasttext.wiki.nl.fifu").unwrap();
```
This method would check if the embeddings are already available. If not, fetch them, store them in a standard XDG location. Then it would open the embeddings stored in this location.

Similarly, Fetcher::mmap could be used to memory-map an embedding after downloading.

After this is implemented, the functionality could also be exposed in finalfusion-python.

Memory mappable vocab

This time hopefully without soundness errors ;).

We should be very conservative with adding new chunks. But I think there is a place for a new vocabulary chunk that solves three problems of the current vocab and (bucket) subword vocab chunks:

Memory mappable;
avoid storing every string twice when reading the vocabulary into memory,
quicker deserialization, so that it does not require incrementally building a hash map.

The proposed vocabulary format would be structured similarly to a fully loaded hash table.

Data structures

The data would be structured as follows (after the chunk identifier and length). Given a vocabulary of size vocab_len:

string index: vocab_len pairs (see notes 1 and 2 below for refinements) of:

the offset of the string from the start of the string table (u64);
the FNV1a hash of the word;
the index of the word in the storage;

storage->string link table: vocab_len indexes (u64), mapping a storage index to a pair from (1);
string table: a concatenation of all the vocabulary words in UTF-8 format with their string lengths.

Lookups

Look up the word of storage index i: get the i-th element of the storage->string link table. This
is the index of the string in the string index.
Look up the index of word w: hash w and map the hash h to the vocabulary space, h_vocab. Start linear probing at h_vocab until a matching h is found. Then verify that the strings match (otherwise continue probing).

Notes

In practice we'd want to use another number than the vocabulary length, e.g. one of the next powers of two. First to make outcomes of hashing uniformly distributed, second to avoid degenerate cases in linear probing. However, without these blank slots, the storage->string link table would not be necessary (since we could sort the storage by hash table order).
The birthday paradox square estimation puts the hash collision probability of 0.5 at 2^32 items, so in practice the first actual string match would be a hit.
If the table is constructed in word frequency order, the amount of linear probing is a function of the word rank/frequency, since when the most frequent words are inserted, most pairs will still be empty.

Downside

I guess that adding this chunk entails adding three new chunks, since we have combined the vocabs with subword vocabs. Also, it probably requires a redesign of the explicit n-gram chunk, since otherwise it would not be memory mappable.

Footnote

We discussed earlier that it may have been a design mistake not to store storage indices with words and relying on the order of words. However, while thinking about this new chunk, I realized that doing this is actually unsound in our current setup. E.g. if one does a similarity or analogy query indices of the top-n results do not map to individual words anymore. We would need to change the APIs to return multiple words for a given index.

Add criterion benchmarks

...using real-world embeddings. This would allow us to detect performance regressions.

Split Similarity trait

For analogies, we have Analogy and AnalogyBy. However Similarity has both similarity and similarity_by methods.

Delimiters and trimming in read_text_dims

I was converting some files from the text(dims) format to our format and ran into trouble with tokens containing whitespaces. There was a number of tokens with non-space whitespaces within tokens which threw our splitting through .split_whitespace() off.

Also, there were a bunch of tokens that were just a whitespace (e.g. NEL U+0085 and NBSP U+00A0), getting reduced to empty strings by the .trim() call we do on the words. The second part was particularly frustrating to debug, since words ended up with the correct length, the right number of embeddings was created but an incorrect number of vocabulary entries were being created. Which happened because trim()ming reduced a couple hundred tokens to "".

finalfusion-rust/src/compat/text.rs

Lines 161 to 170 in c9bd67c

    
           for line in reader.lines() { 
        
               let line = 
        
                   line.map_err(|e| ErrorKind::io_error("Cannot read line from embedding file", e))?; 
        
               let mut parts = line.split_whitespace(); 
        
               let word = parts 
        
                   .next() 
        
                   .ok_or_else(|| ErrorKind::Format(String::from("Spurious empty line")))? 
        
                   .trim(); 
        
               words.push(word.to_owned());

This took an awfully long time to debug, mostly due to the stupid format and the unnormalized corpus that the embeddings were trained on, but still, I wonder if we can do better here. E.g.:

        let word = parts
            .next()
            .ok_or_else(|| ErrorKind::Format(String::from("Spurious empty line")))?
            .trim_matches(|c| c == ' ' || c == '\t' );

What do you think about targeted trimming and splitting on '\t' and ' ' rather than any whitespaces? Might reduce the headaches of potential users.

While debugging this, I also found that the following assertion leads to killing the Python process when loading a malformed file through our Python library.

finalfusion-rust/src/embeddings.rs

Lines 50 to 54 in c9bd67c

    
           assert_eq!( 
        
               vocab.len(), 
        
               norms.0.len(), 
        
               "Vocab and norms do not have the same length" 
        
           );

It's kind of ugly to get this kind of output for a malformed file, There's an issue at PyO3/pyo3#492 about stopping panics at the ffi boundary, but for now we would have to deal with that ourselves. Maybe it's possible to return a Result here?

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `199872`,
 right: `200000`: Vocab and norms do not have the same length', /data/rust_projects/finalfusion-rust/src/embeddings.rs:50:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
fatal runtime error: failed to initiate panic, error 5
[1]    8362 abort (core dumped)  ipython

Vocab should be a supertrait of SubwordVocab

Every SubwordVocab is also a Vocab. This has the benefit that when you have a dyn SubwordVocab that the Vocab methods are also available.

This should by a safe API change (since it only adds methods accessible through SubwordVocab).

How to use Ngrams to find words not in dictionary

Hey again. So the API is confusing for me here, but when I try to get an embedding and it is not found, what is the proper way of using the Ngram module to get the embedding? Is there a pre-built fn to do it or an example somewhere in the test code?

Rust #[inline] pub fn get_embedding(&self, kmer: &str) -> Embedding { match self.embeddings.embedding_with_norm(&kmer) { Some(x) => x.embedding.into_owned(), None => panic!("Unable to get embeddings for {}", kmer) // Probably have to use the ngrams module to get indices, and then average... } }

I'm using an alternative vocab, but it is working from the NGram bones (with very few changes)
impl From<KmerVocab<NGramConfig, ExplicitIndexer>> for VocabWrap { fn from(v: KmerVocab<NGramConfig, ExplicitIndexer>) -> Self {

Thanks

Relicensing to MIT or Apache License version 2.0

finalfusion-rust is currently licensed under the following licenses (user's choice):

Apache License Version 2.0
Blue Oak Model License Version 1.0.0

We are trying to relicense the finalfusion ecosystem to something that is more canonical for the Rust ecosystem, namely (user's choice):

Apache License Version 2.0
MIT License

I, @sebpuetz, and SfS ISCL have agreed with relicensing all finalfusion projects in this way. However, finalfusion-rust has two additional contributors:

If you are ok with this relicensing, could you please reply to this issue confirming this? Thanks a lot!

Update README with examples

Some code snippets could be nice. Also, a synopsis of what you can do with the crate.

The README also states, that the crate is new and the API will change, while the API might still change, I think we're approaching somewhat stable levels in terms of public API.

The analogy method should return which words could not be looked up

Proposed solution: change analogy to return a Result<..., [bool; 3]>.

The naming of NGramIndices vs. SubwordIndices traits is confusing

Either they are both subwords or n-grams. But the current naming suggests that there is a difference.

Conversion of fastText embeddings

We could get a lot of additional pretrained embeddings for 'free' if we convert fastText embeddings from:

https://fasttext.cc/docs/en/crawl-vectors.html

Any opinions?

Properly handle memory mapped storage on big-endian platforms

compiling to wasm

This is an awesome project sadly I cannot compile it to wasm for native browser use because of the memmap dep?

This seems to be used in
src/chunks/storage/array.rs:8:5
src/chunks/storage/quantized.rs:6:5

is there a way to remove memmap? or replace it with something that can target wasm?

Thanks

Indexing a BucketVocab with the empty String returns Some

With range 3-6 these fail:

fn empty_string_idx(voc: BucketSubwordVocab) {
    assert!(voc.idx("").is_none())
}

fn empty_string_embed(e: Embeddings<BucketSubwordVocab, NdArray>) {
    assert!(e.embedding("").is_none())
}

I'd expect the same for other setups, e.g. range 4-6 and length 1 inputs will probably behave the same.

Implement WriteChunk for mmap'ed quantized storage

Support for phrase embeddings

Hi @danieldk thank's for this great library, can you guide me through the steps to include support for sentence embeddings using sentence-transformer? Which traits should I use? I am specially interested in storing embeddings for similarity queries on phrases

Test non-Linux UNIX platforms

These do not have to go into the default PR tests (to avoid long waits). But it would be nice to at least test the master branch and release branches against the BSDs that are supported by Rust.

Add support for word2vec keyed vector

I tried to use this library with a pretrained models from https://github.com/sdadas/polish-nlp-resources?tab=readme-ov-file#word2vec and I found out that these are in keyed vector format which is currently not supported.

Renormalization after reconstructing quantized vectors

Looking at the losses of OPQ, most of the vectors are reconstructed nearly perfectly in terms of cosine similarity, but the euclidean distances are larger. Consequently, not all vectors of known words are necessarily unit vectors.

To do:

Investigate how much vectors for known words diverge from unit length.
If the divergence turns out to be large, renormalize vectors.

Update reductive

The reductive dependency should be updated sometime soon:

It fixes seeding of multiple clustering attempts, so that the attempts of ff-quantize actually has an effect.
It updates ndarray-linalg to 0.11. The version that we currently use (0.10) forces static linking of OpenBLAS, which fails on systems that do not provide a static library (when the system feature is used). Without the system feature, the compiled OpenBLAS library sometimes misses LAPACKE_ symbols. (Not sure yet what the circumstances are, possibly it does not fail if there is a static system library, because it links against that first.)

This update is blocked by yet another issue. reductive switched to rand 0.7. However, rand 0.7 makes use of crate renaming, which seems to be used seldomly enough that it is not supported by Nix' buildRustCrate. Since I use Nix to build Docker images for sticker (which uses finalfusion), I would like to see that fixed before making the leap. I have already submitted a PR in nixpkgs to add support for crate renames.

Release 0.8

@sebpuetz @NianhengWu

We have accumulated quite some changes since June:

fastText support
The possibility to get n-grams + subword indices.
Similarity queries for embeddings.
Library-specific error types.
Report of missing words in analogy queries.

Maybe it's time to do a new release branch of finalfusion-rust and finalfusion-utils? Are there any small changes we need to get in before? The primary thing on top of my head is a unit test using toy fastText embeddings.

norms API

I have the norms layer mostly done, but I was thinking about what the API should look like:

Provide a separate method fn norm(&self, word: &str) -> Option<f32>. Benefit: no cluttering of embedding's signature. Most people will just want an embedding. Downside: more expensive if someone wants the embedding and the original norm, especially with subword units.
Extend the return type of embedding to Option<(CowArray1<f32>, f32)>, where embedding and norm are returned as a tuple. Advantage: faster than (1), disadvantage: tuple.
Extend the return type of embedding to Option<Embedding> where Embedding is a type that holds the CowArray1 and the norm. Provide two getters: embedding/norm. Advantage: faster than (1), clearer than (2). Disadvantages: double wrapping of the actual vector (Embedding -> CowArray); getting the embedding is ugly (embeds.embedding("blah").embedding()); the embedding getter returns a reference, &CowArray1 is not a very useful type in case the array is owned.
(2), but make this a separate method fn embedding_with_norm(&self, word: &str) -> Option<(CowArray1<f32>, f32)>.

Any opinions @sebpuetz ?

	for line in reader.lines() {
	let line =
	line.map_err(\|e\| ErrorKind::io_error("Cannot read line from embedding file", e))?;
	let mut parts = line.split_whitespace();

	let word = parts
	.next()
	.ok_or_else(\|\| ErrorKind::Format(String::from("Spurious empty line")))?
	.trim();
	words.push(word.to_owned());

	assert_eq!(
	vocab.len(),
	norms.0.len(),
	"Vocab and norms do not have the same length"
	);