jina-ai / finetuner Goto Github PK

:dart: Task-oriented embedding tuning for BERT, CLIP, etc.

License: Apache License 2.0

Python 98.05% Makefile 1.95%

fine-tuning pretrained-models few-shot-learning negative-sampling metric-learning siamese-network triplet-loss transfer-learning jina neural-search finetuning similarity-learning bert openai-clip

finetuner's People

Stargazers

Watchers

Forkers

afizs ibrandiay zhenwang23 dr-r00t3r alisheheryar dliofindia mrbananahuman techthiyanes gauthamsuresh09 stjordanis qinghsui zeeroocooll jongli747 aturkelson adbmd skrohit rafapi rizwandel taogeanton2 forkkit 5uru iaxat python-repository-hub as85207 ihxkjggg nashid kunalvaghani90 mbrukman anooprerna nishchaychawla anoop-qasolve ping-back brahimmade gg-big-org vishalsingh17 markonerds gaohuan2015 pinkdiamond1 brucepro jianantian freecamel themayurahir ske159 sarvex summerflowers sorokinvld bin7ayola noverkill realbigdave912 arpitjain799 jfontestad prestomac drgonzalomora rosiethuypham jafarliu trainmachines flyingbearhk brunoscaglione 5l1v3r1 lihuibng dr-gareth-roberts mrgenius01 jqk6 jxzhangjhu smbiz1 xiaozhiob

finetuner's Issues

Evaluation functions

Have an evaluation function with the following interface

def get_evaluation(queries: DocumentSequence, catalog: Union[DocumentArray, DocumentArrayMemmap], model: AnyDNN, metrics: List[str] = None):

where

queries are the to-be-scored Documents with either positive results as matches or class informations,
catalog are the potential results,
model is any model and
metrics the names of the metrics, where None means all.

The evaluation function should compute the embeddings and compute/output requested metrics.

Furthermore, we should have a demo evaluation for the FMNIST dataset as an integration test (potentially sub-sampled for faster execution).

Refactor Pipeline

Refactor Pipeline as suggested in Tadej slides

below new refers to dataset without tuples/triplets, old refers to user manually creates tuples/triplets

refactor Dataset, support new and old style dataset. - @tadejsv
add sampler, take N items as batch from Dataset. - @tadejsv
adjust get embedding (use model extract embeddings from batch of data given by sampler). - @tadejsv
add miner, create pairs/triplets given batch of embeddings and labels. @bwanglzu
adjust loss functions. - @bwanglzu
transformer based text pre-processing support

we start from pytorch, then go to paddle and keras, finish by making abstraction classes.

modulize tuner and labeler unit test

tailor unit testing is in good shape, not the same case for tuner and labeler.

Tailor vision transformer (DINO)

Hi there!
I am trying to tailor a vision transformer (DINO) adding a projection layer on top of the whole transformer arquitecture.

When I run finetuner.tailor.to_embedding_model(model,output_dim=100, input_size=(3, 224, 224)) I get the following error.

I get the model from vits8 = torch.hub.load('facebookresearch/dino:main', 'dino_vits8')

code snippet to reproduce error

import torch
import finetuner

vits8 = torch.hub.load('facebookresearch/dino:main', 'dino_vits8')

finetuner.tailor.to_embedding_model(vits8,output_dim=100, input_size=(3, 224, 224))

Keras celebA example raises shape error: expected shape=(None, 224, 224, 3), found shape=(30, 3, 218, 178)

Following Keras example in the docs raises shape incompatibility using Jina 2.1.11

Have Finetuner ready in Google Colab

Checkpointing - stopping and resuming training

It would be useful to implement checkpointing, and related with this, to allow users to resume training from the latest checkpoint.

Add real data working example

Pokedex front-to-back search should work out of the box.

Fast submitting of labeled data leads to slow down of page

If you submit several sets of labeled data sequentially before the first set returns (reproducible by spamming space), the number of to-be-displayed sets grow massively.

[bug - labeler UI]: examples_per_view may not always be respected.

next_batch function checks this condition but doesn't take into account that there might already be requests in progress because of it's async nature.

add/validate gpu support to tuner

Allow specifying distance

There is no reason we should bind euclidian distance to TripletLayer and cosine to CosineLayer - there should be a parameter in the fit() function to allow setting the distance function to use

Add example to documentation that uses augmentation (preprocessing)

full finetuner use experience on celeba dataset

use celeba dataset
use ResNet as base model
use labeler to add new dataset to see if results being improve at every batch.
user experience (such as given different input size, the front-end shows correctly, training time , loss decrease, train accuracy improve etc).

UI: Refactor matches elements to separate vue components

Currently, we have image and text matches tables.

If we want to extend this to more mimetypes, we should make sure that it doesn't get hard to maintain the HTML file as a whole. Refactoring out components for them could be a good way to simplify it.

refactor(tailor): make sure summary and module names are aligned

currently the implementation for paddle and pytorch is error-prone, we get names from model and we apply hooks to modules, and we zip(names, summary based on the assumption that they should be identical.

a better choice is rename module in the hook directly.

Error when using `fit(interactive=False)`

This example (adapted from documentation) will crash

import torch

import finetuner
from finetuner.toydata import generate_fashion_match

embed_model = torch.nn.Sequential(
    torch.nn.Flatten(),
    torch.nn.Linear(
        in_features=28 * 28,
        out_features=128,
    ),
    torch.nn.ReLU(),
    torch.nn.Linear(in_features=128, out_features=32),
)

finetuner.fit(embed_model, train_data=generate_fashion_match, interactive=False)

Traceback (most recent call last):
  File "/home/tadej/projects/finetuner/test.py", line 16, in <module>
    finetuner.fit(embed_model, train_data=generate_fashion_match, interactive=False)
  File "/home/tadej/projects/finetuner/finetuner/__init__.py", line 66, in fit
    return fit(*args, **kwargs)
TypeError: fit() got an unexpected keyword argument 'interactive'

This is because of this

finetuner/finetuner/__init__.py

Lines 62 to 65 in 91587d8

    
           else: 
        
               from .tuner.fit import fit 
        
               return fit(*args, **kwargs)

here interactive should be removed. Will create a PR

Check frameworks behavior under inference mode

Write a unit test, that checks, if batching vs. non-batching produce same results for a model that has BatchNormlization or DropoutLayer for: https://github.com/jina-ai/finetuner/blob/ae8e3990080681a760f465b29c381ffe0e4b0245/finetuner/embedding.py

Allow modifying optimizer parameters

At the very least, learning rate should be customizable.

We should also consider supporting multiple optimizers, e.g. Adam (which should be the default, as it is the most commonly used one), AdamW (used for transformers) and SGD, and setting their parameters.

Best way to do this is not clear - either we have some selection of optimizers that users can choose from, and we implement it in all frameworks, or we allow passing the optimizer object.

Add non-interactive example to docs

Clicking `crtl+c` during training cancels one epochs and moves on to the next one

it should cancel the entire training instead

tailor: unify all test models

unify all test models, make sure the model as exactly the same structure, including dense, simple_cnn, vgg and lstm, add bert to test models.
more robust test on lstm.
given the same test models produced by 1, make sure three tailors produce exactly the same output model.

Add save button

Add a save button to the frontend and enable the saving of the model via the Flow.

Create Evaluation Example

We want an evaluation pipeline, that measures the quality

independent of current building blocks
input: model, data, potentially catalog
output some evaluation metric like NDCG, MRR
accompanied by a dataset (e.g. totally looky like, CELEBA)
give the tuner 5/10/15 minutes

Have callback functions ready

https://www.tensorflow.org/guide/keras/custom_callback

Enable Labeler in Colab

The task is to figure out, how to use the Labeler in google Colab.

prepare cross model search case

allow CLIP model fine-tuning

Write evaluation functions

Allow setting number of workers for dataset loading

This can prevent bottlenecks when using pre-processing, so it is done asynchronously

Tagging of specific class based on seed instance.

Say I want to quickly label many examples of a single class. Is there a way I can use a seed example of that class and use a combination of the nearest neighbour technique and my input to quickly label several 100?

UI: Integration test

Use labeler UI to mimic user interactions, make sure that it's working end-to-end.

This will give us more confidence that the changes we make doesn't break existing functionality.

I'm thinking about using https://www.cypress.io/ for this.

add ndcg/map evaluation to tuner

Pass optimizer object

Instead of name/kwargs - keep the default, also an argument for learning rate

Possibility to freeze layers

add docstrings

We have very few docstrings right now - would be good to have more

Tuner
Tailor
Labeler
Helper

Customize interactive labeler

Currently interactive labeler only support Image and Text, is there some kind of middle layer API or similar so that I can modify to allow Audio interactive labeling?

I want to label similar speaker based on audio clip.

Thanks

Enable training transformers and add example

The `get_framework` does not work with 3rd party packages (e.g. `transformers`)

Fix is to check like isinstance(model, torch.nn.Module) - however this requires care to first check which frameworks we even have.

add miner to tuner supports new dataset

takes inputs embeddings and labels, and outputs a list of tuples/triplets

progress bar should show real percentage progress of training

implementation could like:

for the 1st epoch, sometimes we don't know the length, we keep a counter and store the total length of the iteratable
and then from 2nd epoch, we can show the percentage.

minor issues observed using labeler

working should log as loading data at very first stage.
epochs is not passed into the front-end
when user using labeler and mime_type starts with img/, i expect finetuner automatically fill the datauri/do the conversion.
not sure after some manual label work how to save the tuned model.
sometimes labeler hangs, after a while looks as below:

Labeler only allows to label all the data once

Currently the labeler does the following:

            let end_idx = app.labeler_config.start_idx + (app.labeler_config.example_per_view - app.cur_batch.length)
            if (end_idx === app.labeler_config.start_idx) {
                return
            }
            let start_idx = app.labeler_config.start_idx
            app.labeler_config.start_idx = end_idx

Thus, when I only have 2 queries but a bigger catalog, I only get labelable data once.

In my opinion, the choosing of new data should be done in the backend and not in the frontend.

UI: Make keyboard shortcuts configurable.

We have shortcuts for selecting matches (0 - 9), inverting selection (i), submitting changes (space)

It'd be good to make this configurable.

It would be something like configuration changes we see in games

The difference I want to make in UI is to use the UI itself rather than text. I'll paste a mockup later

attach a new dense layer to every tailored model

and make it trainable

Need documentation for Labeler

Provide .plot() function that leverages mermaid as in Core

Provide a .plot() function that leverages mermaid.js and keras.summary , paddle.summary , torchinfor.summary (pytorch has no native summary function, this is 3rd party; inspired by it we have already reimplemented our own similar thing in tailor). This will provide the same look & feel as in Jina Core.

from finetuner import fit

fit(..., to_embedding_model=True)

this will call tailor + tuner

jina-ai / finetuner Goto Github PK

finetuner's People

Stargazers

Watchers

Forkers

finetuner's Issues

Recommend Projects

Recommend Topics

Recommend Org