The finetuner's discuss from jina-ai

prepare cross model search case

allow CLIP model fine-tuning

Add unlabeled-data section

refactor(tailor): make sure summary and module names are aligned

currently the implementation for paddle and pytorch is error-prone, we get names from model and we apply hooks to modules, and we zip(names, summary based on the assumption that they should be identical.

a better choice is rename module in the hook directly.

Fast submitting of labeled data leads to slow down of page

If you submit several sets of labeled data sequentially before the first set returns (reproducible by spamming space), the number of to-be-displayed sets grow massively.

Requirements are not installed

pip install . does not install the requirements stated in requirements.txt.

[bug - labeler UI]: examples_per_view may not always be respected.

next_batch function checks this condition but doesn't take into account that there might already be requests in progress because of it's async nature.

Enable training transformers and add example

Allow specifying distance

There is no reason we should bind euclidian distance to TripletLayer and cosine to CosineLayer - there should be a parameter in the fit() function to allow setting the distance function to use

Add save button

Add a save button to the frontend and enable the saving of the model via the Flow.

Pass optimizer object

Instead of name/kwargs - keep the default, also an argument for learning rate

Error when using `fit(interactive=False)`

This example (adapted from documentation) will crash

import torch

import finetuner
from finetuner.toydata import generate_fashion_match

embed_model = torch.nn.Sequential(
    torch.nn.Flatten(),
    torch.nn.Linear(
        in_features=28 * 28,
        out_features=128,
    ),
    torch.nn.ReLU(),
    torch.nn.Linear(in_features=128, out_features=32),
)

finetuner.fit(embed_model, train_data=generate_fashion_match, interactive=False)

Traceback (most recent call last):
  File "/home/tadej/projects/finetuner/test.py", line 16, in <module>
    finetuner.fit(embed_model, train_data=generate_fashion_match, interactive=False)
  File "/home/tadej/projects/finetuner/finetuner/__init__.py", line 66, in fit
    return fit(*args, **kwargs)
TypeError: fit() got an unexpected keyword argument 'interactive'

This is because of this

finetuner/finetuner/__init__.py

Lines 62 to 65 in 91587d8

    
           else: 
        
               from .tuner.fit import fit 
        
               return fit(*args, **kwargs)

here interactive should be removed. Will create a PR

UI: Refactor matches elements to separate vue components

Currently, we have image and text matches tables.

If we want to extend this to more mimetypes, we should make sure that it doesn't get hard to maintain the HTML file as a whole. Refactoring out components for them could be a good way to simplify it.

Check frameworks behavior under inference mode

Write a unit test, that checks, if batching vs. non-batching produce same results for a model that has BatchNormlization or DropoutLayer for: https://github.com/jina-ai/finetuner/blob/ae8e3990080681a760f465b29c381ffe0e4b0245/finetuner/embedding.py

Replace summary plots with proper tensorboard/wandb/etc integration

Keras celebA example raises shape error: expected shape=(None, 224, 224, 3), found shape=(30, 3, 218, 178)

Following Keras example in the docs raises shape incompatibility using Jina 2.1.11

add integration test of "tailor+tuner" usage

i.e.

from finetuner import fit

fit(..., to_embedding_model=True)

this will call tailor + tuner

Write evaluation functions

Customize interactive labeler

Currently interactive labeler only support Image and Text, is there some kind of middle layer API or similar so that I can modify to allow Audio interactive labeling?

I want to label similar speaker based on audio clip.

Thanks

add ndcg/map evaluation to tuner

tailor: unify all test models

unify all test models, make sure the model as exactly the same structure, including dense, simple_cnn, vgg and lstm, add bert to test models.
more robust test on lstm.
given the same test models produced by 1, make sure three tailors produce exactly the same output model.

The `get_framework` does not work with 3rd party packages (e.g. `transformers`)

Fix is to check like isinstance(model, torch.nn.Module) - however this requires care to first check which frameworks we even have.

minor issues observed using labeler

working should log as loading data at very first stage.
epochs is not passed into the front-end
when user using labeler and mime_type starts with img/, i expect finetuner automatically fill the datauri/do the conversion.
not sure after some manual label work how to save the tuned model.
sometimes labeler hangs, after a while looks as below:

Provide .plot() function that leverages mermaid as in Core

Provide a .plot() function that leverages mermaid.js and keras.summary , paddle.summary , torchinfor.summary (pytorch has no native summary function, this is 3rd party; inspired by it we have already reimplemented our own similar thing in tailor). This will provide the same look & feel as in Jina Core.

add/validate gpu support to tuner

attach a new dense layer to every tailored model

and make it trainable

UI: Integration test

Use labeler UI to mimic user interactions, make sure that it's working end-to-end.

This will give us more confidence that the changes we make doesn't break existing functionality.

I'm thinking about using https://www.cypress.io/ for this.

Checkpointing - stopping and resuming training

It would be useful to implement checkpointing, and related with this, to allow users to resume training from the latest checkpoint.

Enable Labeler in Colab

The task is to figure out, how to use the Labeler in google Colab.

Have callback functions ready

https://www.tensorflow.org/guide/keras/custom_callback

Add example to documentation that uses augmentation (preprocessing)

UI: Make keyboard shortcuts configurable.

We have shortcuts for selecting matches (0 - 9), inverting selection (i), submitting changes (space)

It'd be good to make this configurable.

It would be something like configuration changes we see in games

The difference I want to make in UI is to use the UI itself rather than text. I'll paste a mockup later

Evaluation functions

Have an evaluation function with the following interface

def get_evaluation(queries: DocumentSequence, catalog: Union[DocumentArray, DocumentArrayMemmap], model: AnyDNN, metrics: List[str] = None):

where

queries are the to-be-scored Documents with either positive results as matches or class informations,
catalog are the potential results,
model is any model and
metrics the names of the metrics, where None means all.

The evaluation function should compute the embeddings and compute/output requested metrics.

Furthermore, we should have a demo evaluation for the FMNIST dataset as an integration test (potentially sub-sampled for faster execution).

Refactor Pipeline

Refactor Pipeline as suggested in Tadej slides

below new refers to dataset without tuples/triplets, old refers to user manually creates tuples/triplets

refactor Dataset, support new and old style dataset. - @tadejsv
add sampler, take N items as batch from Dataset. - @tadejsv
adjust get embedding (use model extract embeddings from batch of data given by sampler). - @tadejsv
add miner, create pairs/triplets given batch of embeddings and labels. @bwanglzu
adjust loss functions. - @bwanglzu
transformer based text pre-processing support

we start from pytorch, then go to paddle and keras, finish by making abstraction classes.

Have Finetuner ready in Google Colab

Tagging of specific class based on seed instance.

Say I want to quickly label many examples of a single class. Is there a way I can use a seed example of that class and use a combination of the nearest neighbour technique and my input to quickly label several 100?

Allow modifying optimizer parameters

At the very least, learning rate should be customizable.

We should also consider supporting multiple optimizers, e.g. Adam (which should be the default, as it is the most commonly used one), AdamW (used for transformers) and SGD, and setting their parameters.

Best way to do this is not clear - either we have some selection of optimizers that users can choose from, and we implement it in all frameworks, or we allow passing the optimizer object.

Add real data working example

Pokedex front-to-back search should work out of the box.

Create Evaluation Example

We want an evaluation pipeline, that measures the quality

independent of current building blocks
input: model, data, potentially catalog
output some evaluation metric like NDCG, MRR
accompanied by a dataset (e.g. totally looky like, CELEBA)
give the tuner 5/10/15 minutes

Tailor vision transformer (DINO)

Hi there!
I am trying to tailor a vision transformer (DINO) adding a projection layer on top of the whole transformer arquitecture.

When I run finetuner.tailor.to_embedding_model(model,output_dim=100, input_size=(3, 224, 224)) I get the following error.

I get the model from vits8 = torch.hub.load('facebookresearch/dino:main', 'dino_vits8')

code snippet to reproduce error

import torch
import finetuner

vits8 = torch.hub.load('facebookresearch/dino:main', 'dino_vits8')

finetuner.tailor.to_embedding_model(vits8,output_dim=100, input_size=(3, 224, 224))

use celeba dataset
use ResNet as base model
use labeler to add new dataset to see if results being improve at every batch.
user experience (such as given different input size, the front-end shows correctly, training time , loss decrease, train accuracy improve etc).

modulize tuner and labeler unit test

tailor unit testing is in good shape, not the same case for tuner and labeler.

add docstrings

We have very few docstrings right now - would be good to have more

Tuner
Tailor
Labeler
Helper

Labeler only allows to label all the data once

Currently the labeler does the following:

            let end_idx = app.labeler_config.start_idx + (app.labeler_config.example_per_view - app.cur_batch.length)
            if (end_idx === app.labeler_config.start_idx) {
                return
            }
            let start_idx = app.labeler_config.start_idx
            app.labeler_config.start_idx = end_idx

Thus, when I only have 2 queries but a bigger catalog, I only get labelable data once.

In my opinion, the choosing of new data should be done in the backend and not in the frontend.

progress bar should show real percentage progress of training

implementation could like:

for the 1st epoch, sometimes we don't know the length, we keep a counter and store the total length of the iteratable
and then from 2nd epoch, we can show the percentage.

jina-ai / finetuner Goto Github PK

finetuner's Issues

Recommend Projects

Recommend Topics

Recommend Org