Giter Site home page Giter Site logo

overview-of-reproduced-project's Introduction

Hi, I am Muyi Bao


English | 简体中文


This depository is to give an overview for the projects reproduced by me and also I want to show my thoughts to these projects and papers. My code often has very detailed comments(some because I am also a new comer so that I need to make detailed comments to let myself understand). As for some papers that I have read but do not reporduce, I may write something in this repositroy

1.Capsule Network 2023/11

The idea of Capsule network is very novel and interesting

1.Change commonly used scalars (this paper think the matrixes normally used in CNN are all scalar, but sometimes we may think these are vectors or matrixs) into vectors and hence proposing a algorithm, Dynamic Routing. In my opinion, the Dynamic routing is powerful for feature extraction, at least it gives a new idea to extract features.

2.It keeps using a idea of capsules.

But training CapsNet is costly. Additionaly, compared with nowadays model, CapsNet shows its inability to more general and complex datasets. It is very hard to deal with complex datasets.

I refer this repository to write the code

Paper: Dynamic Routing Between Capsules

Architecture:

Model

Dynamic Routing Algorithm:

Model
2.U-Net 2024/4/4

U-Net is used in segmentation task. The architecture is relatively simple, therefore suitable for new begineers to start learning how to deal with segmentation task.

It is used in medical field at first. I see a explanation that because the structure of medical images is constraint, relatively shallower model may work better.

Paper: U-Net-Based medical image segmentation

Architecture:

Model
3.Learning without forgetting 2024/4/18

Learning withou forgetting (LwF) is used to deal with continual learning task in classification task. Some papers regard this paper as the first paper to systematically define continual learning (CL). In my opinion, it indead gives a lots of insights to CL.

As to its metholodogy, it can be regared as the most simple way to use Knowledge Distillation (KD) into CL area. This project is very suitable for new begineers who want to learn continual learning using KD.

Additionally, the way of its CL is continually learn one class in one dataset. Taking CUB-200 dataset as example, it will learn one category on one time. Normally, we may think learn all categories of one dataset on one time.

I give very detailed comments in this project. I referred to this project. But the implementation way is different. I am not sure which one is better. But I think my code is very clear.

Paper: Learning without Forgetting

Original Repository: here

Architecture:

Model

Algorithm:

Model
4.Transformer 2024/4/25

There are a lots of paper and repostories to expain it. I also need learn these insights.

The reason why I learn this is that in 2021 transformer is used in Computer Vision(Vision Transformer ViT). Therefore, I learned Transformer, which should be used in NLP.

I learn Transformer by this blog, offering very detailed explanation.

I refer this repository 's code to write my code. I give many detailed explanation and I re-constructure the code skeleton so that it is easier for new comer(also for myself) to learn, and then can understand what source code is doing.

Paper: Attention is all you need

The architecture:

Model
5.Vision Transformer 2024/5/5

In 2021, a team used almost unchanged Transformer used in image classification, which give people an idea that Transformer orinigal used in NLP can also be used in Computer Vision. This is a huge improvement in Vision field. Many records have been broken by Transofrmer-based model. It prove transformer can be used in CV and if at scale, Transformer can even performer better. Based on this work, a lot of work has been born.

If you can write the code of Transformer, Vision Transformer(ViT) is also easy for you because there is not decoder.

I learn ViT through this bilibili vedio and this one, this blog.

Writing code refer to this bilibili vedio and this repository and the authrity repository

Paper: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

The architecture:

Model
6.Swin Transformer 2024/5/11

Swin Transformer is a work based on Vision Transformer(ViT) and solve the problem of large image resolution and high computational complexity. This is almost a landmark work, breaking the record in countless computer vision tasks. It proves swin transformer can be used as an gerneral backbone in CV.

Its code is very good, which I learn a lot from it. I encourage everyone to reproduce this code, which must can give a lot of insight and improving your coding ability.

In its paper and many resource, it says it is better to have a pre-train. I simply train the swim-transformer on FOOD101 (just as an simple experiment). I found three issue: 1) It is very hard to train, requiring hugh computation cost (before this, I just train CNN rathan than transformer-based model). 2) Training a network from scratch will have poor initial results 3) Hyperparameters i.e. learning rate are very important. These are all my findings, which may be wrong.

The source I refer: a bilibili vedio to explain paper, a bilibili vedio to explain to code, a CSDN blog to explain the Swim Transformer, a CSDN blog to introduce Dropath(it is my first time to see this),

Original paper: Swin transformer: Hierarchical vision transformer using shifted windows

Official repository: here

Model
7.Unet-Transformer (UNETR) 2024/5/12

Based on the work of Vision Transformer (ViT), this paper proposed a work named UNEt-TRansformer (UNETR), which is used to deal with 3D medical images. The whole architecture is like U-net and the encodder is replaced by ViT.

This is my first time to see how to deal with 3D image. Dealing 3D is quite different. Normally use torch.nn.Conv3d. The most different is the image size. The 3D image dimension is like (batch_size, one image channel, height(frame), height, width). Take vedio as example: if there are 10 vedios, each consisting 20 frames, RGB image(3 channels), 224*224 pixel, it will be (10, 3, 20, 224, 224)

There is also a work based on this one and Swin-Transformer, named Swin-UNETR, which should be very similar.

The code in official repository use monai libiary, which can provide a fast track for code change proposals and demonstrating cutting-edge research ideas. But in my code, I used the ViT code reproduced by myself to reproduce UNETR.

I think if you have implemented ViT or want to use monai libiary, implementation of UNETR is not a hard thing.

Training such transformer-based network is computational cost. I use my conputer(CPU only) to run the forward part with the image size (2, 1, 128, 128, 128), which need about one minutes. Without good GPU, it very hard to get result. This is also my first time to get an intuitive sense of how much computing resources transofrmer consumes.

Original paper: Unetr: Transformers for 3d medical image segmentation

Official repository: here

Refered repository: here

Model
8.Mamba 2024/6/22

From the perspective of the result and performance, Mamba seem to can shake transofrmer's position. Mamba can outperformer than Transformer slight while calculating much faster. It seems to be a substitute for Transformer. With the development of Transformer, one disadvantage is the time complexity is O(n^2). As models get bigger, the problem gets worse. However, Mamba is O(n), which can well solve this problem.

Another point is that Transoformer's self-attention mechanism is actually not supported by any theory, it seems to be just a patchwork of modules (although it seems to make sense). But Mamba is supported by the State space model theory, which I learn in my undergraduate Y3. This gives mamba a higher interpretability. To some extend, Mamba has very similar idea with RNN/LSTM. They are a kind of forward flow, from the previous one and input at this time to the next.

In short, I think Mamba has a lot of advantages, and it can do better than transformer at the beginning of its birth, and its emergence is expected to greatly promote the development of the field, at least using the idea of SSM is great.

The paper of Mamba is very abstract. Fortunally, many blogs and videos try to explain it, which give me lots insights.

Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Official Repository: here

I recommend this CSND blog

I recommend these BiliBili videos: 1, 2, 3, 4 and 5. After seeing these videos, I get a lots of insights and know what Mamba is.

Although a lots materials to explain what Mamba is, I think the the code and architecture of Mamba is not very clear and these materials do not focus on the code. But I found this repository, which provide the minimal implementation. After seeing this code, I know basicly what the Mamba code is. In my reprodeced code, I give detailed comments to explain each part.

mamba_minimal.py is the work of the repository mentioned above.

mamba_minimal_muyi.py is what I reproduced and give detailed comments.

mamba_main is official full implementation and I give some comments.

I put some import picture here:

The whole architecture demo:

Model

The formula for delta,A,B,C,D:

Model

The algorithm for SSM:

Model

The Mamba block architecture:

Model
9.Vision Mamba(Vim) 2024/6/25

Very similar to the relationship of Transformer and Vision Transformer, Vision Mamba(Vim) has the similar idea with them based on Mamba. Vim has the potential to become the universal backbone of the new CV field. Performance and speed are higher than Transormer.

In addition, I have an idea that since Mamba can process very long sequences of text (such as millions of pixels), the image is unlikely to reach millions of patches no matter how many pixals image is. Therefore, Vim should not forget too much of the previous patch content when processing images (note that Vim is a timing-sequence model). So processing images as time series data does not reduce performance. Vision Transofrmer does not reduce performance because it is parallelized, and each patch is computed at the same time.

Vision Mmaba has two major innovations:

1.Use mamba in computer vision field.

2.Use bidirectional SSM, which leads to a lots of similar works.

I only see this Bilibili video. I know Vim when I learn Mamba. This is not too hard because it is very similar with the relationship of Transformer and Vision Transformer.

The official repository is here.

The paper: Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

As for code, I did not see any codes that can help people to understand. In my reprodeced code, I make a toy version(very simple one, similar to mamba_minimal). I also give a very detail comments in the souce code of Vision mamba. In the source code, I find there's something that seems to be wrong: When conducting bidirectional SSM, it use two Vim block, one used for forward and other one used for backward. This is not consist with the architecture described in the paper picture. I also show the real architecture below.

The Vision Mamba architecture:

Model

The real Vim architecture in code:

Model

The Vision Mmaba algorithm:

Model
10.SegMamba 2024/7/3

Contributions:

  • use the U-Net architecture
  • The first layer is Stem Convolutional Network, kernal size of 7 * 7 * 7, padding of 3 * 3 * 3 and stride of 2 * 2 * 2。In the first paragraph, it mentions that some works find that using large kernel to improve the view field to extract large range information form 3D image with high resolution is useful.
    • Actually, this Stem Convolutional Layer is similar to the Patch Embedding. But it seemingly is not very suitbale.
  • Mamba block is replaced by TSMamba Block,seen as Fig.2.
  • decoder is CNN-based

As for code:

  • it rewrite the mamba. But I think there is some error about nslices about inter-slice direction.
    • xz : [B, L, D] and nslices is set as [64, 32, 16, 8]
    • for example, if xz is [1,2,3,4.....35] and nslice = 5. After implementation, xz becomes [0,7,14,21,28,1,8...]
    • it means interval = total tokens num/ nslice => nslice = total token num/interval = H * W * D/H * W = D
    • Therefore, we should set nslics as D rather than fixed numbers
  • Compared with the code of U-Mamba, VM-UNet and nnMamba, this code is relative simple.

I see some papers about vision mamba and segmentation in medical images. During this paper SegMamba, I see U-Mamba, nnMamba and VM-UNet at the same time. Except for VM-UNet, these three paper do not use patch embedding, instead using stem convolution. I guess maybe it will be better for mamba to use patch with small size.

The Paper: SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation

The official repository: Here

Model
11.UltraLight VM-UNet 2024/7/9

Contributions:

  • The biggest contribution made by this work is the lightweight model. Compared with the Light MUNet(or see this respository), it is 87% less parameters, with only 0.049M parameters and 0.06GFLOPs. The proposed PVM Layer is a plug-and-play module, which is very good.
  • The overall architecture is U-Net architecture, downsamping layer is maxpooling. Encoder uses 3 layers of ConV Block and then 3 layers of PVM Layer. Decoder is symmetric and also 3 layers of convolution, 3 layers of PVM Layer. The middle skip connection uses the same SAB and CAB(spatical attention bridge and chanel attention bridge) as the H-VMamba(seen in my respository).
    • Encoder part: totally 6 layers. The first three layers are Conv Layer. Last 4 layers is PVM Layers
    • The connection part,consisted of SAB and CAB with shared parameters
      • SAB(x) = x + x * Conv2d(k=7)([MaxPool(x); AvgPool(x)])
      • CAB(x) = x + x * Sigmoid(FC(GAP(x)))
    • Decoder part: is symmetry with the Encoder, consised of 3 Conv layers and 3 PVM layers
  • PVM Layer:
    • The core idea is shown in Fig.3. We divide the channel into four parts and perform a mamba operation on each part (from the code, it is the same mamba for every channel groups), which can save a lot of parameters and finally put them together
    • In Fig.4, which I did not put here, if mamba is performed directly against the number of C channels and x parameters are required, then mamba is performed twice against C/2 and only 2*0.251 is required (two C/2 are separate mamba). For 4 * C/4, only 0.063 * 4 parameters are required
    • The overall look is very simple, and very few parameters, and the effect is not bad, although not all the best, ISIC2017 DSC SE is the best, PH^2 is all the best, ISIC2018 is the best on DSC and ACC
  • Implementation details from the code
    • First of all, regarding the implementation of CAB, we can see that CAB in Fig.2 actually has another stage, which I have not seen before. In fact, the output of the 6 stages should be cat together, and then mapped to their respective dimensions through the corresponding linear layer, so the information of each stage is actually synthesized here
    • As for skip connection, according to Fig.2, every stage goes through SAB CAB, but in fact, it is not. According to the code, stage 6 does not go through SAB CAB, or even skip connection. In fact, a bit of stage as a bottleneck feeling, this is definitely not a code error, because the above mentioned CAB is all stages combined together, but the code actually only the first 5 stages combined together
    • maxpooling with stride=2 and size=2 is used for downsampling
    • encoder convolution is all size=3,stride=1,padding=1
    • The last convolution of the decoder is actually a segmentation head, output num_class, size=1. The other two decoder size=3, stride=1, padding=1

As for my code, I add two more hyper-parameters to control the number of devided channels and whether use the same mamba. I also let stage 6 pass through the CAB and SAB. This code is very clear and easy to read.

Datasets:

- ISIC2017
- ISIC2018
- PH^2,from external validation

The Paper, published in 2024.3.29: UltraLight VM-UNet:Parallel Vision Mamba Significantly Reduces Parameters for Skin Lesion Segmentation

The official repository: Here

Model

overview-of-reproduced-project's People

Contributors

baobao0926 avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.