facebookresearch / adaptive_teacher Goto Github PK

View Code? Open in Web Editor NEW

170.0 170.0 34.0 323 KB

This repo provides the source code for "Cross-Domain Adaptive Teacher for Object Detection".

License: Other

Python 100.00%

adaptive_teacher's People

Contributors

Stargazers

Watchers

adaptive_teacher's Issues

How to train with the coco format datasets?

I am a graduate student，and I want to use AT for my remote sensing datasets with coco format,how should i do?Thank you so much.

How to train using single gpu?

How to open the output file

I can't know the type of the output file event.out........,anyone can help me to open the file and see the right value in it

Why your DA-FRCNN implementation uses multi-scale training trick?

Thanks for your work, but I recently noticed another question about the input image scale.

As far as I know, the input min scale should be 600 for FRCNN-based DAOD frameworks, as shown in https://github.com/krumo/Domain-Adaptive-Faster-RCNN-PyTorch/blob/df0488405a7679552bc2504b973e29178c141b26/configs/da_faster_rcnn/e2e_da_faster_rcnn_R_50_C4_cityscapes_to_foggy_cityscapes.yaml#L24

But It seems that AT uses multi-scale training in all configs?

adaptive_teacher/configs/Base-RCNN-C4.yaml

Line 17 in cba3c59

MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)

possible to wrap the teacher model by DistributedDataParallel?

Hello, I'm trying to use your idea in my thesis work, thanks for your great idea and code!
I set require_grad=False for all the parameters in the teacher model, and wrapped it in DistributedDataParallel.
But what I got with my own code is that the training stucks at loss.backward(), the losses are not NaN.
If I lower down the batch size and run with just 1 GPU, the code works fine. But if I use DistributedDataParallel, the training will stuck immediately.

Would you have an idea about it? Is it because the exponential moving average somehow affects the computation graph?
Thanks

Error while loading pretrained model weights from `detectron2://ImageNetPretrained/MSRA/R-101.pkl` for training with custom dataset

Hello authors, thanks for code.

I was trying to adapt the model to our custom dataset and faced the following issue.

Traceback (most recent call last):
  File "models/adaptive_teacher/train_net.py", line 84, in <module>
    args=(args,),
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/detectron2/engine/launch.py", line 59, in launch
    daemon=False,
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/train_net.py", line 69, in main
    trainer.resume_or_load(resume=args.resume)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/adapteacher/engine/trainer.py", line 377, in resume_or_load
    self.cfg.MODEL.WEIGHTS, resume=resume
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 229, in resume_or_load
    return self.load(path, checkpointables=[])
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 158, in load
    incompatible = self._load_model(checkpoint)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 28, in _load_model
    incompatible = self._load_student_model(checkpoint)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 70, in _load_student_model
    self._convert_ndarray_to_tensor(checkpoint_state_dict)
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 368, in _convert_ndarray_to_tensor
    for k in list(state_dict.keys()):
AttributeError: 'NoneType' object has no attribute 'keys'

Configuration used:

_BASE_: "./Base-RCNN-C4.yaml"
MODEL:
  META_ARCHITECTURE: "DAobjTwoStagePseudoLabGeneralizedRCNN"
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-101.pkl"
  MASK_ON: False
  RESNETS:
    DEPTH: 101
  PROPOSAL_GENERATOR:
    NAME: "PseudoLabRPN"
  PIXEL_MEAN: [87.0, 91.0, 95.0]
  # RPN:
  #   POSITIVE_FRACTION: 0.25
  ROI_HEADS:
    NAME: "StandardROIHeadsPseudoLab"
    LOSS: "CrossEntropy" # variant: "CrossEntropy"
    NUM_CLASSES: 4 # this doesn't include background.
  ROI_BOX_HEAD:
    NAME: "FastRCNNConvFCHead"
    NUM_FC: 2
    POOLER_RESOLUTION: 7
SOLVER:
  LR_SCHEDULER_NAME: "WarmupTwoStageMultiStepLR"
  STEPS: (60000, 80000, 90000, 360000)
  FACTOR_LIST: (1, 1, 1, 1, 1)
  MAX_ITER: 100000
  IMG_PER_BATCH_LABEL: 4
  IMG_PER_BATCH_UNLABEL: 4
  BASE_LR: 0.04
  CHECKPOINT_PERIOD: 1000
DATALOADER:
  SUP_PERCENT: 100.0
DATASETS:
  CROSS_DATASET: True
  TRAIN_LABEL: ("train_clear_day",) #voc_2012_train
  TRAIN_UNLABEL: ("train_dense_fog_day",) #Clipart1k_train
  TEST: ("test_clear_day",) #Clipart1k_test
SEMISUPNET:
  Trainer: "ateacher"
  BBOX_THRESHOLD: 0.8
  TEACHER_UPDATE_ITER: 1
  BURN_UP_STEP: 20000
  EMA_KEEP_RATE: 0.9996
  UNSUP_LOSS_WEIGHT: 1.0
  SUP_LOSS_WEIGHT: 1.0
  DIS_TYPE: "res4" #["concate","p2","multi"]
TEST:
  EVAL_PERIOD: 1000

Please have look into it.

How to calculate mAP of model and mAP of each class, I tried to use sample code, but can not, how to solve?

what is crop ratio?

There is weak augmentations for teacher model. I would like to learn what is the cropping ratio for weak augmentation? This makes model inference about 3x faster compared to base faster rcnn model.

About discriminator in the paper

Hi,

Where is the discriminator code that is mentioned in the paper?
Is it only a fully-connected layer, classifier in the network?

Regarding the Watercolor Dataset Config

Hello, so I am wondering whether there is a specific split file we have to run to obtain the voc split for the watercolor dataset? The burn in training for watercolor only trains on the 7 overlapping classes that span both voc and watercolor. Is that part of the code missing or is it assumed to be done by ourselves.? Many Thanks.

Question about Figure 4

Thank you for your work. But I have a question about Figure 4 in the main paper.

It seems that with only 10k iterations of the source-only pre-training, the model has achieved around 33.0 mAP, which has significantly outperformed the well-trained source-only results (28.8). Does it mean that the detectron2 implemented FRCNN works better?

About VGG16 pre-trained on ImageNet

we found that in the paper Chapter 4.2 : "ResNet101 [13] or VGG16 [36] pre-trained on ImageNet [7]". However, at adaptive_teacher/configs/faster_rcnn_VGG_cross_city.yaml, VGG16 did not used the pre-train ImageNet parameters like adaptive_teacher/configs/faster_rcnn_R101_cross_water.yaml.

We would like to know whether VGG16 are pretrained on ImageNet or not. thank you very much

.

Poor performance with registered Coco annotations

Need clarification whether Adaptive Teacher is unsupervised or not.

adaptive_teacher/adapteacher/engine/trainer.py

Lines 646 to 647 in cba3c59

    
           else:  # supervised loss 
        
               loss_dict[key] = record_dict[key] * 1

Here are you using the labels of Target dataset? @yujheli

Hello thankyou for the repository I tried to evaluate using provided pretrained model but it returns all zero like this. Any suggestion please?

Can not reproduce the results of "cityscapes" to "foggy cityscapes".

Hi，I get some problems during the reproduction of the results of "cityscapes" to "foggy cityscapes".

During my reproduction, I caught an error as below:

The environment is followed by your prerequisites: detectron=0.3, pytorch=1.7, cuda=11.0, python=3.7. So I changed your code in below to solve the error:

From

adaptive_teacher/adapteacher/engine/trainer.py

Lines 167 to 168 in cba3c59

    
           evaluator_list.append(COCOEvaluator( 
        
               dataset_name, output_dir=output_folder))

to evaluator_list.append(COCOEvaluator(dataset_name, cfg, True, output_dir=output_folder))

This error is solved after my modification. Is this modification OK? However, besides this error, another error arise when I run 32979 iterations, shown as below.

I have changed the IMG_PER_BATCH_LABEL/IMG_PER_BATCH_UNLABEL to 8/8. But the error is still there. My GPU is 16G V100 and Memory is 377G. Due to this error, I can not reproduce the results of "cityscapes" to "foggy cityscapes". Do you know how to fix this error?

Why 2 seperate annotation files?

In readme.md, it is instructed to put gtFine folder to both clear and foggy data directories. But foggy images are created from clear ones therefore their annotations should be exactly the same. Indeed, there is only one annotation file in download link for cityscapes. Are we supposed to put exactly same gtFine folder to 2 different places?
Or is it structured like this to train custom datasets? Because unsupervised images dont have to be created synthetically from clean images.
Thanks

The experiments of vgg16 without BN layers?

Hi, thanks for your awesome project!

When I dive into the detail of adaptive_teacher, I find that the vgg16 backbone has BN layers by default.

adaptive_teacher/adapteacher/modeling/meta_arch/vgg.py

Line 62 in cba3c59

self.vgg = make_layers(cfgs['vgg16'],batch_norm=True)

As far as I know, the vgg16 backbone does not contain BN layers, which will improve the baseline of cross-domain detection performance, making it an unfair comparison, as previous works use pure vgg16 backbone without BN layers. I observed that the proposed method outperforms previous methods by a large margin. Do the authors conduct the experiments of vgg16 without BN layers?

Thanks a lot!

mAP predicted by student_model is sometimes higher than teacher

I train the model with my dataset,but the ap50 is instable, I may get vary different result by same parameters.
Moreover,the teacher' AP50 sometimes lower than student'AP50.
Is this phenomenon normal in DAOD？

detectron2 v0.6 has removed FastRCNNOutputs class

It should be using detectron2 v0.5 to implement the code.

how to get AP50 result for each classes as reported on the paper?

Loss becomes NaN

I conducted a reproduction experiment of domain adaptation from PASCAL VOC to Clipart 1k,
but the loss became NaN during learning.

I am using 4 RTX 2080 Ti and changing the parameters as follows.

IMG_PER_BATCH_LABEL: 16 -> 4
IMG_PER_BATCH_UNLABEL: 16 -> 4
BASE_LR: 0.04 -> 0.01
MAX_ITER: 100000 -> 400000
BURN_UP_STEP: 20000 -> 80000

Is there a good solution?

question about selecting best model (validation or post last step?)

Hi, I have been reading your paper and code,
and I am confused about how the best model of the entire training process is selected.

this is how I understood the training code

model training (both burn-in and mutual learning stage) is performed on train data
model weight is saved every 5000 steps, by hooks.PeriodicCheckpointer
After the last training step is finished (MAX_ITER reached), resulting weight is used for evaluation

Please correct me if i am wrong.

and my questions are:
a. Should I take the model weight after the last training step as the final model weight for future inference?
b. It seems validation loss/metric is not calculated in the code, but in the paper there is a plot of validation mAP (Figure 4 )
Are the metrics reported on the paper calculated with post last training step weights or weight selected based on validation set?
c. Is there a model selection based on validation loss/metric function that i missed in this repo?

Thank you for the great paper and code
I found the contents really interesting.
Thanks in advance!

why ap ap50 ap75 = 0.000?

Train

Memory Leak

Hi, thanks for your work.

adaptive_teacher/adapteacher/data/datasets/cityscapes_foggy.py

Lines 79 to 85 in d57d206

    
           pool = mp.Pool(processes=max(mp.cpu_count() // get_world_size() // 2, 4)) 
        
           ret = pool.map( 
        
               functools.partial(_cityscapes_files_to_dict, from_json=from_json, to_polygons=to_polygons), 
        
               files, 
        
           ) 
        
           logger.info("Loaded {} images from {}".format(len(ret), image_dir))

Here if we don't close the pool, that may cause a memory leak.

pool.close()

loss nan

When I set Dis_loss_weight=0.1, model will collapse. I see the same problem in facebookresearch/detectron2#1128 .
According to your solution, setting the smaller diss_weight will alleviate this issue.But it will get a poor MAP.
How did you train your model with Dis_loss_weight=0.1?

[05/30 11:21:11] d2.utils.events INFO: eta: 8:40:21 iter: 9999 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4926 loss_rpn_loc: 0.2313 loss_D_img_s: nan loss_D_img_t: nan time: 0.6949 data_time: 0.0415 lr: 0.01 max_mem: 5007M
[05/30 11:21:25] d2.utils.events INFO: eta: 8:39:59 iter: 10019 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4825 loss_rpn_loc: 0.2396 loss_D_img_s: nan loss_D_img_t: nan time: 0.6948 data_time: 0.0418 lr: 0.01 max_mem: 5007M
[05/30 11:21:38] d2.utils.events INFO: eta: 8:39:40 iter: 10039 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4763 loss_rpn_loc: 0.2427 loss_D_img_s: nan loss_D_img_t: nan time: 0.6948 data_time: 0.0349 lr: 0.01 max_mem: 5007M
[05/30 11:21:52] d2.utils.events INFO: eta: 8:38:53 iter: 10059 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4791 loss_rpn_loc: 0.232 loss_D_img_s: nan loss_D_img_t: nan time: 0.6947 data_time: 0.0333 lr: 0.01 max_mem: 5007M
[05/30 11:22:05] d2.utils.events INFO: eta: 8:38:33 iter: 10079 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.493 loss_rpn_loc: 0.2346 loss_D_img_s: nan loss_D_img_t: nan time: 0.6947 data_time: 0.0344 lr: 0.01 max_mem: 5007M

.

some puzzles about using "branch.startswith("supervised")" in adapteacher/modeling/meta_arch/rcnn.py

Hi，I find you write "if branch.startswith("supervised")" in line 217 of adapteacher/modeling/meta_arch/rcnn.py. I am confused of it.
I think it might be some problem when we run loss of unlabeled data with pseudo label ( in line 605 of adapteacher/engine/trainer.py), which should run in branch "supervised_target". And I think this will result wrong label for loss_D_img_s_pesudo. Please check it.

Optimal number of learning iterations

I have a question about the paper.
Looking at Figure 4, the score rises when the iteration is between 10k and 20k,
but The score hasn't increased much since then.
In this case, is it enough for learning to have about 20K iterations?

How to train on custom datasets?

Hi,

Thanks for sharing the code.

I was wondering if you could also share the instruction on how to train on custom datasets. Since I've noticed that you modified the low level code such as builtin.py in data.datasets, rather than registering somewhere else.

Thanks in advance.

Distributed training failure

Hi,

When running the training code, I encountered the following issue.

Exception during training:
Traceback (most recent call last):
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 402, in train_loop
self.run_step_full_semisup()
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 597, in run_step_full_semisup
all_label_data, branch="supervised"
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 787, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 66 67 68 69 70 71 72 73
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Then I added find_unused_parameters=True to DistributedDataParallel() function. And the problem has been solved.

But now I have another issue.

Exception during training:
Traceback (most recent call last):
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 403, in train_loop
self.run_step_full_semisup()
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 657, in run_step_full_semisup
losses.backward()
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/autograd/init.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 65 with name roi_heads.box_predictor.bbox_pred.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

The answer from online suggests setting find_unused_parameters=False. But this will cause the previous error.

I was wondering if you have a better solution.

My environment:
detectron v0.5
pytorch1.9.0
cuda 11.1

Thanks

Evaluation for each category

Hi authors,

Great work! I tried to reproduce the adaptive teacher, but I find the script used to evaluate is only for coco style metrics. Do you have the script to output per category AP so that we can compare with the results in the paper?

Thanks!

AP NaN

Hello,

I formed a new target dataset in Pascal VOC format and as I understand, the target dataset should be unlabeled so I did not add .xml files to the Annotations folder of the target dataset. But how does the evaluation for the unlabeled images work in the teacher model if there are no ground truth boxes?

Specifically, at every EVAL_PERIOD iteration this line returns NaN:

adaptive_teacher/adapteacher/evaluation/pascal_voc_evaluation.py

Line 305 in d57d206

rec = tp / float(npos)

What should be done instead? Thanks!

AttributeError: 'NoneType' object has no attribute 'keys'

I was trying to reproduce the training from PASCAL VOC (source) to Clipart1k (target) using

python train_net.py \
      --num-gpus 8 \
      --config configs/faster_rcnn_R101_cross_clipart.yaml\
      OUTPUT_DIR output/exp_clipart

However, I got the following error message:

Traceback (most recent call last):
  File ".../adaptive_teacher/train_net.py", line 73, in <module>
    launch(
  File ".../detectron2-0.3/detectron2/engine/launch.py", line 55, in launch
    mp.spawn(
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 4 terminated with the following error:
Traceback (most recent call last):
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File ".../detectron2-0.3/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File ".../adaptive_teacher/train_net.py", line 64, in main
    trainer.resume_or_load(resume=args.resume)
  File ".../adaptive_teacher/adapteacher/engine/trainer.py", line 337, in resume_or_load
    checkpoint = self.checkpointer.resume_or_load(
  File ".../lib/python3.10/site-packages/fvcore/common/checkpoint.py", line 229, in resume_or_load
    return self.load(path, checkpointables=[])
  File ".../lib/python3.10/site-packages/fvcore/common/checkpoint.py", line 156, in load
    incompatible = self._load_model(checkpoint)
  File ".../adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 24, in _load_model
    incompatible = self._load_student_model(checkpoint)
  File ".../adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 64, in _load_student_model
    self._convert_ndarray_to_tensor(checkpoint_state_dict)
  File ".../lib/python3.10/site-packages/fvcore/common/checkpoint.py", line 368, in _convert_ndarray_to_tensor
    for k in list(state_dict.keys()):
AttributeError: 'NoneType' object has no attribute 'keys'

I have pinpointed that the issue comes from detectron2.checkpoint.c2_model_loading.align_and_update_state_dicts removing all information in checkpoint["model"] because it is totally normal before entering this function but returns None after the function call in Line 17 of file .../adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py.

Could you please confirm why this function returns None? I really appreciate your help!!

As you might have noticed, I am using the following environment:

Python 3.10.4
torch==1.12.0+cu102 (& torchvision of the same version)
Detectron2==0.3
Latest adaptive_teacher (I just confirmed right before submitting this issue)
V100

Why your test scale is larger than 600?

Thanks for your working. I found the test scale is 800, which is larger than it in other papers in this field.

Teacher or Student model at inference time?

Hello,
Do you use the teacher or student model at inference time? Looked through the paper but I couldn't find the answer.
Thanks!

resnet backbone dont converge

I changed the backbone from vgg to resnet50. First, I have to use very low learning rate otherwise training diverges. Second, mAP is usually lower than vgg backbone adaptive teacher even if I train long. How can I achieve good performance with resnet50 backbone?
I am using gradient clipping and learning rate of 0.001 to stop diverging.

why were the mAP and AP50 getting smaller with the increase of iterations

With detectron2 for the first time, I didn't understand a problem. When I trained the source only, AT model, I loaded the pretrained model of model zoo, and the mAP and AP50 were getting smaller with the increase of iterations.

When I run the 17th line of adapteacher/checkpoint/detection_checkpoint.py, the ‘model’ in the ‘checkpoint’ is none, could you please tell me the reason? and it tells me that some keys in the weight file are not use

When I run the 17th line of adapteacher/checkpoint/detection_checkpoint.py, the ‘model’ in the ‘checkpoint’ is none, could you please tell me the reason? and it tells me that some keys in the weight file are not use

what‘s more, line 334 of adapter/modeling/meta_arch/rcnn.py uses the ’convert_image_to_rgb‘ function, but the IDE tells me that this function is not defined, where is it defined?

Can not reproduce the results on "foggy cityscapes" due to Out of Memory Issue

I am getting "Cannot allocate memory error" after around 13-15k iterations while trying to reproduce results on "foggy cityscapes" dataset. I running this code on 4 GPUs with 360G memory.

I can reproduce VOC results on the same machine! The error is only on cityscapes dataset. I doubt the memory storage keeps increasing with iterations

*** environment***
Python 3.7.10, torch=1.7.0, torchvision=0.8.1, detectron2=0.5

cfg parameters used for my trail:
MAX_ITER: 100000
IMG_PER_BATCH_LABEL: 8
IMG_PER_BATCH_UNLABEL: 8
BASE_LR: 0.04
BURN_UP_STEP: 20000
EVAL_PERIOD: 1000
NUM_WORKERS: 4

**** Error****
ImportError: /scratch/1/ace14705nl/adaptive_teacher/.venv/lib/python3.7/site-packages/PIL/_imaging.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory

UPDATE: when I tried this experiment on other GPU cluster; 4 GPUs V100NVLINK 256G I could run this code for 28K and get AP@50 around 46 but again the process gets terminated due to memory issue.

"iterations =>> PBS: job killed: mem 269239236kb exceeded limit 268435456kb"

I cant figure out why so much memory (269G) is required while running this code on cityscapes dataset. I will highly apricate any help. Thanks.

ImportError: cannot import name 'FastRCNNOutputs' from 'detectron2.modeling.roi_heads.fast_rcnn'

There is no FastRCNNOutputs in the detectron installed according to the installation steps, did I install it wrong?

Why discriminator is trained in supervised and target branch?

I noticed that in the rcnn.py loss_D_img_s and loss_D_img_t are trained with a small weight. I don't know what is the meaning of these two lines of code?

Is this the way to initialize the discriminator? Will it prevent the model suffer from Model Collapse, which is caused by the discriminator?

losses["loss_D_img_s"] = loss_D_img_s*0.001
losses["loss_D_img_t"] = loss_D_img_t*0.001

Will the performance of the model be affected if the two lines of code above are removed and the model just be trained with the following two lines of code in the domain branch?

losses["loss_D_img_s"] = loss_D_img_s
losses["loss_D_img_t"] = loss_D_img_t

how to change backbone to resnet

Current code here uses Faster-rcnn with vgg16 backbone, right? Paper mentions mask-rcnn was used but that would require segmentation in annotations, but I only used bounding boxes for my dataset.

Also, as title mentions, how can I change vgg backbone to resnet 101 or 50?

Thanks

Very low performance during evaluation

Predicted boxes or scores contain Inf/NaN. Training has diverged.

Just wondering if anyone has experienced the same issue while training clipart at batch size=16, lr = 0.01 during mutual learning (as stated in the title)? When I tested with batch size = 1 there seemed to be no problem. Initial thought was the cause of high learning rate (facebookresearch/detectron2#1128), but all datasets were trained at lr=0.01 as from the paper. The bug was caused by the following lines:

Can adaptive teacher use one-stage detector?

Pre-trained VGG16 on ImageNet not shared?

Is the VGG16 model pre-trained on ImageNet not shared or do you train the model from scratch?

The checkpoint state_dict contains keys that are not used by the model

hi,
when I eval or load the model the below warning message will appear.
is it correct for this message?

WARNING [07/29 10:39:53 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
modelTeacher.D_img.conv1.{bias, weight}
modelTeacher.D_img.conv2.{bias, weight}
modelTeacher.D_img.conv3.{bias, weight}
modelTeacher.D_img.classifier.{bias, weight}
modelStudent.D_img.conv1.{bias, weight}
modelStudent.D_img.conv2.{bias, weight}
modelStudent.D_img.conv3.{bias, weight}
modelStudent.D_img.classifier.{bias, weight}

The pretrained model for VGG16

Thanks for your contribution! Can you provide the pretrained model for vgg16?

The AP and APl is 0.000 when evaluation in cityscapes_foggy_val

Ok, the model will be evaluated two times:
If you are in the stage of burn-in, you will get 0 AP for teacher.
bbox (not uses):

bbox_student:

Originally posted by @yujheli in #20 (comment)

How to understant "the stage of burn-in"? It means the training epoches is not enough?

	else: # supervised loss
	loss_dict[key] = record_dict[key] * 1

	evaluator_list.append(COCOEvaluator(
	dataset_name, output_dir=output_folder))

	pool = mp.Pool(processes=max(mp.cpu_count() // get_world_size() // 2, 4))

	ret = pool.map(
	functools.partial(_cityscapes_files_to_dict, from_json=from_json, to_polygons=to_polygons),
	files,
	)
	logger.info("Loaded {} images from {}".format(len(ret), image_dir))

facebookresearch / adaptive_teacher Goto Github PK

adaptive_teacher's People

Contributors

Stargazers

Watchers

Forkers

adaptive_teacher's Issues

Need clarification whether Adaptive Teacher is unsupervised or not.

Recommend Projects

Recommend Topics

Recommend Org