Giter Site home page Giter Site logo

adaptive_teacher's People

Contributors

hnanacc avatar xiaoliangdai avatar yujheli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

adaptive_teacher's Issues

How to open the output file

I can't know the type of the output file event.out........,anyone can help me to open the file and see the right value in it

Why your DA-FRCNN implementation uses multi-scale training trick?

Thanks for your work, but I recently noticed another question about the input image scale.

As far as I know, the input min scale should be 600 for FRCNN-based DAOD frameworks, as shown in https://github.com/krumo/Domain-Adaptive-Faster-RCNN-PyTorch/blob/df0488405a7679552bc2504b973e29178c141b26/configs/da_faster_rcnn/e2e_da_faster_rcnn_R_50_C4_cityscapes_to_foggy_cityscapes.yaml#L24

But It seems that AT uses multi-scale training in all configs?

MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)

possible to wrap the teacher model by DistributedDataParallel?

Hello, I'm trying to use your idea in my thesis work, thanks for your great idea and code!
I set require_grad=False for all the parameters in the teacher model, and wrapped it in DistributedDataParallel.
But what I got with my own code is that the training stucks at loss.backward(), the losses are not NaN.
If I lower down the batch size and run with just 1 GPU, the code works fine. But if I use DistributedDataParallel, the training will stuck immediately.

Would you have an idea about it? Is it because the exponential moving average somehow affects the computation graph?
Thanks

Error while loading pretrained model weights from `detectron2://ImageNetPretrained/MSRA/R-101.pkl` for training with custom dataset

Hello authors, thanks for code.

I was trying to adapt the model to our custom dataset and faced the following issue.

Traceback (most recent call last):
  File "models/adaptive_teacher/train_net.py", line 84, in <module>
    args=(args,),
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/detectron2/engine/launch.py", line 59, in launch
    daemon=False,
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes
    while not context.join():
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/train_net.py", line 69, in main
    trainer.resume_or_load(resume=args.resume)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/adapteacher/engine/trainer.py", line 377, in resume_or_load
    self.cfg.MODEL.WEIGHTS, resume=resume
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 229, in resume_or_load
    return self.load(path, checkpointables=[])
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 158, in load
    incompatible = self._load_model(checkpoint)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 28, in _load_model
    incompatible = self._load_student_model(checkpoint)
  File "/misc/no_backups/s1437/DA-Multimodal-OD/models/adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 70, in _load_student_model
    self._convert_ndarray_to_tensor(checkpoint_state_dict)
  File "/no_backups/s1437/.pyenv/versions/adaptiveteacher/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 368, in _convert_ndarray_to_tensor
    for k in list(state_dict.keys()):
AttributeError: 'NoneType' object has no attribute 'keys'

Configuration used:

_BASE_: "./Base-RCNN-C4.yaml"
MODEL:
  META_ARCHITECTURE: "DAobjTwoStagePseudoLabGeneralizedRCNN"
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-101.pkl"
  MASK_ON: False
  RESNETS:
    DEPTH: 101
  PROPOSAL_GENERATOR:
    NAME: "PseudoLabRPN"
  PIXEL_MEAN: [87.0, 91.0, 95.0]
  # RPN:
  #   POSITIVE_FRACTION: 0.25
  ROI_HEADS:
    NAME: "StandardROIHeadsPseudoLab"
    LOSS: "CrossEntropy" # variant: "CrossEntropy"
    NUM_CLASSES: 4 # this doesn't include background.
  ROI_BOX_HEAD:
    NAME: "FastRCNNConvFCHead"
    NUM_FC: 2
    POOLER_RESOLUTION: 7
SOLVER:
  LR_SCHEDULER_NAME: "WarmupTwoStageMultiStepLR"
  STEPS: (60000, 80000, 90000, 360000)
  FACTOR_LIST: (1, 1, 1, 1, 1)
  MAX_ITER: 100000
  IMG_PER_BATCH_LABEL: 4
  IMG_PER_BATCH_UNLABEL: 4
  BASE_LR: 0.04
  CHECKPOINT_PERIOD: 1000
DATALOADER:
  SUP_PERCENT: 100.0
DATASETS:
  CROSS_DATASET: True
  TRAIN_LABEL: ("train_clear_day",) #voc_2012_train
  TRAIN_UNLABEL: ("train_dense_fog_day",) #Clipart1k_train
  TEST: ("test_clear_day",) #Clipart1k_test
SEMISUPNET:
  Trainer: "ateacher"
  BBOX_THRESHOLD: 0.8
  TEACHER_UPDATE_ITER: 1
  BURN_UP_STEP: 20000
  EMA_KEEP_RATE: 0.9996
  UNSUP_LOSS_WEIGHT: 1.0
  SUP_LOSS_WEIGHT: 1.0
  DIS_TYPE: "res4" #["concate","p2","multi"]
TEST:
  EVAL_PERIOD: 1000

Please have look into it.

what is crop ratio?

There is weak augmentations for teacher model. I would like to learn what is the cropping ratio for weak augmentation? This makes model inference about 3x faster compared to base faster rcnn model.

About discriminator in the paper

Hi,

Where is the discriminator code that is mentioned in the paper?
Is it only a fully-connected layer, classifier in the network?

Regarding the Watercolor Dataset Config

Hello, so I am wondering whether there is a specific split file we have to run to obtain the voc split for the watercolor dataset? The burn in training for watercolor only trains on the 7 overlapping classes that span both voc and watercolor. Is that part of the code missing or is it assumed to be done by ourselves.? Many Thanks.

Question about Figure 4

Thank you for your work. But I have a question about Figure 4 in the main paper.

It seems that with only 10k iterations of the source-only pre-training, the model has achieved around 33.0 mAP, which has significantly outperformed the well-trained source-only results (28.8). Does it mean that the detectron2 implemented FRCNN works better?

About VGG16 pre-trained on ImageNet

we found that in the paper Chapter 4.2 : "ResNet101 [13] or VGG16 [36] pre-trained on ImageNet [7]". However, at adaptive_teacher/configs/faster_rcnn_VGG_cross_city.yaml, VGG16 did not used the pre-train ImageNet parameters like adaptive_teacher/configs/faster_rcnn_R101_cross_water.yaml.

We would like to know whether VGG16 are pretrained on ImageNet or not. thank you very much

Can not reproduce the results of "cityscapes" to "foggy cityscapes".

Hi,I get some problems during the reproduction of the results of "cityscapes" to "foggy cityscapes".

During my reproduction, I caught an error as below:
img1

The environment is followed by your prerequisites: detectron=0.3, pytorch=1.7, cuda=11.0, python=3.7. So I changed your code in below to solve the error:

From

evaluator_list.append(COCOEvaluator(
dataset_name, output_dir=output_folder))

to evaluator_list.append(COCOEvaluator(dataset_name, cfg, True, output_dir=output_folder))

This error is solved after my modification. Is this modification OK? However, besides this error, another error arise when I run 32979 iterations, shown as below.
img2

I have changed the IMG_PER_BATCH_LABEL/IMG_PER_BATCH_UNLABEL to 8/8. But the error is still there. My GPU is 16G V100 and Memory is 377G. Due to this error, I can not reproduce the results of "cityscapes" to "foggy cityscapes". Do you know how to fix this error?

Why 2 seperate annotation files?

In readme.md, it is instructed to put gtFine folder to both clear and foggy data directories. But foggy images are created from clear ones therefore their annotations should be exactly the same. Indeed, there is only one annotation file in download link for cityscapes. Are we supposed to put exactly same gtFine folder to 2 different places?
Or is it structured like this to train custom datasets? Because unsupervised images dont have to be created synthetically from clean images.
Thanks

The experiments of vgg16 without BN layers?

Hi, thanks for your awesome project!

When I dive into the detail of adaptive_teacher, I find that the vgg16 backbone has BN layers by default.

self.vgg = make_layers(cfgs['vgg16'],batch_norm=True)

As far as I know, the vgg16 backbone does not contain BN layers, which will improve the baseline of cross-domain detection performance, making it an unfair comparison, as previous works use pure vgg16 backbone without BN layers. I observed that the proposed method outperforms previous methods by a large margin. Do the authors conduct the experiments of vgg16 without BN layers?

Thanks a lot!

Loss becomes NaN

I conducted a reproduction experiment of domain adaptation from PASCAL VOC to Clipart 1k,
but the loss became NaN during learning.

I am using 4 RTX 2080 Ti and changing the parameters as follows.

IMG_PER_BATCH_LABEL: 16 -> 4
IMG_PER_BATCH_UNLABEL: 16 -> 4
BASE_LR: 0.04 -> 0.01
MAX_ITER: 100000 -> 400000
BURN_UP_STEP: 20000 -> 80000

Is there a good solution?

question about selecting best model (validation or post last step?)

Hi, I have been reading your paper and code,
and I am confused about how the best model of the entire training process is selected.

this is how I understood the training code

  1. model training (both burn-in and mutual learning stage) is performed on train data
  2. model weight is saved every 5000 steps, by hooks.PeriodicCheckpointer
  3. After the last training step is finished (MAX_ITER reached), resulting weight is used for evaluation

Please correct me if i am wrong.

and my questions are:
a. Should I take the model weight after the last training step as the final model weight for future inference?
b. It seems validation loss/metric is not calculated in the code, but in the paper there is a plot of validation mAP (Figure 4 )
Are the metrics reported on the paper calculated with post last training step weights or weight selected based on validation set?
c. Is there a model selection based on validation loss/metric function that i missed in this repo?

Thank you for the great paper and code
I found the contents really interesting.
Thanks in advance!

Memory Leak

Hi, thanks for your work.

pool = mp.Pool(processes=max(mp.cpu_count() // get_world_size() // 2, 4))
ret = pool.map(
functools.partial(_cityscapes_files_to_dict, from_json=from_json, to_polygons=to_polygons),
files,
)
logger.info("Loaded {} images from {}".format(len(ret), image_dir))

Here if we don't close the pool, that may cause a memory leak.

pool.close()

loss nan

When I set Dis_loss_weight=0.1, model will collapse. I see the same problem in facebookresearch/detectron2#1128 .
According to your solution, setting the smaller diss_weight will alleviate this issue.But it will get a poor MAP.
How did you train your model with Dis_loss_weight=0.1?

[05/30 11:21:11] d2.utils.events INFO: eta: 8:40:21 iter: 9999 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4926 loss_rpn_loc: 0.2313 loss_D_img_s: nan loss_D_img_t: nan time: 0.6949 data_time: 0.0415 lr: 0.01 max_mem: 5007M
[05/30 11:21:25] d2.utils.events INFO: eta: 8:39:59 iter: 10019 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4825 loss_rpn_loc: 0.2396 loss_D_img_s: nan loss_D_img_t: nan time: 0.6948 data_time: 0.0418 lr: 0.01 max_mem: 5007M
[05/30 11:21:38] d2.utils.events INFO: eta: 8:39:40 iter: 10039 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4763 loss_rpn_loc: 0.2427 loss_D_img_s: nan loss_D_img_t: nan time: 0.6948 data_time: 0.0349 lr: 0.01 max_mem: 5007M
[05/30 11:21:52] d2.utils.events INFO: eta: 8:38:53 iter: 10059 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.4791 loss_rpn_loc: 0.232 loss_D_img_s: nan loss_D_img_t: nan time: 0.6947 data_time: 0.0333 lr: 0.01 max_mem: 5007M
[05/30 11:22:05] d2.utils.events INFO: eta: 8:38:33 iter: 10079 total_loss: nan loss_cls: nan loss_box_reg: nan loss_rpn_cls: 0.493 loss_rpn_loc: 0.2346 loss_D_img_s: nan loss_D_img_t: nan time: 0.6947 data_time: 0.0344 lr: 0.01 max_mem: 5007M

some puzzles about using "branch.startswith("supervised")" in adapteacher/modeling/meta_arch/rcnn.py

Hi,I find you write "if branch.startswith("supervised")" in line 217 of adapteacher/modeling/meta_arch/rcnn.py. I am confused of it.
I think it might be some problem when we run loss of unlabeled data with pseudo label ( in line 605 of adapteacher/engine/trainer.py), which should run in branch "supervised_target". And I think this will result wrong label for loss_D_img_s_pesudo. Please check it.

Optimal number of learning iterations

I have a question about the paper.
Looking at Figure 4, the score rises when the iteration is between 10k and 20k,
but The score hasn't increased much since then.
In this case, is it enough for learning to have about 20K iterations?

fig4

How to train on custom datasets?

Hi,

Thanks for sharing the code.

I was wondering if you could also share the instruction on how to train on custom datasets. Since I've noticed that you modified the low level code such as builtin.py in data.datasets, rather than registering somewhere else.

Thanks in advance.

Distributed training failure

Hi,

When running the training code, I encountered the following issue.

Exception during training:
Traceback (most recent call last):
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 402, in train_loop
self.run_step_full_semisup()
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 597, in run_step_full_semisup
all_label_data, branch="supervised"
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 787, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 66 67 68 69 70 71 72 73
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Then I added find_unused_parameters=True to DistributedDataParallel() function. And the problem has been solved.

But now I have another issue.

Exception during training:
Traceback (most recent call last):
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 403, in train_loop
self.run_step_full_semisup()
File "/research/cbim/vast/tl601/projects/adaptive_teacher/adapteacher/engine/trainer.py", line 657, in run_step_full_semisup
losses.backward()
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/research/cbim/vast/tl601/anaconda3/envs/adapteacher/lib/python3.7/site-packages/torch/autograd/init.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 65 with name roi_heads.box_predictor.bbox_pred.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

The answer from online suggests setting find_unused_parameters=False. But this will cause the previous error.

I was wondering if you have a better solution.

My environment:
detectron v0.5
pytorch1.9.0
cuda 11.1

Thanks

Evaluation for each category

Hi authors,

Great work! I tried to reproduce the adaptive teacher, but I find the script used to evaluate is only for coco style metrics. Do you have the script to output per category AP so that we can compare with the results in the paper?

Thanks!

AP NaN

Hello,

I formed a new target dataset in Pascal VOC format and as I understand, the target dataset should be unlabeled so I did not add .xml files to the Annotations folder of the target dataset. But how does the evaluation for the unlabeled images work in the teacher model if there are no ground truth boxes?

Specifically, at every EVAL_PERIOD iteration this line returns NaN:

What should be done instead? Thanks!

AttributeError: 'NoneType' object has no attribute 'keys'

I was trying to reproduce the training from PASCAL VOC (source) to Clipart1k (target) using

python train_net.py \
      --num-gpus 8 \
      --config configs/faster_rcnn_R101_cross_clipart.yaml\
      OUTPUT_DIR output/exp_clipart

However, I got the following error message:

Traceback (most recent call last):
  File ".../adaptive_teacher/train_net.py", line 73, in <module>
    launch(
  File ".../detectron2-0.3/detectron2/engine/launch.py", line 55, in launch
    mp.spawn(
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 4 terminated with the following error:
Traceback (most recent call last):
  File ".../lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File ".../detectron2-0.3/detectron2/engine/launch.py", line 94, in _distributed_worker
    main_func(*args)
  File ".../adaptive_teacher/train_net.py", line 64, in main
    trainer.resume_or_load(resume=args.resume)
  File ".../adaptive_teacher/adapteacher/engine/trainer.py", line 337, in resume_or_load
    checkpoint = self.checkpointer.resume_or_load(
  File ".../lib/python3.10/site-packages/fvcore/common/checkpoint.py", line 229, in resume_or_load
    return self.load(path, checkpointables=[])
  File ".../lib/python3.10/site-packages/fvcore/common/checkpoint.py", line 156, in load
    incompatible = self._load_model(checkpoint)
  File ".../adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 24, in _load_model
    incompatible = self._load_student_model(checkpoint)
  File ".../adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py", line 64, in _load_student_model
    self._convert_ndarray_to_tensor(checkpoint_state_dict)
  File ".../lib/python3.10/site-packages/fvcore/common/checkpoint.py", line 368, in _convert_ndarray_to_tensor
    for k in list(state_dict.keys()):
AttributeError: 'NoneType' object has no attribute 'keys'

I have pinpointed that the issue comes from detectron2.checkpoint.c2_model_loading.align_and_update_state_dicts removing all information in checkpoint["model"] because it is totally normal before entering this function but returns None after the function call in Line 17 of file .../adaptive_teacher/adapteacher/checkpoint/detection_checkpoint.py.

Could you please confirm why this function returns None? I really appreciate your help!!

As you might have noticed, I am using the following environment:

  • Python 3.10.4
  • torch==1.12.0+cu102 (& torchvision of the same version)
  • Detectron2==0.3
  • Latest adaptive_teacher (I just confirmed right before submitting this issue)
  • V100

resnet backbone dont converge

I changed the backbone from vgg to resnet50. First, I have to use very low learning rate otherwise training diverges. Second, mAP is usually lower than vgg backbone adaptive teacher even if I train long. How can I achieve good performance with resnet50 backbone?
I am using gradient clipping and learning rate of 0.001 to stop diverging.

When I run the 17th line of adapteacher/checkpoint/detection_checkpoint.py, the ‘model’ in the ‘checkpoint’ is none, could you please tell me the reason? and it tells me that some keys in the weight file are not use

When I run the 17th line of adapteacher/checkpoint/detection_checkpoint.py, the ‘model’ in the ‘checkpoint’ is none, could you please tell me the reason? and it tells me that some keys in the weight file are not use
image

what‘s more, line 334 of adapter/modeling/meta_arch/rcnn.py uses the ’convert_image_to_rgb‘ function, but the IDE tells me that this function is not defined, where is it defined?

Can not reproduce the results on "foggy cityscapes" due to Out of Memory Issue

I am getting "Cannot allocate memory error" after around 13-15k iterations while trying to reproduce results on "foggy cityscapes" dataset. I running this code on 4 GPUs with 360G memory.

I can reproduce VOC results on the same machine! The error is only on cityscapes dataset. I doubt the memory storage keeps increasing with iterations

driver

*** environment***
Python 3.7.10, torch=1.7.0, torchvision=0.8.1, detectron2=0.5

cfg parameters used for my trail:
MAX_ITER: 100000
IMG_PER_BATCH_LABEL: 8
IMG_PER_BATCH_UNLABEL: 8
BASE_LR: 0.04
BURN_UP_STEP: 20000
EVAL_PERIOD: 1000
NUM_WORKERS: 4

**** Error****
ImportError: /scratch/1/ace14705nl/adaptive_teacher/.venv/lib/python3.7/site-packages/PIL/_imaging.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory


UPDATE: when I tried this experiment on other GPU cluster; 4 GPUs V100NVLINK 256G I could run this code for 28K and get AP@50 around 46 but again the process gets terminated due to memory issue.

"iterations =>> PBS: job killed: mem 269239236kb exceeded limit 268435456kb"

I cant figure out why so much memory (269G) is required while running this code on cityscapes dataset. I will highly apricate any help. Thanks.

Why discriminator is trained in supervised and target branch?

I noticed that in the rcnn.py loss_D_img_s and loss_D_img_t are trained with a small weight. I don't know what is the meaning of these two lines of code?

Is this the way to initialize the discriminator? Will it prevent the model suffer from Model Collapse, which is caused by the discriminator?

losses["loss_D_img_s"] = loss_D_img_s*0.001
losses["loss_D_img_t"] = loss_D_img_t*0.001

Will the performance of the model be affected if the two lines of code above are removed and the model just be trained with the following two lines of code in the domain branch?

losses["loss_D_img_s"] = loss_D_img_s
losses["loss_D_img_t"] = loss_D_img_t

how to change backbone to resnet

Current code here uses Faster-rcnn with vgg16 backbone, right? Paper mentions mask-rcnn was used but that would require segmentation in annotations, but I only used bounding boxes for my dataset.

Also, as title mentions, how can I change vgg backbone to resnet 101 or 50?

Thanks

The checkpoint state_dict contains keys that are not used by the model

hi,
when I eval or load the model the below warning message will appear.
is it correct for this message?

WARNING [07/29 10:39:53 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
modelTeacher.D_img.conv1.{bias, weight}
modelTeacher.D_img.conv2.{bias, weight}
modelTeacher.D_img.conv3.{bias, weight}
modelTeacher.D_img.classifier.{bias, weight}
modelStudent.D_img.conv1.{bias, weight}
modelStudent.D_img.conv2.{bias, weight}
modelStudent.D_img.conv3.{bias, weight}
modelStudent.D_img.classifier.{bias, weight}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.