first commit

2026-06-03 12:42:47 +08:00
commit ec23799148
339 changed files with 57120 additions and 0 deletions
--- a/rtdetrv2_pytorch/Dockerfile
+++ b/rtdetrv2_pytorch/Dockerfile
@@ -0,0 +1,10 @@
+FROM nvcr.io/nvidia/pytorch:25.06-py3
+
+WORKDIR /workspace
+
+COPY requirements.txt .
+
+RUN pip install --upgrade pip && \
+    pip install -r requirements.txt
+
+CMD ["/bin/bash"]
--- a/rtdetrv2_pytorch/README.md
+++ b/rtdetrv2_pytorch/README.md
@@ -0,0 +1,168 @@
+
+## Quick start
+
+<details >
+<summary>Setup</summary>
+
+```shell
+
+pip install -r requirements.txt
+```
+
+The following is the corresponding `torch` and `torchvision` versions.
+`rtdetr` | `torch` | `torchvision`
+|---|---|---|
+| `-` | `2.4` | `0.19` |
+| `-` | `2.2` | `0.17` |
+| `-` | `2.1` | `0.16` |
+| `-` | `2.0` | `0.15` |
+
+</details>
+
+<details open>
+<summary>Fig</summary>
+
+<div align="center">
+<img width="500" alt="image" src="https://github.com/user-attachments/assets/437877e9-1d4f-4d30-85e8-aafacfa0ec56">
+</div>
+
+</details>
+
+
+## Model Zoo
+
+### Base models
+
+| Model | Dataset | Input Size | AP<sup>val</sup> | AP<sub>50</sub><sup>val</sup> | #Params(M) | FPS | config| checkpoint | 
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |:---: |
+**RT-DETRv2-S** | COCO | 640 | **48.1** <font color=green>(+1.6)</font> | **65.1** | 20 | 217 | [config](./configs/rtdetrv2/rtdetrv2_r18vd_120e_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.2/rtdetrv2_r18vd_120e_coco_rerun_48.1.pth) |
+**RT-DETRv2-M**<sup>*<sup> | COCO | 640 | **49.9** <font color=green>(+1.0)</font> | **67.5** | 31 | 161 | [config](./configs/rtdetrv2/rtdetrv2_r34vd_120e_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r34vd_120e_coco_ema.pth)
+**RT-DETRv2-M** | COCO | 640 | **51.9** <font color=green>(+0.6)</font> | **69.9** | 36 | 145 | [config](./configs/rtdetrv2/rtdetrv2_r50vd_m_7x_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r50vd_m_7x_coco_ema.pth)
+**RT-DETRv2-L** | COCO | 640 | **53.4** <font color=green>(+0.3)</font> | **71.6** | 42 | 108 | [config](./configs/rtdetrv2/rtdetrv2_r50vd_6x_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r50vd_6x_coco_ema.pth)
+**RT-DETRv2-X** | COCO | 640 | 54.3 | **72.8** <font color=green>(+0.1)</font> | 76 | 74 | [config](./configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r101vd_6x_coco_from_paddle.pth)
+<!-- rtdetrv2_hgnetv2_l | COCO | 640 | 52.9 | 71.5 | 32 | 114 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_hgnetv2_l_6x_coco_from_paddle.pth) 
+rtdetrv2_hgnetv2_x | COCO | 640 | 54.7 | 72.9 | 67 | 74 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_hgnetv2_x_6x_coco_from_paddle.pth) 
+rtdetrv2_hgnetv2_h | COCO | 640 | 56.3 | 74.8 | 123 | 40 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_hgnetv2_h_6x_coco_from_paddle.pth) 
+rtdetrv2_18vd | COCO+Objects365 | 640 | 49.0 | 66.5 | 20 | 217 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_5x_coco_objects365_from_paddle.pth)
+rtdetrv2_r50vd | COCO+Objects365 | 640 | 55.2 | 73.4 | 42 | 108 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r50vd_2x_coco_objects365_from_paddle.pth)
+rtdetrv2_r101vd | COCO+Objects365 | 640 | 56.2 | 74.5 | 76 | 74 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r101vd_2x_coco_objects365_from_paddle.pth)
+ -->
+
+**Notes:**
+- `AP` is evaluated on *MSCOCO val2017* dataset.
+- `FPS` is evaluated on a single T4 GPU with $batch\\_size = 1$, $fp16$, and $TensorRT>=8.5.1$.
+- `COCO + Objects365` in the table means finetuned model on `COCO` using pretrained weights trained on `Objects365`.
+
+
+
+### Models of discrete sampling
+
+| Model | Sampling Method | AP<sup>val</sup> | AP<sub>50</sub><sup>val</sup> | config| checkpoint 
+| :---: | :---: | :---: | :---: | :---: | :---: |
+**RT-DETRv2-S_dsp** | discrete_sampling | 47.4 | 64.8 <font color=red>(-0.1)</font> | [config](./configs/rtdetrv2/rtdetrv2_r18vd_dsp_3x_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_dsp_3x_coco.pth)
+**RT-DETRv2-M**<sup>*</sup>**_dsp** | discrete_sampling | 49.2 | 67.1 <font color=red>(-0.4)</font> | [config](./configs/rtdetrv2/rtdetrv2_r34vd_dsp_1x_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rrtdetrv2_r34vd_dsp_1x_coco.pth)
+**RT-DETRv2-M_dsp** | discrete_sampling | 51.4 | 69.7 <font color=red>(-0.2)</font> | [config](./configs/rtdetrv2/rtdetrv2_r50vd_m_dsp_3x_coco.yml) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r50vd_m_dsp_3x_coco.pth)
+**RT-DETRv2-L_dsp** | discrete_sampling | 52.9 | 71.3 <font color=red>(-0.3)</font> |[config](./configs/rtdetrv2/rtdetrv2_r50vd_dsp_1x_coco.yml)| [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r50vd_dsp_1x_coco.pth)
+
+
+<!-- **rtdetrv2_r18vd_dsp1** | discrete_sampling | 21600 | 46.3 | 63.9 | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_dsp1_1x_coco.pth) -->
+
+<!-- rtdetrv2_r18vd_dsp1 | discrete_sampling | 21600 | 45.5 | 63.0 | 4.34 | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_dsp1_120e_coco.pth) -->
+<!-- 4.3 -->
+
+**Notes:**
+- The impact on inference speed is related to specific device and software.
+- `*_dsp*` is the model inherit `*_sp*` model's knowledge and adapt to `discrete_sampling` strategy. **You can use TensorRT 8.4 (or even older versions) to inference for these models**
+<!-- - `grid_sampling` use `grid_sample` to sample attention map, `discrete_sampling` use `index_select` method to sample attention map.  -->
+
+
+### Ablation on sampling points
+
+<!-- Flexible samping strategy in cross attenstion layer for devices that do **not** optimize (or not support) `grid_sampling` well. You can choose models based on specific scenarios and the trade-off between speed and accuracy. -->
+
+| Model | Sampling Method | #Points | AP<sup>val</sup> | AP<sub>50</sub><sup>val</sup> | checkpoint 
+| :---: | :---: | :---: | :---: | :---: | :---: |
+**rtdetrv2_r18vd_sp1** | grid_sampling | 21,600 | 47.3 | 64.3 <font color=red>(-0.6) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_sp1_120e_coco.pth)
+**rtdetrv2_r18vd_sp2** | grid_sampling | 43,200 | 47.7 | 64.7 <font color=red>(-0.2) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_sp2_120e_coco.pth)
+**rtdetrv2_r18vd_sp3** | grid_sampling | 64,800 | 47.8 | 64.8 <font color=red>(-0.1) | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_sp3_120e_coco.pth)
+rtdetrv2_r18vd(_sp4)| grid_sampling | 86,400 | 47.9 | 64.9 | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_120e_coco.pth) 
+
+**Notes:**
+- The impact on inference speed is related to specific device and software.
+- `#points` the total number of sampling points in decoder for per image inference.
+
+
+## Usage
+<details>
+<summary> details </summary>
+
+<!-- <summary>1. Training </summary> -->
+1. Training
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config --use-amp --seed=0 &> log.txt 2>&1 &
+```
+
+<!-- <summary>2. Testing </summary> -->
+2. Testing
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config -r path/to/checkpoint --test-only
+```
+
+<!-- <summary>3. Tuning </summary> -->
+3. Tuning
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config -t path/to/checkpoint --use-amp --seed=0 &> log.txt 2>&1 &
+```
+
+<!-- <summary>4. Export onnx </summary> -->
+4. Export onnx
+```shell
+python tools/export_onnx.py -c path/to/config -r path/to/checkpoint --check
+```
+
+<!-- <summary>5. Export tensorrt </summary> -->
+5. Export tensorrt
+```shell
+python tools/export_trt.py -i path/to/onnxfile
+```
+
+<!-- <summary>6. Inference </summary> -->
+5. Inference
+
+Support torch, onnxruntime, tensorrt and openvino, see details in *references/deploy*
+```shell
+python references/deploy/rtdetrv2_onnxruntime.py --onnx-file=model.onnx --im-file=xxxx
+python references/deploy/rtdetrv2_tensorrt.py --trt-file=model.trt --im-file=xxxx
+python references/deploy/rtdetrv2_torch.py -c path/to/config -r path/to/checkpoint --im-file=xxx --device=cuda:0
+```
+</details>
+
+
+
+## Citation
+If you use `RTDETR` or `RTDETRv2` in your work, please use the following BibTeX entries:
+
+<details>
+<summary> bibtex </summary>
+
+```latex
+@misc{lv2023detrs,
+      title={DETRs Beat YOLOs on Real-time Object Detection},
+      author={Wenyu Lv and Shangliang Xu and Yian Zhao and Guanzhong Wang and Jinman Wei and Cheng Cui and Yuning Du and Qingqing Dang and Yi Liu},
+      year={2023},
+      eprint={2304.08069},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+
+@misc{lv2024rtdetrv2improvedbaselinebagoffreebies,
+      title={RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer}, 
+      author={Wenyu Lv and Yian Zhao and Qinyao Chang and Kui Huang and Guanzhong Wang and Yi Liu},
+      year={2024},
+      eprint={2407.17140},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2407.17140}, 
+}
+```
+</details>
--- a/rtdetrv2_pytorch/configs/dataset/coco_detection.yml
+++ b/rtdetrv2_pytorch/configs/dataset/coco_detection.yml
@@ -0,0 +1,48 @@
+task: detection
+
+evaluator:
+  type: CocoEvaluator
+  iou_types: ['bbox', ]
+
+# num_classes: 365
+# remap_mscoco_category: False
+
+# num_classes: 91
+# remap_mscoco_category: False
+
+num_classes: 80
+remap_mscoco_category: True
+
+
+train_dataloader: 
+  type: DataLoader
+  dataset: 
+    type: CocoDetection
+    img_folder: ./dataset/coco/train2017/
+    ann_file: ./dataset/coco/annotations/instances_train2017.json
+    return_masks: False
+    transforms:
+      type: Compose
+      ops: ~
+  shuffle: True
+  num_workers: 4
+  drop_last: True 
+  collate_fn:
+    type: BatchImageCollateFunction
+
+
+val_dataloader:
+  type: DataLoader
+  dataset: 
+    type: CocoDetection
+    img_folder: ./dataset/coco/val2017/
+    ann_file: ./dataset/coco/annotations/instances_val2017.json
+    return_masks: False
+    transforms:
+      type: Compose
+      ops: ~ 
+  shuffle: False
+  num_workers: 4
+  drop_last: False
+  collate_fn:
+    type: BatchImageCollateFunction
--- a/rtdetrv2_pytorch/configs/dataset/voc_detection.yml
+++ b/rtdetrv2_pytorch/configs/dataset/voc_detection.yml
@@ -0,0 +1,40 @@
+task: detection
+
+evaluator:
+  type: CocoEvaluator
+  iou_types: ['bbox', ]
+
+num_classes: 20
+
+train_dataloader: 
+  type: DataLoader
+  dataset: 
+    type: VOCDetection
+    root: ./dataset/voc/
+    ann_file: trainval.txt
+    label_file: label_list.txt
+    transforms:
+      type: Compose
+      ops: ~
+  shuffle: True
+  num_workers: 4
+  drop_last: True 
+  collate_fn:
+    type: BatchImageCollateFunction
+
+
+val_dataloader:
+  type: DataLoader
+  dataset: 
+    type: VOCDetection
+    root: ./dataset/voc/
+    ann_file: test.txt
+    label_file: label_list.txt
+    transforms:
+      type: Compose
+      ops: ~
+  shuffle: False
+  num_workers: 4
+  drop_last: False
+  collate_fn:
+    type: BatchImageCollateFunction
--- a/rtdetrv2_pytorch/configs/rtdetr/include/dataloader.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/include/dataloader.yml
@@ -0,0 +1,31 @@
+
+train_dataloader: 
+  dataset: 
+    return_masks: False
+    transforms:
+      ops:
+        - {type: RandomPhotometricDistort, p: 0.5}
+        - {type: RandomZoomOut, fill: 0}
+        - {type: RandomIoUCrop, p: 0.8}
+        - {type: SanitizeBoundingBoxes, min_size: 1}
+        - {type: RandomHorizontalFlip}
+        - {type: Resize, size: [640, 640], }
+        - {type: SanitizeBoundingBoxes, min_size: 1}
+        - {type: ConvertPILImage, dtype: 'float32', scale: True}   
+        - {type: ConvertBoxes, fmt: 'cxcywh', normalize: True}  
+  collate_fn:
+    type: BatchImageCollateFunction
+    scales: [480, 512, 544, 576, 608, 640, 640, 640, 672, 704, 736, 768, 800]
+  shuffle: True
+  num_workers: 4
+  total_batch_size: 16
+
+val_dataloader:
+  dataset: 
+    transforms:
+      ops: 
+        - {type: Resize, size: [640, 640]}
+        - {type: ConvertPILImage, dtype: 'float32', scale: True}   
+  shuffle: False
+  total_batch_size: 16
+  num_workers: 8
--- a/rtdetrv2_pytorch/configs/rtdetr/include/optimizer.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/include/optimizer.yml
@@ -0,0 +1,40 @@
+
+use_ema: True 
+ema:
+  type: ModelEMA
+  decay: 0.9999
+  warmups: 2000
+
+
+epoches: 72
+clip_max_norm: 0.1
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*(?:norm|bn)).*$'
+      lr: 0.00001
+    -
+      params: '^(?=.*backbone)(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
+
+lr_scheduler:
+  type: MultiStepLR
+  milestones: [1000]
+  gamma: 0.1
+
+
+lr_warmup_scheduler:
+  type: LinearWarmup
+  warmup_duration: 2000
--- a/rtdetrv2_pytorch/configs/rtdetr/include/rtdetr_r50vd.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/include/rtdetr_r50vd.yml
@@ -0,0 +1,79 @@
+task: detection
+
+model: RTDETR
+criterion: RTDETRCriterion
+postprocessor: RTDETRPostProcessor
+
+
+use_focal_loss: True
+eval_spatial_size: [640, 640] # h w
+
+
+RTDETR: 
+  backbone: PResNet
+  encoder: HybridEncoder
+  decoder: RTDETRTransformer
+  
+
+PResNet:
+  depth: 50
+  variant: d
+  freeze_at: 0
+  return_idx: [1, 2, 3]
+  num_stages: 4
+  freeze_norm: True
+  pretrained: True 
+
+
+HybridEncoder:
+  in_channels: [512, 1024, 2048]
+  feat_strides: [8, 16, 32]
+
+  # intra
+  hidden_dim: 256
+  use_encoder_idx: [2]
+  num_encoder_layers: 1
+  nhead: 8
+  dim_feedforward: 1024
+  dropout: 0.
+  enc_act: 'gelu'
+  
+  # cross
+  expansion: 1.0
+  depth_mult: 1
+  act: 'silu'
+
+  version: v1
+
+RTDETRTransformer:
+  feat_channels: [256, 256, 256]
+  feat_strides: [8, 16, 32]
+  hidden_dim: 256
+  num_levels: 3
+
+  num_layers: 6
+  num_queries: 300
+
+  num_denoising: 100
+  label_noise_ratio: 0.5
+  box_noise_scale: 1.0 # 1.0 0.4
+
+  eval_idx: -1
+
+
+RTDETRPostProcessor:
+  num_top_queries: 300
+
+
+RTDETRCriterion:
+  weight_dict: {loss_vfl: 1, loss_bbox: 5, loss_giou: 2,}
+  losses: ['vfl', 'boxes', ]
+  alpha: 0.75
+  gamma: 2.0
+
+  matcher:
+    type: HungarianMatcher
+    weight_dict: {cost_class: 2, cost_bbox: 5, cost_giou: 2}
+    alpha: 0.25
+    gamma: 2.0
+
--- a/rtdetrv2_pytorch/configs/rtdetr/readme.md
+++ b/rtdetrv2_pytorch/configs/rtdetr/readme.md
@@ -0,0 +1,111 @@
+# DETRs Beat YOLOs on Real-time Object Detection
+
+## Introduction
+This repository is the official pytorch implementation of [*RTDETR*](https://arxiv.org/abs/2304.08069v1), and is compatiable with [RT-DETR/rtdetr_pytorch](https://github.com/lyuwenyu/RT-DETR/tree/main). For paddle version implementation, please refer to [RT-DETR/rtdetr_paddle](https://github.com/lyuwenyu/RT-DETR/tree/main). **If you are using rtdetr for the first time, it is highly recommended to use [rtdetrv2](../rtdetrv2/)**.
+
+<details open>
+<summary> Fig </summary>
+<div align="center">
+  <img src="https://github.com/lyuwenyu/RT-DETR/assets/17582080/42636690-1ecf-4647-b075-842ecb9bc562" width=500>
+</div>
+</details>
+
+<!-- 
+<div align="center">
+  <img src="https://github.com/lyuwenyu/RT-DETR/assets/17582080/42636690-1ecf-4647-b075-842ecb9bc562" width=500>
+</div> -->
+
+
+## Model Zoo
+| Model | Dataset | Input Size | AP<sup>val</sup> | AP<sub>50</sub><sup>val</sup> | #Params(M) | FPS |  checkpoint |
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+rtdetr_r18vd | COCO | 640 | 46.4 | 63.7 | 20 | 217 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r18vd_dec3_6x_coco_from_paddle.pth)
+rtdetr_r34vd | COCO | 640 | 48.9 | 66.8 | 31 | 161 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r34vd_dec4_6x_coco_from_paddle.pth)
+rtdetr_r50vd_m | COCO | 640 | 51.3 | 69.5 | 36 | 145 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r50vd_m_6x_coco_from_paddle.pth)
+rtdetr_r50vd | COCO | 640 | 53.1 | 71.2| 42 | 108 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r50vd_6x_coco_from_paddle.pth)
+rtdetr_r101vd | COCO | 640 | 54.3 | 72.8 | 76 | 74 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r101vd_6x_coco_from_paddle.pth)
+rtdetr_18vd | COCO+Objects365 | 640 | 49.0 | 66.5 | 20 | 217 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r18vd_5x_coco_objects365_from_paddle.pth)
+rtdetr_r50vd | COCO+Objects365 | 640 | 55.2 | 73.4 | 42 | 108 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r50vd_2x_coco_objects365_from_paddle.pth)
+rtdetr_r101vd | COCO+Objects365 | 640 | 56.2 | 74.5 | 76 | 74 | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r101vd_2x_coco_objects365_from_paddle.pth)
+
+<!-- rtdetr_r18vd | COCO | 640 | 46.5 | 63.6 | 20 | 217 | [url](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r18vd_6x_coco.pth) -->
+
+<!-- rtdetr_r18vd | Objects365 | 640 | 22.9 |  31.2| - | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r18vd_5x_coco_objects365_from_paddle.pth)
+rtdetr_r50vd | Objects365 | 640 | 35.1 | 46.2 | - | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r50vd_2x_coco_objects365_from_paddle.pth)
+rtdetr_r101vd | Objects365 | 640 | 36.8 | 48.3 | - | [url<sup>*</sup>](https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetr_r101vd_2x_coco_objects365_from_paddle.pth) -->
+
+Notes
+<!-- - AP is evaluated on coco 2017 val dataset -->
+<!-- RT-DETR was trained on COCO train2017 and evaluated on val2017. -->
+- `COCO + Objects365` in the table means finetuned model on `COCO` using pretrained weights trained on `Objects365`.
+- `FPS` is evaluated on a single T4 GPU with $batch\\_size = 1$ and $tensorrt\\_fp16$ mode
+- `url`<sup>`*`</sup> is the url of the pretrained weights, converted from the paddle model to save energy. *There may be slight differences between this table and the paper.
+
+
+## Usage
+<details>
+<summary> details </summary>
+
+<!-- <summary>1. Training </summary> -->
+1. Training
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config &> log.txt 2>&1 &
+```
+
+<!-- <summary>2. Testing </summary> -->
+2. Testing
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config -r path/to/checkpoint --test-only
+```
+
+<!-- <summary>3. Tuning </summary> -->
+3. Tuning
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --master_port=9909 --nproc_per_node=4 tools/train.py -c path/to/config -t path/to/checkpoint &> log.txt 2>&1 &
+```
+
+<!-- <summary>4. Export onnx </summary> -->
+4. Export onnx
+```shell
+python tools/export_onnx.py -c path/to/config -r path/to/checkpoint --check
+```
+
+<!-- <summary>5. Inference </summary> -->
+5. Inference
+
+Support torch, onnxruntime, tensorrt and openvino, see details in *references/deploy*
+```shell
+python references/deploy/rtdetrv2_onnx.py --onnx-file=model.onnx --im-file=xxxx
+python references/deploy/rtdetrv2_tensorrt.py --trt-file=model.trt --im-file=xxxx
+python references/deploy/rtdetrv2_torch.py -c path/to/config -r path/to/checkpoint --im-file=xxx --device=cuda:0
+```
+</details>
+
+
+## Citation
+If you use `RTDETR` in your work, please use the following BibTeX entries:
+
+<details>
+<summary> bibtex </summary>
+
+```latex
+@misc{lv2023detrs,
+      title={DETRs Beat YOLOs on Real-time Object Detection},
+      author={Wenyu Lv and Shangliang Xu and Yian Zhao and Guanzhong Wang and Jinman Wei and Cheng Cui and Yuning Du and Qingqing Dang and Yi Liu},
+      year={2023},
+      eprint={2304.08069},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+
+@software{Lv_rtdetr_by_cvperception_2023,
+author = {Lv, Wenyu},
+license = {Apache-2.0},
+month = oct,
+title = {{rtdetr by cvperception}},
+url = {https://github.com/lyuwenyu/cvperception/},
+version = {0.0.1dev},
+year = {2023}
+}
+```
+</details>
--- a/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r101vd_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r101vd_6x_coco.yml
@@ -0,0 +1,41 @@
+
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetr_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetr_r101vd_6x_coco
+
+
+PResNet:
+  depth: 101
+
+
+HybridEncoder:
+  # intra
+  hidden_dim: 384
+  dim_feedforward: 2048
+
+
+RTDETRTransformer:
+  feat_channels: [384, 384, 384]
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.000001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
--- a/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r18vd_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r18vd_6x_coco.yml
@@ -0,0 +1,48 @@
+
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetr_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetr_r18vd_6x_coco
+
+
+PResNet:
+  depth: 18
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformer:
+  num_layers: 3
+
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?=.*norm|bn).*$'
+      weight_decay: 0.
+      lr: 0.00001
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
--- a/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r34vd_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r34vd_6x_coco.yml
@@ -0,0 +1,48 @@
+
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetr_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetr_r34vd_6x_coco
+
+
+PResNet:
+  depth: 34
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformer:
+  num_layers: 4
+
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?=.*norm|bn).*$'
+      weight_decay: 0.
+      lr: 0.00001
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
--- a/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r50vd_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r50vd_6x_coco.yml
@@ -0,0 +1,14 @@
+
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetr_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetr_r50vd_6x_coco
+
+
+
--- a/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r50vd_m_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetr/rtdetr_r50vd_m_6x_coco.yml
@@ -0,0 +1,34 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetr_r50vd.yml',
+]
+
+output_dir: ./output/rtdetr_r50vd_m_6x_coco
+
+
+HybridEncoder:
+  expansion: 0.5
+
+
+RTDETRTransformer:
+  eval_idx: 2 # use 3th decoder layer to eval
+
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
--- a/rtdetrv2_pytorch/configs/rtdetrv2/include/dataloader.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/include/dataloader.yml
@@ -0,0 +1,38 @@
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      ops:
+        - {type: RandomPhotometricDistort, p: 0.5}
+        - {type: RandomZoomOut, fill: 0}
+        - {type: RandomIoUCrop, p: 0.8}
+        - {type: SanitizeBoundingBoxes, min_size: 1}
+        - {type: RandomHorizontalFlip}
+        - {type: Resize, size: [640, 640], }
+        - {type: SanitizeBoundingBoxes, min_size: 1}
+        - {type: ConvertPILImage, dtype: 'float32', scale: True}   
+        - {type: ConvertBoxes, fmt: 'cxcywh', normalize: True}
+      policy:
+        name: stop_epoch
+        epoch: 71 # epoch in [71, ~) stop `ops`
+        ops: ['RandomPhotometricDistort', 'RandomZoomOut', 'RandomIoUCrop']
+  
+  collate_fn:
+    type: BatchImageCollateFunction
+    scales: [480, 512, 544, 576, 608, 640, 640, 640, 672, 704, 736, 768, 800]
+    stop_epoch: 71 # epoch in [71, ~) stop `multiscales`
+
+  shuffle: True
+  total_batch_size: 16 # total batch size equals to 16 (4 * 4)
+  num_workers: 4
+
+
+val_dataloader:
+  dataset: 
+    transforms:
+      ops: 
+        - {type: Resize, size: [640, 640]}
+        - {type: ConvertPILImage, dtype: 'float32', scale: True}   
+  shuffle: False
+  total_batch_size: 32
+  num_workers: 4
--- a/rtdetrv2_pytorch/configs/rtdetrv2/include/optimizer.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/include/optimizer.yml
@@ -0,0 +1,37 @@
+
+use_amp: True
+use_ema: True 
+ema:
+  type: ModelEMA
+  decay: 0.9999
+  warmups: 2000
+
+
+epoches: 72
+clip_max_norm: 0.1
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm).*$'
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
+
+lr_scheduler:
+  type: MultiStepLR
+  milestones: [1000]
+  gamma: 0.1
+
+
+lr_warmup_scheduler:
+  type: LinearWarmup
+  warmup_duration: 2000
--- a/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/include/rtdetrv2_r50vd.yml
@@ -0,0 +1,83 @@
+task: detection
+
+model: RTDETR
+criterion: RTDETRCriterionv2
+postprocessor: RTDETRPostProcessor
+
+
+use_focal_loss: True
+eval_spatial_size: [640, 640] # h w
+
+
+RTDETR: 
+  backbone: PResNet
+  encoder: HybridEncoder
+  decoder: RTDETRTransformerv2
+  
+
+PResNet:
+  depth: 50
+  variant: d
+  freeze_at: 0
+  return_idx: [1, 2, 3]
+  num_stages: 4
+  freeze_norm: True
+  pretrained: True 
+
+
+HybridEncoder:
+  in_channels: [512, 1024, 2048]
+  feat_strides: [8, 16, 32]
+
+  # intra
+  hidden_dim: 256
+  use_encoder_idx: [2]
+  num_encoder_layers: 1
+  nhead: 8
+  dim_feedforward: 1024
+  dropout: 0.
+  enc_act: 'gelu'
+  
+  # cross
+  expansion: 1.0
+  depth_mult: 1
+  act: 'silu'
+
+
+RTDETRTransformerv2:
+  feat_channels: [256, 256, 256]
+  feat_strides: [8, 16, 32]
+  hidden_dim: 256
+  num_levels: 3
+
+  num_layers: 6
+  num_queries: 300
+
+  num_denoising: 100
+  label_noise_ratio: 0.5
+  box_noise_scale: 1.0 # 1.0 0.4
+
+  eval_idx: -1
+
+  # NEW
+  num_points: [4, 4, 4] # [3,3,3] [2,2,2]
+  cross_attn_method: default # default, discrete
+  query_select_method: default # default, agnostic 
+
+
+RTDETRPostProcessor:
+  num_top_queries: 300
+
+
+RTDETRCriterionv2:
+  weight_dict: {loss_vfl: 1, loss_bbox: 5, loss_giou: 2,}
+  losses: ['vfl', 'boxes', ]
+  alpha: 0.75
+  gamma: 2.0
+
+  matcher:
+    type: HungarianMatcher
+    weight_dict: {cost_class: 2, cost_bbox: 5, cost_giou: 2}
+    alpha: 0.25
+    gamma: 2.0
+
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_hgnetv2_h_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_hgnetv2_h_6x_coco.yml
@@ -0,0 +1,50 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_hgnetv2_h_6x_coco
+
+
+RTDETR:
+  backbone: HGNetv2
+
+
+HGNetv2:
+  name: 'H'
+  return_idx: [1, 2, 3]
+  freeze_at: 0
+  freeze_norm: True
+  pretrained: True
+
+
+HybridEncoder:
+  # intra
+  hidden_dim: 512
+  dim_feedforward: 2048
+  num_encoder_layers: 2
+
+
+RTDETRTransformerv2:
+  feat_channels: [512, 512, 512]
+
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.000005
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_hgnetv2_l_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_hgnetv2_l_6x_coco.yml
@@ -0,0 +1,38 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_hgnetv2_l_6x_coco
+
+
+RTDETR:
+  backbone: HGNetv2
+
+
+HGNetv2:
+  name: 'L'
+  return_idx: [1, 2, 3]
+  freeze_at: 0
+  freeze_norm: True
+  pretrained: True
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.000005
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_hgnetv2_x_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_hgnetv2_x_6x_coco.yml
@@ -0,0 +1,50 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_hgnetv2_x_6x_coco
+
+
+RTDETR:
+  backbone: HGNetv2
+
+
+HGNetv2:
+  name: 'X'
+  return_idx: [1, 2, 3]
+  freeze_at: 0
+  freeze_norm: True
+  pretrained: True
+
+
+
+HybridEncoder:
+  # intra
+  hidden_dim: 384
+  dim_feedforward: 2048
+
+
+RTDETRTransformerv2:
+  feat_channels: [384, 384, 384]
+
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.000001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r101vd_6x_coco.yml
@@ -0,0 +1,40 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r101vd_6x_coco
+
+
+PResNet:
+  depth: 101
+
+
+HybridEncoder:
+  # intra
+  hidden_dim: 384
+  dim_feedforward: 2048
+
+
+RTDETRTransformerv2:
+  feat_channels: [384, 384, 384]
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.000001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_120e_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_120e_coco.yml
@@ -0,0 +1,46 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r18vd_120e_coco
+
+
+PResNet:
+  depth: 18
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 3
+
+
+epoches: 120 
+
+optimizer:
+  type: AdamW
+  params:
+    - 
+      params: '^(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 117
+  collate_fn:
+    scales: ~
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_120e_voc.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_120e_voc.yml
@@ -0,0 +1,46 @@
+__include__: [
+  '../dataset/voc_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r18vd_120e_voc
+
+
+PResNet:
+  depth: 18
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 3
+
+
+epoches: 120 
+
+optimizer:
+  type: AdamW
+  params:
+    - 
+      params: '^(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 117
+  collate_fn:
+    scales: ~
+  total_batch_size: 32
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_dsp_3x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_dsp_3x_coco.yml
@@ -0,0 +1,49 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+tuning: https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r18vd_120e_coco.pth
+
+output_dir: ./output/rtdetrv2_r18vd_dsp_3x_coco
+
+PResNet:
+  depth: 18
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 3
+  num_points: [4, 4, 4]
+  cross_attn_method: discrete
+
+
+epoches: 36
+
+optimizer:
+  type: AdamW
+  params:
+    - 
+      params: '^(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 33
+  collate_fn:
+    scales: ~
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_sp1_120e_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_sp1_120e_coco.yml
@@ -0,0 +1,47 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r18vd_sp1_120e_coco
+
+
+PResNet:
+  depth: 18
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 3
+  num_points: [1, 1, 1]
+
+
+epoches: 120 
+
+optimizer:
+  type: AdamW
+  params:
+    - 
+      params: '^(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 117
+  collate_fn:
+    scales: ~
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_sp2_120e_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_sp2_120e_coco.yml
@@ -0,0 +1,47 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r18vd_sp2_120e_coco
+
+
+PResNet:
+  depth: 18
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 3
+  num_points: [2, 2, 2]
+
+
+epoches: 120 
+
+optimizer:
+  type: AdamW
+  params:
+    - 
+      params: '^(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 117
+  collate_fn:
+    scales: ~
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_sp3_120e_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r18vd_sp3_120e_coco.yml
@@ -0,0 +1,47 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r18vd_sp3_120e_coco
+
+
+PResNet:
+  depth: 18
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 3
+  num_points: [3, 3, 3]
+
+
+epoches: 120 
+
+optimizer:
+  type: AdamW
+  params:
+    - 
+      params: '^(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 117
+  collate_fn:
+    scales: ~
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r34vd_120e_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r34vd_120e_coco.yml
@@ -0,0 +1,57 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r34vd_120e_coco
+
+
+PResNet:
+  depth: 34
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 4
+
+
+epoches: 120
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.00005
+    - 
+      params: '^(?=.*backbone)(?=.*norm|bn).*$'
+      lr: 0.00005
+      weight_decay: 0.
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 117
+  collate_fn:
+    stop_epoch: 117
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r34vd_dsp_1x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r34vd_dsp_1x_coco.yml
@@ -0,0 +1,59 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+tuning: https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r34vd_120e_coco_ema.pth
+
+output_dir: ./output/rtdetrv2_r34vd_dsp_1x_coco
+
+
+PResNet:
+  depth: 34
+  freeze_at: -1
+  freeze_norm: False
+  pretrained: True
+
+
+HybridEncoder:
+  in_channels: [128, 256, 512]
+  hidden_dim: 256
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  num_layers: 4
+  cross_attn_method: discrete
+
+
+epoches: 12
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm|bn).*$'
+      lr: 0.00005
+    - 
+      params: '^(?=.*backbone)(?=.*norm|bn).*$'
+      lr: 0.00005
+      weight_decay: 0.
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn|bias)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 10
+  collate_fn:
+    stop_epoch: 10
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_6x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_6x_coco.yml
@@ -0,0 +1,27 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+output_dir: ./output/rtdetrv2_r50vd_6x_coco
+
+
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm).*$'
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_dsp_1x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_dsp_1x_coco.yml
@@ -0,0 +1,27 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+
+tuning: https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r50vd_6x_coco_ema.pth
+
+output_dir: ./output/rtdetrv2_r50vd_dsp_1x_coco
+
+
+RTDETRTransformerv2:
+  cross_attn_method: discrete
+
+
+epoches: 12
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 10
+  collate_fn:
+    stop_epoch: 10
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_m_7x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_m_7x_coco.yml
@@ -0,0 +1,43 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+output_dir: ./output/rtdetrv2_r50vd_m_6x_coco
+
+
+HybridEncoder:
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  eval_idx: 2 # use 3th decoder layer to eval
+
+
+epoches: 84
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm).*$'
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 81
+  collate_fn:
+    stop_epoch: 81
--- a/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_m_dsp_3x_coco.yml
+++ b/rtdetrv2_pytorch/configs/rtdetrv2/rtdetrv2_r50vd_m_dsp_3x_coco.yml
@@ -0,0 +1,44 @@
+__include__: [
+  '../dataset/coco_detection.yml',
+  '../runtime.yml',
+  './include/dataloader.yml',
+  './include/optimizer.yml',
+  './include/rtdetrv2_r50vd.yml',
+]
+
+output_dir: ./output/rtdetrv2_r50vd_m_dsp_3x_coco
+tuning: https://github.com/lyuwenyu/storage/releases/download/v0.1/rtdetrv2_r50vd_m_7x_coco_ema.pth
+
+HybridEncoder:
+  expansion: 0.5
+
+
+RTDETRTransformerv2:
+  eval_idx: 2 # use 3th decoder layer to eval
+  cross_attn_method: discrete
+
+
+epoches: 36
+
+optimizer:
+  type: AdamW
+  params: 
+    - 
+      params: '^(?=.*backbone)(?!.*norm).*$'
+      lr: 0.00001
+    - 
+      params: '^(?=.*(?:encoder|decoder))(?=.*(?:norm|bn)).*$'
+      weight_decay: 0.
+
+  lr: 0.0001
+  betas: [0.9, 0.999]
+  weight_decay: 0.0001
+
+
+train_dataloader: 
+  dataset: 
+    transforms:
+      policy:
+        epoch: 33
+  collate_fn:
+    stop_epoch: 33
--- a/rtdetrv2_pytorch/configs/runtime.yml
+++ b/rtdetrv2_pytorch/configs/runtime.yml
@@ -0,0 +1,21 @@
+
+print_freq: 100
+output_dir: './logs'
+checkpoint_freq: 1
+
+
+sync_bn: True
+find_unused_parameters: False
+
+
+use_amp: False
+scaler:
+  type: GradScaler
+  enabled: True
+
+
+use_ema: False
+ema:
+  type: ModelEMA
+  decay: 0.9999
+  warmups: 2000
--- a/rtdetrv2_pytorch/docker-compose.yml
+++ b/rtdetrv2_pytorch/docker-compose.yml
@@ -0,0 +1,23 @@
+services:
+  tensorrt-container:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    image: rtdetr-v2:25.06
+    container_name: rtdetr-v2-trt
+    ports:
+      - "6006:6006" # tensorboard
+    volumes:
+      - ./:/workspace
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+    working_dir: /workspace
+    restart: unless-stopped
+    stdin_open: true
+    tty: true
+    command: bash
--- a/rtdetrv2_pytorch/references/deploy/readme.md
+++ b/rtdetrv2_pytorch/references/deploy/readme.md
@@ -0,0 +1,2 @@
+# Deployment
+
--- a/rtdetrv2_pytorch/references/deploy/rtdetrv2_onnxruntime.py
+++ b/rtdetrv2_pytorch/references/deploy/rtdetrv2_onnxruntime.py
@@ -0,0 +1,61 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torchvision.transforms as T
+
+import numpy as np 
+import onnxruntime as ort 
+from PIL import Image, ImageDraw
+
+
+def draw(images, labels, boxes, scores, thrh = 0.6):
+    for i, im in enumerate(images):
+        draw = ImageDraw.Draw(im)
+
+        scr = scores[i]
+        lab = labels[i][scr > thrh]
+        box = boxes[i][scr > thrh]
+
+        for b in box:
+            draw.rectangle(list(b), outline='red',)
+            draw.text((b[0], b[1]), text=str(lab[i].item()), fill='blue', )
+
+        im.save(f'results_{i}.jpg')
+
+
+def main(args, ):
+    """main
+    """
+    sess = ort.InferenceSession(args.onnx_file)
+    print(ort.get_device())
+
+    im_pil = Image.open(args.im_file).convert('RGB')
+    w, h = im_pil.size
+    orig_size = torch.tensor([w, h])[None]
+
+    transforms = T.Compose([
+        T.Resize((640, 640)),
+        T.ToTensor(),
+    ])
+    im_data = transforms(im_pil)[None]
+
+    output = sess.run(
+        # output_names=['labels', 'boxes', 'scores'],
+        output_names=None,
+        input_feed={'images': im_data.data.numpy(), "orig_target_sizes": orig_size.data.numpy()}
+    )
+
+    labels, boxes, scores = output
+
+    draw([im_pil], labels, boxes, scores)
+
+
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--onnx-file', type=str, )
+    parser.add_argument('--im-file', type=str, )
+    # parser.add_argument('-d', '--device', type=str, default='cpu')
+    args = parser.parse_args()
+    main(args)
--- a/rtdetrv2_pytorch/references/deploy/rtdetrv2_openvino.py
+++ b/rtdetrv2_pytorch/references/deploy/rtdetrv2_openvino.py
@@ -0,0 +1,5 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+# please reference: https://github.com/guojin-yan/RT-DETR-OpenVINO
--- a/rtdetrv2_pytorch/references/deploy/rtdetrv2_tensorrt.py
+++ b/rtdetrv2_pytorch/references/deploy/rtdetrv2_tensorrt.py
@@ -0,0 +1,258 @@
+# Copyright 2023 lyuwenyu. All Rights Reserved.
+# Copyright (c) 2025 Hitbee-dev. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+# ==============================================================================
+# NOTICE: This file has been heavily modified by [Hitbee-dev] from the original source.
+# Modifications include restructuring for broader GPU architecture compatibility
+# (including NVIDIA Blackwell), improved modularity, and enhanced testability.
+# ==============================================================================
+
+import time
+import numpy as np
+import torch
+import tensorrt as trt
+from collections import OrderedDict
+from PIL import Image, ImageDraw, ImageFont
+
+class TRTInference(object):
+    """
+    A high-level wrapper for TensorRT inference, designed for ease of use and flexibility.
+    This class handles engine loading, context creation, and dynamic buffer allocation.
+    """
+    def __init__(self, engine_path, device='cuda:0', verbose=False):
+        """
+        Initializes the TRTInference instance.
+
+        Args:
+            engine_path (str): Path to the serialized TensorRT engine file.
+            device (str): The device to run inference on (e.g., 'cuda:0').
+            verbose (bool): If True, enables verbose logging from the TensorRT logger.
+        """
+        self.engine_path = engine_path
+        self.device = torch.device(device)
+        self.logger = trt.Logger(trt.Logger.VERBOSE) if verbose else trt.Logger(trt.Logger.INFO)
+        
+        trt.init_libnvinfer_plugins(self.logger, '')
+        self.runtime = trt.Runtime(self.logger)
+        self.engine = self._load_engine(engine_path)
+        self.context = self.engine.create_execution_context()
+
+        self.input_names, self.output_names = self._get_io_names()
+
+        self.buffers_allocated = False
+        self.gpu_buffers = OrderedDict()
+        self.binding_addrs = OrderedDict()
+
+        print(f"[TRTInference] Initialized successfully. Engine: '{engine_path}'.")
+
+    def _load_engine(self, path):
+        """Loads a TensorRT engine from a file."""
+        with open(path, 'rb') as f:
+            engine = self.runtime.deserialize_cuda_engine(f.read())
+        if engine is None:
+            raise RuntimeError(f"Failed to load TensorRT engine from '{path}'.")
+        return engine
+
+    def _get_io_names(self):
+        """Parses input and output tensor names from the engine."""
+        input_names, output_names = [], []
+        for i in range(self.engine.num_io_tensors):
+            name = self.engine.get_tensor_name(i)
+            if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT:
+                input_names.append(name)
+            else:
+                output_names.append(name)
+        return input_names, output_names
+
+    def _allocate_buffers(self, blob: dict):
+        """
+        Allocates GPU buffers for inputs and outputs based on the first inference request.
+        This "lazy allocation" strategy handles dynamic input shapes gracefully.
+        """
+        print("[TRTInference] First inference call detected. Allocating GPU buffers...")
+        for name in self.input_names:
+            tensor = blob[name]
+            shape = tuple(tensor.shape)
+            dtype = tensor.dtype
+            self.context.set_input_shape(name, shape)
+            self.gpu_buffers[name] = torch.empty(shape, dtype=dtype, device=self.device)
+            self.binding_addrs[name] = self.gpu_buffers[name].data_ptr()
+            print(f"  - Input '{name}': allocated buffer with shape {shape}.")
+
+        for name in self.output_names:
+            shape = tuple(self.context.get_tensor_shape(name))
+            dtype = trt.nptype(self.engine.get_tensor_dtype(name))
+            torch_dtype = torch.from_numpy(np.array(0, dtype=dtype)).dtype
+            self.gpu_buffers[name] = torch.empty(shape, dtype=torch_dtype, device=self.device)
+            self.binding_addrs[name] = self.gpu_buffers[name].data_ptr()
+            print(f"  - Output '{name}': allocated buffer with shape {shape}.")
+
+        self.buffers_allocated = True
+        print("[TRTInference] GPU buffers allocated successfully.")
+
+    def __call__(self, blob: dict):
+        """
+        Executes inference on the loaded TensorRT engine.
+
+        Args:
+            blob (dict): A dictionary mapping input tensor names to their corresponding
+                         torch.Tensor data on the GPU.
+
+        Returns:
+            dict: A dictionary mapping output tensor names to their corresponding
+                  torch.Tensor results on the GPU.
+        """
+        if not self.buffers_allocated:
+            self._allocate_buffers(blob)
+            
+        for name in self.input_names:
+            self.gpu_buffers[name].copy_(blob[name])
+
+        self.context.execute_v2(bindings=list(self.binding_addrs.values()))
+        
+        return {name: self.gpu_buffers[name] for name in self.output_names}
+
+# --- Visualization Utility Function ---
+COCO_CLASSES = [
+    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
+    'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
+    'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
+    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
+    'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard',
+    'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase',
+    'scissors', 'teddy bear', 'hair drier', 'toothbrush'
+]
+
+def visualize_detections(image_pil, boxes, scores, labels, class_names=COCO_CLASSES, threshold=0.5):
+    """
+    Draws bounding boxes on a PIL image. This function is a general-purpose utility.
+
+    Args:
+        image_pil (PIL.Image.Image): The image to draw on.
+        boxes (torch.Tensor): A tensor of bounding boxes (shape: [N, 4]).
+        scores (torch.Tensor): A tensor of confidence scores (shape: [N]).
+        labels (torch.Tensor): A tensor of class labels (shape: [N]).
+        class_names (list): A list of strings corresponding to class labels.
+        threshold (float): The confidence threshold for displaying detections.
+
+    Returns:
+        PIL.Image.Image: The image with detections drawn on it.
+    """
+    img_draw = image_pil.copy()
+    draw = ImageDraw.Draw(img_draw)
+    
+    # Ensure tensors are on CPU and converted to NumPy for processing
+    boxes = boxes.cpu().numpy()
+    scores = scores.cpu().numpy()
+    labels = labels.cpu().numpy()
+    
+    count = 0
+    for i in range(len(scores)):
+        score = scores[i]
+        if score < threshold:
+            continue
+        
+        count += 1
+        box = boxes[i]
+        label_idx = int(labels[i])
+        
+        xmin, ymin, xmax, ymax = box
+        class_name = class_names[label_idx] if label_idx < len(class_names) else f'CLS-{label_idx}'
+        color = 'red' # Keep it simple or use a color map
+        
+        draw.rectangle(((xmin, ymin), (xmax, ymax)), outline=color, width=3)
+        
+        text = f"{class_name}: {score:.2f}"
+        
+        try:
+            font = ImageFont.truetype("arial.ttf", 20)
+        except IOError:
+            font = ImageFont.load_default()
+
+        text_bbox = draw.textbbox((xmin, ymin), text, font=font)
+        draw.rectangle(text_bbox, fill=color)
+        draw.text((xmin, ymin), text, fill="white", font=font)
+        
+    print(f"   - Found {count} objects above threshold {threshold}.")
+    return img_draw
+
+if __name__ == '__main__':
+    import argparse
+    import torchvision.transforms as T
+    import os
+
+    parser = argparse.ArgumentParser(description="Test script for the TRTInference wrapper.")
+    parser.add_argument('--engine', type=str, required=True, help="Path to the TensorRT engine file.")
+    parser.add_argument('--image', type=str, required=True, help="Path to the input image file.")
+    parser.add_argument('--output', type=str, default='output.jpg', help="Path to save the output image with detections.")
+    parser.add_argument('--device', type=str, default='cuda:0', help="Device to run inference on.")
+    parser.add_argument('--threshold', type=float, default=0.5, help="Confidence threshold for displaying detections.")
+    args = parser.parse_args()
+    
+    if not torch.cuda.is_available():
+        raise SystemExit("CUDA is not available. This script requires a GPU.")
+    
+    print("--- TRTInference Wrapper Test ---")
+    
+    print("\n1. Initializing TRTInference...")
+    trt_model = TRTInference(args.engine, device=args.device)
+    
+    print("\n2. Preprocessing input image...")
+    image_pil = Image.open(args.image).convert('RGB')
+    w, h = image_pil.size
+    
+    transforms = T.Compose([
+        T.Resize((640, 640)),
+        T.ToTensor(),
+    ])
+    
+    image_tensor = transforms(image_pil).unsqueeze(0).to(args.device)
+    orig_size_tensor = torch.tensor([[w, h]], dtype=torch.int64, device=args.device)
+
+    blob = {
+        'images': image_tensor,
+        'orig_target_sizes': orig_size_tensor
+    }
+    print(f"   - Original image size: {w}x{h}")
+    print(f"   - Input tensor shape: {image_tensor.shape}")
+
+    print("\n3. Running inference...")
+    start_time = time.time()
+    output_gpu = trt_model(blob)
+    torch.cuda.synchronize()
+    end_time = time.time()
+    
+    print(f"\n4. Inference complete in { (end_time - start_time) * 1000:.2f} ms.")
+    
+    print("\n5. Post-processing and saving output image...")
+    output_labels = output_gpu['labels'][0]
+    output_boxes = output_gpu['boxes'][0]
+    output_scores = output_gpu['scores'][0]
+    
+    # Use the new, separate visualization function
+    result_image = visualize_detections(
+        image_pil, 
+        output_boxes, 
+        output_scores, 
+        output_labels, 
+        threshold=args.threshold
+    )
+    
+    result_image.save(args.output)
+    print(f"   - Output image with detections saved to: {os.path.abspath(args.output)}")
+
+    print("\n--- Test finished successfully ---")
--- a/rtdetrv2_pytorch/references/deploy/rtdetrv2_torch.py
+++ b/rtdetrv2_pytorch/references/deploy/rtdetrv2_torch.py
@@ -0,0 +1,84 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torch.nn as nn 
+import torchvision.transforms as T
+
+import numpy as np 
+from PIL import Image, ImageDraw
+
+from src.core import YAMLConfig
+
+
+def draw(images, labels, boxes, scores, thrh = 0.6):
+    for i, im in enumerate(images):
+        draw = ImageDraw.Draw(im)
+
+        scr = scores[i]
+        lab = labels[i][scr > thrh]
+        box = boxes[i][scr > thrh]
+        scrs = scores[i][scr > thrh]
+
+        for j,b in enumerate(box):
+            draw.rectangle(list(b), outline='red',)
+            draw.text((b[0], b[1]), text=f"{lab[j].item()} {round(scrs[j].item(),2)}", fill='blue', )
+
+        im.save(f'results_{i}.jpg')
+
+
+def main(args, ):
+    """main
+    """
+    cfg = YAMLConfig(args.config, resume=args.resume)
+
+    if args.resume:
+        checkpoint = torch.load(args.resume, map_location='cpu') 
+        if 'ema' in checkpoint:
+            state = checkpoint['ema']['module']
+        else:
+            state = checkpoint['model']
+    else:
+        raise AttributeError('Only support resume to load model.state_dict by now.')
+
+    # NOTE load train mode state -> convert to deploy mode
+    cfg.model.load_state_dict(state)
+
+    class Model(nn.Module):
+        def __init__(self, ) -> None:
+            super().__init__()
+            self.model = cfg.model.deploy()
+            self.postprocessor = cfg.postprocessor.deploy()
+            
+        def forward(self, images, orig_target_sizes):
+            outputs = self.model(images)
+            outputs = self.postprocessor(outputs, orig_target_sizes)
+            return outputs
+
+    model = Model().to(args.device)
+
+    im_pil = Image.open(args.im_file).convert('RGB')
+    w, h = im_pil.size
+    orig_size = torch.tensor([w, h])[None].to(args.device)
+
+    transforms = T.Compose([
+        T.Resize((640, 640)),
+        T.ToTensor(),
+    ])
+    im_data = transforms(im_pil)[None].to(args.device)
+
+    output = model(im_data, orig_size)
+    labels, boxes, scores = output
+
+    draw([im_pil], labels, boxes, scores)
+
+
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-c', '--config', type=str, )
+    parser.add_argument('-r', '--resume', type=str, )
+    parser.add_argument('-f', '--im-file', type=str, )
+    parser.add_argument('-d', '--device', type=str, default='cpu')
+    args = parser.parse_args()
+    main(args)
--- a/rtdetrv2_pytorch/requirements.txt
+++ b/rtdetrv2_pytorch/requirements.txt
@@ -0,0 +1,9 @@
+torch>=2.0.1
+torchvision>=0.15.2
+faster-coco-eval>=1.6.6
+PyYAML
+tensorboard
+scipy
+pycocotools
+onnx
+onnxruntime-gpu
--- a/rtdetrv2_pytorch/src/init.py
+++ b/rtdetrv2_pytorch/src/init.py
@@ -0,0 +1,8 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+# for register purpose
+from . import optim
+from . import data 
+from . import nn
+from . import zoo
--- a/rtdetrv2_pytorch/src/core/init.py
+++ b/rtdetrv2_pytorch/src/core/init.py
@@ -0,0 +1,7 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from .workspace import GLOBAL_CONFIG, register, create
+from .yaml_utils import *
+from ._config import BaseConfig
+from .yaml_config import YAMLConfig
--- a/rtdetrv2_pytorch/src/core/_config.py
+++ b/rtdetrv2_pytorch/src/core/_config.py
@@ -0,0 +1,290 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn 
+from torch.utils.data import Dataset, DataLoader
+from torch.optim import Optimizer
+from torch.optim.lr_scheduler import LRScheduler
+from torch.cuda.amp.grad_scaler import GradScaler
+from torch.utils.tensorboard import SummaryWriter
+
+from pathlib import Path 
+from typing import Callable, List, Dict
+
+
+__all__ = ['BaseConfig', ]
+
+
+class BaseConfig(object):
+    # TODO property
+
+    def __init__(self) -> None:
+        super().__init__()
+
+        self.task :str = None 
+
+        # instance / function 
+        self._model :nn.Module = None 
+        self._postprocessor :nn.Module = None 
+        self._criterion :nn.Module = None 
+        self._optimizer :Optimizer = None 
+        self._lr_scheduler :LRScheduler = None 
+        self._lr_warmup_scheduler: LRScheduler = None 
+        self._train_dataloader :DataLoader = None 
+        self._val_dataloader :DataLoader = None 
+        self._ema :nn.Module = None 
+        self._scaler :GradScaler = None 
+        self._train_dataset :Dataset = None 
+        self._val_dataset :Dataset = None
+        self._collate_fn :Callable = None
+        self._evaluator :Callable[[nn.Module, DataLoader, str], ] = None
+        self._writer: SummaryWriter = None
+        
+        # dataset 
+        self.num_workers :int = 0
+        self.batch_size :int = None
+        self._train_batch_size :int = None
+        self._val_batch_size :int = None
+        self._train_shuffle: bool = None  
+        self._val_shuffle: bool = None 
+
+        # runtime
+        self.resume :str = None
+        self.tuning :str = None 
+
+        self.epoches :int = None
+        self.last_epoch :int = -1
+
+        self.use_amp :bool = False 
+        self.use_ema :bool = False 
+        self.ema_decay :float = 0.9999
+        self.ema_warmups: int = 2000
+        self.sync_bn :bool = False 
+        self.clip_max_norm : float = 0.
+        self.find_unused_parameters :bool = None
+
+        self.seed :int = None
+        self.print_freq :int = None 
+        self.checkpoint_freq :int = 1
+        self.output_dir :str = None
+        self.summary_dir :str = None
+        self.device : str = ''
+
+    @property
+    def model(self, ) -> nn.Module:
+        return self._model 
+    
+    @model.setter
+    def model(self, m):
+        assert isinstance(m, nn.Module), f'{type(m)} != nn.Module, please check your model class'
+        self._model = m 
+
+    @property
+    def postprocessor(self, ) -> nn.Module:
+        return self._postprocessor
+    
+    @postprocessor.setter
+    def postprocessor(self, m):
+        assert isinstance(m, nn.Module), f'{type(m)} != nn.Module, please check your model class'
+        self._postprocessor = m 
+
+    @property
+    def criterion(self, ) -> nn.Module:
+        return self._criterion
+    
+    @criterion.setter
+    def criterion(self, m):
+        assert isinstance(m, nn.Module), f'{type(m)} != nn.Module, please check your model class'
+        self._criterion = m 
+
+    @property
+    def optimizer(self, ) -> Optimizer:
+        return self._optimizer
+    
+    @optimizer.setter
+    def optimizer(self, m):
+        assert isinstance(m, Optimizer), f'{type(m)} != optim.Optimizer, please check your model class'
+        self._optimizer = m 
+
+    @property
+    def lr_scheduler(self, ) -> LRScheduler:
+        return self._lr_scheduler
+    
+    @lr_scheduler.setter
+    def lr_scheduler(self, m):
+        assert isinstance(m, LRScheduler), f'{type(m)} != LRScheduler, please check your model class'
+        self._lr_scheduler = m 
+
+    @property
+    def lr_warmup_scheduler(self, ) -> LRScheduler:
+        return self._lr_warmup_scheduler
+
+    @lr_warmup_scheduler.setter
+    def lr_warmup_scheduler(self, m):
+        self._lr_warmup_scheduler = m 
+
+    @property
+    def train_dataloader(self) -> DataLoader:
+        if self._train_dataloader is None and self.train_dataset is not None:
+            loader = DataLoader(self.train_dataset, 
+                                batch_size=self.train_batch_size, 
+                                num_workers=self.num_workers, 
+                                collate_fn=self.collate_fn,
+                                shuffle=self.train_shuffle, )
+            loader.shuffle = self.train_shuffle
+            self._train_dataloader = loader
+
+        return self._train_dataloader
+
+    @train_dataloader.setter
+    def train_dataloader(self, loader):
+        self._train_dataloader = loader 
+
+    @property
+    def val_dataloader(self) -> DataLoader:
+        if self._val_dataloader is None and self.val_dataset is not None:
+            loader = DataLoader(self.val_dataset, 
+                                batch_size=self.val_batch_size, 
+                                num_workers=self.num_workers, 
+                                drop_last=False,
+                                collate_fn=self.collate_fn, 
+                                shuffle=self.val_shuffle)
+            loader.shuffle = self.val_shuffle
+            self._val_dataloader = loader
+
+        return self._val_dataloader
+    
+    @val_dataloader.setter
+    def val_dataloader(self, loader):
+        self._val_dataloader = loader 
+
+    @property
+    def ema(self, ) -> nn.Module:
+        if self._ema is None and self.use_ema and self.model is not None:
+            from ..optim import ModelEMA
+            self._ema = ModelEMA(self.model, self.ema_decay, self.ema_warmups)
+        return self._ema
+
+    @ema.setter
+    def ema(self, obj):
+        self._ema = obj
+
+    @property
+    def scaler(self) -> GradScaler: 
+        if self._scaler is None and self.use_amp and torch.cuda.is_available():
+            self._scaler = GradScaler()
+        return self._scaler
+    
+    @scaler.setter
+    def scaler(self, obj: GradScaler):
+        self._scaler = obj
+
+    @property
+    def val_shuffle(self) -> bool:
+        if self._val_shuffle is None:
+            print('warning: set default val_shuffle=False')
+            return False
+        return self._val_shuffle
+
+    @val_shuffle.setter
+    def val_shuffle(self, shuffle):
+        assert isinstance(shuffle, bool), 'shuffle must be bool'
+        self._val_shuffle = shuffle
+
+    @property
+    def train_shuffle(self) -> bool:
+        if self._train_shuffle is None:
+            print('warning: set default train_shuffle=True')
+            return True
+        return self._train_shuffle
+
+    @train_shuffle.setter
+    def train_shuffle(self, shuffle):
+        assert isinstance(shuffle, bool), 'shuffle must be bool'
+        self._train_shuffle = shuffle
+
+
+    @property
+    def train_batch_size(self) -> int:
+        if self._train_batch_size is None and isinstance(self.batch_size, int):
+            print(f'warning: set train_batch_size=batch_size={self.batch_size}')
+            return self.batch_size
+        return self._train_batch_size
+
+    @train_batch_size.setter
+    def train_batch_size(self, batch_size):
+        assert isinstance(batch_size, int), 'batch_size must be int'
+        self._train_batch_size = batch_size
+
+    @property
+    def val_batch_size(self) -> int:
+        if self._val_batch_size is None:
+            print(f'warning: set val_batch_size=batch_size={self.batch_size}')
+            return self.batch_size
+        return self._val_batch_size
+
+    @val_batch_size.setter
+    def val_batch_size(self, batch_size):
+        assert isinstance(batch_size, int), 'batch_size must be int'
+        self._val_batch_size = batch_size
+
+
+    @property
+    def train_dataset(self) -> Dataset:
+        return self._train_dataset
+
+    @train_dataset.setter
+    def train_dataset(self, dataset):
+        assert isinstance(dataset, Dataset), f'{type(dataset)} must be Dataset'
+        self._train_dataset = dataset
+
+
+    @property
+    def val_dataset(self) -> Dataset:
+        return self._val_dataset
+
+    @val_dataset.setter
+    def val_dataset(self, dataset):
+        assert isinstance(dataset, Dataset), f'{type(dataset)} must be Dataset'
+        self._val_dataset = dataset
+
+    @property
+    def collate_fn(self) -> Callable:
+        return self._collate_fn
+
+    @collate_fn.setter
+    def collate_fn(self, fn):
+        assert isinstance(fn, Callable), f'{type(fn)} must be Callable'
+        self._collate_fn = fn
+
+    @property
+    def evaluator(self) -> Callable:
+        return self._evaluator
+
+    @evaluator.setter
+    def evaluator(self, fn):
+        assert isinstance(fn, Callable), f'{type(fn)} must be Callable'
+        self._evaluator = fn
+
+    @property
+    def writer(self) -> SummaryWriter:
+        if self._writer is None: 
+            if self.summary_dir:
+                self._writer = SummaryWriter(self.summary_dir)
+            elif self.output_dir:
+                self._writer = SummaryWriter(Path(self.output_dir) / 'summary')
+        return self._writer
+    
+    @writer.setter
+    def writer(self, m):
+        assert isinstance(m, SummaryWriter), f'{type(m)} must be SummaryWriter'
+        self._writer = m
+
+    def __repr__(self, ):
+        s = ''
+        for k, v in self.__dict__.items():
+            if not k.startswith('_'):
+                s +=  f'{k}: {v}\n'
+        return s 
+
--- a/rtdetrv2_pytorch/src/core/workspace.py
+++ b/rtdetrv2_pytorch/src/core/workspace.py
@@ -0,0 +1,179 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import inspect
+import importlib
+import functools
+import inspect
+from collections import defaultdict
+from typing import Any, Dict, Optional, List
+
+
+GLOBAL_CONFIG = defaultdict(dict)
+
+
+def register(dct :Any=GLOBAL_CONFIG, name=None, force=False):
+    """
+        dct:
+            if dct is Dict, register foo into dct as key-value pair
+            if dct is Clas, register as modules attibute
+        force 
+            whether force register.
+    """
+    def decorator(foo):
+        register_name = foo.__name__ if name is None else name
+        if not force:
+            if inspect.isclass(dct):
+                assert not hasattr(dct, foo.__name__), \
+                    f'module {dct.__name__} has {foo.__name__}'
+            else:
+                assert foo.__name__ not in dct, \
+                f'{foo.__name__} has been already registered'
+
+        if inspect.isfunction(foo):
+            @functools.wraps(foo)
+            def wrap_func(*args, **kwargs):
+                return foo(*args, **kwargs)
+            if isinstance(dct, dict):
+                dct[foo.__name__] = wrap_func
+            elif inspect.isclass(dct):
+                setattr(dct, foo.__name__, wrap_func)
+            else:
+                raise AttributeError('')
+            return wrap_func
+
+        elif inspect.isclass(foo):
+            dct[register_name] = extract_schema(foo) 
+
+        else:
+            raise ValueError(f'Do not support {type(foo)} register')
+
+        return foo
+
+    return decorator
+
+
+
+def extract_schema(module: type):
+    """
+    Args:
+        module (type),
+    Return:
+        Dict, 
+    """
+    argspec = inspect.getfullargspec(module.__init__)
+    arg_names = [arg for arg in argspec.args if arg != 'self']
+    num_defualts = len(argspec.defaults) if argspec.defaults is not None else 0
+    num_requires = len(arg_names) - num_defualts
+
+    schame = dict()
+    schame['_name'] = module.__name__
+    schame['_pymodule'] = importlib.import_module(module.__module__)
+    schame['_inject'] = getattr(module, '__inject__', [])
+    schame['_share'] = getattr(module, '__share__', [])
+    schame['_kwargs'] = {}
+    for i, name in enumerate(arg_names):
+        if name in schame['_share']:
+            assert i >= num_requires, 'share config must have default value.'
+            value = argspec.defaults[i - num_requires]
+        
+        elif i >= num_requires:
+            value = argspec.defaults[i - num_requires]
+
+        else:
+            value = None 
+
+        schame[name] = value
+        schame['_kwargs'][name] = value 
+        
+    return schame
+
+
+def create(type_or_name, global_cfg=GLOBAL_CONFIG, **kwargs):
+    """
+    """
+    assert type(type_or_name) in (type, str), 'create should be modules or name.'
+
+    name = type_or_name if isinstance(type_or_name, str) else type_or_name.__name__
+
+    if name in global_cfg:
+        if hasattr(global_cfg[name], '__dict__'):
+            return global_cfg[name]
+    else:
+        raise ValueError('The module {} is not registered'.format(name))
+
+    cfg = global_cfg[name]
+
+    if isinstance(cfg, dict) and 'type' in cfg:
+        _cfg: dict = global_cfg[cfg['type']]
+        # clean args
+        _keys = [k for k in _cfg.keys() if not k.startswith('_')]
+        for _arg in _keys:
+            del _cfg[_arg]
+        _cfg.update(_cfg['_kwargs']) # restore default args
+        _cfg.update(cfg) # load config args 
+        _cfg.update(kwargs) # TODO recive extra kwargs
+        name = _cfg.pop('type') # pop extra key `type` (from cfg)
+        
+        return create(name, global_cfg)
+    
+    module = getattr(cfg['_pymodule'], name)    
+    module_kwargs = {}
+    module_kwargs.update(cfg)
+    
+    # shared var
+    for k in cfg['_share']:
+        if k in global_cfg:
+            module_kwargs[k] = global_cfg[k]
+        else:
+            module_kwargs[k] = cfg[k]
+
+    # inject
+    for k in cfg['_inject']:
+        _k = cfg[k]
+
+        if _k is None:
+            continue
+
+        if isinstance(_k, str):            
+            if _k not in global_cfg:
+                raise ValueError(f'Missing inject config of {_k}.')
+
+            _cfg = global_cfg[_k]
+            
+            if isinstance(_cfg, dict):
+                module_kwargs[k] = create(_cfg['_name'], global_cfg)
+            else:
+                module_kwargs[k] = _cfg 
+
+        elif isinstance(_k, dict):
+            if 'type' not in _k.keys():
+                raise ValueError(f'Missing inject for `type` style.')
+
+            _type = str(_k['type'])
+            if _type not in global_cfg:
+                raise ValueError(f'Missing {_type} in inspect stage.')
+
+            # TODO 
+            _cfg: dict = global_cfg[_type]
+            # clean args
+            _keys = [k for k in _cfg.keys() if not k.startswith('_')]
+            for _arg in _keys:
+                del _cfg[_arg]
+            _cfg.update(_cfg['_kwargs']) # restore default values
+            _cfg.update(_k) # load config args
+            name = _cfg.pop('type') # pop extra key (`type` from _k)
+            module_kwargs[k] = create(name, global_cfg)
+
+        else:
+            raise ValueError(f'Inject does not support {_k}')
+    
+    # TODO hard code
+    module_kwargs = {k: v for k, v in module_kwargs.items() if not k.startswith('_')}
+
+    # TODO for **kwargs
+    # extra_args = set(module_kwargs.keys()) - set(arg_names)
+    # if len(extra_args) > 0:
+    #     raise RuntimeError(f'Error: unknown args {extra_args} for {module}')
+
+    return module(**module_kwargs)
--- a/rtdetrv2_pytorch/src/core/yaml_config.py
+++ b/rtdetrv2_pytorch/src/core/yaml_config.py
@@ -0,0 +1,172 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn
+import torch.optim as optim
+from torch.utils.data import DataLoader
+
+import re 
+import copy
+
+from ._config import BaseConfig
+from .workspace import create
+from .yaml_utils import load_config, merge_config, merge_dict
+
+class YAMLConfig(BaseConfig):
+    def __init__(self, cfg_path: str, **kwargs) -> None:
+        super().__init__()
+
+        cfg = load_config(cfg_path)
+        cfg = merge_dict(cfg, kwargs)
+
+        self.yaml_cfg = copy.deepcopy(cfg) 
+        
+        for k in super().__dict__:
+            if not k.startswith('_') and k in cfg:
+                self.__dict__[k] = cfg[k]
+
+    @property
+    def global_cfg(self, ):
+        return merge_config(self.yaml_cfg, inplace=False, overwrite=False)
+    
+    @property
+    def model(self, ) -> torch.nn.Module:
+        if self._model is None and 'model' in self.yaml_cfg:
+            self._model = create(self.yaml_cfg['model'], self.global_cfg)
+        return super().model 
+
+    @property
+    def postprocessor(self, ) -> torch.nn.Module:
+        if self._postprocessor is None and 'postprocessor' in self.yaml_cfg:
+            self._postprocessor = create(self.yaml_cfg['postprocessor'], self.global_cfg)
+        return super().postprocessor
+
+    @property
+    def criterion(self, ) -> torch.nn.Module:
+        if self._criterion is None and 'criterion' in self.yaml_cfg:
+            self._criterion = create(self.yaml_cfg['criterion'], self.global_cfg)
+        return super().criterion
+    
+    @property
+    def optimizer(self, ) -> optim.Optimizer:
+        if self._optimizer is None and 'optimizer' in self.yaml_cfg:
+            params = self.get_optim_params(self.yaml_cfg['optimizer'], self.model)
+            self._optimizer = create('optimizer', self.global_cfg, params=params)
+        return super().optimizer
+    
+    @property
+    def lr_scheduler(self, ) -> optim.lr_scheduler.LRScheduler:
+        if self._lr_scheduler is None and 'lr_scheduler' in self.yaml_cfg:
+            self._lr_scheduler = create('lr_scheduler', self.global_cfg, optimizer=self.optimizer)
+            print(f'Initial lr: {self._lr_scheduler.get_last_lr()}')
+        return super().lr_scheduler
+    
+    @property
+    def lr_warmup_scheduler(self, ) -> optim.lr_scheduler.LRScheduler:
+        if self._lr_warmup_scheduler is None and 'lr_warmup_scheduler' in self.yaml_cfg :
+            self._lr_warmup_scheduler = create('lr_warmup_scheduler', self.global_cfg, lr_scheduler=self.lr_scheduler)
+        return super().lr_warmup_scheduler
+
+    @property
+    def train_dataloader(self, ) -> DataLoader:
+        if self._train_dataloader is None and 'train_dataloader' in self.yaml_cfg:
+            self._train_dataloader = self.build_dataloader('train_dataloader')
+        return super().train_dataloader
+
+    @property
+    def val_dataloader(self, ) -> DataLoader:
+        if self._val_dataloader is None and 'val_dataloader' in self.yaml_cfg:
+            self._val_dataloader = self.build_dataloader('val_dataloader')
+        return super().val_dataloader
+    
+    @property
+    def ema(self, ) -> torch.nn.Module:
+        if self._ema is None and self.yaml_cfg.get('use_ema', False):
+            self._ema = create('ema', self.global_cfg, model=self.model)
+        return super().ema
+    
+    @property
+    def scaler(self, ):
+        if self._scaler is None and self.yaml_cfg.get('use_amp', False):
+            self._scaler = create('scaler', self.global_cfg)
+        return super().scaler
+
+    @property
+    def evaluator(self, ):
+        if self._evaluator is None and 'evaluator' in self.yaml_cfg:
+            if self.yaml_cfg['evaluator']['type'] == 'CocoEvaluator':
+                from ..data import get_coco_api_from_dataset
+                base_ds = get_coco_api_from_dataset(self.val_dataloader.dataset)                
+                self._evaluator = create('evaluator', self.global_cfg, coco_gt=base_ds)
+            else:
+                raise NotImplementedError(f"{self.yaml_cfg['evaluator']['type']}")
+        return super().evaluator
+
+    @staticmethod
+    def get_optim_params(cfg: dict, model: nn.Module):
+        """
+        E.g.:
+            ^(?=.*a)(?=.*b).*$  means including a and b
+            ^(?=.*(?:a|b)).*$   means including a or b
+            ^(?=.*a)(?!.*b).*$  means including a, but not b
+        """
+        assert 'type' in cfg, ''
+        cfg = copy.deepcopy(cfg)
+
+        if 'params' not in cfg:
+            return model.parameters() 
+
+        assert isinstance(cfg['params'], list), ''
+
+        param_groups = []
+        visited = []
+        for pg in cfg['params']:
+            pattern = pg['params']
+            params = {k: v for k, v in model.named_parameters() if v.requires_grad and len(re.findall(pattern, k)) > 0}
+            pg['params'] = params.values()
+            param_groups.append(pg)
+            visited.extend(list(params.keys()))
+            # print(params.keys())
+
+        names = [k for k, v in model.named_parameters() if v.requires_grad]
+
+        if len(visited) < len(names):
+            unseen = set(names) - set(visited)
+            params = {k: v for k, v in model.named_parameters() if v.requires_grad and k in unseen}
+            param_groups.append({'params': params.values()})
+            visited.extend(list(params.keys()))
+            # print(params.keys())
+
+        assert len(visited) == len(names), ''
+
+        return param_groups
+
+    @staticmethod
+    def get_rank_batch_size(cfg):
+        """compute batch size for per rank if total_batch_size is provided.
+        """
+        assert ('total_batch_size' in cfg or 'batch_size' in cfg) \
+            and not ('total_batch_size' in cfg and 'batch_size' in cfg), \
+                '`batch_size` or `total_batch_size` should be choosed one'
+
+        total_batch_size = cfg.get('total_batch_size', None)
+        if total_batch_size is None:
+            bs = cfg.get('batch_size')
+        else:
+            from ..misc import dist_utils
+            assert total_batch_size % dist_utils.get_world_size() == 0, \
+                'total_batch_size should be divisible by world size'
+            bs = total_batch_size // dist_utils.get_world_size()
+        return bs 
+
+    def build_dataloader(self, name: str):
+        bs = self.get_rank_batch_size(self.yaml_cfg[name])
+        global_cfg = self.global_cfg
+        if 'total_batch_size' in global_cfg[name]:
+            # pop unexpected key for dataloader init
+            _ = global_cfg[name].pop('total_batch_size')
+        print(f'building {name} with batch_size={bs}...')
+        loader = create(name, global_cfg, batch_size=bs)
+        loader.shuffle = self.yaml_cfg[name].get('shuffle', False)      
+        return loader
--- a/rtdetrv2_pytorch/src/core/yaml_utils.py
+++ b/rtdetrv2_pytorch/src/core/yaml_utils.py
@@ -0,0 +1,124 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import os
+import copy
+import yaml 
+from typing import Any, Dict, Optional, List
+
+from .workspace import GLOBAL_CONFIG
+
+__all__ = [
+    'load_config', 
+    'merge_config', 
+    'merge_dict', 
+    'parse_cli',
+]
+
+
+INCLUDE_KEY = '__include__'
+
+
+def load_config(file_path, cfg=dict()):
+    """load config
+    """
+    _, ext = os.path.splitext(file_path)
+    assert ext in ['.yml', '.yaml'], "only support yaml files"
+
+    with open(file_path) as f:
+        file_cfg = yaml.load(f, Loader=yaml.Loader)
+        if file_cfg is None:
+            return {}
+
+    if INCLUDE_KEY in file_cfg:
+        base_yamls = list(file_cfg[INCLUDE_KEY])
+        for base_yaml in base_yamls:
+            if base_yaml.startswith('~'):
+                base_yaml = os.path.expanduser(base_yaml)
+
+            if not base_yaml.startswith('/'):
+                base_yaml = os.path.join(os.path.dirname(file_path), base_yaml)
+
+            with open(base_yaml) as f:
+                base_cfg = load_config(base_yaml, cfg)
+                merge_dict(cfg, base_cfg)
+
+    return merge_dict(cfg, file_cfg)
+
+
+def merge_dict(dct, another_dct, inplace=True) -> Dict:
+    """merge another_dct into dct
+    """
+    def _merge(dct, another) -> Dict:
+        for k in another:
+            if (k in dct and isinstance(dct[k], dict) and isinstance(another[k], dict)):
+                _merge(dct[k], another[k])
+            else:
+                dct[k] = another[k]
+
+        return dct
+    
+    if not inplace:
+        dct = copy.deepcopy(dct)
+    
+    return _merge(dct, another_dct)
+
+
+def dictify(s: str, v: Any) -> Dict:
+    if '.' not in s:
+        return {s: v}
+    key, rest = s.split('.', 1)
+    return {key: dictify(rest, v)}
+
+
+def parse_cli(nargs: List[str]) -> Dict:
+    """
+    parse command-line arguments
+        convert `a.c=3 b=10` to `{'a': {'c': 3}, 'b': 10}`
+    """
+    cfg = {}
+    if nargs is None or len(nargs) == 0:
+        return cfg
+
+    for s in nargs:
+        s = s.strip()
+        k, v = s.split('=', 1)
+        d = dictify(k, yaml.load(v, Loader=yaml.Loader))
+        cfg = merge_dict(cfg, d)
+
+    return cfg
+
+
+
+def merge_config(cfg, another_cfg=GLOBAL_CONFIG, inplace: bool=False, overwrite: bool=False):
+    """
+    Merge another_cfg into cfg, return the merged config
+
+    Example:
+
+        cfg1 = load_config('./rtdetrv2_r18vd_6x_coco.yml')
+        cfg1 = merge_config(cfg, inplace=True)
+
+        cfg2 = load_config('./rtdetr_r50vd_6x_coco.yml')
+        cfg2 = merge_config(cfg2, inplace=True)
+
+        model1 = create(cfg1['model'], cfg1)
+        model2 = create(cfg2['model'], cfg2)
+    """
+    def _merge(dct, another):
+        for k in another:
+            if k not in dct:
+                dct[k] = another[k]
+            
+            elif isinstance(dct[k], dict) and isinstance(another[k], dict):
+                _merge(dct[k], another[k])   
+            
+            elif overwrite:
+                dct[k] = another[k]
+
+        return cfg
+    
+    if not inplace:
+        cfg = copy.deepcopy(cfg)
+
+    return _merge(cfg, another_cfg)
--- a/rtdetrv2_pytorch/src/data/init.py
+++ b/rtdetrv2_pytorch/src/data/init.py
@@ -0,0 +1,21 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from .dataset import *
+from .transforms import *
+from .dataloader import *
+
+from ._misc import convert_to_tv_tensor
+
+
+
+
+# def set_epoch(self, epoch) -> None:
+#     self.epoch = epoch 
+# def _set_epoch_func(datasets):
+#     """Add `set_epoch` for datasets
+#     """
+#     from ..core import register
+#     for ds in datasets:
+#         register(ds)(set_epoch)
+# _set_epoch_func([CIFAR10, VOCDetection, CocoDetection])
--- a/rtdetrv2_pytorch/src/data/_misc.py
+++ b/rtdetrv2_pytorch/src/data/_misc.py
@@ -0,0 +1,55 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import importlib.metadata
+from torch import Tensor 
+
+if importlib.metadata.version('torchvision') == '0.15.2':
+    import torchvision
+    torchvision.disable_beta_transforms_warning()
+
+    from torchvision.datapoints import BoundingBox as BoundingBoxes
+    from torchvision.datapoints import BoundingBoxFormat, Mask, Image, Video
+    from torchvision.transforms.v2 import SanitizeBoundingBox as SanitizeBoundingBoxes
+    _boxes_keys = ['format', 'spatial_size']
+
+elif '0.17' > importlib.metadata.version('torchvision') >= '0.16':
+    import torchvision
+    torchvision.disable_beta_transforms_warning()
+
+    from torchvision.transforms.v2 import SanitizeBoundingBoxes
+    from torchvision.tv_tensors import (
+        BoundingBoxes, BoundingBoxFormat, Mask, Image, Video)
+    _boxes_keys = ['format', 'canvas_size']
+
+elif importlib.metadata.version('torchvision') >= '0.17':
+    import torchvision
+    from torchvision.transforms.v2 import SanitizeBoundingBoxes
+    from torchvision.tv_tensors import (
+        BoundingBoxes, BoundingBoxFormat, Mask, Image, Video)
+    _boxes_keys = ['format', 'canvas_size']
+
+else:
+    raise RuntimeError('Please make sure torchvision version >= 0.15.2')
+
+
+
+def convert_to_tv_tensor(tensor: Tensor, key: str, box_format='xyxy', spatial_size=None) -> Tensor:
+    """
+    Args:
+        tensor (Tensor): input tensor
+        key (str): transform to key
+
+    Return:
+        Dict[str, TV_Tensor]
+    """
+    assert key in ('boxes', 'masks', ), "Only support 'boxes' and 'masks'"
+    
+    if key == 'boxes':
+        box_format = getattr(BoundingBoxFormat, box_format.upper())
+        _kwargs = dict(zip(_boxes_keys, [box_format, spatial_size]))
+        return BoundingBoxes(tensor, **_kwargs)
+
+    if key == 'masks':
+       return Mask(tensor)
+
--- a/rtdetrv2_pytorch/src/data/dataloader.py
+++ b/rtdetrv2_pytorch/src/data/dataloader.py
@@ -0,0 +1,107 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.utils.data as data
+import torch.nn.functional as F
+from torch.utils.data import default_collate
+
+import torchvision
+torchvision.disable_beta_transforms_warning()
+import torchvision.transforms.v2 as VT
+from torchvision.transforms.v2 import functional as VF, InterpolationMode
+
+import random
+from functools import partial
+
+from ..core import register
+
+
+__all__ = [
+    'DataLoader',
+    'BaseCollateFunction', 
+    'BatchImageCollateFunction',
+    'batch_image_collate_fn'
+]
+
+
+@register()
+class DataLoader(data.DataLoader):
+    __inject__ = ['dataset', 'collate_fn']
+
+    def __repr__(self) -> str:
+        format_string = self.__class__.__name__ + "("
+        for n in ['dataset', 'batch_size', 'num_workers', 'drop_last', 'collate_fn']:
+            format_string += "\n"
+            format_string += "    {0}: {1}".format(n, getattr(self, n))
+        format_string += "\n)"
+        return format_string
+
+    def set_epoch(self, epoch):
+        self._epoch = epoch 
+        self.dataset.set_epoch(epoch)
+        self.collate_fn.set_epoch(epoch)
+    
+    @property
+    def epoch(self):
+        return self._epoch if hasattr(self, '_epoch') else -1
+
+    @property
+    def shuffle(self):
+        return self._shuffle
+
+    @shuffle.setter
+    def shuffle(self, shuffle):
+        assert isinstance(shuffle, bool), 'shuffle must be a boolean'
+        self._shuffle = shuffle
+
+
+@register()
+def batch_image_collate_fn(items):
+    """only batch image
+    """
+    return torch.cat([x[0][None] for x in items], dim=0), [x[1] for x in items]
+
+
+class BaseCollateFunction(object):
+    def set_epoch(self, epoch):
+        self._epoch = epoch 
+
+    @property
+    def epoch(self):
+        return self._epoch if hasattr(self, '_epoch') else -1
+
+    def __call__(self, items):
+        raise NotImplementedError('')
+
+
+@register()
+class BatchImageCollateFunction(BaseCollateFunction):
+    def __init__(
+        self, 
+        scales=None, 
+        stop_epoch=None, 
+    ) -> None:
+        super().__init__()
+        self.scales = scales
+        self.stop_epoch = stop_epoch if stop_epoch is not None else 100000000
+        # self.interpolation = interpolation
+
+    def __call__(self, items):
+        images = torch.cat([x[0][None] for x in items], dim=0)
+        targets = [x[1] for x in items]
+
+        if self.scales is not None and self.epoch < self.stop_epoch:
+            # sz = random.choice(self.scales)
+            # sz = [sz] if isinstance(sz, int) else list(sz)
+            # VF.resize(inpt, sz, interpolation=self.interpolation)
+
+            sz = random.choice(self.scales)
+            images = F.interpolate(images, size=sz)
+            if 'masks' in targets[0]:
+                for tg in targets:
+                    tg['masks'] = F.interpolate(tg['masks'], size=sz, mode='nearest')
+                raise NotImplementedError('')
+
+        return images, targets
+
--- a/rtdetrv2_pytorch/src/data/dataset/init.py
+++ b/rtdetrv2_pytorch/src/data/dataset/init.py
@@ -0,0 +1,16 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+# from ._dataset import DetDataset
+from .cifar_dataset import CIFAR10
+from .coco_dataset import CocoDetection
+from .coco_dataset import (
+    CocoDetection, 
+    mscoco_category2name, 
+    mscoco_category2label,
+    mscoco_label2category,
+)
+from .coco_eval import CocoEvaluator
+from .coco_utils import get_coco_api_from_dataset
+from .voc_detection import VOCDetection
+from .voc_eval import VOCEvaluator
--- a/rtdetrv2_pytorch/src/data/dataset/_dataset.py
+++ b/rtdetrv2_pytorch/src/data/dataset/_dataset.py
@@ -0,0 +1,22 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.utils.data as data
+
+class DetDataset(data.Dataset):
+    def __getitem__(self, index):
+        img, target = self.load_item(index)
+        if self.transforms is not None:
+            img, target, _ = self.transforms(img, target, self)
+        return img, target
+
+    def load_item(self, index):
+        raise NotImplementedError("Please implement this function to return item before `transforms`.")
+
+    def set_epoch(self, epoch) -> None:
+        self._epoch = epoch 
+
+    @property
+    def epoch(self):
+        return self._epoch if hasattr(self, '_epoch') else -1
--- a/rtdetrv2_pytorch/src/data/dataset/cifar_dataset.py
+++ b/rtdetrv2_pytorch/src/data/dataset/cifar_dataset.py
@@ -0,0 +1,16 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+import torchvision
+from typing import Optional, Callable
+
+from ...core import register
+
+@register()
+class CIFAR10(torchvision.datasets.CIFAR10):
+    __inject__ = ['transform', 'target_transform']
+    
+    def __init__(self, root: str, train: bool = True, transform: Optional[Callable] = None, target_transform: Optional[Callable] = None, download: bool = False) -> None:
+        super().__init__(root, train, transform, target_transform, download)
+
--- a/rtdetrv2_pytorch/src/data/dataset/coco_dataset.py
+++ b/rtdetrv2_pytorch/src/data/dataset/coco_dataset.py
@@ -0,0 +1,261 @@
+"""
+Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+Mostly copy-paste from https://github.com/pytorch/vision/blob/13b35ff/references/detection/coco_utils.py
+
+Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+from faster_coco_eval.utils.pytorch import FasterCocoDetection
+import torchvision
+
+from PIL import Image 
+from faster_coco_eval.core import mask as coco_mask
+
+from ._dataset import DetDataset
+from .._misc import convert_to_tv_tensor
+from ...core import register
+
+__all__ = ['CocoDetection']
+
+torchvision.disable_beta_transforms_warning()
+
+@register()
+class CocoDetection(FasterCocoDetection, DetDataset):
+    __inject__ = ['transforms', ]
+    __share__ = ['remap_mscoco_category']
+    
+    def __init__(self, img_folder, ann_file, transforms, return_masks=False, remap_mscoco_category=False):
+        super(FasterCocoDetection, self).__init__(img_folder, ann_file)
+        self._transforms = transforms
+        self.prepare = ConvertCocoPolysToMask(return_masks)
+        self.img_folder = img_folder
+        self.ann_file = ann_file
+        self.return_masks = return_masks
+        self.remap_mscoco_category = remap_mscoco_category
+
+    def __getitem__(self, idx):
+        img, target = self.load_item(idx)
+        if self._transforms is not None:
+            img, target, _ = self._transforms(img, target, self)
+        return img, target
+
+    def load_item(self, idx):
+        image, target = super(FasterCocoDetection, self).__getitem__(idx)
+        image_id = self.ids[idx]
+        target = {'image_id': image_id, 'annotations': target}
+
+        if self.remap_mscoco_category:
+            image, target = self.prepare(image, target, category2label=mscoco_category2label)
+            # image, target = self.prepare(image, target, category2label=self.category2label)
+        else:
+            image, target = self.prepare(image, target)
+
+        target['idx'] = torch.tensor([idx])
+
+        if 'boxes' in target:
+            target['boxes'] = convert_to_tv_tensor(target['boxes'], key='boxes', spatial_size=image.size[::-1])
+
+        if 'masks' in target:
+            target['masks'] = convert_to_tv_tensor(target['masks'], key='masks')
+        
+        return image, target
+
+    def extra_repr(self) -> str:
+        s = f' img_folder: {self.img_folder}\n ann_file: {self.ann_file}\n'
+        s += f' return_masks: {self.return_masks}\n'
+        if hasattr(self, '_transforms') and self._transforms is not None:
+            s += f' transforms:\n   {repr(self._transforms)}'
+        if hasattr(self, '_preset') and self._preset is not None:
+            s += f' preset:\n   {repr(self._preset)}'
+        return s 
+
+    @property
+    def categories(self, ):
+        return self.coco.dataset['categories']
+
+    @property
+    def category2name(self, ):
+        return {cat['id']: cat['name'] for cat in self.categories}
+
+    @property
+    def category2label(self, ):
+        return {cat['id']: i for i, cat in enumerate(self.categories)}
+
+    @property
+    def label2category(self, ):
+        return {i: cat['id'] for i, cat in enumerate(self.categories)}
+
+
+def convert_coco_poly_to_mask(segmentations, height, width):
+    masks = []
+    for polygons in segmentations:
+        rles = coco_mask.frPyObjects(polygons, height, width)
+        mask = coco_mask.decode(rles)
+        if len(mask.shape) < 3:
+            mask = mask[..., None]
+        mask = torch.as_tensor(mask, dtype=torch.uint8)
+        mask = mask.any(dim=2)
+        masks.append(mask)
+    if masks:
+        masks = torch.stack(masks, dim=0)
+    else:
+        masks = torch.zeros((0, height, width), dtype=torch.uint8)
+    return masks
+
+
+class ConvertCocoPolysToMask(object):
+    def __init__(self, return_masks=False):
+        self.return_masks = return_masks
+
+    def __call__(self, image: Image.Image, target, **kwargs):
+        w, h = image.size
+
+        image_id = target["image_id"]
+        image_id = torch.tensor([image_id])
+
+        anno = target["annotations"]
+
+        anno = [obj for obj in anno if 'iscrowd' not in obj or obj['iscrowd'] == 0]
+
+        boxes = [obj["bbox"] for obj in anno]
+        # guard against no boxes via resizing
+        boxes = torch.as_tensor(boxes, dtype=torch.float32).reshape(-1, 4)
+        boxes[:, 2:] += boxes[:, :2]
+        boxes[:, 0::2].clamp_(min=0, max=w)
+        boxes[:, 1::2].clamp_(min=0, max=h)
+
+        category2label = kwargs.get('category2label', None)
+        if category2label is not None:
+            labels = [category2label[obj["category_id"]] for obj in anno]
+        else:
+            labels = [obj["category_id"] for obj in anno]
+            
+        labels = torch.tensor(labels, dtype=torch.int64)
+
+        if self.return_masks:
+            segmentations = [obj["segmentation"] for obj in anno]
+            masks = convert_coco_poly_to_mask(segmentations, h, w)
+
+        keypoints = None
+        if anno and "keypoints" in anno[0]:
+            keypoints = [obj["keypoints"] for obj in anno]
+            keypoints = torch.as_tensor(keypoints, dtype=torch.float32)
+            num_keypoints = keypoints.shape[0]
+            if num_keypoints:
+                keypoints = keypoints.view(num_keypoints, -1, 3)
+
+        keep = (boxes[:, 3] > boxes[:, 1]) & (boxes[:, 2] > boxes[:, 0])
+        boxes = boxes[keep]
+        labels = labels[keep]
+        if self.return_masks:
+            masks = masks[keep]
+        if keypoints is not None:
+            keypoints = keypoints[keep]
+
+        target = {}
+        target["boxes"] = boxes
+        target["labels"] = labels
+        if self.return_masks:
+            target["masks"] = masks
+        target["image_id"] = image_id
+        if keypoints is not None:
+            target["keypoints"] = keypoints
+
+        # for conversion to coco api
+        area = torch.tensor([obj["area"] for obj in anno])
+        iscrowd = torch.tensor([obj["iscrowd"] if "iscrowd" in obj else 0 for obj in anno])
+        target["area"] = area[keep]
+        target["iscrowd"] = iscrowd[keep]
+
+        target["orig_size"] = torch.as_tensor([int(w), int(h)])
+        # target["size"] = torch.as_tensor([int(w), int(h)])
+    
+        return image, target
+
+
+mscoco_category2name = {
+    1: 'person',
+    2: 'bicycle',
+    3: 'car',
+    4: 'motorcycle',
+    5: 'airplane',
+    6: 'bus',
+    7: 'train',
+    8: 'truck',
+    9: 'boat',
+    10: 'traffic light',
+    11: 'fire hydrant',
+    13: 'stop sign',
+    14: 'parking meter',
+    15: 'bench',
+    16: 'bird',
+    17: 'cat',
+    18: 'dog',
+    19: 'horse',
+    20: 'sheep',
+    21: 'cow',
+    22: 'elephant',
+    23: 'bear',
+    24: 'zebra',
+    25: 'giraffe',
+    27: 'backpack',
+    28: 'umbrella',
+    31: 'handbag',
+    32: 'tie',
+    33: 'suitcase',
+    34: 'frisbee',
+    35: 'skis',
+    36: 'snowboard',
+    37: 'sports ball',
+    38: 'kite',
+    39: 'baseball bat',
+    40: 'baseball glove',
+    41: 'skateboard',
+    42: 'surfboard',
+    43: 'tennis racket',
+    44: 'bottle',
+    46: 'wine glass',
+    47: 'cup',
+    48: 'fork',
+    49: 'knife',
+    50: 'spoon',
+    51: 'bowl',
+    52: 'banana',
+    53: 'apple',
+    54: 'sandwich',
+    55: 'orange',
+    56: 'broccoli',
+    57: 'carrot',
+    58: 'hot dog',
+    59: 'pizza',
+    60: 'donut',
+    61: 'cake',
+    62: 'chair',
+    63: 'couch',
+    64: 'potted plant',
+    65: 'bed',
+    67: 'dining table',
+    70: 'toilet',
+    72: 'tv',
+    73: 'laptop',
+    74: 'mouse',
+    75: 'remote',
+    76: 'keyboard',
+    77: 'cell phone',
+    78: 'microwave',
+    79: 'oven',
+    80: 'toaster',
+    81: 'sink',
+    82: 'refrigerator',
+    84: 'book',
+    85: 'clock',
+    86: 'vase',
+    87: 'scissors',
+    88: 'teddy bear',
+    89: 'hair drier',
+    90: 'toothbrush'
+}
+
+mscoco_category2label = {k: i for i, k in enumerate(mscoco_category2name.keys())}
+mscoco_label2category = {v: k for k, v in mscoco_category2label.items()}
--- a/rtdetrv2_pytorch/src/data/dataset/coco_eval.py
+++ b/rtdetrv2_pytorch/src/data/dataset/coco_eval.py
@@ -0,0 +1,16 @@
+"""
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+COCO evaluator that works in distributed mode.
+Mostly copy-paste from https://github.com/pytorch/vision/blob/edfd5a7/references/detection/coco_eval.py
+The difference is that there is less copy-pasting from pycocotools
+in the end of the file, as python3 can suppress prints with contextlib
+
+# MiXaiLL76 replacing pycocotools with faster-coco-eval for better performance and support.
+"""
+
+from ...core import register
+from faster_coco_eval.utils.pytorch import FasterCocoEvaluator
+
+@register()
+class CocoEvaluator(FasterCocoEvaluator):
+    pass
--- a/rtdetrv2_pytorch/src/data/dataset/coco_utils.py
+++ b/rtdetrv2_pytorch/src/data/dataset/coco_utils.py
@@ -0,0 +1,194 @@
+"""
+copy and modified https://github.com/pytorch/vision/blob/main/references/detection/coco_utils.py
+
+Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+import torch
+import torch.utils.data
+import torchvision
+import torchvision.transforms.functional as TVF
+from faster_coco_eval import COCO
+import faster_coco_eval.core.mask as mask_util
+
+def convert_coco_poly_to_mask(segmentations, height, width):
+    masks = []
+    for polygons in segmentations:
+        rles = mask_util.frPyObjects(polygons, height, width)
+        mask = mask_util.decode(rles)
+        if len(mask.shape) < 3:
+            mask = mask[..., None]
+        mask = torch.as_tensor(mask, dtype=torch.uint8)
+        mask = mask.any(dim=2)
+        masks.append(mask)
+    if masks:
+        masks = torch.stack(masks, dim=0)
+    else:
+        masks = torch.zeros((0, height, width), dtype=torch.uint8)
+    return masks
+
+
+class ConvertCocoPolysToMask:
+    def __call__(self, image, target):
+        w, h = image.size
+
+        image_id = target["image_id"]
+
+        anno = target["annotations"]
+
+        anno = [obj for obj in anno if obj["iscrowd"] == 0]
+
+        boxes = [obj["bbox"] for obj in anno]
+        # guard against no boxes via resizing
+        boxes = torch.as_tensor(boxes, dtype=torch.float32).reshape(-1, 4)
+        boxes[:, 2:] += boxes[:, :2]
+        boxes[:, 0::2].clamp_(min=0, max=w)
+        boxes[:, 1::2].clamp_(min=0, max=h)
+
+        classes = [obj["category_id"] for obj in anno]
+        classes = torch.tensor(classes, dtype=torch.int64)
+
+        segmentations = [obj["segmentation"] for obj in anno]
+        masks = convert_coco_poly_to_mask(segmentations, h, w)
+
+        keypoints = None
+        if anno and "keypoints" in anno[0]:
+            keypoints = [obj["keypoints"] for obj in anno]
+            keypoints = torch.as_tensor(keypoints, dtype=torch.float32)
+            num_keypoints = keypoints.shape[0]
+            if num_keypoints:
+                keypoints = keypoints.view(num_keypoints, -1, 3)
+
+        keep = (boxes[:, 3] > boxes[:, 1]) & (boxes[:, 2] > boxes[:, 0])
+        boxes = boxes[keep]
+        classes = classes[keep]
+        masks = masks[keep]
+        if keypoints is not None:
+            keypoints = keypoints[keep]
+
+        target = {}
+        target["boxes"] = boxes
+        target["labels"] = classes
+        target["masks"] = masks
+        target["image_id"] = image_id
+        if keypoints is not None:
+            target["keypoints"] = keypoints
+
+        # for conversion to coco api
+        area = torch.tensor([obj["area"] for obj in anno])
+        iscrowd = torch.tensor([obj["iscrowd"] for obj in anno])
+        target["area"] = area
+        target["iscrowd"] = iscrowd
+
+        return image, target
+
+
+def _coco_remove_images_without_annotations(dataset, cat_list=None):
+    def _has_only_empty_bbox(anno):
+        return all(any(o <= 1 for o in obj["bbox"][2:]) for obj in anno)
+
+    def _count_visible_keypoints(anno):
+        return sum(sum(1 for v in ann["keypoints"][2::3] if v > 0) for ann in anno)
+
+    min_keypoints_per_image = 10
+
+    def _has_valid_annotation(anno):
+        # if it's empty, there is no annotation
+        if len(anno) == 0:
+            return False
+        # if all boxes have close to zero area, there is no annotation
+        if _has_only_empty_bbox(anno):
+            return False
+        # keypoints task have a slight different criteria for considering
+        # if an annotation is valid
+        if "keypoints" not in anno[0]:
+            return True
+        # for keypoint detection tasks, only consider valid images those
+        # containing at least min_keypoints_per_image
+        if _count_visible_keypoints(anno) >= min_keypoints_per_image:
+            return True
+        return False
+
+    ids = []
+    for ds_idx, img_id in enumerate(dataset.ids):
+        ann_ids = dataset.coco.getAnnIds(imgIds=img_id, iscrowd=None)
+        anno = dataset.coco.loadAnns(ann_ids)
+        if cat_list:
+            anno = [obj for obj in anno if obj["category_id"] in cat_list]
+        if _has_valid_annotation(anno):
+            ids.append(ds_idx)
+
+    dataset = torch.utils.data.Subset(dataset, ids)
+    return dataset
+
+
+def convert_to_coco_api(ds):
+    coco_ds = COCO()
+    # annotation IDs need to start at 1, not 0, see torchvision issue #1530
+    ann_id = 1
+    dataset = {"images": [], "categories": [], "annotations": []}
+    categories = set()
+    for img_idx in range(len(ds)):
+        # find better way to get target
+        # targets = ds.get_annotations(img_idx)
+        # img, targets = ds[img_idx]
+
+        # TODO (by lyuwenyu), load image and targets before `transforms`
+        img, targets = ds.load_item(img_idx)
+        width, height = img.size
+        
+        image_id = targets["image_id"].item()
+        img_dict = {}
+        img_dict["id"] = image_id
+        img_dict["width"] = width
+        img_dict["height"] = height
+        dataset["images"].append(img_dict)
+        bboxes = targets["boxes"].clone()
+        bboxes[:, 2:] -= bboxes[:, :2] # xyxy -> xywh
+        bboxes = bboxes.tolist()
+        labels = targets["labels"].tolist()
+        areas = targets["area"].tolist()
+        iscrowd = targets["iscrowd"].tolist()
+        if "masks" in targets:
+            masks = targets["masks"]
+            # make masks Fortran contiguous for coco_mask
+            masks = masks.permute(0, 2, 1).contiguous().permute(0, 2, 1)
+        if "keypoints" in targets:
+            keypoints = targets["keypoints"]
+            keypoints = keypoints.reshape(keypoints.shape[0], -1).tolist()
+        num_objs = len(bboxes)
+        for i in range(num_objs):
+            ann = {}
+            ann["image_id"] = image_id
+            ann["bbox"] = bboxes[i]
+            ann["category_id"] = labels[i]
+            categories.add(labels[i])
+            ann["area"] = areas[i]
+            ann["iscrowd"] = iscrowd[i]
+            ann["id"] = ann_id
+            if "masks" in targets:
+                ann["segmentation"] = mask_util.encode(masks[i].numpy())
+            if "keypoints" in targets:
+                ann["keypoints"] = keypoints[i]
+                ann["num_keypoints"] = sum(k != 0 for k in keypoints[i][2::3])
+            dataset["annotations"].append(ann)
+            ann_id += 1
+    dataset["categories"] = [{"id": i} for i in sorted(categories)]
+    coco_ds.dataset = dataset
+    coco_ds.createIndex()
+    return coco_ds
+
+
+def get_coco_api_from_dataset(dataset):
+    # FIXME: This is... awful?
+    for _ in range(10):
+        if isinstance(dataset, torchvision.datasets.CocoDetection):
+            break
+        if isinstance(dataset, torch.utils.data.Subset):
+            dataset = dataset.dataset
+    if isinstance(dataset, torchvision.datasets.CocoDetection):
+        return dataset.coco
+    return convert_to_coco_api(dataset)
+
+
--- a/rtdetrv2_pytorch/src/data/dataset/voc_detection.py
+++ b/rtdetrv2_pytorch/src/data/dataset/voc_detection.py
@@ -0,0 +1,75 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from sympy import im
+import torch
+import torchvision
+import torchvision.transforms.functional as TVF 
+
+import os
+from PIL import Image
+from typing import Optional, Callable
+
+try:
+    from defusedxml.ElementTree import parse as ET_parse
+except ImportError:
+    from xml.etree.ElementTree import parse as ET_parse
+
+from ._dataset import DetDataset
+from .._misc import convert_to_tv_tensor
+from ...core import register
+
+@register()
+class VOCDetection(torchvision.datasets.VOCDetection, DetDataset):
+    __inject__ = ['transforms', ]
+
+    def __init__(self, root: str, ann_file: str = "trainval.txt", label_file: str = "label_list.txt", transforms: Optional[Callable] = None):
+
+        with open(os.path.join(root, ann_file), 'r') as f:
+            lines = [x.strip() for x in f.readlines()]
+            lines = [x.split(' ') for x in lines]
+
+        self.images = [os.path.join(root, lin[0]) for lin in lines]
+        self.targets = [os.path.join(root, lin[1]) for lin in lines]
+        assert len(self.images) == len(self.targets)
+
+        with open(os.path.join(root + label_file), 'r') as f:
+            labels = f.readlines()
+            labels = [lab.strip() for lab in labels]
+
+        self.transforms = transforms
+        self.labels_map = {lab: i for i, lab in enumerate(labels)}
+        
+    def __getitem__(self, index: int):
+        image, target = self.load_item(index)
+        if self.transforms is not None:
+            image, target, _ = self.transforms(image, target, self)        
+        # target["orig_size"] = torch.tensor(TVF.get_image_size(image))
+        return image, target
+
+    def load_item(self, index: int):
+        image = Image.open(self.images[index]).convert("RGB")
+        target = self.parse_voc_xml(ET_parse(self.annotations[index]).getroot())
+        
+        output = {}
+        output["image_id"] = torch.tensor([index])
+        for k in ['area', 'boxes', 'labels', 'iscrowd']:
+            output[k] = []
+            
+        for blob in target['annotation']['object']:
+            box = [float(v) for v in blob['bndbox'].values()]
+            output["boxes"].append(box)
+            output["labels"].append(blob['name'])
+            output["area"].append((box[2] - box[0]) * (box[3] - box[1]))
+            output["iscrowd"].append(0)
+
+        w, h = image.size
+        boxes = torch.tensor(output["boxes"]) if len(output["boxes"]) > 0 else torch.zeros(0, 4)
+        output['boxes'] = convert_to_tv_tensor(boxes, 'boxes', box_format='xyxy', spatial_size=[h, w])
+        output['labels'] = torch.tensor([self.labels_map[lab] for lab in output["labels"]])
+        output['area'] = torch.tensor(output['area'])
+        output["iscrowd"] = torch.tensor(output["iscrowd"])
+        output["orig_size"] = torch.tensor([w, h])
+        
+        return image, output
+    
--- a/rtdetrv2_pytorch/src/data/dataset/voc_eval.py
+++ b/rtdetrv2_pytorch/src/data/dataset/voc_eval.py
@@ -0,0 +1,10 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torchvision
+
+
+class VOCEvaluator(object):
+    def __init__(self) -> None:
+        pass
--- a/rtdetrv2_pytorch/src/data/transforms/init.py
+++ b/rtdetrv2_pytorch/src/data/transforms/init.py
@@ -0,0 +1,20 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+from ._transforms import (
+    EmptyTransform,
+    RandomPhotometricDistort,
+    RandomZoomOut,
+    RandomIoUCrop,
+    RandomHorizontalFlip,
+    Resize,
+    PadToSize,
+    SanitizeBoundingBoxes,
+    RandomCrop,
+    Normalize,
+    ConvertBoxes,
+    ConvertPILImage,
+)
+from .container import Compose
+from .mosaic import Mosaic
--- a/rtdetrv2_pytorch/src/data/transforms/_transforms.py
+++ b/rtdetrv2_pytorch/src/data/transforms/_transforms.py
@@ -0,0 +1,148 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn 
+
+import torchvision
+torchvision.disable_beta_transforms_warning()
+
+import torchvision.transforms.v2 as T
+import torchvision.transforms.v2.functional as F
+
+import PIL
+import PIL.Image
+
+from typing import Any, Dict, List, Optional
+
+from .._misc import convert_to_tv_tensor, _boxes_keys
+from .._misc import Image, Video, Mask, BoundingBoxes
+from .._misc import SanitizeBoundingBoxes
+
+from ...core import register
+
+
+RandomPhotometricDistort = register()(T.RandomPhotometricDistort)
+RandomZoomOut = register()(T.RandomZoomOut)
+RandomHorizontalFlip = register()(T.RandomHorizontalFlip)
+Resize = register()(T.Resize)
+# ToImageTensor = register()(T.ToImageTensor)
+# ConvertDtype = register()(T.ConvertDtype)
+# PILToTensor = register()(T.PILToTensor)
+SanitizeBoundingBoxes = register(name='SanitizeBoundingBoxes')(SanitizeBoundingBoxes)
+RandomCrop = register()(T.RandomCrop)
+Normalize = register()(T.Normalize)
+
+
+@register()
+class EmptyTransform(T.Transform):
+    def __init__(self, ) -> None:
+        super().__init__()
+
+    def forward(self, *inputs):
+        inputs = inputs if len(inputs) > 1 else inputs[0]
+        return inputs
+
+
+@register()
+class PadToSize(T.Pad):
+    _transformed_types = (
+        PIL.Image.Image,
+        Image,
+        Video,
+        Mask,
+        BoundingBoxes,
+    )
+    def _get_params(self, flat_inputs: List[Any]) -> Dict[str, Any]:
+        sp = F.get_spatial_size(flat_inputs[0])
+        h, w = self.size[1] - sp[0], self.size[0] - sp[1]
+        self.padding = [0, 0, w, h]
+        return dict(padding=self.padding)
+
+    def make_params(self, flat_inputs: List[Any]) -> Dict[str, Any]:
+        return self._get_params(flat_inputs)
+
+    def __init__(self, size, fill=0, padding_mode='constant') -> None:
+        if isinstance(size, int):
+            size = (size, size)
+        self.size = size
+        super().__init__(0, fill, padding_mode)
+
+    def _transform(self, inpt: Any, params: Dict[str, Any]) -> Any:        
+        fill = self._fill[type(inpt)]
+        padding = params['padding']
+        return F.pad(inpt, padding=padding, fill=fill, padding_mode=self.padding_mode)  # type: ignore[arg-type]
+
+    def transform(self, inpt: Any, params: Dict[str, Any]) -> Any:
+        return self._transform(inpt, params)
+
+    def __call__(self, *inputs: Any) -> Any:
+        outputs = super().forward(*inputs)
+        if len(outputs) > 1 and isinstance(outputs[1], dict):
+            outputs[1]['padding'] = torch.tensor(self.padding)
+        return outputs
+
+
+@register()
+class RandomIoUCrop(T.RandomIoUCrop):
+    def __init__(self, min_scale: float = 0.3, max_scale: float = 1, min_aspect_ratio: float = 0.5, max_aspect_ratio: float = 2, sampler_options: Optional[List[float]] = None, trials: int = 40, p: float = 1.0):
+        super().__init__(min_scale, max_scale, min_aspect_ratio, max_aspect_ratio, sampler_options, trials)
+        self.p = p 
+
+    def __call__(self, *inputs: Any) -> Any:
+        if torch.rand(1) >= self.p:
+            return inputs if len(inputs) > 1 else inputs[0]
+
+        return super().forward(*inputs)
+
+
+@register()
+class ConvertBoxes(T.Transform):
+    _transformed_types = (
+        BoundingBoxes,
+    )
+    def __init__(self, fmt='', normalize=False) -> None:
+        super().__init__()
+        self.fmt = fmt
+        self.normalize = normalize
+
+    def _transform(self, inpt: Any, params: Dict[str, Any]) -> Any:  
+        spatial_size = getattr(inpt, _boxes_keys[1])
+        if self.fmt:
+            in_fmt = inpt.format.value.lower()
+            inpt = torchvision.ops.box_convert(inpt, in_fmt=in_fmt, out_fmt=self.fmt.lower())
+            inpt = convert_to_tv_tensor(inpt, key='boxes', box_format=self.fmt.upper(), spatial_size=spatial_size)
+            
+        if self.normalize:
+            inpt = inpt / torch.tensor(spatial_size[::-1]).tile(2)[None]
+
+        return inpt
+
+    def transform(self, inpt: Any, params: Dict[str, Any]) -> Any:
+        return self._transform(inpt, params)
+
+
+@register()
+class ConvertPILImage(T.Transform):
+    _transformed_types = (
+        PIL.Image.Image,
+    )
+    def __init__(self, dtype='float32', scale=True) -> None:
+        super().__init__()
+        self.dtype = dtype
+        self.scale = scale
+
+    def _transform(self, inpt: Any, params: Dict[str, Any]) -> Any:  
+        inpt = F.pil_to_tensor(inpt)
+        if self.dtype == 'float32':
+            inpt = inpt.float()
+
+        if self.scale:
+            inpt = inpt / 255.
+
+        inpt = Image(inpt)
+
+        return inpt
+
+    def transform(self, inpt: Any, params: Dict[str, Any]) -> Any:
+        return self._transform(inpt, params)
--- a/rtdetrv2_pytorch/src/data/transforms/container.py
+++ b/rtdetrv2_pytorch/src/data/transforms/container.py
@@ -0,0 +1,95 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn 
+
+import torchvision
+torchvision.disable_beta_transforms_warning()
+import torchvision.transforms.v2 as T
+
+from typing import Any, Dict, List, Optional
+
+from ._transforms import EmptyTransform
+from ...core import register, GLOBAL_CONFIG
+
+
+@register()
+class Compose(T.Compose):
+    def __init__(self, ops, policy=None) -> None:
+        transforms = []
+        if ops is not None:
+            for op in ops:
+                if isinstance(op, dict):
+                    name = op.pop('type')
+                    transfom = getattr(GLOBAL_CONFIG[name]['_pymodule'], GLOBAL_CONFIG[name]['_name'])(**op)
+                    transforms.append(transfom)
+                    op['type'] = name
+
+                elif isinstance(op, nn.Module):
+                    transforms.append(op)
+
+                else:
+                    raise ValueError('')
+        else:
+            transforms =[EmptyTransform(), ]
+ 
+        super().__init__(transforms=transforms)
+
+        if policy is None:
+            policy = {'name': 'default'}
+
+        self.policy = policy
+        self.global_samples = 0
+
+    def forward(self, *inputs: Any) -> Any:
+        return self.get_forward(self.policy['name'])(*inputs)
+
+    def get_forward(self, name):
+        forwards = {
+            'default': self.default_forward,
+            'stop_epoch': self.stop_epoch_forward,
+            'stop_sample': self.stop_sample_forward,
+        }
+        return forwards[name]
+
+    def default_forward(self, *inputs: Any) -> Any:
+        sample = inputs if len(inputs) > 1 else inputs[0]
+        for transform in self.transforms:
+            sample = transform(sample)
+        return sample
+
+    def stop_epoch_forward(self, *inputs: Any):
+        sample = inputs if len(inputs) > 1 else inputs[0]
+        dataset = sample[-1]
+        
+        cur_epoch = dataset.epoch
+        policy_ops = self.policy['ops']
+        policy_epoch = self.policy['epoch']
+
+        for transform in self.transforms:
+            if type(transform).__name__ in policy_ops and cur_epoch >= policy_epoch:
+                pass
+            else:
+                sample = transform(sample)
+
+        return sample
+
+
+    def stop_sample_forward(self, *inputs: Any):
+        sample = inputs if len(inputs) > 1 else inputs[0]
+        dataset = sample[-1]
+        
+        cur_epoch = dataset.epoch
+        policy_ops = self.policy['ops']
+        policy_sample = self.policy['sample']
+
+        for transform in self.transforms:
+            if type(transform).__name__ in policy_ops and self.global_samples >= policy_sample:
+                pass
+            else:
+                sample = transform(sample)
+
+        self.global_samples += 1
+
+        return sample
--- a/rtdetrv2_pytorch/src/data/transforms/functional.py
+++ b/rtdetrv2_pytorch/src/data/transforms/functional.py
@@ -0,0 +1,169 @@
+import torch
+import torchvision.transforms.functional as F
+
+from packaging import version
+from typing import Optional, List
+from torch import Tensor
+
+# needed due to empty tensor bug in pytorch and torchvision 0.5
+import torchvision
+if version.parse(torchvision.__version__) < version.parse('0.7'):
+    from torchvision.ops import _new_empty_tensor
+    from torchvision.ops.misc import _output_size
+
+
+def interpolate(input, size=None, scale_factor=None, mode="nearest", align_corners=None):
+    # type: (Tensor, Optional[List[int]], Optional[float], str, Optional[bool]) -> Tensor
+    """
+    Equivalent to nn.functional.interpolate, but with support for empty batch sizes.
+    This will eventually be supported natively by PyTorch, and this
+    class can go away.
+    """
+    if version.parse(torchvision.__version__) < version.parse('0.7'):
+        if input.numel() > 0:
+            return torch.nn.functional.interpolate(
+                input, size, scale_factor, mode, align_corners
+            )
+
+        output_shape = _output_size(2, input, size, scale_factor)
+        output_shape = list(input.shape[:-2]) + list(output_shape)
+        return _new_empty_tensor(input, output_shape)
+    else:
+        return torchvision.ops.misc.interpolate(input, size, scale_factor, mode, align_corners)
+
+
+
+def crop(image, target, region):
+    cropped_image = F.crop(image, *region)
+
+    target = target.copy()
+    i, j, h, w = region
+
+    # should we do something wrt the original size?
+    target["size"] = torch.tensor([h, w])
+
+    fields = ["labels", "area", "iscrowd"]
+
+    if "boxes" in target:
+        boxes = target["boxes"]
+        max_size = torch.as_tensor([w, h], dtype=torch.float32)
+        cropped_boxes = boxes - torch.as_tensor([j, i, j, i])
+        cropped_boxes = torch.min(cropped_boxes.reshape(-1, 2, 2), max_size)
+        cropped_boxes = cropped_boxes.clamp(min=0)
+        area = (cropped_boxes[:, 1, :] - cropped_boxes[:, 0, :]).prod(dim=1)
+        target["boxes"] = cropped_boxes.reshape(-1, 4)
+        target["area"] = area
+        fields.append("boxes")
+
+    if "masks" in target:
+        # FIXME should we update the area here if there are no boxes?
+        target['masks'] = target['masks'][:, i:i + h, j:j + w]
+        fields.append("masks")
+
+    # remove elements for which the boxes or masks that have zero area
+    if "boxes" in target or "masks" in target:
+        # favor boxes selection when defining which elements to keep
+        # this is compatible with previous implementation
+        if "boxes" in target:
+            cropped_boxes = target['boxes'].reshape(-1, 2, 2)
+            keep = torch.all(cropped_boxes[:, 1, :] > cropped_boxes[:, 0, :], dim=1)
+        else:
+            keep = target['masks'].flatten(1).any(1)
+
+        for field in fields:
+            target[field] = target[field][keep]
+
+    return cropped_image, target
+
+
+def hflip(image, target):
+    flipped_image = F.hflip(image)
+
+    w, h = image.size
+
+    target = target.copy()
+    if "boxes" in target:
+        boxes = target["boxes"]
+        boxes = boxes[:, [2, 1, 0, 3]] * torch.as_tensor([-1, 1, -1, 1]) + torch.as_tensor([w, 0, w, 0])
+        target["boxes"] = boxes
+
+    if "masks" in target:
+        target['masks'] = target['masks'].flip(-1)
+
+    return flipped_image, target
+
+
+def resize(image, target, size, max_size=None):
+    # size can be min_size (scalar) or (w, h) tuple
+
+    def get_size_with_aspect_ratio(image_size, size, max_size=None):
+        w, h = image_size
+        if max_size is not None:
+            min_original_size = float(min((w, h)))
+            max_original_size = float(max((w, h)))
+            if max_original_size / min_original_size * size > max_size:
+                size = int(round(max_size * min_original_size / max_original_size))
+
+        if (w <= h and w == size) or (h <= w and h == size):
+            return (h, w)
+
+        if w < h:
+            ow = size
+            oh = int(size * h / w)
+        else:
+            oh = size
+            ow = int(size * w / h)
+            
+        # r = min(size / min(h, w), max_size / max(h, w))
+        # ow = int(w * r)
+        # oh = int(h * r)
+
+        return (oh, ow)
+
+    def get_size(image_size, size, max_size=None):
+        if isinstance(size, (list, tuple)):
+            return size[::-1]
+        else:
+            return get_size_with_aspect_ratio(image_size, size, max_size)
+
+    size = get_size(image.size, size, max_size)
+    rescaled_image = F.resize(image, size)
+
+    if target is None:
+        return rescaled_image, None
+
+    ratios = tuple(float(s) / float(s_orig) for s, s_orig in zip(rescaled_image.size, image.size))
+    ratio_width, ratio_height = ratios
+
+    target = target.copy()
+    if "boxes" in target:
+        boxes = target["boxes"]
+        scaled_boxes = boxes * torch.as_tensor([ratio_width, ratio_height, ratio_width, ratio_height])
+        target["boxes"] = scaled_boxes
+
+    if "area" in target:
+        area = target["area"]
+        scaled_area = area * (ratio_width * ratio_height)
+        target["area"] = scaled_area
+
+    h, w = size
+    target["size"] = torch.tensor([h, w])
+
+    if "masks" in target:
+        target['masks'] = interpolate(
+            target['masks'][:, None].float(), size, mode="nearest")[:, 0] > 0.5
+
+    return rescaled_image, target
+
+
+def pad(image, target, padding):
+    # assumes that we only pad on the bottom right corners
+    padded_image = F.pad(image, (0, 0, padding[0], padding[1]))
+    if target is None:
+        return padded_image, None
+    target = target.copy()
+    # should we do something wrt the original size?
+    target["size"] = torch.tensor(padded_image.size[::-1])
+    if "masks" in target:
+        target['masks'] = torch.nn.functional.pad(target['masks'], (0, padding[0], 0, padding[1]))
+    return padded_image, target
--- a/rtdetrv2_pytorch/src/data/transforms/mosaic.py
+++ b/rtdetrv2_pytorch/src/data/transforms/mosaic.py
@@ -0,0 +1,72 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torchvision
+torchvision.disable_beta_transforms_warning()
+import torchvision.transforms.v2 as T
+import torchvision.transforms.v2.functional as F
+
+import random
+from PIL import Image 
+
+from .._misc import convert_to_tv_tensor
+from ...core import register
+
+
+@register()
+class Mosaic(T.Transform):
+    def __init__(self, size, max_size=None, ) -> None:
+        super().__init__()
+        self.resize = T.Resize(size=size, max_size=max_size)
+        self.crop = T.RandomCrop(size=max_size if max_size else size)
+        
+        # TODO add arg `output_size` for affine`
+        # self.random_perspective = T.RandomPerspective(distortion_scale=0.5, p=1., )
+        self.random_affine = T.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.5, 1.5), fill=114)
+
+    def forward(self, *inputs):
+        inputs = inputs if len(inputs) > 1 else inputs[0]
+        image, target, dataset = inputs
+
+        images = []
+        targets = []
+        indices = random.choices(range(len(dataset)), k=3)
+        for i in indices:
+            image, target = dataset.load_item(i)
+            image, target = self.resize(image, target)
+            images.append(image)
+            targets.append(target)
+
+        h, w = F.get_spatial_size(images[0])
+        offset = [[0, 0], [w, 0], [0, h], [w, h]]
+        image = Image.new(mode=images[0].mode, size=(w * 2, h * 2), color=0)
+        for i, im in enumerate(images):
+            image.paste(im, offset[i])
+
+        offset = torch.tensor([[0, 0], [w, 0], [0, h], [w, h]]).repeat(1, 2)
+        target = {}
+        for k in targets[0]:
+            if k == 'boxes':
+                v = [t[k] + offset[i] for i, t in enumerate(targets)]
+            else: 
+                v = [t[k] for t in targets]
+            
+            if isinstance(v[0], torch.Tensor):
+                v = torch.cat(v, dim=0)
+
+            target[k] = v
+
+        if 'boxes' in target:
+            # target['boxes'] = target['boxes'].clamp(0, 640 * 2 - 1)
+            w, h = image.size
+            target['boxes'] = convert_to_tv_tensor(target['boxes'], 'boxes', box_format='xyxy', spatial_size=[h, w])
+        
+        if 'masks' in target:
+            target['masks'] = convert_to_tv_tensor(target['masks'], 'masks')
+
+        image, target = self.random_affine(image, target)
+        # image, target = self.resize(image, target)
+        image, target = self.crop(image, target)
+
+        return image, target, dataset
--- a/rtdetrv2_pytorch/src/data/transforms/presets.py
+++ b/rtdetrv2_pytorch/src/data/transforms/presets.py
@@ -0,0 +1,2 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
--- a/rtdetrv2_pytorch/src/misc/init.py
+++ b/rtdetrv2_pytorch/src/misc/init.py
@@ -0,0 +1,7 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from .logger import *
+from .visualizer import *
+from .dist_utils import setup_seed, setup_print
+from .profiler_utils import stats
--- a/rtdetrv2_pytorch/src/misc/box_ops.py
+++ b/rtdetrv2_pytorch/src/misc/box_ops.py
@@ -0,0 +1,103 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torchvision
+from torch import Tensor 
+from typing import List, Tuple
+
+
+def generalized_box_iou(boxes1: Tensor, boxes2: Tensor) -> Tensor:
+    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
+    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
+    return torchvision.ops.generalized_box_iou(boxes1, boxes2)
+
+
+# elementwise
+def elementwise_box_iou(boxes1: Tensor, boxes2: Tensor) -> Tensor:
+    """
+    Args:
+        boxes1, [N, 4]
+        boxes2, [N, 4]
+    Returns:
+        iou, [N, ]
+        union, [N, ]
+    """
+    area1 = torchvision.ops.box_area(boxes1) # [N, ]
+    area2 = torchvision.ops.box_area(boxes2) # [N, ]
+    lt = torch.max(boxes1[:, :2], boxes2[:, :2])  # [N, 2]
+    rb = torch.min(boxes1[:, 2:], boxes2[:, 2:])  # [N, 2]
+    wh = (rb - lt).clamp(min=0)  # [N, 2]
+    inter = wh[:, 0] * wh[:, 1]  # [N, ]
+    union = area1 + area2 - inter
+    iou = inter / union
+    return iou, union
+
+
+def elementwise_generalized_box_iou(boxes1: Tensor, boxes2: Tensor) -> Tensor:
+    """
+    Args:
+        boxes1, [N, 4] with [x1, y1, x2, y2]
+        boxes2, [N, 4] with [x1, y1, x2, y2]
+    Returns:
+        giou, [N, ]
+    """
+    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
+    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
+    iou, union = elementwise_box_iou(boxes1, boxes2)
+    lt = torch.min(boxes1[:, :2], boxes2[:, :2]) # [N, 2]
+    rb = torch.max(boxes1[:, 2:], boxes2[:, 2:]) # [N, 2]
+    wh = (rb - lt).clamp(min=0)  # [N, 2]
+    area = wh[:, 0] * wh[:, 1]
+    return iou - (area - union) / area
+
+
+def check_point_inside_box(points: Tensor, boxes: Tensor, eps=1e-9) -> Tensor:
+    """
+    Args:
+        points, [K, 2], (x, y)
+        boxes, [N, 4], (x1, y1, y2, y2)
+    Returns:
+        Tensor (bool), [K, N]
+    """
+    x, y = [p.unsqueeze(-1) for p in points.unbind(-1)]
+    x1, y1, x2, y2 = [x.unsqueeze(0) for x in boxes.unbind(-1)]
+
+    l = x - x1
+    t = y - y1 
+    r = x2 - x
+    b = y2 - y
+    
+    ltrb = torch.stack([l, t, r, b], dim=-1)
+    mask = ltrb.min(dim=-1).values > eps
+
+    return mask
+
+
+def point_box_distance(points: Tensor, boxes: Tensor) -> Tensor:
+    """
+    Args:
+        boxes, [N, 4], (x1, y1, x2, y2)
+        points, [N, 2], (x, y)
+    Returns:
+        Tensor (N, 4), (l, t, r, b)
+    """
+    x1y1, x2y2 = torch.split(boxes, 2, dim=-1)
+    lt = points - x1y1
+    rb = x2y2 - points
+    return torch.concat([lt, rb], dim=-1)
+
+
+def point_distance_box(points: Tensor, distances: Tensor) -> Tensor:
+    """
+    Args:
+        points (Tensor), [N, 2], (x, y)
+        distances (Tensor), [N, 4], (l, t, r, b)
+    Returns:
+        boxes (Tensor),  (N, 4), (x1, y1, x2, y2)
+    """
+    lt, rb = torch.split(distances, 2, dim=-1)
+    x1y1 = -lt + points
+    x2y2 = rb + points
+    boxes = torch.concat([x1y1, x2y2], dim=-1)
+    return boxes
--- a/rtdetrv2_pytorch/src/misc/dist_utils.py
+++ b/rtdetrv2_pytorch/src/misc/dist_utils.py
@@ -0,0 +1,267 @@
+"""
+reference
+- https://github.com/pytorch/vision/blob/main/references/detection/utils.py
+- https://github.com/facebookresearch/detr/blob/master/util/misc.py#L406
+
+Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import os
+import random
+import numpy as np 
+import atexit
+
+import torch
+import torch.nn as nn 
+import torch.distributed
+import torch.backends.cudnn
+
+from torch.nn.parallel import DataParallel as DP
+from torch.nn.parallel import DistributedDataParallel as DDP
+from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
+
+from torch.utils.data import DistributedSampler
+# from torch.utils.data.dataloader import DataLoader
+from ..data import DataLoader 
+
+
+def setup_distributed(print_rank: int=0, print_method: str='builtin', seed: int=None, ):
+    """
+    env setup
+    args:
+        print_rank, 
+        print_method, (builtin, rich)
+        seed, 
+    """
+    try:
+        # https://pytorch.org/docs/stable/elastic/run.html
+        RANK = int(os.getenv('RANK', -1))
+        LOCAL_RANK = int(os.getenv('LOCAL_RANK', -1))  
+        WORLD_SIZE = int(os.getenv('WORLD_SIZE', 1))
+        
+        # torch.distributed.init_process_group(backend=backend, init_method='env://')
+        torch.distributed.init_process_group(init_method='env://')
+        torch.distributed.barrier()
+
+        rank = torch.distributed.get_rank()
+        torch.cuda.set_device(rank)
+        torch.cuda.empty_cache()
+        enabled_dist = True
+        print('Initialized distributed mode...')
+
+    except:
+        enabled_dist = False
+        print('Not init distributed mode.')
+
+    setup_print(get_rank() == print_rank, method=print_method)
+    if seed is not None:
+        setup_seed(seed)
+
+    return enabled_dist
+
+
+def setup_print(is_main, method='builtin'):
+    """This function disables printing when not in master process
+    """
+    import builtins as __builtin__
+
+    if method == 'builtin':
+        builtin_print = __builtin__.print
+
+    elif method == 'rich':
+        import rich 
+        builtin_print = rich.print
+
+    else:
+        raise AttributeError('')
+
+    def print(*args, **kwargs):
+        force = kwargs.pop('force', False)
+        if is_main or force:
+            builtin_print(*args, **kwargs)
+
+    __builtin__.print = print
+
+
+def is_dist_available_and_initialized():
+    if not torch.distributed.is_available():
+        return False
+    if not torch.distributed.is_initialized():
+        return False
+    return True
+
+
+@atexit.register
+def cleanup():
+    """cleanup distributed environment
+    """
+    if is_dist_available_and_initialized():
+        torch.distributed.barrier()
+        torch.distributed.destroy_process_group()
+
+
+def get_rank():
+    if not is_dist_available_and_initialized():
+        return 0
+    return torch.distributed.get_rank()
+
+
+def get_world_size():
+    if not is_dist_available_and_initialized():
+        return 1
+    return torch.distributed.get_world_size()
+
+    
+def is_main_process():
+    return get_rank() == 0
+
+
+def save_on_master(*args, **kwargs):
+    if is_main_process():
+        torch.save(*args, **kwargs)
+
+
+
+def warp_model(
+    model: torch.nn.Module, 
+    sync_bn: bool=False, 
+    dist_mode: str='ddp', 
+    find_unused_parameters: bool=False, 
+    compile: bool=False, 
+    compile_mode: str='reduce-overhead', 
+    **kwargs
+):
+    if is_dist_available_and_initialized():
+        rank = get_rank()
+        model = nn.SyncBatchNorm.convert_sync_batchnorm(model) if sync_bn else model 
+        if dist_mode == 'dp':
+            model = DP(model, device_ids=[rank], output_device=rank)
+        elif dist_mode == 'ddp':
+            model = DDP(model, device_ids=[rank], output_device=rank, find_unused_parameters=find_unused_parameters)
+        else:
+            raise AttributeError('')
+
+    if compile:
+        model = torch.compile(model, mode=compile_mode)
+
+    return model
+
+def de_model(model):
+    return de_parallel(de_complie(model))
+
+
+def warp_loader(loader, shuffle=False):        
+    if is_dist_available_and_initialized():
+        sampler = DistributedSampler(loader.dataset, shuffle=shuffle)
+        loader = DataLoader(loader.dataset, 
+                            loader.batch_size, 
+                            sampler=sampler, 
+                            drop_last=loader.drop_last, 
+                            collate_fn=loader.collate_fn, 
+                            pin_memory=loader.pin_memory,
+                            num_workers=loader.num_workers, )
+    return loader
+
+
+
+def is_parallel(model) -> bool:
+    # Returns True if model is of type DP or DDP
+    return type(model) in (torch.nn.parallel.DataParallel, torch.nn.parallel.DistributedDataParallel)
+
+
+def de_parallel(model) -> nn.Module:
+    # De-parallelize a model: returns single-GPU model if model is of type DP or DDP
+    return model.module if is_parallel(model) else model
+
+
+def reduce_dict(data, avg=True):
+    """
+    Args 
+        data dict: input, {k: v, ...}
+        avg bool: true
+    """
+    world_size = get_world_size()
+    if world_size < 2:
+        return data
+    
+    with torch.no_grad():
+        keys, values = [], []
+        for k in sorted(data.keys()):
+            keys.append(k)
+            values.append(data[k])
+
+        values = torch.stack(values, dim=0)
+        torch.distributed.all_reduce(values)
+
+        if avg is True:
+            values /= world_size
+        
+        return {k: v for k, v in zip(keys, values)}
+        
+
+def all_gather(data):
+    """
+    Run all_gather on arbitrary picklable data (not necessarily tensors)
+    Args:
+        data: any picklable object
+    Returns:
+        list[data]: list of data gathered from each rank
+    """
+    world_size = get_world_size()
+    if world_size == 1:
+        return [data]
+    data_list = [None] * world_size
+    torch.distributed.all_gather_object(data_list, data)
+    return data_list
+
+    
+import time 
+def sync_time():
+    """sync_time
+    """
+    if torch.cuda.is_available():
+        torch.cuda.synchronize()
+
+    return time.time()
+
+
+
+def setup_seed(seed: int, deterministic=False):
+    """setup_seed for reproducibility
+    torch.manual_seed(3407) is all you need. https://arxiv.org/abs/2109.08203
+    """
+    seed = seed + get_rank()
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed_all(seed)
+
+    # memory will be large when setting deterministic to True
+    if torch.backends.cudnn.is_available() and deterministic:
+        torch.backends.cudnn.deterministic = True
+
+
+# for torch.compile
+def check_compile():
+    import torch
+    import warnings
+    gpu_ok = False
+    if torch.cuda.is_available():
+        device_cap = torch.cuda.get_device_capability()
+        if device_cap in ((7, 0), (8, 0), (9, 0)):
+            gpu_ok = True
+    if not gpu_ok:
+        warnings.warn(
+            "GPU is not NVIDIA V100, A100, or H100. Speedup numbers may be lower "
+            "than expected."
+        )
+    return gpu_ok
+
+def is_compile(model):
+    import torch._dynamo
+    return type(model) in (torch._dynamo.OptimizedModule, )
+
+def de_complie(model):
+    return model._orig_mod if is_compile(model) else model
--- a/rtdetrv2_pytorch/src/misc/lazy_loader.py
+++ b/rtdetrv2_pytorch/src/misc/lazy_loader.py
@@ -0,0 +1,70 @@
+"""
+https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/util/lazy_loader.py
+"""
+
+
+import types
+import importlib
+
+class LazyLoader(types.ModuleType):
+  """Lazily import a module, mainly to avoid pulling in large dependencies.
+
+  `paddle`, and `ffmpeg` are examples of modules that are large and not always
+  needed, and this allows them to only be loaded when they are used.
+  """
+
+  # The lint error here is incorrect.
+  def __init__(self, local_name, parent_module_globals, name, warning=None):
+    self._local_name = local_name
+    self._parent_module_globals = parent_module_globals
+    self._warning = warning
+
+    # These members allows doctest correctly process this module member without
+    # triggering self._load(). self._load() mutates parant_module_globals and
+    # triggers a dict mutated during iteration error from doctest.py.
+    # - for from_module()
+    self.__module__ = name.rsplit(".", 1)[0]
+    # - for is_routine()
+    self.__wrapped__ = None
+
+    super(LazyLoader, self).__init__(name)
+
+  def _load(self):
+    """Load the module and insert it into the parent's globals."""
+    # Import the target module and insert it into the parent's namespace
+    module = importlib.import_module(self.__name__)
+    self._parent_module_globals[self._local_name] = module
+
+    # Emit a warning if one was specified
+    if self._warning:
+      # logging.warning(self._warning)
+      # Make sure to only warn once.
+      self._warning = None
+
+    # Update this object's dict so that if someone keeps a reference to the
+    #   LazyLoader, lookups are efficient (__getattr__ is only called on lookups
+    #   that fail).
+    self.__dict__.update(module.__dict__)
+
+    return module
+
+  def __getattr__(self, item):
+    module = self._load()
+    return getattr(module, item)
+
+  def __repr__(self):
+    # Carefully to not trigger _load, since repr may be called in very
+    # sensitive places.
+    return f"<LazyLoader {self.__name__} as {self._local_name}>"
+
+  def __dir__(self):
+    module = self._load()
+    return dir(module)
+
+
+# import paddle.nn as nn
+# nn = LazyLoader("nn", globals(), "paddle.nn")
+
+# class M(nn.Layer):
+#     def __init__(self) -> None:
+#       super().__init__()
--- a/rtdetrv2_pytorch/src/misc/logger.py
+++ b/rtdetrv2_pytorch/src/misc/logger.py
@@ -0,0 +1,239 @@
+"""
+# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+https://github.com/facebookresearch/detr/blob/main/util/misc.py
+Mostly copy-paste from torchvision references.
+"""
+
+import time
+import pickle
+import datetime
+from collections import defaultdict, deque
+from typing import Dict
+
+import torch
+import torch.distributed as tdist
+
+from .dist_utils import is_dist_available_and_initialized, get_world_size
+
+
+class SmoothedValue(object):
+    """Track a series of values and provide access to smoothed values over a
+    window or the global series average.
+    """
+
+    def __init__(self, window_size=20, fmt=None):
+        if fmt is None:
+            fmt = "{median:.4f} ({global_avg:.4f})"
+        self.deque = deque(maxlen=window_size)
+        self.total = 0.0
+        self.count = 0
+        self.fmt = fmt
+
+    def update(self, value, n=1):
+        self.deque.append(value)
+        self.count += n
+        self.total += value * n
+
+    def synchronize_between_processes(self):
+        """
+        Warning: does not synchronize the deque!
+        """
+        if not is_dist_available_and_initialized():
+            return
+        t = torch.tensor([self.count, self.total], dtype=torch.float64, device='cuda')
+        tdist.barrier()
+        tdist.all_reduce(t)
+        t = t.tolist()
+        self.count = int(t[0])
+        self.total = t[1]
+
+    @property
+    def median(self):
+        d = torch.tensor(list(self.deque))
+        return d.median().item()
+
+    @property
+    def avg(self):
+        d = torch.tensor(list(self.deque), dtype=torch.float32)
+        return d.mean().item()
+
+    @property
+    def global_avg(self):
+        return self.total / self.count
+
+    @property
+    def max(self):
+        return max(self.deque)
+
+    @property
+    def value(self):
+        return self.deque[-1]
+
+    def __str__(self):
+        return self.fmt.format(
+            median=self.median,
+            avg=self.avg,
+            global_avg=self.global_avg,
+            max=self.max,
+            value=self.value)
+
+
+def all_gather(data):
+    """
+    Run all_gather on arbitrary picklable data (not necessarily tensors)
+    Args:
+        data: any picklable object
+    Returns:
+        list[data]: list of data gathered from each rank
+    """
+    world_size = get_world_size()
+    if world_size == 1:
+        return [data]
+
+    # serialized to a Tensor
+    buffer = pickle.dumps(data)
+    storage = torch.ByteStorage.from_buffer(buffer)
+    tensor = torch.ByteTensor(storage).to("cuda")
+
+    # obtain Tensor size of each rank
+    local_size = torch.tensor([tensor.numel()], device="cuda")
+    size_list = [torch.tensor([0], device="cuda") for _ in range(world_size)]
+    tdist.all_gather(size_list, local_size)
+    size_list = [int(size.item()) for size in size_list]
+    max_size = max(size_list)
+
+    # receiving Tensor from all ranks
+    # we pad the tensor because torch all_gather does not support
+    # gathering tensors of different shapes
+    tensor_list = []
+    for _ in size_list:
+        tensor_list.append(torch.empty((max_size,), dtype=torch.uint8, device="cuda"))
+    if local_size != max_size:
+        padding = torch.empty(size=(max_size - local_size,), dtype=torch.uint8, device="cuda")
+        tensor = torch.cat((tensor, padding), dim=0)
+    tdist.all_gather(tensor_list, tensor)
+
+    data_list = []
+    for size, tensor in zip(size_list, tensor_list):
+        buffer = tensor.cpu().numpy().tobytes()[:size]
+        data_list.append(pickle.loads(buffer))
+
+    return data_list
+
+
+def reduce_dict(input_dict, average=True) -> Dict[str, torch.Tensor]:
+    """
+    Args:
+        input_dict (dict): all the values will be reduced
+        average (bool): whether to do average or sum
+    Reduce the values in the dictionary from all processes so that all processes
+    have the averaged results. Returns a dict with the same fields as
+    input_dict, after reduction.
+    """
+    world_size = get_world_size()
+    if world_size < 2:
+        return input_dict
+    with torch.no_grad():
+        names = []
+        values = []
+        # sort the keys so that they are consistent across processes
+        for k in sorted(input_dict.keys()):
+            names.append(k)
+            values.append(input_dict[k])
+        values = torch.stack(values, dim=0)
+        tdist.all_reduce(values)
+        if average:
+            values /= world_size
+        reduced_dict = {k: v for k, v in zip(names, values)}
+    return reduced_dict
+
+
+class MetricLogger(object):
+    def __init__(self, delimiter="\t"):
+        self.meters = defaultdict(SmoothedValue)
+        self.delimiter = delimiter
+
+    def update(self, **kwargs):
+        for k, v in kwargs.items():
+            if isinstance(v, torch.Tensor):
+                v = v.item()
+            assert isinstance(v, (float, int))
+            self.meters[k].update(v)
+
+    def __getattr__(self, attr):
+        if attr in self.meters:
+            return self.meters[attr]
+        if attr in self.__dict__:
+            return self.__dict__[attr]
+        raise AttributeError("'{}' object has no attribute '{}'".format(
+            type(self).__name__, attr))
+
+    def __str__(self):
+        loss_str = []
+        for name, meter in self.meters.items():
+            loss_str.append(
+                "{}: {}".format(name, str(meter))
+            )
+        return self.delimiter.join(loss_str)
+
+    def synchronize_between_processes(self):
+        for meter in self.meters.values():
+            meter.synchronize_between_processes()
+
+    def add_meter(self, name, meter):
+        self.meters[name] = meter
+
+    def log_every(self, iterable, print_freq, header=None):
+        i = 0
+        if not header:
+            header = ''
+        start_time = time.time()
+        end = time.time()
+        iter_time = SmoothedValue(fmt='{avg:.4f}')
+        data_time = SmoothedValue(fmt='{avg:.4f}')
+        space_fmt = ':' + str(len(str(len(iterable)))) + 'd'
+        if torch.cuda.is_available():
+            log_msg = self.delimiter.join([
+                header,
+                '[{0' + space_fmt + '}/{1}]',
+                'eta: {eta}',
+                '{meters}',
+                'time: {time}',
+                'data: {data}',
+                'max mem: {memory:.0f}'
+            ])
+        else:
+            log_msg = self.delimiter.join([
+                header,
+                '[{0' + space_fmt + '}/{1}]',
+                'eta: {eta}',
+                '{meters}',
+                'time: {time}',
+                'data: {data}'
+            ])
+        MB = 1024.0 * 1024.0
+        for obj in iterable:
+            data_time.update(time.time() - end)
+            yield obj
+            iter_time.update(time.time() - end)
+            if i % print_freq == 0 or i == len(iterable) - 1:
+                eta_seconds = iter_time.global_avg * (len(iterable) - i)
+                eta_string = str(datetime.timedelta(seconds=int(eta_seconds)))
+                if torch.cuda.is_available():
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time),
+                        memory=torch.cuda.max_memory_allocated() / MB))
+                else:
+                    print(log_msg.format(
+                        i, len(iterable), eta=eta_string,
+                        meters=str(self),
+                        time=str(iter_time), data=str(data_time)))
+            i += 1
+            end = time.time()
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print('{} Total time: {} ({:.4f} s / it)'.format(
+            header, total_time_str, total_time / len(iterable)))
+
--- a/rtdetrv2_pytorch/src/misc/profiler_utils.py
+++ b/rtdetrv2_pytorch/src/misc/profiler_utils.py
@@ -0,0 +1,65 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import re
+import torch
+import torch.nn as nn
+from torch import Tensor 
+
+from typing import List
+
+def stats(
+    model: nn.Module, 
+    data: Tensor=None, 
+    input_shape: List=[1, 3, 640, 640], 
+    device: str='cpu', 
+    verbose=False) -> str:
+    
+    is_training = model.training
+
+    model.train()
+    num_params = sum([p.numel() for p in model.parameters() if p.requires_grad])
+
+    model.eval()
+    model = model.to(device)
+
+    if data is None:
+        data = torch.rand(*input_shape, device=device)
+        
+    def trace_handler(prof):
+        print(prof.key_averages().table(
+            sort_by="self_cuda_time_total", row_limit=-1))
+
+    num_active = 2
+    with torch.profiler.profile(
+        activities=[
+            torch.profiler.ProfilerActivity.CPU,
+            torch.profiler.ProfilerActivity.CUDA,
+        ],
+        schedule=torch.profiler.schedule(
+            wait=1,
+            warmup=1,
+            active=num_active,
+            repeat=1
+        ),
+        # on_trace_ready=trace_handler,
+        # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
+        # with_modules=True,
+        with_flops=True,
+    ) as p:
+        for _ in range(5):
+            _ = model(data)
+            p.step()
+
+    if is_training:
+        model.train()
+    
+    info = p.key_averages().table(sort_by="self_cuda_time_total", row_limit=-1)
+    num_flops = sum([float(v.strip()) for v in re.findall('(\d+.?\d+ *\n)', info)]) / num_active
+
+    if verbose:
+        # print(info)
+        print(f'Total number of trainable parameters: {num_params}')
+        print(f'Total number of flops: {int(num_flops)}M with {input_shape}')
+
+    return {'n_parameters': num_params, 'n_flops': num_flops, 'info': info}
--- a/rtdetrv2_pytorch/src/misc/visualizer.py
+++ b/rtdetrv2_pytorch/src/misc/visualizer.py
@@ -0,0 +1,34 @@
+""""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torch.utils.data
+
+import torchvision
+torchvision.disable_beta_transforms_warning()
+
+import PIL 
+
+__all__ = ['show_sample']
+
+def show_sample(sample):
+    """for coco dataset/dataloader
+    """
+    import matplotlib.pyplot as plt
+    from torchvision.transforms.v2 import functional as F
+    from torchvision.utils import draw_bounding_boxes
+
+    image, target = sample
+    if isinstance(image, PIL.Image.Image):
+        image = F.to_image_tensor(image)
+
+    image = F.convert_dtype(image, torch.uint8)
+    annotated_image = draw_bounding_boxes(image, target["boxes"], colors="yellow", width=3)
+
+    fig, ax = plt.subplots()
+    ax.imshow(annotated_image.permute(1, 2, 0).numpy())
+    ax.set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])
+    fig.tight_layout()
+    fig.show()
+    plt.show()
+
--- a/rtdetrv2_pytorch/src/nn/init.py
+++ b/rtdetrv2_pytorch/src/nn/init.py
@@ -0,0 +1,17 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+from .arch import *
+from .criterion import *
+from .postprocessor import *
+
+# 
+from .backbone import *
+
+
+from .backbone import (
+    get_activation, 
+    FrozenBatchNorm2d,
+    freeze_batch_norm2d,
+)
--- a/rtdetrv2_pytorch/src/nn/arch/init.py
+++ b/rtdetrv2_pytorch/src/nn/arch/init.py
@@ -0,0 +1,6 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+from .classification import Classification, ClassHead
+from .yolo import YOLO
--- a/rtdetrv2_pytorch/src/nn/arch/classification.py
+++ b/rtdetrv2_pytorch/src/nn/arch/classification.py
@@ -0,0 +1,45 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+import torch 
+import torch.nn as nn
+
+from ...core import register
+
+
+__all__ = ['Classification', 'ClassHead']
+
+
+@register()
+class Classification(torch.nn.Module):
+    __inject__ = ['backbone', 'head']
+
+    def __init__(self, backbone: nn.Module, head: nn.Module=None):
+        super().__init__()
+        
+        self.backbone = backbone
+        self.head = head
+
+    def forward(self, x):
+        x = self.backbone(x)
+
+        if self.head is not None:
+            x = self.head(x)
+
+        return x 
+
+
+@register()
+class ClassHead(nn.Module):
+    def __init__(self, hidden_dim, num_classes):
+        super().__init__()
+        self.pool = nn.AdaptiveAvgPool2d(1)
+        self.proj = nn.Linear(hidden_dim, num_classes)  
+
+    def forward(self, x):
+        x = x[0] if isinstance(x, (list, tuple)) else x 
+        x = self.pool(x)
+        x = x.reshape(x.shape[0], -1)
+        x = self.proj(x)
+        return x 
--- a/rtdetrv2_pytorch/src/nn/arch/yolo.py
+++ b/rtdetrv2_pytorch/src/nn/arch/yolo.py
@@ -0,0 +1,33 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+
+from ...core import register
+
+
+__all__ = ['YOLO', ]
+
+
+@register()
+class YOLO(torch.nn.Module):
+    __inject__ = ['backbone', 'neck', 'head', ]
+
+    def __init__(self, backbone: torch.nn.Module, neck, head):
+        super().__init__()
+        self.backbone = backbone
+        self.neck = neck
+        self.head = head
+
+    def forward(self, x, **kwargs):           
+        x = self.backbone(x)
+        x = self.neck(x)        
+        x = self.head(x)
+        return x
+    
+    def deploy(self, ):
+        self.eval()
+        for m in self.modules():
+            if m is not self and hasattr(m, 'deploy'):
+                m.deploy()
+        return self 
--- a/rtdetrv2_pytorch/src/nn/backbone/init.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/init.py
@@ -0,0 +1,18 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from .common import (
+    get_activation, 
+    FrozenBatchNorm2d,
+    freeze_batch_norm2d,
+)
+from .presnet import PResNet
+from .test_resnet import MResNet
+
+from .timm_model import TimmModel
+from .torchvision_model import TorchVisionModel
+
+from .csp_resnet import CSPResNet
+from .csp_darknet import CSPDarkNet, CSPPAN
+
+from .hgnetv2 import HGNetv2
--- a/rtdetrv2_pytorch/src/nn/backbone/common.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/common.py
@@ -0,0 +1,97 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn
+
+
+class FrozenBatchNorm2d(nn.Module):
+    """copy and modified from https://github.com/facebookresearch/detr/blob/master/models/backbone.py
+    BatchNorm2d where the batch statistics and the affine parameters are fixed.
+    Copy-paste from torchvision.misc.ops with added eps before rqsrt,
+    without which any other models than torchvision.models.resnet[18,34,50,101]
+    produce nans.
+    """
+    def __init__(self, num_features, eps=1e-5):
+        super(FrozenBatchNorm2d, self).__init__()
+        n = num_features
+        self.register_buffer("weight", torch.ones(n))
+        self.register_buffer("bias", torch.zeros(n))
+        self.register_buffer("running_mean", torch.zeros(n))
+        self.register_buffer("running_var", torch.ones(n))
+        self.eps = eps
+        self.num_features = n 
+
+    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
+                              missing_keys, unexpected_keys, error_msgs):
+        num_batches_tracked_key = prefix + 'num_batches_tracked'
+        if num_batches_tracked_key in state_dict:
+            del state_dict[num_batches_tracked_key]
+
+        super(FrozenBatchNorm2d, self)._load_from_state_dict(
+            state_dict, prefix, local_metadata, strict,
+            missing_keys, unexpected_keys, error_msgs)
+
+    def forward(self, x):
+        # move reshapes to the beginning
+        # to make it fuser-friendly
+        w = self.weight.reshape(1, -1, 1, 1)
+        b = self.bias.reshape(1, -1, 1, 1)
+        rv = self.running_var.reshape(1, -1, 1, 1)
+        rm = self.running_mean.reshape(1, -1, 1, 1)
+        scale = w * (rv + self.eps).rsqrt()
+        bias = b - rm * scale
+        return x * scale + bias
+
+    def extra_repr(self):
+        return (
+            "{num_features}, eps={eps}".format(**self.__dict__)
+        )
+
+def freeze_batch_norm2d(module: nn.Module) -> nn.Module:
+    if isinstance(module, nn.BatchNorm2d):
+        module = FrozenBatchNorm2d(module.num_features)
+    else:
+        for name, child in module.named_children():
+            _child = freeze_batch_norm2d(child)
+            if _child is not child:
+                setattr(module, name, _child)
+    return module
+
+
+def get_activation(act: str, inplace: bool=True):
+    """get activation
+    """
+    if act is None:
+        return nn.Identity()
+
+    elif isinstance(act, nn.Module):
+        return act 
+
+    act = act.lower()
+    
+    if act == 'silu' or act == 'swish':
+        m = nn.SiLU()
+
+    elif act == 'relu':
+        m = nn.ReLU()
+
+    elif act == 'leaky_relu':
+        m = nn.LeakyReLU()
+
+    elif act == 'silu':
+        m = nn.SiLU()
+    
+    elif act == 'gelu':
+        m = nn.GELU()
+
+    elif act == 'hardsigmoid':
+        m = nn.Hardsigmoid()
+
+    else:
+        raise RuntimeError('')  
+
+    if hasattr(m, 'inplace'):
+        m.inplace = inplace
+    
+    return m 
--- a/rtdetrv2_pytorch/src/nn/backbone/csp_darknet.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/csp_darknet.py
@@ -0,0 +1,177 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn 
+import torch.nn.functional as F 
+
+import math
+import warnings
+
+from .common import get_activation
+from ...core import register
+
+
+def autopad(k, p=None): 
+    if p is None:
+        p = k // 2 if isinstance(k, int) else [x // 2 for x in k] 
+    return p
+
+def make_divisible(c, d):
+    return math.ceil(c / d) * d
+    
+
+class Conv(nn.Module):
+    def __init__(self, cin, cout, k=1, s=1, p=None, g=1, act='silu') -> None:
+        super().__init__()
+        self.conv = nn.Conv2d(cin, cout, k, s, autopad(k, p), groups=g, bias=False)
+        self.bn = nn.BatchNorm2d(cout)
+        self.act = get_activation(act, inplace=True)
+
+    def forward(self, x):
+        return self.act(self.bn(self.conv(x)))
+
+
+class Bottleneck(nn.Module):
+    # Standard bottleneck
+    def __init__(self, c1, c2, shortcut=True, g=1, e=0.5, act='silu'):
+        super().__init__()
+        c_ = int(c2 * e)  # hidden channels
+        self.cv1 = Conv(c1, c_, 1, 1, act=act)
+        self.cv2 = Conv(c_, c2, 3, 1, g=g, act=act)
+        self.add = shortcut and c1 == c2
+
+    def forward(self, x):
+        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))
+
+
+class C3(nn.Module):
+    # CSP Bottleneck with 3 convolutions
+    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5, act='silu'):  # ch_in, ch_out, number, shortcut, groups, expansion
+        super().__init__()
+        c_ = int(c2 * e)  # hidden channels
+        self.cv1 = Conv(c1, c_, 1, 1, act=act)
+        self.cv2 = Conv(c1, c_, 1, 1, act=act)
+        self.m = nn.Sequential(*(Bottleneck(c_, c_, shortcut, g, e=1.0, act=act) for _ in range(n)))
+        self.cv3 = Conv(2 * c_, c2, 1, act=act)
+
+    def forward(self, x):
+        return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), dim=1))
+
+
+class SPPF(nn.Module):
+    # Spatial Pyramid Pooling - Fast (SPPF) layer for YOLOv5 by Glenn Jocher
+    def __init__(self, c1, c2, k=5, act='silu'):  # equivalent to SPP(k=(5, 9, 13))
+        super().__init__()
+        c_ = c1 // 2  # hidden channels
+        self.cv1 = Conv(c1, c_, 1, 1, act=act)
+        self.cv2 = Conv(c_ * 4, c2, 1, 1, act=act)
+        self.m = nn.MaxPool2d(kernel_size=k, stride=1, padding=k // 2)
+
+    def forward(self, x):
+        x = self.cv1(x)
+        with warnings.catch_warnings():
+            warnings.simplefilter('ignore')  # suppress torch 1.9.0 max_pool2d() warning
+            y1 = self.m(x)
+            y2 = self.m(y1)
+            return self.cv2(torch.cat([x, y1, y2, self.m(y2)], 1))
+
+
+@register()
+class CSPDarkNet(nn.Module):
+    __share__ = ['depth_multi', 'width_multi']
+
+    def __init__(self, in_channels=3, width_multi=1.0, depth_multi=1.0, return_idx=[2, 3, -1], act='silu', ) -> None:
+        super().__init__()
+
+        channels = [64, 128, 256, 512, 1024]
+        channels = [make_divisible(c * width_multi, 8) for c in channels]
+
+        depths = [3, 6, 9, 3]
+        depths = [max(round(d * depth_multi), 1) for d in depths]
+
+        self.layers = nn.ModuleList([Conv(in_channels, channels[0], 6, 2, 2, act=act)])
+        for i, (c, d) in enumerate(zip(channels, depths), 1):
+            layer = nn.Sequential(*[Conv(c, channels[i], 3, 2, act=act), C3(channels[i], channels[i], n=d, act=act)])
+            self.layers.append(layer)
+
+        self.layers.append(SPPF(channels[-1], channels[-1], k=5, act=act))
+
+        self.return_idx = return_idx
+        self.out_channels = [channels[i] for i in self.return_idx]
+        self.strides = [[2, 4, 8, 16, 32][i] for i in self.return_idx]
+        self.depths = depths
+        self.act = act
+
+    def forward(self, x):
+        outputs = []
+        for _, m in enumerate(self.layers):
+            x = m(x)
+            outputs.append(x)
+
+        return [outputs[i] for i in self.return_idx]
+
+
+@register()
+class CSPPAN(nn.Module):
+    """
+    P5 ---> 1x1  ---------------------------------> concat --> c3 --> det
+             | up                                     | conv /2 
+    P4 ---> concat ---> c3 ---> 1x1  -->  concat ---> c3 -----------> det
+                                 | up       | conv /2
+    P3 -----------------------> concat ---> c3 ---------------------> det
+    """
+    __share__ = ['depth_multi', ]
+
+    def __init__(self, in_channels=[256, 512, 1024], depth_multi=1., act='silu') -> None:
+        super().__init__()
+        depth = max(round(3 * depth_multi), 1)
+
+        self.out_channels = in_channels
+        self.fpn_stems = nn.ModuleList([Conv(cin, cout, 1, 1, act=act) for cin, cout in zip(in_channels[::-1], in_channels[::-1][1:])])
+        self.fpn_csps = nn.ModuleList([C3(cin, cout, depth, False, act=act) for cin, cout in zip(in_channels[::-1], in_channels[::-1][1:])])
+
+        self.pan_stems = nn.ModuleList([Conv(c, c, 3, 2, act=act) for c in in_channels[:-1]])
+        self.pan_csps = nn.ModuleList([C3(c, c, depth, False, act=act) for c in in_channels[1:]])
+
+    def forward(self, feats):
+        fpn_feats = []
+        for i, feat in enumerate(feats[::-1]):
+            if i == 0:
+                feat = self.fpn_stems[i](feat)
+                fpn_feats.append(feat)
+            else:
+                _feat = F.interpolate(fpn_feats[-1], scale_factor=2, mode='nearest')
+                feat = torch.concat([_feat, feat], dim=1)
+                feat = self.fpn_csps[i-1](feat)
+                if i < len(self.fpn_stems):
+                    feat = self.fpn_stems[i](feat)
+                fpn_feats.append(feat)
+
+        pan_feats = []
+        for i, feat in enumerate(fpn_feats[::-1]):
+            if i == 0:
+                pan_feats.append(feat)
+            else:
+                _feat = self.pan_stems[i-1](pan_feats[-1])
+                feat = torch.concat([_feat, feat], dim=1)
+                feat = self.pan_csps[i-1](feat)
+                pan_feats.append(feat)
+
+        return pan_feats
+
+
+if __name__ == '__main__':
+
+    data = torch.rand(1, 3, 320, 640)
+
+    width_multi = 0.75
+    depth_multi = 0.33
+
+    m = CSPDarkNet(3, width_multi=width_multi, depth_multi=depth_multi, act='silu')
+    outputs = m(data)
+    print([o.shape for o in outputs])
+
+    m = CSPPAN(in_channels=m.out_channels, depth_multi=depth_multi, act='silu')
+    outputs = m(outputs)
+    print([o.shape for o in outputs])
--- a/rtdetrv2_pytorch/src/nn/backbone/csp_resnet.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/csp_resnet.py
@@ -0,0 +1,277 @@
+"""
+https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.6/ppdet/modeling/backbones/cspresnet.py
+
+Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn 
+import torch.nn.functional as F 
+from collections import OrderedDict
+
+from .common import get_activation
+
+from ...core import register
+
+__all__ = ['CSPResNet']
+
+
+donwload_url = {
+    's': 'https://github.com/lyuwenyu/storage/releases/download/v0.1/CSPResNetb_s_pretrained_from_paddle.pth',
+    'm': 'https://github.com/lyuwenyu/storage/releases/download/v0.1/CSPResNetb_m_pretrained_from_paddle.pth',
+    'l': 'https://github.com/lyuwenyu/storage/releases/download/v0.1/CSPResNetb_l_pretrained_from_paddle.pth',
+    'x': 'https://github.com/lyuwenyu/storage/releases/download/v0.1/CSPResNetb_x_pretrained_from_paddle.pth',
+}
+
+
+class ConvBNLayer(nn.Module):
+    def __init__(self, ch_in, ch_out, filter_size=3, stride=1, groups=1, padding=0, act=None):
+        super().__init__()
+        self.conv = nn.Conv2d(ch_in, ch_out, filter_size, stride, padding, groups=groups, bias=False)
+        self.bn = nn.BatchNorm2d(ch_out)
+        self.act = get_activation(act) 
+       
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.conv(x)
+        x = self.bn(x)
+        x = self.act(x)
+        return x
+
+class RepVggBlock(nn.Module):
+    def __init__(self, ch_in, ch_out, act='relu', alpha: bool=False):
+        super().__init__()
+        self.ch_in = ch_in
+        self.ch_out = ch_out
+        self.conv1 = ConvBNLayer(
+            ch_in, ch_out, 3, stride=1, padding=1, act=None)
+        self.conv2 = ConvBNLayer(
+            ch_in, ch_out, 1, stride=1, padding=0, act=None)
+        self.act = get_activation(act) 
+
+        if alpha:
+            self.alpha = nn.Parameter(torch.ones(1, ))
+        else:
+            self.alpha = None
+
+    def forward(self, x):
+        if hasattr(self, 'conv'):
+            y = self.conv(x)
+        else:
+            if self.alpha:
+                y = self.conv1(x) + self.alpha * self.conv2(x)
+            else:
+                y = self.conv1(x) + self.conv2(x)
+        y = self.act(y)
+        return y
+
+    def convert_to_deploy(self):
+        if not hasattr(self, 'conv'):
+            self.conv = nn.Conv2d(self.ch_in, self.ch_out, 3, 1, padding=1)
+
+        kernel, bias = self.get_equivalent_kernel_bias()
+        self.conv.weight.data = kernel
+        self.conv.bias.data = bias 
+
+    def get_equivalent_kernel_bias(self):
+        kernel3x3, bias3x3 = self._fuse_bn_tensor(self.conv1)
+        kernel1x1, bias1x1 = self._fuse_bn_tensor(self.conv2)
+
+        if self.alpha:
+            return kernel3x3 + self.alpha * self._pad_1x1_to_3x3_tensor(
+                kernel1x1), bias3x3 + self.alpha * bias1x1
+        else:
+            return kernel3x3 + self._pad_1x1_to_3x3_tensor(
+                kernel1x1), bias3x3 + bias1x1
+
+    def _pad_1x1_to_3x3_tensor(self, kernel1x1):
+        if kernel1x1 is None:
+            return 0
+        else:
+            return F.pad(kernel1x1, [1, 1, 1, 1])
+
+    def _fuse_bn_tensor(self, branch: ConvBNLayer):
+        if branch is None:
+            return 0, 0
+        kernel = branch.conv.weight
+        running_mean = branch.norm.running_mean
+        running_var = branch.norm.running_var
+        gamma = branch.norm.weight
+        beta = branch.norm.bias
+        eps = branch.norm.eps
+        std = (running_var + eps).sqrt()
+        t = (gamma / std).reshape(-1, 1, 1, 1)
+        return kernel * t, beta - running_mean * gamma / std
+
+
+class BasicBlock(nn.Module):
+    def __init__(self,
+                 ch_in,
+                 ch_out,
+                 act='relu',
+                 shortcut=True,
+                 use_alpha=False):
+        super().__init__()
+        assert ch_in == ch_out
+        self.conv1 = ConvBNLayer(ch_in, ch_out, 3, stride=1, padding=1, act=act)
+        self.conv2 = RepVggBlock(ch_out, ch_out, act=act, alpha=use_alpha)
+        self.shortcut = shortcut
+
+    def forward(self, x):
+        y = self.conv1(x)
+        y = self.conv2(y)
+        if self.shortcut:
+            return x + y
+        else:
+            return y
+
+
+class EffectiveSELayer(nn.Module):
+    """ Effective Squeeze-Excitation
+    From `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://arxiv.org/abs/1911.06667
+    """
+
+    def __init__(self, channels, act='hardsigmoid'):
+        super(EffectiveSELayer, self).__init__()
+        self.fc = nn.Conv2d(channels, channels, kernel_size=1, padding=0)
+        self.act = get_activation(act)
+
+    def forward(self, x: torch.Tensor):
+        x_se = x.mean((2, 3), keepdim=True)
+        x_se = self.fc(x_se)
+        x_se = self.act(x_se)
+        return x * x_se
+
+
+class CSPResStage(nn.Module):
+    def __init__(self,
+                 block_fn,
+                 ch_in,
+                 ch_out,
+                 n,
+                 stride,
+                 act='relu',
+                 attn='eca',
+                 use_alpha=False):
+        super().__init__()
+        ch_mid = (ch_in + ch_out) // 2
+        if stride == 2:
+            self.conv_down = ConvBNLayer(
+                ch_in, ch_mid, 3, stride=2, padding=1, act=act)
+        else:
+            self.conv_down = None
+        self.conv1 = ConvBNLayer(ch_mid, ch_mid // 2, 1, act=act)
+        self.conv2 = ConvBNLayer(ch_mid, ch_mid // 2, 1, act=act)
+        self.blocks = nn.Sequential(*[
+            block_fn(
+                ch_mid // 2,
+                ch_mid // 2,
+                act=act,
+                shortcut=True,
+                use_alpha=use_alpha) for i in range(n)
+        ])
+        if attn:
+            self.attn = EffectiveSELayer(ch_mid, act='hardsigmoid')
+        else:
+            self.attn = None
+
+        self.conv3 = ConvBNLayer(ch_mid, ch_out, 1, act=act)
+
+    def forward(self, x):
+        if self.conv_down is not None:
+            x = self.conv_down(x)
+        y1 = self.conv1(x)
+        y2 = self.blocks(self.conv2(x))
+        y = torch.concat([y1, y2], dim=1)
+        if self.attn is not None:
+            y = self.attn(y)
+        y = self.conv3(y)
+        return y
+
+
+@register()
+class CSPResNet(nn.Module):
+    layers = [3, 6, 6, 3]
+    channels = [64, 128, 256, 512, 1024]
+    model_cfg = {
+        's': {'depth_mult': 0.33, 'width_mult': 0.50, },
+        'm': {'depth_mult': 0.67, 'width_mult': 0.75, },
+        'l': {'depth_mult': 1.00, 'width_mult': 1.00, },
+        'x': {'depth_mult': 1.33, 'width_mult': 1.25, },
+    }
+
+    def __init__(self,
+                 name: str,
+                 act='silu',
+                 return_idx=[1, 2, 3],
+                 use_large_stem=True,
+                 use_alpha=False,
+                 pretrained=False):
+
+        super().__init__()        
+        depth_mult = self.model_cfg[name]['depth_mult']
+        width_mult = self.model_cfg[name]['width_mult']
+
+        channels = [max(round(c * width_mult), 1) for c in self.channels]
+        layers = [max(round(l * depth_mult), 1) for l in self.layers]
+        act = get_activation(act)
+
+        if use_large_stem:
+            self.stem = nn.Sequential(OrderedDict([
+                ('conv1', ConvBNLayer(
+                    3, channels[0] // 2, 3, stride=2, padding=1, act=act)),
+                ('conv2', ConvBNLayer(
+                    channels[0] // 2,
+                    channels[0] // 2,
+                    3,
+                    stride=1,
+                    padding=1,
+                    act=act)), ('conv3', ConvBNLayer(
+                        channels[0] // 2,
+                        channels[0],
+                        3,
+                        stride=1,
+                        padding=1,
+                        act=act))]))
+        else:
+            self.stem = nn.Sequential(OrderedDict([
+                ('conv1', ConvBNLayer(
+                    3, channels[0] // 2, 3, stride=2, padding=1, act=act)),
+                ('conv2', ConvBNLayer(
+                    channels[0] // 2,
+                    channels[0],
+                    3,
+                    stride=1,
+                    padding=1,
+                    act=act))]))
+
+        n = len(channels) - 1
+        self.stages = nn.Sequential(OrderedDict([(str(i), CSPResStage(
+            BasicBlock,
+            channels[i],
+            channels[i + 1],
+            layers[i],
+            2,
+            act=act,
+            use_alpha=use_alpha)) for i in range(n)]))
+
+        self._out_channels = channels[1:]
+        self._out_strides = [4 * 2**i for i in range(n)]
+        self.return_idx = return_idx
+
+        if pretrained:
+            if isinstance(pretrained, bool) or 'http' in pretrained:
+                state = torch.hub.load_state_dict_from_url(donwload_url[name], map_location='cpu')
+            else:
+                state = torch.load(pretrained, map_location='cpu')
+            self.load_state_dict(state)
+            print(f'Load CSPResNet_{name} state_dict')
+
+    def forward(self, x):
+        x = self.stem(x)
+        outs = []
+        for idx, stage in enumerate(self.stages):
+            x = stage(x)
+            if idx in self.return_idx:
+                outs.append(x)
+        
+        return outs
--- a/rtdetrv2_pytorch/src/nn/backbone/hgnetv2.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/hgnetv2.py
@@ -0,0 +1,428 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+
+https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.init as init
+import torch.nn.functional as F
+
+from torch import Tensor
+from typing import List, Tuple
+
+from .common import FrozenBatchNorm2d
+from ...core import register
+
+
+__all__ = ['HGNetv2']
+
+
+class LearnableAffineBlock(nn.Module):
+    def __init__(self, scale_value=1.0, bias_value=0.0):
+        super().__init__()
+        self.scale = nn.Parameter(torch.tensor([scale_value]))
+        self.bias = nn.Parameter(torch.tensor([bias_value]))
+
+    def forward(self, x: Tensor) -> Tensor:
+        return self.scale * x + self.bias
+
+
+class ConvBNAct(nn.Module):
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size=3,
+                 stride=1,
+                 padding=0,
+                 groups=1,
+                 use_act=True,
+                 use_lab=False):
+        super().__init__()
+        self.use_act = use_act
+        self.use_lab = use_lab
+        if padding == 'same':
+            self.conv = nn.Sequential(
+                nn.ZeroPad2d([0, 1, 0, 1]),
+                nn.Conv2d(
+                    in_channels,
+                    out_channels,
+                    kernel_size,
+                    stride,
+                    groups=groups,
+                    bias=False
+                )
+            )
+        else:
+            self.conv = nn.Conv2d(
+                in_channels,
+                out_channels,
+                kernel_size,
+                stride,
+                padding=(kernel_size - 1) // 2,
+                groups=groups,
+                bias=False
+            )
+        self.bn = nn.BatchNorm2d(out_channels)
+        if self.use_act:
+            self.act = nn.ReLU()
+            if self.use_lab:
+                self.lab = LearnableAffineBlock()
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.conv(x)
+        x = self.bn(x)
+        if self.use_act:
+            x = self.act(x)
+            if self.use_lab:
+                x = self.lab(x)
+        return x
+
+
+class LightConvBNAct(nn.Module):
+    def __init__(self,
+                 in_channels,
+                 out_channels,
+                 kernel_size,
+                 stride,
+                 groups=1,
+                 use_lab=False):
+        super().__init__()
+        self.conv1 = ConvBNAct(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=1,
+            use_act=False,
+            use_lab=use_lab
+        )
+        self.conv2 = ConvBNAct(
+            in_channels=out_channels,
+            out_channels=out_channels,
+            kernel_size=kernel_size,
+            groups=out_channels,
+            use_act=True,
+            use_lab=use_lab
+        )
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.conv1(x)
+        x = self.conv2(x)
+        return x
+
+
+class StemBlock(nn.Module):
+    def __init__(self,
+                 in_channels,
+                 mid_channels,
+                 out_channels,
+                 use_lab=False):
+        super().__init__()
+        self.stem1 = ConvBNAct(
+            in_channels=in_channels,
+            out_channels=mid_channels,
+            kernel_size=3,
+            stride=2,
+            use_lab=use_lab
+        )
+        self.stem2a = ConvBNAct(
+            in_channels=mid_channels,
+            out_channels=mid_channels // 2,
+            kernel_size=2,
+            stride=1,
+            padding='same',
+            use_lab=use_lab
+        )
+        self.stem2b = ConvBNAct(
+            in_channels=mid_channels // 2,
+            out_channels=mid_channels,
+            kernel_size=2,
+            stride=1,
+            padding='same',
+            use_lab=use_lab
+        )
+        self.stem3 = ConvBNAct(
+            in_channels=mid_channels * 2,
+            out_channels=mid_channels,
+            kernel_size=3,
+            stride=2,
+            use_lab=use_lab
+        )
+        self.stem4 = ConvBNAct(
+            in_channels=mid_channels,
+            out_channels=out_channels,
+            kernel_size=1,
+            stride=1,
+            use_lab=use_lab
+        )
+
+        self.pool = nn.Sequential(
+            nn.ZeroPad2d([0, 1, 0, 1]),
+            nn.MaxPool2d(2, 1, ceil_mode=True)
+        )
+
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.stem1(x)
+        x2 = self.stem2a(x)
+        x2 = self.stem2b(x2)
+        x1 = self.pool(x)
+        x = torch.concat([x1, x2], dim=1)
+        x = self.stem3(x)
+        x = self.stem4(x)
+
+        return x
+
+
+class HG_Block(nn.Module):
+    def __init__(self,
+                 in_channels,
+                 mid_channels,
+                 out_channels,
+                 kernel_size=3,
+                 layer_num=6,
+                 identity=False,
+                 light_block=True,
+                 use_lab=False):
+        super().__init__()
+        self.identity = identity
+
+        self.layers = nn.ModuleList()
+        block_type = "LightConvBNAct" if light_block else "ConvBNAct"
+        for i in range(layer_num):
+            self.layers.append(
+                eval(block_type)(in_channels=in_channels
+                                 if i == 0 else mid_channels,
+                                 out_channels=mid_channels,
+                                 stride=1,
+                                 kernel_size=kernel_size,
+                                 use_lab=use_lab))
+        # feature aggregation
+        total_channels = in_channels + layer_num * mid_channels
+        self.aggregation_squeeze_conv = ConvBNAct(
+            in_channels=total_channels,
+            out_channels=out_channels // 2,
+            kernel_size=1,
+            stride=1,
+            use_lab=use_lab)
+        self.aggregation_excitation_conv = ConvBNAct(
+            in_channels=out_channels // 2,
+            out_channels=out_channels,
+            kernel_size=1,
+            stride=1,
+            use_lab=use_lab)
+
+    def forward(self, x):
+        identity = x
+        output = []
+        output.append(x)
+        for layer in self.layers:
+            x = layer(x)
+            output.append(x)
+        x = torch.concat(output, dim=1)
+        x = self.aggregation_squeeze_conv(x)
+        x = self.aggregation_excitation_conv(x)
+        if self.identity:
+            x = x + identity
+        return x
+
+
+class HG_Stage(nn.Module):
+    def __init__(self,
+                 in_channels,
+                 mid_channels,
+                 out_channels,
+                 block_num,
+                 layer_num=6,
+                 downsample=True,
+                 light_block=True,
+                 kernel_size=3,
+                 use_lab=False):
+        super().__init__()
+        self.downsample = downsample
+        if downsample:
+            self.downsample = ConvBNAct(
+                in_channels=in_channels,
+                out_channels=in_channels,
+                kernel_size=3,
+                stride=2,
+                groups=in_channels,
+                use_act=False,
+                use_lab=use_lab)
+
+        blocks_list = []
+        for i in range(block_num):
+            blocks_list.append(
+                HG_Block(
+                    in_channels=in_channels if i == 0 else out_channels,
+                    mid_channels=mid_channels,
+                    out_channels=out_channels,
+                    kernel_size=kernel_size,
+                    layer_num=layer_num,
+                    identity=False if i == 0 else True,
+                    light_block=light_block,
+                    use_lab=use_lab))
+        self.blocks = nn.Sequential(*blocks_list)
+
+    def forward(self, x):
+        if self.downsample:
+            x = self.downsample(x)
+        x = self.blocks(x)
+        return x
+
+
+@register()
+class HGNetv2(nn.Module):
+    """
+    Args:
+        stem_channels: list. Number of channels for the stem block.
+        stage_type: str. The stage configuration of PPHGNet. such as the number of channels, stride, etc.
+        use_lab: boolean. Whether to use LearnableAffineBlock in network.
+        lr_mult_list: list. Control the learning rate of different stages.
+    Returns:
+        model: nn.Module.
+    """
+
+    arch_configs = {
+        'L': {
+            'stem_channels': [3, 32, 48],
+            'stage_config': {
+                # in_channels, mid_channels, out_channels, num_blocks, downsample, light_block, kernel_size, layer_num
+                "stage1": [48, 48, 128, 1, False, False, 3, 6],
+                "stage2": [128, 96, 512, 1, True, False, 3, 6],
+                "stage3": [512, 192, 1024, 3, True, True, 5, 6],
+                "stage4": [1024, 384, 2048, 1, True, True, 5, 6],
+            },
+            'url': 'https://github.com/lyuwenyu/storage/releases/download/v0.1/PPHGNetV2_L_ssld_pretrained_from_paddle.pth',
+
+        },
+        'X': {
+            'stem_channels': [3, 32, 64],
+            'stage_config': {
+                # in_channels, mid_channels, out_channels, num_blocks, downsample, light_block, kernel_size, layer_num
+                "stage1": [64, 64, 128, 1, False, False, 3, 6],
+                "stage2": [128, 128, 512, 2, True, False, 3, 6],
+                "stage3": [512, 256, 1024, 5, True, True, 5, 6],
+                "stage4": [1024, 512, 2048, 2, True, True, 5, 6],
+            },
+            'url': 'https://github.com/lyuwenyu/storage/releases/download/v0.1/PPHGNetV2_X_ssld_pretrained_from_paddle.pth',
+
+        },
+        'H': {
+            'stem_channels': [3, 48, 96],
+            'stage_config': {
+                # in_channels, mid_channels, out_channels, num_blocks, downsample, light_block, kernel_size, layer_num
+                "stage1": [96, 96, 192, 2, False, False, 3, 6],
+                "stage2": [192, 192, 512, 3, True, False, 3, 6],
+                "stage3": [512, 384, 1024, 6, True, True, 5, 6],
+                "stage4": [1024, 768, 2048, 3, True, True, 5, 6],
+            },
+            'url': 'https://github.com/lyuwenyu/storage/releases/download/v0.1/PPHGNetV2_H_ssld_pretrained_from_paddle.pth',
+        }
+    }
+
+    def __init__(self,
+                 name,
+                 use_lab=False,
+                 return_idx=[1, 2, 3],
+                 freeze_at=-1,
+                 freeze_norm=False,
+                 pretrained=False):
+        super().__init__()
+        self.use_lab = use_lab
+        self.return_idx = return_idx
+
+        stem_channels = self.arch_configs[name]['stem_channels']
+        stage_config = self.arch_configs[name]['stage_config']
+        download_url = self.arch_configs[name]['url']
+
+        self._out_strides = [4, 8, 16, 32]
+        self._out_channels = [stage_config[k][2] for k in stage_config]
+
+        # stem
+        self.stem = StemBlock(
+            in_channels=stem_channels[0],
+            mid_channels=stem_channels[1],
+            out_channels=stem_channels[2],
+            use_lab=use_lab
+        )
+
+        # stages
+        self.stages = nn.ModuleList()
+        for i, k in enumerate(stage_config):
+            in_channels, mid_channels, out_channels, block_num, downsample, light_block, kernel_size, layer_num = stage_config[
+                k]
+            self.stages.append(
+                HG_Stage(
+                    in_channels,
+                    mid_channels,
+                    out_channels,
+                    block_num,
+                    layer_num,
+                    downsample,
+                    light_block,
+                    kernel_size,
+                    use_lab))
+
+        self._init_weights()
+
+        if freeze_at >= 0:
+            self._freeze_parameters(self.stem)
+            for i in range(min(freeze_at, 4)):
+                self._freeze_parameters(self.stages[i])
+
+        if freeze_norm:
+            self._freeze_norm(self)
+
+        if pretrained:
+            if isinstance(pretrained, bool) or 'http' in pretrained:
+                state = torch.hub.load_state_dict_from_url(download_url, map_location='cpu')
+            else:
+                state = torch.load(pretrained, map_location='cpu')
+            self.load_state_dict(state)
+            print(f'Load HGNetv2_{name} state_dict')
+        
+
+    def _init_weights(self):
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                init.kaiming_normal_(m.weight)
+            elif isinstance(m, (nn.BatchNorm2d)):
+                init.constant_(m.weight, 1)
+                init.constant_(m.bias, 0)
+            elif isinstance(m, nn.Linear):
+                init.constant_(m.bias, 0)
+
+    def _freeze_parameters(self, m: nn.Module):
+        for p in m.parameters():
+            p.requires_grad = False
+
+    def _freeze_norm(self, m: nn.Module):
+        if isinstance(m, nn.BatchNorm2d):
+            m = FrozenBatchNorm2d(m.num_features)
+        else:
+            for name, child in m.named_children():
+                _child = self._freeze_norm(child)
+                if _child is not child:
+                    setattr(m, name, _child)
+        return m
+
+
+    def forward(self, x: Tensor) -> List[Tensor]:
+        x = self.stem(x)
+        outs = []
+        for idx, stage in enumerate(self.stages):
+            x = stage(x)
+            if idx in self.return_idx:
+                outs.append(x)
+        return outs
+
+
+
+if __name__ == '__main__':
+
+    m = HGNetv2(name='X', pretrained=False, freeze_at=-1, freeze_norm=False)
+    data = torch.randn(1, 3, 640, 640)
+
+    output = m(data)
+    print([o.shape for o in output])
+
+    output[0].mean().backward()
--- a/rtdetrv2_pytorch/src/nn/backbone/presnet.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/presnet.py
@@ -0,0 +1,245 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+import torch
+import torch.nn as nn 
+import torch.nn.functional as F 
+
+from collections import OrderedDict
+
+from .common import get_activation, FrozenBatchNorm2d
+
+from ...core import register
+
+
+__all__ = ['PResNet']
+
+
+ResNet_cfg = {
+    18: [2, 2, 2, 2],
+    34: [3, 4, 6, 3],
+    50: [3, 4, 6, 3],
+    101: [3, 4, 23, 3],
+    # 152: [3, 8, 36, 3],
+}
+
+
+donwload_url = {
+    18: 'https://github.com/lyuwenyu/storage/releases/download/v0.1/ResNet18_vd_pretrained_from_paddle.pth',
+    34: 'https://github.com/lyuwenyu/storage/releases/download/v0.1/ResNet34_vd_pretrained_from_paddle.pth',
+    50: 'https://github.com/lyuwenyu/storage/releases/download/v0.1/ResNet50_vd_ssld_v2_pretrained_from_paddle.pth',
+    101: 'https://github.com/lyuwenyu/storage/releases/download/v0.1/ResNet101_vd_ssld_pretrained_from_paddle.pth',
+}
+
+
+class ConvNormLayer(nn.Module):
+    def __init__(self, ch_in, ch_out, kernel_size, stride, padding=None, bias=False, act=None):
+        super().__init__()
+        self.conv = nn.Conv2d(
+            ch_in, 
+            ch_out, 
+            kernel_size, 
+            stride, 
+            padding=(kernel_size-1)//2 if padding is None else padding, 
+            bias=bias)
+        self.norm = nn.BatchNorm2d(ch_out)
+        self.act = get_activation(act) 
+
+    def forward(self, x):
+        return self.act(self.norm(self.conv(x)))
+
+
+class BasicBlock(nn.Module):
+    expansion = 1
+
+    def __init__(self, ch_in, ch_out, stride, shortcut, act='relu', variant='b'):
+        super().__init__()
+
+        self.shortcut = shortcut
+
+        if not shortcut:
+            if variant == 'd' and stride == 2:
+                self.short = nn.Sequential(OrderedDict([
+                    ('pool', nn.AvgPool2d(2, 2, 0, ceil_mode=True)),
+                    ('conv', ConvNormLayer(ch_in, ch_out, 1, 1))
+                ]))
+            else:
+                self.short = ConvNormLayer(ch_in, ch_out, 1, stride)
+
+        self.branch2a = ConvNormLayer(ch_in, ch_out, 3, stride, act=act)
+        self.branch2b = ConvNormLayer(ch_out, ch_out, 3, 1, act=None)
+        self.act = nn.Identity() if act is None else get_activation(act) 
+
+
+    def forward(self, x):
+        out = self.branch2a(x)
+        out = self.branch2b(out)
+        if self.shortcut:
+            short = x
+        else:
+            short = self.short(x)
+        
+        out = out + short
+        out = self.act(out)
+
+        return out
+
+
+class BottleNeck(nn.Module):
+    expansion = 4
+
+    def __init__(self, ch_in, ch_out, stride, shortcut, act='relu', variant='b'):
+        super().__init__()
+
+        if variant == 'a':
+            stride1, stride2 = stride, 1
+        else:
+            stride1, stride2 = 1, stride
+
+        width = ch_out 
+
+        self.branch2a = ConvNormLayer(ch_in, width, 1, stride1, act=act)
+        self.branch2b = ConvNormLayer(width, width, 3, stride2, act=act)
+        self.branch2c = ConvNormLayer(width, ch_out * self.expansion, 1, 1)
+
+        self.shortcut = shortcut
+        if not shortcut:
+            if variant == 'd' and stride == 2:
+                self.short = nn.Sequential(OrderedDict([
+                    ('pool', nn.AvgPool2d(2, 2, 0, ceil_mode=True)),
+                    ('conv', ConvNormLayer(ch_in, ch_out * self.expansion, 1, 1))
+                ]))
+            else:
+                self.short = ConvNormLayer(ch_in, ch_out * self.expansion, 1, stride)
+
+        self.act = nn.Identity() if act is None else get_activation(act) 
+
+    def forward(self, x):
+        out = self.branch2a(x)
+        out = self.branch2b(out)
+        out = self.branch2c(out)
+
+        if self.shortcut:
+            short = x
+        else:
+            short = self.short(x)
+
+        out = out + short
+        out = self.act(out)
+
+        return out
+
+
+class Blocks(nn.Module):
+    def __init__(self, block, ch_in, ch_out, count, stage_num, act='relu', variant='b'):
+        super().__init__()
+
+        self.blocks = nn.ModuleList()
+        for i in range(count):
+            self.blocks.append(
+                block(
+                    ch_in, 
+                    ch_out,
+                    stride=2 if i == 0 and stage_num != 2 else 1, 
+                    shortcut=False if i == 0 else True,
+                    variant=variant,
+                    act=act)
+            )
+
+            if i == 0:
+                ch_in = ch_out * block.expansion
+
+    def forward(self, x):
+        out = x
+        for block in self.blocks:
+            out = block(out)
+        return out
+
+
+@register()
+class PResNet(nn.Module):
+    def __init__(
+        self, 
+        depth, 
+        variant='d', 
+        num_stages=4, 
+        return_idx=[0, 1, 2, 3], 
+        act='relu',
+        freeze_at=-1, 
+        freeze_norm=True, 
+        pretrained=False):
+        super().__init__()
+
+        block_nums = ResNet_cfg[depth]
+        ch_in = 64
+        if variant in ['c', 'd']:
+            conv_def = [
+                [3, ch_in // 2, 3, 2, "conv1_1"],
+                [ch_in // 2, ch_in // 2, 3, 1, "conv1_2"],
+                [ch_in // 2, ch_in, 3, 1, "conv1_3"],
+            ]
+        else:
+            conv_def = [[3, ch_in, 7, 2, "conv1_1"]]
+
+        self.conv1 = nn.Sequential(OrderedDict([
+            (name, ConvNormLayer(cin, cout, k, s, act=act)) for cin, cout, k, s, name in conv_def
+        ]))
+
+        ch_out_list = [64, 128, 256, 512]
+        block = BottleNeck if depth >= 50 else BasicBlock
+
+        _out_channels = [block.expansion * v for v in ch_out_list]
+        _out_strides = [4, 8, 16, 32]
+
+        self.res_layers = nn.ModuleList()
+        for i in range(num_stages):
+            stage_num = i + 2
+            self.res_layers.append(
+                Blocks(block, ch_in, ch_out_list[i], block_nums[i], stage_num, act=act, variant=variant)
+            )
+            ch_in = _out_channels[i]
+
+        self.return_idx = return_idx
+        self.out_channels = [_out_channels[_i] for _i in return_idx]
+        self.out_strides = [_out_strides[_i] for _i in return_idx]
+
+        if freeze_at >= 0:
+            self._freeze_parameters(self.conv1)
+            for i in range(min(freeze_at, num_stages)):
+                self._freeze_parameters(self.res_layers[i])
+
+        if freeze_norm:
+            self._freeze_norm(self)
+
+        if pretrained:
+            if isinstance(pretrained, bool) or 'http' in pretrained:
+                state = torch.hub.load_state_dict_from_url(donwload_url[depth], map_location='cpu')
+            else:
+                state = torch.load(pretrained, map_location='cpu')
+            self.load_state_dict(state)
+            print(f'Load PResNet{depth} state_dict')
+
+    def _freeze_parameters(self, m: nn.Module):
+        for p in m.parameters():
+            p.requires_grad = False
+
+    def _freeze_norm(self, m: nn.Module):
+        if isinstance(m, nn.BatchNorm2d):
+            m = FrozenBatchNorm2d(m.num_features)
+        else:
+            for name, child in m.named_children():
+                _child = self._freeze_norm(child)
+                if _child is not child:
+                    setattr(m, name, _child)
+        return m
+
+    def forward(self, x):
+        conv1 = self.conv1(x)
+        x = F.max_pool2d(conv1, kernel_size=3, stride=2, padding=1)
+        outs = []
+        for idx, stage in enumerate(self.res_layers):
+            x = stage(x)
+            if idx in self.return_idx:
+                outs.append(x)
+        return outs
+
+
--- a/rtdetrv2_pytorch/src/nn/backbone/test_resnet.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/test_resnet.py
@@ -0,0 +1,81 @@
+import torch
+import torch.nn as nn 
+import torch.nn.functional as F 
+
+from collections import OrderedDict
+
+
+from ...core import register
+
+
+class BasicBlock(nn.Module):
+    expansion = 1
+
+    def __init__(self, in_planes, planes, stride=1):
+        super(BasicBlock, self).__init__()
+
+        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=3, stride=stride, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(planes)
+
+        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,stride=1, padding=1, bias=False)
+        self.bn2 = nn.BatchNorm2d(planes)
+
+        self.shortcut = nn.Sequential()         
+        if stride != 1 or in_planes != self.expansion*planes:
+            self.shortcut = nn.Sequential(
+                nn.Conv2d(in_planes, self.expansion*planes,kernel_size=1, stride=stride, bias=False),
+                nn.BatchNorm2d(self.expansion*planes)
+            )
+    def forward(self, x):
+        out = F.relu(self.bn1(self.conv1(x)))
+        out = self.bn2(self.conv2(out))       
+        out += self.shortcut(x)          
+        out = F.relu(out)
+        return out
+
+
+
+class _ResNet(nn.Module):
+    def __init__(self, block, num_blocks, num_classes=10):
+        super().__init__()
+        self.in_planes = 64
+
+        self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
+        self.bn1 = nn.BatchNorm2d(64)
+        
+        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
+        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
+        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
+        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
+        
+        self.linear = nn.Linear(512 * block.expansion, num_classes)
+
+    def _make_layer(self, block, planes, num_blocks, stride):
+        strides = [stride] + [1]*(num_blocks-1)
+        layers = []
+        for stride in strides:
+            layers.append(block(self.in_planes, planes, stride))
+            self.in_planes = planes * block.expansion 
+        return nn.Sequential(*layers)
+        
+    def forward(self, x):
+        out = F.relu(self.bn1(self.conv1(x)))
+        out = self.layer1(out)
+        out = self.layer2(out)
+        out = self.layer3(out)
+        out = self.layer4(out)
+        out = F.avg_pool2d(out, 4)
+        out = out.view(out.size(0), -1)
+        out = self.linear(out)              
+        return out
+        
+
+@register()
+class MResNet(nn.Module):
+    def __init__(self, num_classes=10, num_blocks=[2, 2, 2, 2]) -> None:
+        super().__init__()
+        self.model = _ResNet(BasicBlock, num_blocks, num_classes)
+        
+    def forward(self, x):
+        return self.model(x)
+
--- a/rtdetrv2_pytorch/src/nn/backbone/timm_model.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/timm_model.py
@@ -0,0 +1,70 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+
+https://towardsdatascience.com/getting-started-with-pytorch-image-models-timm-a-practitioners-guide-4e77b4bf9055#0583
+"""
+
+import torch
+from torchvision.models.feature_extraction import get_graph_node_names, create_feature_extractor
+
+from .utils import IntermediateLayerGetter
+from ...core import register
+
+
+@register()
+class TimmModel(torch.nn.Module):
+    def __init__(self, \
+        name, 
+        return_layers, 
+        pretrained=False, 
+        exportable=True, 
+        features_only=True,
+        **kwargs) -> None:
+
+        super().__init__()
+
+        import timm
+        model = timm.create_model(
+            name,
+            pretrained=pretrained, 
+            exportable=exportable, 
+            features_only=features_only,
+            **kwargs
+        )
+        # nodes, _ = get_graph_node_names(model)
+        # print(nodes)
+        # features = {'': ''}
+        # model = create_feature_extractor(model, return_nodes=features)
+
+        assert set(return_layers).issubset(model.feature_info.module_name()), \
+            f'return_layers should be a subset of {model.feature_info.module_name()}'
+        
+        # self.model = model
+        self.model = IntermediateLayerGetter(model, return_layers)
+
+        return_idx = [model.feature_info.module_name().index(name) for name in return_layers]
+        self.strides = [model.feature_info.reduction()[i] for i in return_idx]
+        self.channels = [model.feature_info.channels()[i] for i in return_idx]
+        self.return_idx = return_idx
+        self.return_layers = return_layers
+
+    def forward(self, x: torch.Tensor): 
+        outputs = self.model(x)
+        # outputs = [outputs[i] for i in self.return_idx]
+        return outputs
+
+
+if __name__ == '__main__':
+    
+    model = TimmModel(name='resnet34', return_layers=['layer2', 'layer3'])
+    data = torch.rand(1, 3, 640, 640)
+    outputs = model(data)
+    
+    for output in outputs:
+        print(output.shape)
+
+    """
+    model:
+        type: TimmModel
+        name: resnet34
+        return_layers: ['layer2', 'layer4']
+    """
--- a/rtdetrv2_pytorch/src/nn/backbone/torchvision_model.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/torchvision_model.py
@@ -0,0 +1,49 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torchvision 
+
+from ...core import register
+from .utils import IntermediateLayerGetter
+
+__all__ = ['TorchVisionModel']
+
+@register()
+class TorchVisionModel(torch.nn.Module):
+    def __init__(self, name, return_layers, weights=None, **kwargs) -> None:
+        super().__init__()
+        
+        if weights is not None:
+            weights = getattr(torchvision.models.get_model_weights(name), weights)
+
+        model = torchvision.models.get_model(name, weights=weights, **kwargs)
+
+        # TODO hard code.
+        if hasattr(model, 'features'):
+            model = IntermediateLayerGetter(model.features, return_layers)
+        else:
+            model = IntermediateLayerGetter(model, return_layers)
+
+        self.model = model 
+
+    def forward(self, x):
+        return self.model(x)
+
+
+# TorchVisionModel('swin_t', return_layers=['5', '7'])
+# TorchVisionModel('resnet34', return_layers=['layer2','layer3', 'layer4'])
+
+"""
+TorchVisionModel:
+    name: swin_t
+    return_layers: ['5', '7']
+    weights: DEFAULT
+
+
+model:
+    type: TorchVisionModel
+    name: resnet34
+    return_layers: ['layer2','layer3', 'layer4']
+    weights: DEFAULT
+"""
--- a/rtdetrv2_pytorch/src/nn/backbone/utils.py
+++ b/rtdetrv2_pytorch/src/nn/backbone/utils.py
@@ -0,0 +1,55 @@
+"""
+https://github.com/pytorch/vision/blob/main/torchvision/models/_utils.py
+
+Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from collections import OrderedDict
+from typing import Dict, List
+
+
+import torch.nn as nn 
+
+
+class IntermediateLayerGetter(nn.ModuleDict):
+    """
+    Module wrapper that returns intermediate layers from a model
+
+    It has a strong assumption that the modules have been registered
+    into the model in the same order as they are used.
+    This means that one should **not** reuse the same nn.Module
+    twice in the forward if you want this to work.
+
+    Additionally, it is only able to query submodules that are directly
+    assigned to the model. So if `model` is passed, `model.feature1` can
+    be returned, but not `model.feature1.layer2`.
+    """
+
+    _version = 3
+
+    def __init__(self, model: nn.Module, return_layers: List[str]) -> None:
+        if not set(return_layers).issubset([name for name, _ in model.named_children()]):
+            raise ValueError("return_layers are not present in model. {}"\
+                .format([name for name, _ in model.named_children()]))
+        orig_return_layers = return_layers
+        return_layers = {str(k): str(k)  for k in return_layers}
+        layers = OrderedDict()
+        for name, module in model.named_children():
+            layers[name] = module
+            if name in return_layers:
+                del return_layers[name]
+            if not return_layers:
+                break
+
+        super().__init__(layers)
+        self.return_layers = orig_return_layers
+
+    def forward(self, x):
+        outputs = []
+        for name, module in self.items():
+            x = module(x)
+            if name in self.return_layers:
+                outputs.append(x)
+        
+        return outputs
+
--- a/rtdetrv2_pytorch/src/nn/criterion/init.py
+++ b/rtdetrv2_pytorch/src/nn/criterion/init.py
@@ -0,0 +1,10 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+import torch.nn as nn 
+from ...core import register
+
+from .det_criterion import DetCriterion
+
+CrossEntropyLoss = register()(nn.CrossEntropyLoss)
--- a/rtdetrv2_pytorch/src/nn/criterion/det_criterion.py
+++ b/rtdetrv2_pytorch/src/nn/criterion/det_criterion.py
@@ -0,0 +1,171 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torch.nn.functional as F 
+import torch.distributed
+import torchvision
+
+from ...misc import box_ops
+from ...misc import dist_utils
+from ...core import register
+
+
+@register()
+class DetCriterion(torch.nn.Module):
+    """Default Detection Criterion
+    """
+    __share__ = ['num_classes']
+    __inject__ = ['matcher']
+
+    def __init__(self, 
+                losses, 
+                weight_dict, 
+                num_classes=80, 
+                alpha=0.75, 
+                gamma=2.0, 
+                box_fmt='cxcywh',
+                matcher=None):
+        """
+        Args:
+            losses (list[str]): requested losses, support ['boxes', 'vfl', 'focal']
+            weight_dict (dict[str, float)]: corresponding losses weight, including
+                ['loss_bbox', 'loss_giou', 'loss_vfl', 'loss_focal']
+            box_fmt (str): in box format, 'cxcywh' or 'xyxy'
+            matcher (Matcher): matcher used to match source to target
+        """
+        super().__init__()
+        self.losses = losses
+        self.weight_dict = weight_dict
+        self.alpha = alpha
+        self.gamma = gamma
+        self.num_classes = num_classes
+        self.box_fmt = box_fmt
+        assert matcher is not None, ''
+        self.matcher = matcher
+
+    def forward(self, outputs, targets, **kwargs):
+        """
+        Args:
+            outputs: Dict[Tensor], 'pred_boxes', 'pred_logits', 'meta'.
+            targets, List[Dict[str, Tensor]], len(targets) == batch_size.
+            kwargs, store other information such as current epoch id.
+        Return:
+            losses, Dict[str, Tensor]
+        """
+        matched = self.matcher(outputs, targets)
+        values = matched['values']
+        indices = matched['indices']
+        num_boxes = self._get_positive_nums(indices)
+        
+        # Compute all the requested losses
+        losses = {}
+        for loss in self.losses:
+            l_dict = self.get_loss(loss, outputs, targets, indices, num_boxes)
+            l_dict = {k: l_dict[k] * self.weight_dict[k] for k in l_dict if k in self.weight_dict}
+            losses.update(l_dict)
+        return losses 
+
+    def _get_src_permutation_idx(self, indices):
+        # permute predictions following indices
+        batch_idx = torch.cat([torch.full_like(src, i) for i, (src, _) in enumerate(indices)])
+        src_idx = torch.cat([src for (src, _) in indices])        
+        return batch_idx, src_idx
+
+    def _get_tgt_permutation_idx(self, indices):
+        # permute targets following indices
+        batch_idx = torch.cat([torch.full_like(tgt, i) for i, (_, tgt) in enumerate(indices)])
+        tgt_idx = torch.cat([tgt for (_, tgt) in indices])
+        return batch_idx, tgt_idx
+
+    def _get_positive_nums(self, indices):
+        # number of positive samples
+        num_pos = sum(len(i) for (i, _) in indices)
+        num_pos = torch.as_tensor([num_pos], dtype=torch.float32, device=indices[0][0].device)
+        if dist_utils.is_dist_available_and_initialized():
+            torch.distributed.all_reduce(num_pos)
+        num_pos = torch.clamp(num_pos / dist_utils.get_world_size(), min=1).item()
+        return num_pos
+
+    def loss_labels_focal(self, outputs, targets, indices, num_boxes):
+        assert 'pred_logits' in outputs
+        src_logits = outputs['pred_logits']
+
+        idx = self._get_src_permutation_idx(indices)
+        target_classes_o = torch.cat([t["labels"][j] for t, (_, j) in zip(targets, indices)])
+        target_classes = torch.full(src_logits.shape[:2], self.num_classes,
+                                    dtype=torch.int64, device=src_logits.device)
+        target_classes[idx] = target_classes_o
+
+        target = F.one_hot(target_classes, num_classes=self.num_classes + 1)[..., :-1].to(src_logits.dtype)
+        loss = torchvision.ops.sigmoid_focal_loss(src_logits, target, self.alpha, self.gamma, reduction='none')
+        loss = loss.sum() / num_boxes
+        return {'loss_focal': loss}
+
+    def loss_labels_vfl(self, outputs, targets, indices, num_boxes):
+        assert 'pred_boxes' in outputs
+        idx = self._get_src_permutation_idx(indices)
+        
+        src_boxes = outputs['pred_boxes'][idx]
+        target_boxes = torch.cat([t['boxes'][j] for t, (_, j) in zip(targets, indices)], dim=0)
+
+        src_boxes = torchvision.ops.box_convert(src_boxes, in_fmt=self.box_fmt, out_fmt='xyxy')
+        target_boxes = torchvision.ops.box_convert(target_boxes, in_fmt=self.box_fmt, out_fmt='xyxy')
+        iou, _ = box_ops.elementwise_box_iou(src_boxes.detach(), target_boxes)
+        
+        src_logits: torch.Tensor = outputs['pred_logits']
+        target_classes_o = torch.cat([t["labels"][j] for t, (_, j) in zip(targets, indices)])
+        target_classes = torch.full(src_logits.shape[:2], self.num_classes,
+                                    dtype=torch.int64, device=src_logits.device)
+        target_classes[idx] = target_classes_o
+        target = F.one_hot(target_classes, num_classes=self.num_classes + 1)[..., :-1]
+
+        target_score_o = torch.zeros_like(target_classes, dtype=src_logits.dtype)
+        target_score_o[idx] = iou.to(src_logits.dtype)
+        target_score = target_score_o.unsqueeze(-1) * target
+
+        src_score = F.sigmoid(src_logits.detach())
+        weight = self.alpha * src_score.pow(self.gamma) * (1 - target) + target_score
+        
+        loss = F.binary_cross_entropy_with_logits(src_logits, target_score, weight=weight, reduction='none')        
+        loss = loss.sum() / num_boxes
+        return {'loss_vfl': loss}
+
+    def loss_boxes(self, outputs, targets, indices, num_boxes):
+        assert 'pred_boxes' in outputs
+        idx = self._get_src_permutation_idx(indices)        
+        src_boxes = outputs['pred_boxes'][idx]
+        target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0)
+
+        losses = {}
+        loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction='none')
+        losses['loss_bbox'] = loss_bbox.sum() / num_boxes
+        
+        src_boxes = torchvision.ops.box_convert(src_boxes, in_fmt=self.box_fmt, out_fmt='xyxy')
+        target_boxes = torchvision.ops.box_convert(target_boxes, in_fmt=self.box_fmt, out_fmt='xyxy')
+        loss_giou = 1 - box_ops.elementwise_generalized_box_iou(src_boxes, target_boxes)
+        losses['loss_giou'] = loss_giou.sum() / num_boxes
+        return losses
+
+    def loss_boxes_giou(self, outputs, targets, indices, num_boxes):
+        assert 'pred_boxes' in outputs
+        idx = self._get_src_permutation_idx(indices)        
+        src_boxes = outputs['pred_boxes'][idx]
+        target_boxes = torch.cat([t['boxes'][i] for t, (_, i) in zip(targets, indices)], dim=0)
+
+        losses = {}
+        src_boxes = torchvision.ops.box_convert(src_boxes, in_fmt=self.box_fmt, out_fmt='xyxy')
+        target_boxes = torchvision.ops.box_convert(target_boxes, in_fmt=self.box_fmt, out_fmt='xyxy')
+        loss_giou = 1 - box_ops.elementwise_generalized_box_iou(src_boxes, target_boxes)
+        losses['loss_giou'] = loss_giou.sum() / num_boxes
+        return losses
+
+    def get_loss(self, loss, outputs, targets, indices, num_boxes, **kwargs):
+        loss_map = {
+            'boxes': self.loss_boxes,
+            'giou': self.loss_boxes_giou,
+            'vfl': self.loss_labels_vfl,
+            'focal': self.loss_labels_focal,
+        }
+        assert loss in loss_map, f'do you really want to compute {loss} loss?'
+        return loss_map[loss](outputs, targets, indices, num_boxes, **kwargs)
--- a/rtdetrv2_pytorch/src/nn/postprocessor/init.py
+++ b/rtdetrv2_pytorch/src/nn/postprocessor/init.py
@@ -0,0 +1,5 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+from .nms_postprocessor import DetNMSPostProcessor
--- a/rtdetrv2_pytorch/src/nn/postprocessor/box_revert.py
+++ b/rtdetrv2_pytorch/src/nn/postprocessor/box_revert.py
@@ -0,0 +1,62 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torchvision
+from torch import Tensor
+from enum import Enum
+
+
+class BoxProcessFormat(Enum):
+    """Box process format 
+
+    Available formats are
+    * ``RESIZE``
+    * ``RESIZE_KEEP_RATIO``
+    * ``RESIZE_KEEP_RATIO_PADDING``
+    """
+    RESIZE = 1
+    RESIZE_KEEP_RATIO = 2
+    RESIZE_KEEP_RATIO_PADDING = 3
+
+
+def box_revert(
+    boxes: Tensor, 
+    orig_sizes: Tensor=None, 
+    eval_sizes: Tensor=None,
+    inpt_sizes: Tensor=None,
+    inpt_padding: Tensor=None,
+    normalized: bool=True,
+    in_fmt: str='cxcywh', 
+    out_fmt: str='xyxy',
+    process_fmt=BoxProcessFormat.RESIZE,
+) -> Tensor:
+    """
+    Args:
+        boxes(Tensor), [N, :, 4], (x1, y1, x2, y2), pred boxes.
+        inpt_sizes(Tensor), [N, 2], (w, h). input sizes.
+        orig_sizes(Tensor), [N, 2], (w, h). origin sizes.
+        inpt_padding (Tensor), [N, 2], (w_pad, h_pad, ...).
+        (inpt_sizes + inpt_padding) == eval_sizes
+    """
+    assert in_fmt in ('cxcywh', 'xyxy'), ''
+
+    if normalized and eval_sizes is not None:
+        boxes = boxes * eval_sizes.repeat(1, 2).unsqueeze(1)
+    
+    if inpt_padding is not None:
+        if in_fmt == 'xyxy':
+            boxes -= inpt_padding[:, :2].repeat(1, 2).unsqueeze(1)
+        elif in_fmt == 'cxcywh':
+            boxes[..., :2] -= inpt_padding[:, :2].repeat(1, 2).unsqueeze(1)
+
+    if orig_sizes is not None:
+        orig_sizes = orig_sizes.repeat(1, 2).unsqueeze(1)
+        if inpt_sizes is not None:
+            inpt_sizes = inpt_sizes.repeat(1, 2).unsqueeze(1)
+            boxes = boxes * (orig_sizes / inpt_sizes)
+        else:
+            boxes = boxes * orig_sizes
+
+    boxes = torchvision.ops.box_convert(boxes, in_fmt=in_fmt, out_fmt=out_fmt)
+    return boxes
--- a/rtdetrv2_pytorch/src/nn/postprocessor/detr_postprocessor.py
+++ b/rtdetrv2_pytorch/src/nn/postprocessor/detr_postprocessor.py
@@ -0,0 +1,81 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn 
+import torch.nn.functional as F 
+
+import torchvision
+
+
+__all__ = ['DetDETRPostProcessor']
+
+from .box_revert import box_revert
+from .box_revert import BoxProcessFormat
+
+def mod(a, b):
+    out = a - a // b * b
+    return out
+
+class DetDETRPostProcessor(nn.Module):
+    def __init__(
+        self, 
+        num_classes=80, 
+        use_focal_loss=True, 
+        num_top_queries=300, 
+        box_process_format=BoxProcessFormat.RESIZE,
+    ) -> None:
+        super().__init__()
+        self.use_focal_loss = use_focal_loss
+        self.num_top_queries = num_top_queries
+        self.num_classes = int(num_classes)
+        self.box_process_format = box_process_format
+        self.deploy_mode = False 
+
+    def extra_repr(self) -> str:
+        return f'use_focal_loss={self.use_focal_loss}, num_classes={self.num_classes}, num_top_queries={self.num_top_queries}'
+    
+    def forward(self, outputs, **kwargs):
+        logits, boxes = outputs['pred_logits'], outputs['pred_boxes']
+
+        if self.use_focal_loss:
+            scores = F.sigmoid(logits)
+            scores, index = torch.topk(scores.flatten(1), self.num_top_queries, dim=-1)
+            labels = index % self.num_classes
+            # labels = mod(index, self.num_classes) # for tensorrt
+            index = index // self.num_classes
+            boxes = boxes.gather(dim=1, index=index.unsqueeze(-1).repeat(1, 1, boxes.shape[-1]))
+            
+        else:
+            scores = F.softmax(logits)[:, :, :-1]
+            scores, labels = scores.max(dim=-1)
+            if scores.shape[1] > self.num_top_queries:
+                scores, index = torch.topk(scores, self.num_top_queries, dim=-1)
+                labels = torch.gather(labels, dim=1, index=index)
+                boxes = torch.gather(boxes, dim=1, index=index.unsqueeze(-1).tile(1, 1, boxes.shape[-1]))
+
+        if kwargs is not None:
+            boxes = box_revert(
+                boxes, 
+                in_fmt='cxcywh',
+                out_fmt='xyxy',
+                process_fmt=self.box_process_format,
+                normalized=True,
+                **kwargs,
+            )
+
+        # TODO for onnx export
+        if self.deploy_mode:
+            return labels, boxes, scores
+
+        results = []
+        for lab, box, sco in zip(labels, boxes, scores):
+            result = dict(labels=lab, boxes=box, scores=sco)
+            results.append(result)
+        
+        return results
+        
+    def deploy(self, ):
+        self.eval()
+        self.deploy_mode = True
+        return self 
--- a/rtdetrv2_pytorch/src/nn/postprocessor/nms_postprocessor.py
+++ b/rtdetrv2_pytorch/src/nn/postprocessor/nms_postprocessor.py
@@ -0,0 +1,79 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torch.nn.functional as F 
+import torch.distributed
+import torchvision
+from torch import Tensor 
+
+from ...core import register
+
+from typing import Dict 
+
+
+__all__ = ['DetNMSPostProcessor', ]
+
+
+@register()
+class DetNMSPostProcessor(torch.nn.Module):
+    def __init__(self, \
+                iou_threshold=0.7, 
+                score_threshold=0.01, 
+                keep_topk=300, 
+                box_fmt='cxcywh',
+                logit_fmt='sigmoid') -> None:
+        super().__init__()
+        self.iou_threshold = iou_threshold
+        self.score_threshold = score_threshold
+        self.keep_topk = keep_topk
+        self.box_fmt = box_fmt.lower()
+        self.logit_fmt = logit_fmt.lower()
+        self.logit_func = getattr(F, self.logit_fmt, None)
+        self.deploy_mode = False 
+    
+    def forward(self, outputs: Dict[str, Tensor], orig_target_sizes: Tensor):
+        logits, boxes = outputs['pred_logits'], outputs['pred_boxes']
+        pred_boxes = torchvision.ops.box_convert(boxes, in_fmt=self.box_fmt, out_fmt='xyxy')
+        pred_boxes *= orig_target_sizes.repeat(1, 2).unsqueeze(1)
+
+        values, pred_labels = torch.max(logits, dim=-1)
+        
+        if self.logit_func:
+            pred_scores = self.logit_func(values)
+        else:
+            pred_scores = values
+
+        # TODO for onnx export
+        if self.deploy_mode:
+            blobs = {
+                'pred_labels': pred_labels, 
+                'pred_boxes': pred_boxes,
+                'pred_scores': pred_scores
+            }
+            return blobs
+
+        results = []
+        for i in range(logits.shape[0]):
+            score_keep = pred_scores[i] > self.score_threshold
+            pred_box = pred_boxes[i][score_keep]
+            pred_label = pred_labels[i][score_keep]
+            pred_score = pred_scores[i][score_keep]
+
+            keep = torchvision.ops.batched_nms(pred_box, pred_score, pred_label, self.iou_threshold)            
+            keep = keep[:self.keep_topk]
+
+            blob = {
+                'labels': pred_label[keep],
+                'boxes': pred_box[keep],
+                'scores': pred_score[keep],
+            }
+
+            results.append(blob)
+            
+        return results
+
+    def deploy(self, ):
+        self.eval()
+        self.deploy_mode = True
+        return self 
--- a/rtdetrv2_pytorch/src/optim/init.py
+++ b/rtdetrv2_pytorch/src/optim/init.py
@@ -0,0 +1,7 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from .ema import *
+from .optim import *
+from .amp import *
+from .warmup import *
--- a/rtdetrv2_pytorch/src/optim/amp.py
+++ b/rtdetrv2_pytorch/src/optim/amp.py
@@ -0,0 +1,12 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+import torch.cuda.amp as amp
+
+from ..core import register
+
+
+__all__ = ['GradScaler']
+
+GradScaler = register()(amp.grad_scaler.GradScaler)
--- a/rtdetrv2_pytorch/src/optim/ema.py
+++ b/rtdetrv2_pytorch/src/optim/ema.py
@@ -0,0 +1,92 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+import torch
+import torch.nn as nn 
+
+import math
+from copy import deepcopy
+
+from ..core import register
+from ..misc import dist_utils
+
+__all__ = ['ModelEMA']
+
+
+@register()
+class ModelEMA(object):
+    """
+    Model Exponential Moving Average from https://github.com/rwightman/pytorch-image-models
+    Keep a moving average of everything in the model state_dict (parameters and buffers).
+    This is intended to allow functionality like
+    https://www.tensorflow.org/api_docs/python/tf/train/ExponentialMovingAverage
+    A smoothed version of the weights is necessary for some training schemes to perform well.
+    This class is sensitive where it is initialized in the sequence of model init,
+    GPU assignment and distributed training wrappers.
+    """
+    def __init__(self, model: nn.Module, decay: float=0.9999, warmups: int=2000, ):
+        super().__init__()
+
+        self.module = deepcopy(dist_utils.de_parallel(model)).eval() 
+        # if next(model.parameters()).device.type != 'cpu':
+        #     self.module.half()  # FP16 EMA
+        
+        self.decay = decay 
+        self.warmups = warmups
+        self.updates = 0  # number of EMA updates
+        self.decay_fn = lambda x: decay * (1 - math.exp(-x / warmups))  # decay exponential ramp (to help early epochs)
+        
+        for p in self.module.parameters():
+            p.requires_grad_(False)
+
+
+    def update(self, model: nn.Module):
+        # Update EMA parameters
+        with torch.no_grad():
+            self.updates += 1
+            d = self.decay_fn(self.updates)
+            msd = dist_utils.de_parallel(model).state_dict()
+            for k, v in self.module.state_dict().items():
+                if v.dtype.is_floating_point:
+                    v *= d
+                    v += (1 - d) * msd[k].detach()
+            
+    def to(self, *args, **kwargs):
+        self.module = self.module.to(*args, **kwargs)
+        return self
+
+    def state_dict(self, ):
+        return dict(module=self.module.state_dict(), updates=self.updates)
+    
+    def load_state_dict(self, state, strict=True):
+        self.module.load_state_dict(state['module'], strict=strict) 
+        if 'updates' in state:
+            self.updates = state['updates']
+
+    def forwad(self, ):
+        raise RuntimeError('ema...')
+
+    def extra_repr(self) -> str:
+        return f'decay={self.decay}, warmups={self.warmups}'
+
+
+
+class ExponentialMovingAverage(torch.optim.swa_utils.AveragedModel):
+    """Maintains moving averages of model parameters using an exponential decay.
+    ``ema_avg = decay * avg_model_param + (1 - decay) * model_param``
+    `torch.optim.swa_utils.AveragedModel <https://pytorch.org/docs/stable/optim.html#custom-averaging-strategies>`_
+    is used to compute the EMA.
+    """
+    def __init__(self, model, decay, device="cpu", use_buffers=True):
+
+        self.decay_fn = lambda x: decay * (1 - math.exp(-x / 2000))  
+        
+        def ema_avg(avg_model_param, model_param, num_averaged):
+            decay = self.decay_fn(num_averaged)
+            return decay * avg_model_param + (1 - decay) * model_param
+
+        super().__init__(model, device, ema_avg, use_buffers=use_buffers)
+
+
+
--- a/rtdetrv2_pytorch/src/optim/optim.py
+++ b/rtdetrv2_pytorch/src/optim/optim.py
@@ -0,0 +1,23 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+
+import torch.optim as optim
+import torch.optim.lr_scheduler as lr_scheduler
+
+from ..core import register
+
+
+__all__ = ['AdamW', 'SGD', 'Adam', 'MultiStepLR', 'CosineAnnealingLR', 'OneCycleLR', 'LambdaLR']
+
+
+
+SGD = register()(optim.SGD)
+Adam = register()(optim.Adam)
+AdamW = register()(optim.AdamW)
+
+
+MultiStepLR = register()(lr_scheduler.MultiStepLR)
+CosineAnnealingLR = register()(lr_scheduler.CosineAnnealingLR)
+OneCycleLR = register()(lr_scheduler.OneCycleLR)
+LambdaLR = register()(lr_scheduler.LambdaLR)
--- a/rtdetrv2_pytorch/src/optim/warmup.py
+++ b/rtdetrv2_pytorch/src/optim/warmup.py
@@ -0,0 +1,47 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from torch.optim.lr_scheduler import LRScheduler
+
+from ..core import register
+
+
+class Warmup(object):
+    def __init__(self, lr_scheduler: LRScheduler, warmup_duration: int, last_step: int=-1) -> None:
+        self.lr_scheduler = lr_scheduler
+        self.warmup_end_values = [pg['lr'] for pg in lr_scheduler.optimizer.param_groups]
+        self.last_step = last_step
+        self.warmup_duration = warmup_duration
+        self.step()
+
+    def state_dict(self):
+        return {k: v for k, v in self.__dict__.items() if k != 'lr_scheduler'}
+
+    def load_state_dict(self, state_dict):
+        self.__dict__.update(state_dict)
+
+    def get_warmup_factor(self, step, **kwargs):
+        raise NotImplementedError
+
+    def step(self, ):
+        self.last_step += 1
+        if self.last_step >= self.warmup_duration:
+            return
+        factor = self.get_warmup_factor(self.last_step)
+        for i, pg in enumerate(self.lr_scheduler.optimizer.param_groups):
+            pg['lr'] = factor * self.warmup_end_values[i]
+    
+    def finished(self, ):
+        if self.last_step >= self.warmup_duration:
+            return True 
+        return False
+
+
+@register()
+class LinearWarmup(Warmup):
+    def __init__(self, lr_scheduler: LRScheduler, warmup_duration: int, last_step: int = -1) -> None:
+        super().__init__(lr_scheduler, warmup_duration, last_step)
+
+    def get_warmup_factor(self, step):
+        return min(1.0, (step + 1) / self.warmup_duration)
+
--- a/rtdetrv2_pytorch/src/solver/init.py
+++ b/rtdetrv2_pytorch/src/solver/init.py
@@ -0,0 +1,15 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+from ._solver import BaseSolver
+from .clas_solver import ClasSolver
+from .det_solver import DetSolver
+
+
+
+from typing import Dict 
+
+TASKS :Dict[str, BaseSolver] = {
+    'classification': ClasSolver,
+    'detection': DetSolver,
+}
--- a/rtdetrv2_pytorch/src/solver/_solver.py
+++ b/rtdetrv2_pytorch/src/solver/_solver.py
@@ -0,0 +1,191 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch 
+import torch.nn as nn 
+
+from datetime import datetime
+from pathlib import Path 
+from typing import Dict
+import atexit
+
+from ..misc import dist_utils
+from ..core import BaseConfig
+
+
+def to(m: nn.Module, device: str):
+    if m is None:
+        return None 
+    return m.to(device) 
+
+
+class BaseSolver(object):
+    def __init__(self, cfg: BaseConfig) -> None:
+        self.cfg = cfg 
+
+    def _setup(self, ):
+        """Avoid instantiating unnecessary classes 
+        """
+        cfg = self.cfg
+        if cfg.device:
+            device = torch.device(cfg.device)
+        else:
+            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+
+        self.model = cfg.model
+        
+        # NOTE (lyuwenyu): must load_tuning_state before ema instance building
+        if self.cfg.tuning:
+            print(f'tuning checkpoint from {self.cfg.tuning}')
+            self.load_tuning_state(self.cfg.tuning)
+
+        self.model = dist_utils.warp_model(self.model.to(device), sync_bn=cfg.sync_bn, \
+            find_unused_parameters=cfg.find_unused_parameters)
+
+        self.criterion = to(cfg.criterion, device)
+        self.postprocessor = to(cfg.postprocessor, device)
+
+        self.ema = to(cfg.ema, device)
+        self.scaler = cfg.scaler
+
+        self.device = device
+        self.last_epoch = self.cfg.last_epoch
+        
+        self.output_dir = Path(cfg.output_dir)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        self.writer = cfg.writer
+
+        if self.writer:
+            atexit.register(self.writer.close)
+            if dist_utils.is_main_process():
+                self.writer.add_text(f'config', '{:s}'.format(cfg.__repr__()), 0)
+
+    def cleanup(self, ):
+        if self.writer:
+            atexit.register(self.writer.close)
+
+    def train(self, ):
+        self._setup()
+        self.optimizer = self.cfg.optimizer
+        self.lr_scheduler = self.cfg.lr_scheduler
+        self.lr_warmup_scheduler = self.cfg.lr_warmup_scheduler
+
+        self.train_dataloader = dist_utils.warp_loader(self.cfg.train_dataloader, \
+            shuffle=self.cfg.train_dataloader.shuffle)
+        self.val_dataloader = dist_utils.warp_loader(self.cfg.val_dataloader, \
+            shuffle=self.cfg.val_dataloader.shuffle)
+
+        self.evaluator = self.cfg.evaluator
+
+        # NOTE instantiating order
+        if self.cfg.resume:
+            print(f'Resume checkpoint from {self.cfg.resume}')
+            self.load_resume_state(self.cfg.resume)
+
+    def eval(self, ):
+        self._setup()
+
+        self.val_dataloader = dist_utils.warp_loader(self.cfg.val_dataloader, \
+            shuffle=self.cfg.val_dataloader.shuffle)
+
+        self.evaluator = self.cfg.evaluator
+        
+        if self.cfg.resume:
+            print(f'Resume checkpoint from {self.cfg.resume}')
+            self.load_resume_state(self.cfg.resume)
+
+    def to(self, device):
+        for k, v in self.__dict__.items():
+            if hasattr(v, 'to'):
+                v.to(device)
+
+    def state_dict(self):
+        """state dict, train/eval
+        """
+        state = {}
+        state['date'] = datetime.now().isoformat()
+        
+        # TODO for resume
+        state['last_epoch'] = self.last_epoch
+
+        for k, v in self.__dict__.items():
+            if hasattr(v, 'state_dict'):
+                v = dist_utils.de_parallel(v)
+                state[k] = v.state_dict() 
+
+        return state
+
+
+    def load_state_dict(self, state):
+        """load state dict, train/eval
+        """
+        # TODO
+        if 'last_epoch' in state:
+            self.last_epoch = state['last_epoch']
+            print('Load last_epoch')
+
+        for k, v in self.__dict__.items():
+            if hasattr(v, 'load_state_dict') and k in state:
+                v = dist_utils.de_parallel(v)
+                v.load_state_dict(state[k])
+                print(f'Load {k}.state_dict')
+
+            if hasattr(v, 'load_state_dict') and k not in state:
+                print(f'Not load {k}.state_dict')
+
+
+    def load_resume_state(self, path: str):
+        """load resume
+        """
+        # for cuda:0 memory
+        if path.startswith('http'):
+            state = torch.hub.load_state_dict_from_url(path, map_location='cpu')
+        else:
+            state = torch.load(path, map_location='cpu')
+
+        self.load_state_dict(state)
+
+    
+    def load_tuning_state(self, path: str,):
+        """only load model for tuning and skip missed/dismatched keys
+        """
+        if path.startswith('http'):
+            state = torch.hub.load_state_dict_from_url(path, map_location='cpu')
+        else:
+            state = torch.load(path, map_location='cpu')
+
+        module = dist_utils.de_parallel(self.model)
+        
+        # TODO hard code
+        if 'ema' in state:
+            stat, infos = self._matched_state(module.state_dict(), state['ema']['module'])
+        else:
+            stat, infos = self._matched_state(module.state_dict(), state['model'])
+
+        module.load_state_dict(stat, strict=False)
+        print(f'Load model.state_dict, {infos}')
+
+
+    @staticmethod
+    def _matched_state(state: Dict[str, torch.Tensor], params: Dict[str, torch.Tensor]):
+        missed_list = []
+        unmatched_list = []
+        matched_state = {}
+        for k, v in state.items():
+            if k in params:
+                if v.shape == params[k].shape:
+                    matched_state[k] = params[k]
+                else:
+                    unmatched_list.append(k)
+            else:
+                missed_list.append(k)
+
+        return matched_state, {'missed': missed_list, 'unmatched': unmatched_list}
+
+
+    def fit(self, ):
+        raise NotImplementedError('')
+
+
+    def val(self, ):
+        raise NotImplementedError('')
--- a/rtdetrv2_pytorch/src/solver/clas_engine.py
+++ b/rtdetrv2_pytorch/src/solver/clas_engine.py
@@ -0,0 +1,74 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import torch
+import torch.nn as nn 
+
+from ..misc import (MetricLogger, SmoothedValue, reduce_dict)
+
+
+def train_one_epoch(model: nn.Module, criterion: nn.Module, dataloader, optimizer, ema, epoch, device):
+    """
+    """
+    model.train()
+
+    metric_logger = MetricLogger(delimiter="  ")
+    metric_logger.add_meter('lr', SmoothedValue(window_size=1, fmt='{value:.6f}'))
+    print_freq = 100
+    header = 'Epoch: [{}]'.format(epoch)
+
+    for imgs, labels in metric_logger.log_every(dataloader, print_freq, header):
+        imgs = imgs.to(device)
+        labels = labels.to(device)
+
+        preds = model(imgs)
+        loss: torch.Tensor = criterion(preds, labels)
+        
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+        
+        if ema is not None:
+            ema.update(model)
+
+        loss_reduced_values = {k: v.item() for k, v in reduce_dict({'loss': loss}).items()}
+        metric_logger.update(**loss_reduced_values)
+        metric_logger.update(lr=optimizer.param_groups[0]["lr"])
+    
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+
+    stats = {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+    return stats
+
+
+
+@torch.no_grad()
+def evaluate(model, criterion, dataloader, device):
+    model.eval()
+
+    metric_logger = MetricLogger(delimiter="  ")
+    # metric_logger.add_meter('acc', SmoothedValue(window_size=1, fmt='{global_avg:.4f}'))
+    # metric_logger.add_meter('loss', SmoothedValue(window_size=1, fmt='{value:.2f}'))
+    metric_logger.add_meter('acc', SmoothedValue(window_size=1))
+    metric_logger.add_meter('loss', SmoothedValue(window_size=1))
+
+    header = 'Test:'
+    for imgs, labels in metric_logger.log_every(dataloader, 10, header):
+        imgs, labels = imgs.to(device), labels.to(device)
+        preds = model(imgs)
+
+        acc = (preds.argmax(dim=-1) == labels).sum() / preds.shape[0]
+        loss = criterion(preds, labels)
+
+        dict_reduced = reduce_dict({'acc': acc, 'loss': loss})
+        reduced_values = {k: v.item() for k, v in dict_reduced.items()}
+        metric_logger.update(**reduced_values)
+
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+
+    stats = {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+    return stats
+
+
--- a/rtdetrv2_pytorch/src/solver/clas_solver.py
+++ b/rtdetrv2_pytorch/src/solver/clas_solver.py
@@ -0,0 +1,71 @@
+"""Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import time 
+import json
+import datetime
+from pathlib import Path
+
+import torch 
+import torch.nn as nn 
+
+from ..misc import dist_utils
+from ._solver import BaseSolver
+from .clas_engine import train_one_epoch, evaluate
+
+
+class ClasSolver(BaseSolver):
+
+    def fit(self, ):
+        print("Start training")
+        self.train()
+        args = self.cfg 
+
+        n_parameters = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
+        print('Number of params:', n_parameters)
+
+        output_dir = Path(args.output_dir)
+        output_dir.mkdir(exist_ok=True)
+
+        start_time = time.time()
+        start_epoch = self.last_epoch + 1
+        for epoch in range(start_epoch, args.epoches):
+
+            if dist_utils.is_dist_available_and_initialized():
+                self.train_dataloader.sampler.set_epoch(epoch)
+            
+            train_stats = train_one_epoch(self.model, 
+                                        self.criterion, 
+                                        self.train_dataloader, 
+                                        self.optimizer, 
+                                        self.ema, 
+                                        epoch=epoch, 
+                                        device=self.device)
+            self.lr_scheduler.step()
+            self.last_epoch += 1
+
+            if output_dir:
+                checkpoint_paths = [output_dir / 'checkpoint.pth']
+                # extra checkpoint before LR drop and every 100 epochs
+                if (epoch + 1) % args.checkpoint_freq == 0:
+                    checkpoint_paths.append(output_dir / f'checkpoint{epoch:04}.pth')
+                for checkpoint_path in checkpoint_paths:
+                    dist_utils.save_on_master(self.state_dict(epoch), checkpoint_path)
+
+            module = self.ema.module if self.ema else self.model
+            test_stats = evaluate(module, self.criterion, self.val_dataloader, self.device)
+
+            log_stats = {**{f'train_{k}': v for k, v in train_stats.items()},
+                         **{f'test_{k}': v for k, v in test_stats.items()},
+                         'epoch': epoch,
+                         'n_parameters': n_parameters}
+            
+            if output_dir and dist_utils.is_main_process():
+                with (output_dir / "log.txt").open("a") as f:
+                    f.write(json.dumps(log_stats) + "\n")
+
+        total_time = time.time() - start_time
+        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+        print('Training time {}'.format(total_time_str))
+
+
--- a/rtdetrv2_pytorch/src/solver/det_engine.py
+++ b/rtdetrv2_pytorch/src/solver/det_engine.py
@@ -0,0 +1,157 @@
+"""
+Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
+https://github.com/facebookresearch/detr/blob/main/engine.py
+
+Copyright(c) 2023 lyuwenyu. All Rights Reserved.
+"""
+
+import sys
+import math
+from typing import Iterable
+
+import torch
+import torch.amp 
+from torch.utils.tensorboard import SummaryWriter
+from torch.cuda.amp.grad_scaler import GradScaler
+
+from ..optim import ModelEMA, Warmup
+from ..data import CocoEvaluator
+from ..misc import MetricLogger, SmoothedValue, dist_utils
+
+
+def train_one_epoch(model: torch.nn.Module, criterion: torch.nn.Module,
+                    data_loader: Iterable, optimizer: torch.optim.Optimizer,
+                    device: torch.device, epoch: int, max_norm: float = 0, **kwargs):
+    model.train()
+    criterion.train()
+    metric_logger = MetricLogger(delimiter="  ")
+    metric_logger.add_meter('lr', SmoothedValue(window_size=1, fmt='{value:.6f}'))
+    header = 'Epoch: [{}]'.format(epoch)
+    
+    print_freq = kwargs.get('print_freq', 10)
+    writer :SummaryWriter = kwargs.get('writer', None)
+
+    ema :ModelEMA = kwargs.get('ema', None)
+    scaler :GradScaler = kwargs.get('scaler', None)
+    lr_warmup_scheduler :Warmup = kwargs.get('lr_warmup_scheduler', None)
+
+    for i, (samples, targets) in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
+        samples = samples.to(device)
+        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
+        global_step = epoch * len(data_loader) + i
+        metas = dict(epoch=epoch, step=i, global_step=global_step)
+
+        if scaler is not None:
+            with torch.autocast(device_type=str(device), cache_enabled=True):
+                outputs = model(samples, targets=targets)
+            
+            with torch.autocast(device_type=str(device), enabled=False):
+                loss_dict = criterion(outputs, targets, **metas)
+
+            loss = sum(loss_dict.values())
+            scaler.scale(loss).backward()
+            
+            if max_norm > 0:
+                scaler.unscale_(optimizer)
+                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
+
+            scaler.step(optimizer)
+            scaler.update()
+            optimizer.zero_grad()
+
+        else:
+            outputs = model(samples, targets=targets)
+            loss_dict = criterion(outputs, targets, **metas)
+            
+            loss : torch.Tensor = sum(loss_dict.values())
+            optimizer.zero_grad()
+            loss.backward()
+            
+            if max_norm > 0:
+                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
+
+            optimizer.step()
+        
+        # ema 
+        if ema is not None:
+            ema.update(model)
+
+        if lr_warmup_scheduler is not None:
+            lr_warmup_scheduler.step()
+
+        loss_dict_reduced = dist_utils.reduce_dict(loss_dict)
+        loss_value = sum(loss_dict_reduced.values())
+
+        if not math.isfinite(loss_value):
+            print("Loss is {}, stopping training".format(loss_value))
+            print(loss_dict_reduced)
+            sys.exit(1)
+
+        metric_logger.update(loss=loss_value, **loss_dict_reduced)
+        metric_logger.update(lr=optimizer.param_groups[0]["lr"])
+
+        if writer and dist_utils.is_main_process():
+            writer.add_scalar('Loss/total', loss_value.item(), global_step)
+            for j, pg in enumerate(optimizer.param_groups):
+                writer.add_scalar(f'Lr/pg_{j}', pg['lr'], global_step)
+            for k, v in loss_dict_reduced.items():
+                writer.add_scalar(f'Loss/{k}', v.item(), global_step)
+                
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+    return {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+
+
+@torch.no_grad()
+def evaluate(model: torch.nn.Module, criterion: torch.nn.Module, postprocessor, data_loader, coco_evaluator: CocoEvaluator, device):
+    model.eval()
+    criterion.eval()
+    coco_evaluator.cleanup()
+    iou_types = coco_evaluator.iou_types
+
+    metric_logger = MetricLogger(delimiter="  ")
+    header = 'Test:'
+    
+    for samples, targets in metric_logger.log_every(data_loader, 10, header):
+        samples = samples.to(device)
+        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
+
+        outputs = model(samples)
+
+        # TODO (lyuwenyu), fix dataset converted using `convert_to_coco_api`?
+        orig_target_sizes = torch.stack([t["orig_size"] for t in targets], dim=0)
+        
+        results = postprocessor(outputs, orig_target_sizes)
+
+        # if 'segm' in postprocessor.keys():
+        #     target_sizes = torch.stack([t["size"] for t in targets], dim=0)
+        #     results = postprocessor['segm'](results, outputs, orig_target_sizes, target_sizes)
+
+        res = {target['image_id'].item(): output for target, output in zip(targets, results)}
+        if coco_evaluator is not None:
+            coco_evaluator.update(res)
+
+    # gather the stats from all processes
+    metric_logger.synchronize_between_processes()
+    print("Averaged stats:", metric_logger)
+    if coco_evaluator is not None:
+        coco_evaluator.synchronize_between_processes()
+
+    # accumulate predictions from all images
+    if coco_evaluator is not None:
+        coco_evaluator.accumulate()
+        coco_evaluator.summarize()
+
+    stats = {}
+    # stats = {k: meter.global_avg for k, meter in metric_logger.meters.items()}
+    if coco_evaluator is not None:
+        if 'bbox' in iou_types:
+            stats['coco_eval_bbox'] = coco_evaluator.coco_eval['bbox'].stats.tolist()
+        if 'segm' in iou_types:
+            stats['coco_eval_masks'] = coco_evaluator.coco_eval['segm'].stats.tolist()
+            
+    return stats, coco_evaluator
+
+
+
--- a/Show More
+++ b/Show More