Problem

Inverse projection 문제는 결국 perspective image sequence로부터 3D point clunds 그에 따른 semantic understanding을 복원하는 문제로 생각할 수 있다. 즉 각 3D point들의 spatial location과 semantic class와 temporally consistent instance label을 prediction 하여야 한다.
이는 Depth-aware Video Panoptic Segmentation (DVPS)로 모델링할 수 있는데, 이는 monocular depth estimation과 video panoptic segmentation의 2개의 sub task로 구분할 수 있다.

Essence

첫번째 sub task인 video panoptic segmentation은 semantic segmentation과 instance segmentation을 통합한 것으로 video domain로 확장되면서 각 instance가 video sequence동안 동일한 instance ID를 가지도록 요구되고 있다.
이전 연구된 VPSNet의 경우 새로운 tracking head를 추가하여 instance 끼리의 regional feature 유사도를 학습하도록 하고 있다.
논문에서 제안하는 방법은 기존 Panoptic-DeepLab을 확장하여 2개의 연속된 프레임을 concatenation하여 입력한 뒤, 두 프레임에 있는 픽셀들을 모두 첫번째 프레임의 instance의 instance에 대응되도록 하였다.
첫번째 프레임에 없는 instance는 무시 되고 다음 프레임에서 처리하고, video sequence에서 나온 결과에서 temporal instance ID들을 stitching하여 고유한 ID로 통합하도록 하였다.
Monocular depth estimation은 state-of-the-art에서 사용되는 것처럼 fully supervised 방법을 따랐으며, Panoptic-DeepLab의 헤드에 depth prediction head를 추가하여 학습시켰다.

Detail

Video Panoptic Segmentation

Video panoptic segmentation에서 각 instace는 image plane을 시간 축으로 쌓아놓은 tube로 표현될 수 있다. Video Panoptic Qualuity (VPQ)는 Time window k 내에서 각 instace마다 tube 내 픽셀들의 IoU가 0.5 이상이 되는 instance를 TP, 비슷한 방법으로 FP, FN을 정의하면 다음과 같이 표현할 수 있다.

VPQ^k = \frac{N_{\text{classes}}}{1} \sum_c \frac{\sum_{(u, \hat{u}) \in \text{TP}_c}\text{IoU}(u, \hat{u})}{|\text{TP}_c| + \frac{1}{2} |\text{FP}_c| + \frac{1}{2}|\text{FN}_c|}

어떤 image sequence가 주어졌을 때 VPQ는 각 image의 PQ-related statistics를 누적한 것이라고 생각할 수 있는데, 여기에서 image를 concatenation하여 처리할 수 있는 영감을 얻게 되었다.
Panoptic-DeepLab은 image panoptic segmentation을 위한 것으로 3가지 sub task를 구행하는 알고리즘이다. (1) Things와 stuff class를 구분하는 semantic prediction task와 (2) things class들의 각 instance의 center를 prediction하는 task와 (3) 각 픽셀들이 속하는 center를 offset의 형태로 regression하는 center regression task로 구성된다.
Inferece 과정에서 thing에 해당하는 픽셀들을 regression된 center와 가장 가까운 center와 연결하여 instance로 처리하고 여기에 stuff로 prediction된 결과를 합쳐 최종 panoptic prediction으로 출력한다.

제안하는 ViP-DeepLab은 이를 확장하여 video panoptic segmentation을 구현하였다. 마찬가지로 동일한 3개의 sub task를 수행하는데, 대신 image t와 연속된 다음 프레임의 image t+1를 concatenation 하여 사용한다.
Inferece 과정에서 t와 t+1은 모두 t를 기준으로 object center의 위치를 regression 하도록 하였는데, 이렇게 함으로써 t+1의 픽셀들이 t의 어느 object에 속하는지 찾을 수 있게 된다. t+1에서만 나타나는 객체는 이 단계에서는 무시되며 다음 단계에서 첫번째 프레임으로 처리될 때 비로소 찾을 수 있게 된다.

전체적인 구조를 살펴보면, 회색 영역은 기존의 Panoptic-DeepLab과 동일하게 image t를 이용하여 semantic segmentation, object center prediction, object center regression을 수행하고, 여기에 next-frame instance branch 를 추가하여 image t와 t+1를 입력받아 t를 기준으로 하는 center regression을 추가로 계산한다.
이 branch는 큰 receptive field를 요구하기 때문에 4개의 ASPP 모듈을 적용하여 그 결과를 densly-connected 하여 receptive field를 크게 증가시켰다. 이를 Cascade ASPP라고 이름 붙였다.

Stitiching Video Panoptic Prediction

전체 image sequence에서 instance들이 동일한 ID를 가지게 하기 위해 같은 입력으로 사용되었던 P_t, R_t은 동일한 ID를 가지게 되었지만, 다음 프레임의 P_{t+1}도 P_t와 동일한 ID를 가지게끔 해야한다.

논문에서는 R_t로부터 P_{t+1} 로의 ID를 propagate하는 방법을 고안하였고,이를 위해 mask IoU에 기반한 방법을 사용하였다.
만약 R_t와 P_{t+1} 중에서 동일한 class인 region pair 중에서 가장 큰 mask IoU를 가지는 region을 골라 동일한 ID를 할당하도록 하였다. 만약 ID를 할당받지 못했다면 새로운 instance로 간주한다.

Monocular Depth Estimation

논문에서는 [22]와 같이 dense regression problem으로 모델링하여 각 픽셀마다 depth를 estimation 하도록 하였다.
Sementic branch에 depth prediction head를 추가하였고, 이는 feature를 2배로 upsample 한 뒤, f_d를 regression 한다. 이로부터 최종 Depth는 \text{Depth} = \text{MaxDepth} \times \text{Sigmoid}(f_d)로 계산된다. MaxDepth는 88로 설정하였다.
Loss는 주로 사용되는 invariant logarithmic error와 relative squared error를 통합하여 아래와 같이 사용하였다.

\mathcal{L}_\text{depth} (d, \hat{d}) = \frac{1}{n} \sum_i (\log d_i - \log \hat{d}_i)^2 - \frac{1}{n^2}(\sum_i \log d_i - \log \hat{d}_i)^2 + (\frac{1}{n} \sum_i (\frac{d_i - \hat{d}_i}{d_i})^2)^{0.5}

Depth-aware Video Panoptic Segmentation

기존의 Video Panoptic Segmentation을 확장하여 Depth-aware Video Panoptic Segmentation (DVPS) 을 정의하고 이를 평가할 metric인 Depth-aware Video Panoptic Quality (DVPQ)를 제안하였다.
이는 기존 VPQ에서 depth error 까지 고려한 것으로, prediction된 픽셀 중 GT와 비교하여 absolute relative depth error가 \lambda 이내일 경우에만 label을 할당하도록 하였다.
\lambda는 0.1, 0.25, 0.5에 대해서 각각 평가하였다.

논문

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Problem

Essence

Detail

Video Panoptic Segmentation

Stitiching Video Panoptic Prediction

Monocular Depth Estimation

Depth-aware Video Panoptic Segmentation

카테고리

최신 글

최신 댓글

보관함

메타

논문

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

Problem

Essence

Detail

Video Panoptic Segmentation

Stitiching Video Panoptic Prediction

Monocular Depth Estimation

Depth-aware Video Panoptic Segmentation

카테고리

태그

최신 글

최신 댓글

보관함

메타