Problem

하나의 영상에 대한 segmentation 테크닉은 CNN을 이용하여 빠르게 벌전하고 있으나, 이를 video data에 대한 연구는 그보다는 적다.
Single image CNN을 이용한 방법을 video data에 그대로 적용하면 temporal information을 무시하게 되고, 이로 인해 프레임간 jittering이 발생한다.
이를 해결하기 위해 프레임간 CRF를 적용한 방법이 제안되었으나, CNN의 internal representation을 직접 사용하지 않고 label 수준에서 적용된다는 한계가 있으며, 속도 또한 매우 느리기도 하다.

Essence

Single image CNN의 강력함과 video frame간 temporal coherence를 한꺼번에 이용하기 위해 image CNN을 video CNN으로 변환하는 간단한 방법을 제안하였다.
제안하는 NetWarp는 이전 프레임의 intermediate CNN representation을 현재 프레임에 맞게 warp하는 모듈이다.
이는 두 프레임간 optical flow를 이용하여 intermediate CNN representation의 transform 하는 방법을 학습한다.
이 모듈을 CNN의 여러 레이어에 대해 적용하여 여러 representation에 대해 warping을 할 수 있다.

Detail

NetWarp

NetWarp 모듈을 여러 단계를 거친다. 입력으로부터 optical flow field를 계산하고, flow transformation, representation warping, combination of representations의 순으로 동작한다.

Flow Computation

프레임 간 dense optical flow F_t를 계산하기 위해서 DIS-Flow를 사용하였다.
두 이미지 쌍 I_t, I_{(t-1)}이 주어지면 I_t의 모든 픽셀 위치 (x,y)에 해당하는 I_{(t-1)}의 pixel displacement (u, v)를 계산할 수 있다. 수식으로 표현하면 (x', y') = (x + u, y+v)가 된다.
추가적으로 FlowFields를 이용하면 더 정확한 결과를 얻을 수 있으나 속도가 상대적으로 느린 편이다.

Flow Transformation

Representation을 propagation하기에 Optical flow를 그대로 사용하는 것은 잘 맞지 않을 수도 있으므로, 작은 CNN을 만들어서 계산된 optical flow를 transform하도록 하였다.
이를 FlowCNN이라 이름 붙였으며, \Lambda (F_t)로 표시하였다.
계산된 2개 채널의 flow와 이전 프레임과 현재 프레임 이미지, 그리고 두 프레임 간의 차를 concatenation하여 11개 채널을 가지는 tensor를 FlowCNN의 입력으로 하였다.
이 네트워크는 4개의 3x3 convolution layer와 ReLU를 사용하였고, 그 출력은 16, 32, 2, 2를 가진다. 세번째 레이어의 출력은 입력으로 사용된 optical flow를 skip connection으로 연결하여 concatenation하여 마지막 레이어로 입력되며, 마지막 레이어는 transformed flow를 출력하도록 하였다.
다음 그림은 그 결과를 나타낸다.

Warping Representations

계산된 flow를 이용하여 이전 프레임의 representation을 현재 프레임의 representation에 align되도록 warping한 representation \hat{\mathbf{z}}_{(t-1)}을 계산한다.

\hat{\mathbf{z}}_{(t-1)} = Warp(\mathbf{z}_{(t-1)}, \Lambda (F_t))

여기서 Warp 연산은 bilinear interpolation을 통해 계산되며 이는 flow가 정수인 경우를 제외하면 differentiable하다. 따라서 \epsilon = 0.0001을 transformed flow에 더해주어 non-differentiable 한 경우를 제거하였다.
Strided pooling 때문에 이미지 해상도의 transformed optical flow와 CNN의 representation의 해상도가 맞지 않으므로 같은 해상도를 갖도록 stride를 맞춰주었다.

Combination of Representations

이전 프레임으로부터 계산된 warped representation \hat{\mathbf{z}}_{(t-1)}과 현재 프레임의 representation \mathbf{z}_t를 linearly combine하여 최종 representation \tilde{\mathbf{z}}_t를 계산하였다.

\tilde{\mathbf{z}}_t = \mathbf{w}_1 \odot \mathbf{z}_t + \mathbf{w}_2 \odot \hat{\mathbf{z}}_{(t-1)}

여기서 \mathbf{w}_1과 \mathbf{w}_2는 representation의 채널 길이와 같은 길이의 weight vector로 각 채널에 곱해지는 값이며 학습을 통해 계산된다.
이렇계 계산된 representation은 다음 CNN 레이어로 전달된다.

Usage and Training

NetWarp는 end-to-end로 학습되며, 다른 구조를 갖는 DNN에도 적용될 수 있다.
프레임 간 CNN은 동일한 weight를 공유하도록 하였으며, 메모리 한계로 1개의 이전 프레임만을 이용하여 학습시켰다.
Combination을 위한 weight는 \mathbf{w}_1 = 1, \mathbf{w}_2 = 0으로 초기화한 후 학습을 진행하였다.

논문

Semantic Video CNNs through Representation Warping

Problem

Essence

Detail

NetWarp

Flow Computation

Flow Transformation

Warping Representations

Combination of Representations

Usage and Training

카테고리

최신 글

최신 댓글

보관함

메타

논문

Semantic Video CNNs through Representation Warping

Problem

Essence

Detail

NetWarp

Flow Computation

Flow Transformation

Warping Representations

Combination of Representations

Usage and Training

카테고리

태그

최신 글

최신 댓글

보관함

메타