電腦視覺-超解析度-論文回顧

關鍵字: Computer Vision, Super Resolution, SRCNN

超解析度(Super Resolution, SR) 在電腦視覺這個領域當中,也是相當熱門的主題之一,在上圖,我們看到中間的圖片解析度是比較低的,而在通過 Super Resolution 的 Model 之後,就可以生成如右邊解析度較高的照片了,很神奇吧!

A. Promise

  1. 掌握 SRCNN Paper 的重點。

B. Introduction of paper (論文簡介)

原則上為了方便與 Paper 相互對應,於 Medium 的架構會與論文相符。建議大家可以花點時間看看 C. Outline 的部分,以便對文章的架構有個快速的掌握。

C. Outline

0 Abstract
1 Introduction
2 Related Work
→ 2.1 Image Super-Resolution
→ 2.2 Convolutional Neural Networks
→ 2.3 Deep Learning for Image Restoration
3 Convolutional Neural Networks For Super-Resolution
→ 3.1 Formulation
→ 3.2 Relationship to Sparse-Coding-Based Methods
→ 3.3 Training
4 Experiments
→ 4.1 Training Data
→ 4.2 Learned Filters for Super-Resolution
→ 4.3 Model and Performance Trade-offs
→ 4.4 Comparisons to State-of-the-Arts
→ 4.5 Experiments on Color Channels

5 Conclusions

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

我們會將重點以及我們的理解點放入部分的幾個小節,若有任何想法,再請於留言處告訴我們了。接下來,讓我們好好探索 SRCNN 這篇論文吧!

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

0. Abstract (綱要)

  • 此團隊針對 Super Resolution 提出了一套深度學習的方法。→ 也就是 SRCNN
  • 這個方法是一個 CNN-based 的 Model,Input 是一張 low-resolution 的照片,而 Output 是一張 high-resolution 的照片。
  • 可以把傳統的 sparse-coding-based SR 方法也視為 deep CNN 的方法。
  • 此為相當輕量的模型,且保持高品質的輸出。

1. Introduction (介紹)

重申了 Super Resolution 在 CV 的 Mission,如下

Single image super-resolution (SR), which aims at recovering a high-resolution image from a single low-resolution image, is a classical problem in computer vision.

Sparse-coding-based method 有四步驟

  1. Overlapping patches are densely cropped from the input image and pre-processed (e.g., subtracting mean and normalization)
  2. These patches are then encoded by a low-resolution dictionary.
  3. The sparse coefficients are passed into a high-resolution dictionary for reconstructing high-resolution patches.
  4. The overlapping reconstructed patches are aggregated (e.g., by weighted averaging) to produce the final output.

理解點:
作者等人提到,Sparse-coding-based method 是當時傳統方法常使用的策略,不過其實我們仔細分析這四個步驟,會發現這很像是 CNN 運算會做的事情。透過 CNN 運算,我們一次處理一個 window (對應到上面的 Patch),然後把 window 和 kernal/filter 做卷積運算(對應到上面的 Encode),接下來換 input image 的下一個 window,直到做完全部,其結果就會累積(aggregated) 在 Output 的 Feature Map 上。

受到啟發後,作者等人使用了 CNN 做為運算基礎。所以他們這樣說:

We consider a convolutional neural network that directly learns an end-to-end mapping between low- and high-resolution images

命名 Model,SRCNN 從此誕生。

We name the proposed model Super-Resolution Convolutional Neural Network (SRCNN)

以下為作者等人提到的 SRCNN 的優點

  • Its structure is intentionally designed with simplicity in mind, and yet provides superior accuracy compared with state-of-the-art example-based methods.
  • With moderate number of filters and layers, our method achieves fast speed for practical on-line usage even on a CPU.
  • Experiments show that the restoration quality of the network can be further improved when (i) larger and more diverse datasets are available and/ or (ii) a larger and deeper model is used.

理解點:
通常資料越多,且分布越豐富得到的模型表現力就會越高,這一點通常是正確的;但是針對第二點,我們從整個 Deep Learning 目前既有知名模型的角度思考這件事情的話,其實就不一定了,在這邊提到 Deeper & Larger Model 會有更好的表現,這邊是只有針對這篇的 SRCNN,並不一定會適用於其他的知名的神經網路,如 ResNet。

在 Introduction 的最後重申本篇論文的價值

1. We present a fully convolutional neural network for image super-resolution.

2. We establish a relationship between our deep learning-based SR method and the traditional sparse-coding-based SR methods.

3. We demonstrate that deep learning is useful in the classical computer vision problem of super resolution, and can achieve good quality and
speed.

2. Related Work (前人文獻)

我們根據 Paper 的描述將提到的論文依照類別整理在本文最下方。

3. Convolutional Neural Networks For Super-Resolution (CNN for SR)

3.1 Formulation

在將 low-resolution 的照片餵進網路之前,作者等人有先利用 bicubic interpolation 將照片 resize 成目標大小(如上圖左邊的照片),注意,這是本篇論文中唯一對照片做的 Pre-Processing。

接著在概念上將會針對做完放大(但為低解析度)的 Input Image,將會做三件事情。(1) Patch extraction and representation (2) Non-linear mapping (3) Reconstruction ,以下分小節說明。

3.1.1 Patch extraction and representation

  • This operation extracts (overlapping) patches from the low-resolution image Y and represents each patch as a high-dimensional vector.
  • These vectors comprise a set of feature maps, of which the number equals to the dimensionality of the vectors.
  • W1 and B1 represents the filters and biases respectively, and * denotes the convolutional operation.
  • W1 corresponds to n1 filters of support c x f1 x f1, where c is the number of channels in the image, f1 is the spatial size of a filter.
  • Intuitively, W1 applies n1 convolutions on the image, and each convolution has a kernel size c x f1 x f1.
  • B1 is an n1-dimensional vector.
  • Apply ReLU.

理解點:
觀察式(1),能夠發現,最外層包著一個激活函數,ReLU,在 ReLU 內是一般常見的線性運算,Input 的一張 Image 的 Patch (Y),和含有多個 filters 的 W1 做卷積運算後,再加上一個 Biases 項,即是 Patch Extraction and Representation 這一步驟再做的事情。

3.1.2 Non-linear mapping

  • This operation nonlinearly maps each high-dimensional vector onto another high-dimensional vector.
  • Each mapped vector is conceptually the representation of a high-resolution patch.
  • These vectors comprise another set of feature maps.
  • W2 contains n2 filters of size n1 x f2 x f2, and B2 is n2-dimensional.

理解點:
觀察式(2),其形式與式(1)相當一致,在式(2)當中,作者等人再一次將訊號通過 ReLU 的非線性運作,以提取更加抽象的資訊。這樣的 High-Level 抽象資訊會變成下一步驟 Reconstruction 中的 Input。(而且每個 Input Patch 會對應到 n2 個抽象資訊。)

其實我們很容易可以發現,這樣的操作和各大知名CNN神經網路相當一致,將影像 Feed 進神經網路之後,其訊號會通過一層又一層的卷積層和池化層,其層數基本上都會大於3,所以才會用 Deep 來稱呼這些神經網路。

在 SRCNN 當中,一樣也是用了複數層卷積層,(都沒有使用池化層推測是因為我們的任務目標是 Super Resolution,而池化會使 Resolution 降低,所以不使用),作者提到其實我們也可以再多加幾層卷積層以提高 non-linearity,只是訓練時間會再拉長一點,見以下內容。

It is possible to add more convolutional layers to increase the non-linearity. But this can increase the complexity of the model, and thus demands more training time. We will explore deeper structures by introducing additional non-linear mapping layers in Section 4.3.3.

3.1.3 Reconstruction

  • This operations aggregates the above high-resolution patch-wise representations to generate the final high-resolution image.
  • This image is expected to be similar to the ground truth X.
  • W3 corresponds to c filters of a size n2 x f3 x f3, and B3 is a c-dimensional vector.

理解點:
在以往,通常在得到 n2 個 High-Level 抽象資訊之後,會利用 average 的方式得到最後高解析度 Patch 的值(目標是離 Ground Truth X 越像越好),我們其實可以把 average 當作是一個 already defined 的 filter,filter 裡面的值全部都是 1(因此貢獻都一樣,這就是 average)。既然是 filter,作者等人啟發於此,決定使用卷積運算(裡頭有 trainable filter),讓裡面的 filter 去對每個 High-Level 的抽象資訊「加權運算」,所以化成數學就是式(3)。

在此,作者在此正式連結以上三個公式與 CNN 的關係,他們基本上就是一樣的概念。(見下方)

Interestingly, although the above three operations are motivated by different intuitions, they all lead to the same form as a convolutional layer. We put all three operations together and form a convolutional neural network.

3.2 Relationship to Sparse-Coding-Based Methods

前情提要:
Sparse-Coding-Based Methods 是當時 Super-Resolution 傳統的方法(見文獻會顧[49][50]),而作者等人在此小節嘗試把此方法與 CNN 連結起來。

  • In non-linear mapping operator, its spatial support is 1 x 1.
  • And, it’s an iterative algorithm, not feed-forward algorithm.
    (也就是說,像是用兩個 loop 去對每個 pixel 做運算。)
  • On the contrary, our non-linear operator is fully feed-forward and can be computed efficiently.

理解點:
雖然我們可以把 Sparse-Coding-Based Methods 視作是狹義的 CNN,但是還是與 SRCNN 有所區別,以下是作者提出的不同之處。

Not all operations have been considered in the optimization in the sparse-coding-based SR methods. On the contrary, in our convolutional neural network, the low-resolution dictionary, high-resolution dictionary, non-linear mapping, together with mean subtraction and averaging, are all involved in the filters to e optimized.

3.3 Training

  • Network Parameters Θ ={W1, W2, W3, B1, B2, B3}
  • The filter weights of each layer are initialized by drawing randomly form a Gaussian distribution with zero mean and standard deviation 0.001 (and 0 for biases.)
  • Use the loss between the reconstructed images F(Y;Θ) and the corresponding ground truth high-resolution images X.
  • Use Mean Square Error (MSE) as the loss function.
  • n is the number of the training examples.
  • The loss is minimized using stochastic gradient descent with standard backpropagation.
  • The learning rate is 1e-4 for the first two layers, and 1e-5 for the last layer. They empirically find that a smaller learning rate in the last layer is important for the network to converge.

理解點:
作者提到使用 MSE 作為 Loss Function 是因為可以在 PSNR 拿到比較好的分數,PSNR 是一個常用的評估指標(見以下),其它常用的指標有 SSIM, MSSIM。

評估指標: PSNR
The PSNR is a widely-used metric for quantitatively evaluating image restoration quality, and is at least partially related to the perceptual quality.

4. Experiments (實驗)

在此節,有分三個面向的實驗。

  • We first investigate Impact of using different datasets on the model performance.
  • Next, they examine the filters learned by their approach.
  • Then, We explore different architecture designs of the networks. And, study the relationship between super-resolution performance and factors like depth, number of filters, and filter sizes.

也與當時 state-of-the-arts 做比較。(使用評估指標如 PSNR, SSIM)

4.1 Training Data

我們可以發現如果使用資料量較多的 ImageNet Train SRCNN 能夠得到比較高的 PSNR,這個是符合預期的。不過作者提到,其實用了 ImageNet 也沒有說讓 Performance 一下次就進步到另一個等級,其可能是因為,原本的 91 張 images 就足夠 SRCNN 學到怎麼樣才能讓 Super Resolution 做得好了,畢竟 SRCNN 算是個輕量的神經網路。(見下方)

The results positively indicate that SRCNN performance may e further boosted using a larger training set, but the effect of big data is not as impressive as that shown in high-level vision problems. This is mainly because that the 91 images have already captured sufficient variability of natural images. On the other hand, our SRCNN is a relatively small network
(8,032 parameters), which could not overfit the 91 images.

4.2 Learned Filters for Super-Resolution

理解點:
作者提出發現有趣的事實,這些 filters 雖然都是透過 Back Propagation 一次又一次跌代更新得到的結果(不是 hand-crafted, human-defined 的),但是自己學出來的結果有部分與人類定義出來的 filter 有異曲同工之妙。(見以下)

Interestingly, each learned filter has its specific functionality. For instance, the filters g and h are like Laplacian/Gaussian filters, the filters a — e are like edge detectors at different directions, and the filter f is like a texture extractor.

4.3 Models and Performance Trade-offs

4.3.1 Filter number

  • In general, the performance would improve if we increase the network width, i.e. adding more filters , at the cost of running time.
  • We conduct two experiments: (i) one is with a larger network with n1 = 128 and n2 = 64. (ii) The other is with a smaller network with n1 = 32 and n2 = 16.

理解點:
我們知道 PSNR 越大越好,所以其實參數量越多,整體的表現確實越好,但是成本就是運算的時間也會變多(包含 訓練時間與 Feed-Forward 所需時間),端看應用在哪邊,時間與表現常常是互相拉扯的兩個因子。

4.3.2 Filter size

  • In this section, they examine the network sensitivity to different filter sizes.
  • We fix the filter size f1 = 9. f3 = 5, and enlarge the filter size of the second layer to be (i) f2 = 1 (9–1–5). (ii) f2 = 3 (9–3–5). (iii) f2 = 5 (9–5–5).
  • Convergence curves in the above figure show that using a larger filter size could significantly improve the performance.
  • The results suggest that utilizing neighborhood information in the mapping stage is beneficial.

理解點:
從上圖得知,越大的 filter size 有越好的表現的趨勢,不過一樣也是需要考量到時間成本的議題,越大的 filter size 在 train 和 feed-forward 時都會花更多時間。

4.3.3 Number of layers

  • We try to deeper structures by adding another non-linear mapping layer.
  • We conduct 3 controlled experiments, i.e., (i) 9–1–1–5, (ii) 9–3–1–5, (iii) 9–5–1–5.

思考點:
結果四層的結構發現沒有比較好,差不多就是追平只有三層結構的 SRCNN。以下為作者的論述。

We can observe that the four-layer networks converge slower than the three-layer network. Nevertheless, given enough training time, the deeper networks will finally catch up and converge to the three-layer ones.

所以針對 4.3.3 Number of layers 的實驗可以下的結論就是

It’s NOT the deeper the better in this deep model for super-resolution.

4.4 Comparisons to State-of-the-Arts

使用的 SRCNN 參數設定:

A three-layer network with f1 = 9, f2 = 5, f3 = 5, n1 = 64, and n2 = 32 trained on the ImageNet.

與 SRCNN 進行比較的 Models/ Methods: (參考[Num]可以在最下面的文獻回顧找到,有興趣了解更多的朋友可以看看。)

  • SC: sparse coding-based method of Yang et al. [50]
  • NE+LLE: neighbour embedding + locally linear embedding method [4]
  • ANR: Anchored Neighbourhood Regression method [41]
  • A+L: Adjusted Anchored Neighbourhood Regression method [42]
  • KK: the method described in [25]

這裡可以看到 SRCNN Outperform 其它的方法。下面我們一起欣賞,這些方法各自產生的結果吧。

4.5 Experiments on Color Channels

這部分主要談到使用不同的 Color Channels 會得到不同的結果。詳情請各位參考論文 4.5 節

5. Conclusions (結論)

前面的幾節重點回顧,以及重申 SRCNN 的價值:

  • We have presented a novel deep learning approach for single image super-resolution (SR)
  • We show that conventional sparse-coding-based SR methods can be reformulated into a deep convolutional neural network. → Section 3.2
  • With a lightweight structure, the SRCNN has achieved superior performance than the state-of-the-art methods. → Section 4.4
  • We conjecture that additional performance can e further gained y exploring more filters and different training strategies.
  • The proposed structure, with its advantages of simplicity and robustness, could e applied to other low-level vision problems, such as image de￾
    blurring or simultaneous SR+denoising.
  • One could also investigate a network to cope with different upscaling factors.

以上! 就是 SRCNN 這個模型/這篇論文的重點介紹了。

希望本篇內容對大家有幫助,如果大家有任何想法,歡迎在留言區留下意見,我們會第一時間回覆大家,先和大家說謝謝指教了! 我們是 AI.FREE,我們下篇文章見!

Online Resources: 本篇文章使用到的網路資源

  1. [Link] Image Super-Resolution Using Deep Convolutional Networks
  2. [Link] Github Repo: kunal-visoulia/Image-Restoration-using-SRCNN
  3. [Link] SRCNN Cover

2. Related Work (前人文獻)

2.1 Image Super-Resolution

Example-based methods
[46]
Yang, C.Y., Ma, C., Yang, M.H.: Single-image super-resolution: A
benchmark
[16] Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single image.
[25] Kim, K.I., Kwon, Y.: Single-image super-resolution using sparse regression and natural image prior

External example-based methods
[2]
Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Low complexity single-image super-resolution ased on nonnegative neighbor embedding
[4] Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neighbor embedding.
[6] Dai, D., Timofte, R., Van Gool, L.: ointly optimized regressors for
image super-resolution
[15] Freeman, W.T., Pasztor, E.C., Carmichael, O.T.: Learning low level vision.
[37] Schulter, S., Leistner, C., Bischof, H.: Fast and accurate image upscaling with super-resolution forests.
[41] Timofte, R., De Smet, V., Van Gool, L.: Anchored neighborhood
regression for fast example-based super-resolution.
[42] Timofte, R., De Smet, V., Van Gool, L.: A+: Adjusted anchored
neighborhood regression for fast super-resolution
[48]
Yang, J., Wang, Z., Lin, Z., Cohen, S., Huang, T.: Coupled dictionary training for image super-resolution.
[49] Yang, J., Wright, J., Huang, T., Ma, Y.: Image super-resolution as
sparse representation of raw image patches.
[50] Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image super-resolution
via sparse representation.

[51] Zeyde, R., Elad, M., Protter, M.: On single image scale-up us￾
ing sparse-representations.

2.2 Convolutional Neural Networks

[27] LeCun, Y., Boser, B., Denker, .S., Henderson, D., Howard, R.E.,
Hubbard, W., ackel, L.D.: Backpropagation applied to handwritten zip code recognition

Image Classification
[18]
He, K., Zhang, X., Ren, S., Sun, .: Spatial pyramid pooling in deep convolutional networks for visual recognition.
[26] Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classifcation
with deep convolutional neural networks.

Object Detection
[34]
Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., Yang, S., Wang, Z., Xiong, Y., Qian, C., et al.: Deepid-net: multi-stage and deformable deep convolutional neural networks for object detection.
[40] Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high quality object detection.
[52] Zhang, N., Donahue, ., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection

Face Recognition
[39]
Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation joint identification-verification.

Pedestrian Detection
[35]
Ouyang, W., Wang, X.: Joint deep learning for pedestrian detection

2.3 Deep Learning for Image Restoration

Multi-layer Perceptron(MLP)
[3]
Burger, H.C., Schuler, C.J., Harmeling, S.: Image denoising: Can plain neural networks compete with BM3D?
[36] Schuler, C.J., Burger, H.C., Harmeling, S., Scholkopf, B.: A ma￾
chine learning approach for non-blind image deconvolution.

Convolutional Neural Network (CNN)
[22]
Jain, V., Seung, S.: Natural image denoising with convolutional networks.
[12] Eigen, D., Krishnan, D., Fergus, R.: Restoring an image taken
through a window covered with dirt or rain.
[5] Cui, Z., Chang, H., Shan, S., Zhong, B., Chen, X.: Deep network
cascade for image super-resolution.
[16] Glasner, D., Bagon, S., Irani, M.: Super-resolution from a single
image.