🔒
International Research Platform
Serving Researchers Since 2012

GAN-based Super-Resolution for Image Enhancement using Multiple Loss Function

DOI : https://doi.org/10.5281/zenodo.19594560
Download Full-Text PDF Cite this Publication

Text Only Version

GAN-based Super-Resolution for Image Enhancement using Multiple Loss Function

Ishan Lodwal, Sherry Verma, Dev Singh Ahluwalia, Ananya Khurana

School of Engineering and Technology Sushant University

Gurgaon, India

Abstract – Single-image super-resolution (SISR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs and is critical for applications such as medical imaging, surveillance, and remote sensing, where loss of fine details can significantly impact downstream tasks. However, conventional interpolation-based methods fail to recover high- frequency textures, resulting in overly smooth and visually degraded outputs. To address this limitation, this paper proposes a Generative Adversarial Network (GAN)-based 4× super- resolution framework inspired by ESRGAN, incorporating a Residual-in-Residual Dense Block (RRDB)-based generator and a PatchGAN discriminator to enhance texture reconstruction and perceptual realism. A hybrid loss function combining Charbonnier, perceptual, edge-aware, SSIM, and adversarial losses is designed to jointly optimize pixel-level accuracy, structural consistency, and visual fidelity, thereby overcoming the limitations of single-objective optimization. Experimental evaluation on benchmark datasets including Set5, Set14, BSD100, Urban100, General100, and Manga109 demonstrates that the proposed method achieves competitive perceptual performance, particularly excelling in reconstructing high-frequency and structured patterns, validating the effectiveness of the proposed multi-loss GAN framework.

Keywords – Super-resolution, GAN, RRDB, perceptual loss, image enhancement.

  1. INTRODUCTION

    Images are often captured or stored at low resolutions due to limitations in imaging devices, bandwidth constraints, and storage requirements. Enhancing such images is critical for numerous applications including medical diagnostics, satellite imaging, forensic analysis, and video streaming.

    Traditional interpolation techniques such as nearest-neighbor, bilinear, and bicubic interpolation are computationally efficient but inherently incapable of reconstructing fine textures and structural details, leading to blurred outputs [1].

    Deep learning approaches, particularly convolutional neural networks (CNNs), have significantly improved super- resolution performance by learning complex mappings from LR to HR images [1]. Furthermore, Generative Adversarial Networks (GANs) enhance perceptual quality by generating realistic textures and high-frequency details [2], [3].

    In this work, a GAN-based 4× super-resolution framework is proposed to improve image quality while preserving texture consistency and structural integrity.

    Contributions:-

    The main contributions of this paper are as follows:

    1. Design of an RRDB-based generator integrated with a PatchGAN discriminator.

    2. Formulation of a hybrid loss function combining multiple perceptual and structural objectives.

    3. Extensive evaluation on multiple benchmark datasets.

    4. Comparative analysis with state-of-the-art super- resolution models.

  2. LITERATURE REVIEW

    Recent advancements in single-image super-resolution (SISR) have been largely driven by deep learning-based approaches, particularly convolutional neural networks (CNNs) and generative adversarial networks (GANs). Early methods primarily focused on minimizing pixel-wise reconstruction error, while later approaches emphasized perceptual quality and texture realism.

    Dong et al. [1] introduced SRCNN, the first end-to-end CNN- based super-resolution model, which significantly improved reconstruction accuracy in terms of PSNR. However, due to its reliance on pixel-wise loss functions, the generated images often lacked high-frequency details and appeared overly smooth.

    To address the limitations of pixel-based optimization, Ledig et al. [2] proposed SRGAN, which incorporated adversarial learning and perceptual loss functions. This approach enabled the generation of sharper and more visually realistic images, although it introduced challenges such as training instability and the presence of artifacts.

    Building upon this, Wang et al. [3] developed ESRGAN, which improved training stability and perceptual quality through architectural enhancements such as Residual-in- Residual Dense Blocks (RRDB) and refined adversarial

    learning strategies. ESRGAN achieved state-of-the-art perceptual performance and serves as a strong baseline for modern super-resolution methods.

    In parallel, Lim et al. [4] proposed EDSR, which focused on maximizing PSNR using deep residual networks. While EDSR demonstrated strong quantitative performance, it lacked perceptual realism due to the absence of adversarial training.

    Additionally, Johnson et al. [5] introduced perceptual loss functions based on deep feature representations, enabling better texture reconstruction, while Zhang et al. [6] proposed LPIPS as a perceptual similarity metric aligned with human visual perception.

    Despite these advancements, existing methods often struggle to simultaneously optimize pixel-level accuracy, structural consistency, and perceptual realism. Motivated by this limitation, the proposed approach builds upon ESRGAN [3] by incorporating additional structural constraints, including edge-aware and SSIM losses, to enhance texture preservation and overall image quality.

  3. METHODOLOGY

    1. System Overview

      The proposed framework follows a GAN-based architecture in which the generator produces super-resolved images from LR inputs, and the discriminator distinguishes between real HR images and generated images. This adversarial process encourages the generation of perceptually realistic outputs.

    2. Dataset Preparation

      The model is trained on the DIV2K dataset (800 images) and Flickr2K dataset (2650 images), which were obtained via publicly available repositories on the Kaggle platform. Evaluation is conducted on standard benchmark datasets including Set5, Set14, BSD100, Urban100, General100, and Manga109.

      Low-resolution images are synthetically generated using bicubic downsampling along with additional degradations such as Gaussian blur, noise injection, and JPEG compression to simulate real-world scenarios.

    3. Network architecture

      1. Generator

        The generator is based on RRDB blocks, consisting of:

        • Initial convolution layer

        • 16 RRDB blocks

        • PixelShuffle upsampling layers

        • Final reconstruction layer

        RRDB combines dense connections and residual learning to stabilize training and improve feature representation [3].

      2. Discriminator

        A PatchGAN-based discriminator is employed, consisting of convolutional layers with spectral normalization. It outputs a probability map that evaluates local image patches, thereby enforcing high-frequency realism.

    4. Loss function formulation

      To achieve both perceptual quality and structural accuracy, a hybrid loss function is used.

      1. Charbonnier Loss

        The Charbonnier loss is a robust variant of the L1 loss that reduces sensitivity to outliers and ensures stable pixel-levelreconstruction.

      2. Perceptual Loss

        The perceptual loss computes the difference between high- level feature representations extracted from a pretrained VGG network, enabling better texture and detail reconstruction.

        Fig. 1. Overall system pipeline of GAN-based super- resolution model

      3. Edge-Aware Loss

        The edge-aware loss enforces consistency in image gradients, helping preserve sharp edges and structural boundaries in the reconstructed image.

      4. SSIM Loss

        The SSIM loss measures structural similarity between images, ensuring preservation of luminance, contrast, and structural information.

      5. Adversarial Loss

        The adversarial loss encourages the generator to produce realistic images by minimizing the difference between generated and real image distributions.

      6. Total Loss Function

      BSD100

      25.54

      0.6774

      0.2477

      Urban100

      23.72

      0.9452

      0.1797

      Manga109

      27.37

      0.9681

      0.0789

      E. Training Details

      General100

      28.65

      0.8631

      0.1235

      The total loss is a weighted combination of all individual loss components, balancing pixel accuracy, perceptual quality, and structural fidelity.

      1. Epochs: 600

      2. Batch size: 16

      3. Adversarial training: Introduced after 10,000 training steps and gradually increased over 80,000 steps using step- based scheduling

      4. Loss weighting: Dynamic weighting strategy where pixel loss weight decreases to 0.15, perceptual loss increases up to 0.6, adversarial loss increases up to 3 × 10³, while edge- aware and SSIM losses are fixed at 0.05

      5. Discriminator training: Updated after adversarial phase begins and at alternating mini-batch intervals with label smoothing (real: 0.81.0, fake: 0.00.2)

      6. Evaluation metrics: PSNR, SSIM, and LPIPS [6]

      7. Model selection: Best model selected based on lowest validation LPIPS

      8. Checkpointing: Model checkpoints saved every 10 epochs, with additional saving of the best-performing model

  4. RESULTS AND DISCUSSION

      1. Quantitative Results

        Dataset

        PSNR (dB)

        SSIM

        LPIPS

        Set5

        29.07

        0.8547

        0.1058

        Set14

        26.35

        0.8422

        0.1854

        1. Optimizer: Adam optimizer for both generator and

          discriminator ( = 0.9, = 0.99)

        2. Learning rate: Cosine annealing schedule with linear warm-up for the first 5 epochs, reaching a maximum learning rate of 1 × 10

          Table I. PSNR, SSIM and LPIPS results of our model on various benchmarks

      2. Qualitative Results

        (a)

        (b)

        (c)

        (d)

        (e)

        Urban100

        0.1797

        0.1228

        0.1324

        BSD100

        0.2477

        0.1617

        0.1719

        Manga109

        0.0789

        0.0647

        0.0675

        General100

        0.1235

        0.0876

        0.0894

        Fig. 2. Qualitative comparison on the Set5 dataset: (a) Baby, (b) Bird, (c) Butterfly, (d) Woman, and (e) Head. Each row shows (from left to right) low-resolution (LR) input, bicubic upsampling, proposed super-resolved (SR) output, and ground truth (HR) image.

        The proposed model generates visually sharper images with improved texture fidelity, particularly for structured and high-frequency patterns such as cartoons and illustrations. However, performance decreases for complex natural scenes due to variations in texture distribution.

      3. Results compared with State-of-the-Art Models

    Dataset

    Our Model

    ESRGAN

    MOBOSR

    Set5

    29.07

    29.80

    30.69

    Set14

    26.35

    25.51

    26.81

    Urban100

    23.72

    23.74

    24.49

    BSD100

    25.54

    25.21

    26.04

    Tables II and III compare the LPIPS performance of the proposed model with ESRGAN [1] and MOBOSR [2], where ESRGAN serves as a widely adopted GAN-based baseline and MOBOSR represents a recent optimization-based approach.

    Table II. Comparison of LPIPS with ESRGAN and MOBOSR

    Dataset

    Our Model

    ESRGAN

    MOBOSR

    Set5

    0.1058

    0.0750

    0.0745

    Set14

    0.1854

    0.1341

    0.1359

    Manga109

    27.37

    27.36

    28.21

    General100

    28.65

    28.86

    29.66

    Table III. Comparison of PSNR with ESRGAN and MOBOSR

    The proposed method achieves competitive LPIPS scores compared to ESRGAN and MOBOSR [3], particularly excelling on Manga109 and Set5 datasets. While slightly lower in some natural datasets, the performance remains within acceptable industry benchmarks.

    The quantitative comparison of the proposed model with ESRGAN and MOBOSR on standard benchmark datasets is presented in Table 3. The evaluation is conducted using PSNR (×4) on Set5, Set14, Urban100, BSD100, Manga109, and General100 datasets. As observed, the proposed model achieves competitive performance, outperforming ESRGAN on Set14, BSD100, and Manga109, while maintaining comparable results on Set5 and Urban100. However, MOBOSR consistently achieves the highest PSNR across all datasets, indicating its superior reconstruction capability. Despite this, the proposed model demonstrates balanced performance across diverse datasets, highlighting its effectiveness and generalization ability for image super- resolution tasks.

  5. CONCLUSION AND FUTURE WORK

This paper presented a GAN-based 4× super-resolution framework utilizing an RRDB generator and PatchGAN discriminator with a hybrid loss function. Experimental results demonstrate strong perceptual performance, particularly for high-frequency datasets.

Future work includes:

  • Exploring transformer-based architectures such as SwinIR

  • Integrating diffusion-based models

  • Deploying the system for real-time applications

REFERENCES

  1. C. Dong et al., Learning a Deep Convolutional Network

    for Image Super-Resolution, ECCV, 2014.

  2. C. Ledig et al., Photo-Realistic Single Image Super-

    Resolution Using a GAN, CVPR, 2017.

  3. X. Wang et al., ESRGAN: Enhanced Super-Resolution GANs, ECCVW, 2018.

  4. B. Lim et al., Enhanced eep Residual Networks for SISR, CVPRW, 2017.

  5. J. Johnson et al., Perceptual Losses for Real-Time Style Transfer and Super-Resolution, ECCV, 2016.

  6. R. Zhang et al., The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, CVPR, 2018.

  7. E. Agustsson and R. Timofte, NTIRE 2017 Challenge on

    Single Image Super-Resolution, CVPRW, 2017

  8. X. Zhang et al., "Perceptual-Distortion Balanced Image Super-Resolution," arXiv:2409.03179, 2024.