SmartFit: AI-Powered Virtual Try-On

doi:10.5281/zenodo.19335677

Volume 15, Issue 03 (March 2026)

SmartFit: AI-Powered Virtual Try-On

DOI : 10.5281/zenodo.19335677

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 28
Authors : Ch. Swetha, Dr. A. Obulesu, T. Aryaman Raj, G. Harshitha, P. Mahesh Reddy, Sweta Verma
Paper ID : IJERTV15IS031275
Volume & Issue : Volume 15, Issue 03 , March – 2026
Published (First Online): 30-03-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

SmartFit: AI-Powered Virtual Try-On

Ch. Swetha(1), Dr. A. Obulesu(2), T. Aryaman Raj(3), G. Harshitha(4), P. Mahesh Reddy(5), Sweta Verma(6)

Assistant Professor(1), Associate Professor(2), Student(3,4,5,6) Vidya Jyothi Institute of Technology, Hyderabad, India

Abstract: – This project develops a 2D virtual garment try-on system using a multi-stage deep learning pipeline to generate photo-realistic images from a person image and garment image, trained on the VITON-Zalando dataset. It follows a CP-VTON+ inspired architecture with two stages: the Geometric Matching Module (GMM) uses ResNet-18 and Thin Plate Spline (TPS) to align garments via control-point transformations, optimized with appearance loss and grid smoothness loss; the Try-On Module (TOM) is a UNet-based conditional GAN that synthesizes the final image using composition masks, trained with adversarial, L1, and VGG perceptual losses and evaluated using a PatchGAN discriminator. The system leverages agnostic-v3.2 representations, OpenPose key points (18-channel heatmaps), and a 22-channel input for accurate pose and body awareness. Training includes 20 epochs for GMM and 50 epochs for TOM on a Kaggle T4 GPU (~10 hours), with performance evaluated using FID (Fréchet Inception Distance).

The central objective of this project is to develop a 2D image-based virtual try-on system capable of generating a realistic composite image of a person wearing a specified garment, without physical trial.

Keywords: Generative Adversarial Network, CP-VTON+, Geometric Matching Module, Thin Plate Spline UNet Generator, PatchGAN, ResNet-18, Cloth-Agnostic Representation, Composition Mask, VITON-Zalando Dataset, Image Warping

.

INTRODUCTION
The rapid growth of e-commerce platforms and advancements in fashion technology have significantly increased the demand for intelligent systems that can enhance the online shopping experience [8], [4]. One of the key challenges in this domain is the inability of customers to physically try on garments before purchase, often leading to uncertainty in fit, style, and appearance, and consequently resulting in high product return rates [2]. To address this limitation, virtual try-on systems have emerged as an important application of computer vision and deep learning [6], enabling users to visualize how a garment would appear on their body without physical interaction. In this context, this project presents the design and implementation of a 2D image-based virtual garment try-on system capable of generating photo-realistic composite images of a person wearing a specified garment. The proposed approach is based on a multi-stage deep learning pipeline, inspired by the CP- VTON+ architecture [16], which effectively decomposes the problem into garment alignment and image synthesis tasks. The

model addresses critical challenges such as accurate garment deformation, pose-aware alignment, and seamless blending with the human body, which are essential for achieving realistic results. By leveraging techniques from geometric transformations [27], feature extraction using ResNet [24], and generative adversarial networks (GANs) [14], the system ensures both structural consistency and fine-grained texture preservation. Furthermore, the model is trained and evaluated on the VITON-Zalando dataset [15], which provides diverse person-garment pairs and enables robust learning across varying poses and clothing styles.

Overall, this work contributes to the development of AI-driven virtual fitting solutions by combining advancements in image warping, pose estimation, and generative modeling, with potential applications in online retail, personalized fashion recommendation, and digital wardrobe systems.
LITERATURE SURVEY
Recent advancements in virtual try-on (VTON) systems have been largely driven by progress in deep learning and computer vision [6], [7]. Song et al. (2023) present a comprehensive survey of image-based VTON methods [9], highlighting how models such as GANs [14], Diffusion Models, and Transformers are used to combine person and garment images effectively. The study discusses key components like person representation, garment alignment (warping), and image synthesis, commonly evaluated on datasets such as VITON [15] and DeepFashion [34]. Chen and Ni (2024) provide a consumer-focused review of virtual try-on systems [11], analyzing multiple studies across web, AR, and VR platforms. Their work emphasizes improvements in user satisfaction and purchase confidence, while also identifying challenges such as privacy concerns and real-time performance limitations [8]. Furthermore, Aakash V. B. et al. (2024) explore the broader fashion AI ecosystem [12], including garment detection, recommendation systems, and virtual try-on technologies. Their study highlights the effective use of CNNs [24] and Vision Transformers [25] for tasks like clothing classification and retrieval.

Overall, existing literature shows that while virtual try-on systems have improved significantly in terms of realism and accuracy, challenges remain in scalability, speed, and adaptability, with diffusion-based models and user-centered design identified as promising directions for future research.
EXISTING MODELS
A wide range of models have been proposed to address the virtual try-on (VTON) problem, each improving different aspects such as alignment, realism, and generalization [6], [7]. Early methods like VITON (2018) [15] introduced a two-stage pipeline with coarse image generation and refinement, but they often produced blurry outputs and lacked fine garment details. To overcome this, CP-VTON [16] proposed a Geometric Matching Module (GMM) using Thin Plate Spline (TPS) transformations [27] for better garment alignment, followed by a synthesis network for final image generation. Its improved version, CP-VTON+ further enhanced texture preservation and fitting accuracy, making it a strong baseline for many later works [9]. Another important model, ACGPN (Adaptive Content Generating and Preserving Network) focuses on preserving both human body structure and clothing details using semantic segmentation [26] and multi-stage generation. While it produces high-quality outputs, it is more complex and computationally heavy. Similarly, HR-VITON improves output resolution and visual quality by refining high-frequency details, making it suitable for more realistic applications. Recent advancements have introduced GAN-based improvements, such as StyleGAN-based models [31] and Pix2Pix-based approaches [29], which enhance texture realism, lighting consistency, and fine details. However, GANs can sometimes be unstable during training and may struggle with extreme poses [14]. To address these limitations, newer approaches are shifting towards Diffusion Models [6], which generate images through a step-by-step denoising process and provide more stable and high-quality outputs. In addition, Transformer-based models [25] and Vision-Language models (e.g., CLIP-guided systems) are being used to better understand the relationship between text, clothing, and human pose, improving generalization and alignment accuracy. Some advanced systems also integrate pose estimation frameworks like OpenPose [21] and neural rendering techniques, allowing more accurate modeling of body shape and garment draping, bridging the gap between 2D image-based methods and realistic 3D simulations [5].

Overall, while traditional models focus on explicit warping and GAN-based synthesis, advanced approaches using diffusion models, transformers, and 3D-aware techniques are pushing the field toward more realistic, robust, and scalable virtual try-on solutions, with ongoing challenges in computation cost, real- time performance, and dataset diversity [12].
PROPOSED METHODOLOGY
The proposed methodology adopts a two-stage deep learning pipeline inspired by the CP-VTON+ (Cloth-flow Virtual Try- On Network) architecture, adapted for the High-Resolution VITON-Zalando dataset. The core philosophy is to decouple the problem into two well-defined subproblems: (1) geometric alignment of the garment to the body shape, and (2) photo- realistic image synthesis of the dressed person. This separation allows each module to be trained with a focused objective,

improving stability and output quality compared to a single end- to-end model. Pre-trained ResNet-18: Used as a frozen feature extractor inside the GMM to leverage ImageNet visual knowledge without training from scratch, reducing compute time and improving convergence. Sequential training: GMM is fully pre-trained and frozen before TOM training begins. This prevents gradient interference between the two modules and gives the generator a stable warped cloth input. Composition mask output: The generator outputs both a rendered image and a soft blending mask, allowing the final output to preserve sharp garment texture from the warped cloth while the rendered image handles body parts like arms and occlusions. Pre-built agnostic images: The dataset’s agnostic-v3.2 folder provides high-quality cloth-erased person images, which are used directly rather than building them from parse maps resulting in cleaner inputs to both models.
1. Dataset Preparation
  The project uses the High-Resolution VITON-Zalando dataset (available on Kaggle via marquis03/high-resolution-viton- zalando-dataset), which is the standard benchmark for image- based virtual try-on research. It consists of paired person- garment images collected from the Zalando e-commerce platform, with rich annotations for each image pair.
  
  Dataset Statistics
  
  Figure 1: Splitting
  
  Folder Structure
  
  Each split (train/test) contains the following subdirectories, all of which are used in the pipeline. Person-garment correspondence is defined by train_pairs.txt and test_pairs.txt. Each line contains a space-separated pair: person_image_name cloth_image_name. In paired evaluation (same-cloth), both names are identical. In unpaired evaluation (cross-garment), they differ. The dataset loader reads this file to ensure correct person-garment matching.
  
  Figure 2: Folder Components
2. Data Preprocessing
  All images are resized from the original 1024 × 768 px to 256
  
  × 192 px for training, balancing spatial resolution with GPU memory constraints. Preprocessing produces several tensor representations from each person-garment pair.
  
  Image Normalization
  
  All RGB images (person, garment, agnostic) are normalized to the range [-1, 1] using mean = 0.5 and std = 0.5 per channel. This is the standard normalization for GAN training and aligns with the Tanh output activation of the generator. Binary masks (cloth-mask, body-mask) are kept in the range [0, 1] without normalization since they represent probabilities.
  
  Pose Heatmap Construction
  
  Each person image has a corresponding _keypoints.json file in openpose_json/ containing 18 body joint coordinates in OpenPose format. These are converted into an 18-channel spatial heatmap tensor of shape (18, 256, 192) using the following process: Each joint’s (x, y) pixel coordinates are rescaled from the original resolution (1024×768) to the model resolution (256×192). A 2D Gaussian blob ( = 8 px) is drawn at each valid joint location, encoding spatial uncertainty around the joint. Joints with confidence score below 0.05 are skipped, preventing noise from low-quality detections.
  
  Cloth-Agnostic Representation
  
  The cloth-agnostic person is the key input that hides the original garment so the model cannot cheat by copying it. The pre-built agnostic-v3.2/ images are loaded directly, providing higher quality than manually erasing parse labels. A body mask is derived by computing the pixel-level absolute difference between the agnostic image and the person image regions that differ correspond to the erased clothing area.
  
  Final 22-Channel Input Tensor
  
  22-Channel Agnostic Representation = Agnostic Person (3ch)
  
  + Pose Heatmap (18ch) + Body Mask (1ch)
  
  This 22-channel tensor encodes: what the person looks like (face, hair, legs preserved), where every body joint is (precise spatial pose), and where the garment should be placed (body mask region).
3. Feature Extraction
  Feature extraction is performed inside the Geometric Matching Module (GMM) using two instances of a ResNet-18 backbone, one for the agnostic person representation and one for the garment image.
  
  ResNet-18 Backbone
  
  The feature extractor uses the first three residual block groups of a pretrained ResNet-18, producing a feature map of shape (B, 256, H/16, W/16). The final classification layers (avgpool, fc) are discarded. For the model resolution of 256×192, this yields
  
  feature maps of shape (B, 256, 16, 12) a 16× spatially downsampled representation rich in semantic and structural information.
  
  Figure 3: ResNet Structure
  
  Dual-Stream Correlation
  
  Pretrained feature extractors process the agnostic person (22ch input) and the garment (3ch input) independently. Their outputs are concatenated along the channel dimension to form a 512- channel correlation tensor of shape (B, 512, 16, 12). This concatenated representation encodes the spatial relationship between body structure and garment appearance, which the subsequent regression head uses to predict warp parameters. Person extractor: Input channels = 22 (agnostic + pose + mask). Pretrained weights are not used here since the input has non- standard channel count. Garment extractor: Input channels = 3 (standard RGB). Pretrained ImageNet weights are loaded and partially frozen, providing strong texture and shape priors for garment features.
4. Model Architecture
  The complete model consists of three neural network modules: the Geometric Matching Module (GMM), the Try-On Generator (TOM), and the PatchGAN Discriminator. These are described in detail below.
  
  Geometric Matching Module (GMM)
  
  The GMM predicts a Thin Plate Spline (TPS) transformation that warps the flat garment image to align with the person’s body region. It operates in three sub-components:
  
  Feature Extraction: Dual ResNet-18 streams extract (B, 256, 16, 12) feature maps from the agnostic person and garment image respectively. Correlation + Regression: The concatenated 512- channel feature map passes through 3 convolutional layers with BatchNorm and ReLU, followed by AdaptiveAvgPool to a 5×5 spatial grid. A 2-layer MLP then regresses 25 × 2 = 50 TPS control point offsets.
  
  Try-On Generator (UNet)
  
  The generator is a UNet architecture with 6 encoder blocks and 5 decoder blocks connected by skip connections. It takes a 25- channel input (22ch agnostic + 3ch warped cloth) and produces two outputs: a rendered image (3ch) and a composition mask (1ch). The final try-on image is computed as:
  
  Output = mask × warped_cloth + (1 mask) × rendered_image
  
  This composition ensures: when mask 1, the warped garment texture is preserved directly (sharp, fabric-accurate). When mask 0, the rendered output handles arms, collar blending, and occlusions (smooth, body-aware).
  
  PatchGAN Discriminator
  
  The discriminator uses a PatchGAN architecture that classifies overlapping 70×70 pixel patches as real or fake, rather than judging the entire image. It takes a 25-channel input (22ch agnostic + 3ch image) and outputs a spatial patch score map. Pairs fed to the discriminator:
  
  Real pair: Agnostic person (22ch) + Ground-truth person image (3ch)
  
  ake pair: Agnostic person (22ch) + Generated try-on image (3ch)
  
  Figure 4: Model Architecture
5. Model Training
  Stage 1 GMM Pre-training
  
  The GMM is trained independently for 20 epochs before TOM training begins. The optimizer is Adam (lr = 1e-4, = 0.5,
  
  = 0.999) with Cosine Annealing learning rate decay. The GMM loss has two components:
  
  L_GMM = L_appearance + _smooth × L_smooth where
  
  _smooth = 10.0
  
  Appearance Loss (L1): Measures pixel-level difference between the warped garment and the ground-truth clothing region on the person, masked to the body clothing area only. Ensures the warp moves the garment to the right location.
  
  Smoothness Loss: Penalizes the magnitude of spatial gradients in the TPS sampling grid (computed in x and y directions).
  
  Stage 2 TOM GAN Training
  
  The TOM is trained for 50 epochs with the GMM fully frozen. Separate Adam optimizers are used: lr = 2e-4 for the generator and lr = 1e-4 for the discriminator (half the generator rate to prevent D from overpowering G).
  
  L_G = L_adversarial + _L1 × L_L1 + _VGG × L_VGG where _L1 = 10.0, _VGG = 5.0
  
  Adversarial Loss (BCEWithLogitsLoss): Trains the generator to produce images that the PatchGAN discriminator classifies as real. Drives overall image realism.
  
  L1 Pixel Loss: Direct pixel-level reconstruction loss comparing the generated try-on image against the ground-truth person image.
  
  VGG Perceptual Loss: Computes feature-level similarity using the relu1_2, relu2_2, and relu3_3 layers of a frozen VGG-16.
  
  Training Stability Indicators
  
  During TOM training, the following loss behavior indicates healthy convergence: Discriminator loss (~0.50.7): Indicates neither model is dominating. If D loss drops below 0.1, the discriminator is too strong. If it exceeds 0.9, the generator is failing. Generator loss decreasing steadily: Driven primarily by the L1 and VGG terms, which should decrease as the model learns to reconstruct the ground truth.
  
  Figure 5: Training Configuration Summary
6. Evaluation Metrics
Model performance is assessed using both quantitative metrics and qualitative visual inspection. The primary metric is the Fréchet Inception Distance (FID), supplemented by visual comparison of ground-truth and generated image pairs.

Fréchet Inception Distance (FID)

FID measures the distributional similarity between real and generated images by computing the Fréchet distance between multivariate Gaussian distributions fitted to Inception-v3 feature embeddings of both image sets. It captures both fidelity (how realistic individual images look) and diversity (how varied the generated distribution is).

FID = _r _g² + Tr(_r + _g 2(_r _g)^½)

where _r, _r are the mean and covariance of real image features, and _g, _g are those of generated image features extracted from the Inception-v3 pool3 layer (2048- dimensional).

Qualitative Visual Evaluation

Beyond FID, outputs are evaluated visually on 3 criteria:

Garment Placement: Is the garment positioned correctly on the torso? Sleeves aligned with arms?Texture Preservation: Are the garment’s patterns, colors, and fabric texture faithfully reproduced?Body Silhouette: Is the person’s body shape, hair, face, and legs preserved without distortion?

Evaluation Protocol

FID is computed on 200 test image pairs by saving ground-truth person images and corresponding generated try-on images to separate directories, then calling pytorch_fid. The 2048- dimensional Inception-v3 features are extracted at the pool3 layer with batch size 50.
RESULTS AND DISCUSSIONS
1. Quantitative Results
  Quantitative evaluation of the virtual try-on system was conducted on the VITON-Zalando test set comprising 2,032 person-garment pairs. The primary metric is the Fréchet Inception Distance (FID), computed over 200 sampled test pairs using 2048-dimensional Inception-v3 pool3 features. This section presents the quantitative results across training stages and benchmarks them against published methods.
  
  Figure 6 : Comparison of FID Scores, lower the score better the image.
  
  Interpreting your FID: FID < 50 indicates the generated distribution is meaningfully close to real try-on images placing this project in the acceptable range for a beginner-built, 50-epoch model. FID between 5080 is expected for undertrained variants and still demonstrates the pipeline is functioning correctly. Scores above 100 suggest training did not converge.
  
  GMM Training Loss Progression
  
  The Geometric Matching Module was trained for 20 epochs with appearance and smoothness loss components. The following table shows the expected loss trajectory based on the training configuration:
  
  Figure 7: GMM Losses
  
  TOM GAN Training Loss Progression
  
  The Try-On Module GAN was trained for 50 epochs. Three loss components are tracked for the Generator (adversarial, L1, VGG perceptual) and one for the Discriminator. Rows highlighted in green mark epochs where visualization checkpoints were saved.
  
  Figure 8: TOM Losses
2. Generated Image Quality Analysis
  Visual quality of the generated try-on images was assessed across five dimensions: garment alignment, texture fidelity, body silhouette preservation, boundary blending, and warp quality. This section provides a structured analysis of each dimension based on the model’s architecture and training outcome.
  
  Figure 9: Training progression of the proposed GAN model: generated image after 50 epochs The results show improved texture consistency and visual realism with increased training epochs.
  
  Composition Mask Analysis
  
  High mask values ( 1.0, bright regions): The model preserves the warped garment texture directly. These regions correspond
  
  to the flat fabric areas of the garment where texture must be sharp. Low mask values ( 0.0, dark regions): The model uses the rendered image output instead. These regions correspond to complex areas like arms, collars, and garment-skin transitions requiring smooth rendering. Expected mask shape: The mask should form a roughly torso-shaped bright region matching the clothing area, with darker values around the periphery visible in the saved TOM_epoch_*.png visualization columns.
  
  Figure 10: Composition Masking
3. Model Complexity and Computational Analysis
  This section provides a detailed breakdown of the model’s parameter counts, memory footprint, and computational requirements measured on a Kaggle T4 GPU (16 GB VRAM) with batch size 8 and image resolution 256×192.
  
  The proposed system consists of approximately 98.7 million parameters (~377 MB), with the majority concentrated in the TOM UNet generator (~54.4M), while the GMM module utilizes a ResNet-18 backbone with partial freezing to improve training stability and efficiency. The training process is divided into two stages, including GMM pretraining for 20 epochs followed by TOM training for 50 epochs, resulting in a total training time of around 10 hours, with each epoch processing nearly 1,455 batches. The model demonstrates efficient resource utilization, with peak GPU memory usage of about 6.0 GB during TOM training, which fits comfortably within the 16 GB capacity of a T4 GPU, leaving significant headroom for scaling to larger batch sizes or higher resolutions. Additionally, the system achieves fast inference performance, requiring approximately 96 milliseconds per image on GPU compared to
  
  1.37 seconds on CPU, making it suitable for real-time or near real-time virtual try-on applications.
4. Discussions
This section critically examines the project outcomes what worked well, what limitations exist, how the approach compares to related work, and what future improvements could be pursued.End-to-end functional pipeline: The complete virtual try-on pipeline from raw VITON-Zalando data through GMM warping, TOM generation, to inference on arbitrary person-garment pairs was successfully built, trained, and deployed on a Kaggle notebook environment within a 12-hour session limit. Correct use of all data modalities: All six dataset subdirectories (image, cloth, cloth- mask, image-parse-v3, openpose_json, agnostic-v3.2) are

utilized. In particular, OpenPose keypoints are converted to spatial heatmaps and the pre-built agnostic-v3.2 representations are used directly both of which are critical to output quality and are frequently omitted in beginner implementations. Stable two-stage training: The sequential approach of pre-training GMM then freezing it during TOM training proved effective the generator received consistent, well-aligned warped cloth inputs throughout its 50 training epochs, avoiding the instability of joint end-to-end training. Composition mask innovation: The dual-output generator (rendered image + composition mask) is the key architectural improvement over vanilla CP-VTON. It produces sharper garment texture in fabric regions and smoother blending at body boundaries observable in the epoch visualization outputs. Resolution constraint (256×192): At this resolution, fine-grained details such as thin stripes, small text prints, and button details cannot be faithfully reproduced. HR-VITON uses 512×384, which requires significantly more VRAM and training time but produces substantially sharper outputs.
CONCLUSION

The proposed 2D virtual garment try-on system is successfully implemented and functionally effective, achieving realistic garment alignment and visually coherent outputs. The two- stage pipeline (GMM + TOM) shows stable training behavior, with progressive loss reduction and consistent convergence, indicating that both geometric alignment and image synthesis modules are working as intended. Qualitative analysis confirms that the model performs well in terms of garment placement, body structure preservation, and warp quality, while achieving moderate performance in texture fidelity and boundary blending, mainly due to resolution limitations.

From a computational perspective, the model is efficient and well-optimized, requiring only ~6 GB VRAM during training and achieving fast inference (~96 ms per image on GPU), making it suitable for near real-time applications. Compared to baseline methods, the system produces acceptable-quality results within a limited training setup (~50 epochs), although it does not yet reach the performance of more advanced models like HR-VITON, primarily due to constraints in training time, resolution, and dataset scale.

Overall, the project demonstrates that a CP-VTON+ inspired architecture can deliver reliable and realistic virtual try-on results even with limited computational resources, validating the effectiveness of the approach. However, further improvements in resolution, training duration, and advanced warping techniques are necessary to achieve state-of-the-art performance and lower FID scores

IX. REFERENCES

D. Marelli, S. Bianco, and G. Ciocca, Designing an AI-based virtual try- on web application, Sensors, vol. 22, no. 10, p. 3832, 2022.
M. I. G. De Almeida, Consumers acceptance of artificial intelligence virtual try-on systems when shopping apparel online, Masters thesis, ISCTEInstituto Universitário de Lisboa, Portugal, 2021.
M. Dhatrak, S. Jadhav, A. Harkal, A. Kankrale, and S. Gupta, AI-
powered virtual try-on system: Enhancing fit prediction and user comfort through deep learning, in Proc. IEEE Int. Conf. Communication, Computing & Industry 6.0 (C2I6), 2024, pp. 16.
Z. I. Fenanda, A. Triwijayanti, and S. A. Wahyono, Analysis of the effect of using AI and AR-based virtual try-on on purchase intention, Journal of Sustainable Technology and Applied Science, vol. 5, no. 1, pp. 617, 2024.
R. Mihaila, 3D virtual garment simulation and AI virtual try-on technologies, Journal of Research in Gender Studies, vol. 13, no. 2, pp. 5468, 2023.
T. Islam, A. Miron, X. Liu, and Y. Li, Deep learning in virtual try-on: A comprehensive survey, IEEE Access, 2024.
P. Goel, B. Sharma, A. Kumari, A. K. Gupta, and P. Chhajed, A review on virtual try-on, in Proc. IEEE Int. Conf. Computing Communication and Networking Technologies (ICCCNT), 2024, pp. 15.
M. Z. Nawaz, F. Guzman, and S. Nawaz, Technology-enabled engagement process of brand virtual try-on services, Journal of Product & Brand Management, vol. 34, no. 1, pp. 4460, 2025.
D. Song et al., Image-based virtual try-on: A survey, arXiv preprint, 2023.
M. A. Islam et al., Deep learning in virtual try-on: A comprehensive survey, IEEE Access, 2024.
C. Chen and J. Ni, Virtual try-on systems in fashion consumption: A systematic review, MDPI, 2024.
A. V. B. et al., A comprehensive survey on AI-driven fashion technologies, IJAEM, 2024.
H. Chen and Y. Ni, Virtual try-on systems in fashion consumption: A systematic review, 2024.
I. Goodfellow et al., Generative adversarial networks, in Proc. NeurIPS, 2014.
X. Han et al., VITON: An image-based virtual try-on network, in Proc. CVPR, 2018.
B. Wang et al., Toward characteristic-preserving image-based virtual try-on network (CP-VTON), in Proc. ECCV, 2018.
D. Song et al., Image-based virtual try-on: A survey, arXiv, 2023.
M. A. Islam et al., Deep learning in virtual try-on: A comprehensive survey, IEEE Access, 2024.
C. Chen and J. Ni, Virtual try-on systems in fashion consumption: A systematic review, MDPI, 2024.
A. V. B. et al., A comprehensive survey on AI-driven fashion technologies, IJAEM, 2024.
Z. Cao et al., OpenPose: Realtime multi-person 2D pose estimation, in Proc. CVPR, 2017.
V. Bazarevsky et al., BlazePose: On-device real-time body pose tracking, in Proc. CVPR, 2020.
C. Lugaresi et al., MediaPipe: A framework for perceptual computing, Google Research, 2019.
K. He et al., Deep residual learning for image recognition, in Proc. CVPR, 2016.
E. Xie et al., SegFormer: Simple and efficient semantic segmentation with transformers, in Proc. NeurIPS, 2021.
Z. Liu et al., Human parsing with deep learning: A survey, IEEE TPAMI, 2020.
F. Bookstein, Thin-plate splines and the decomposition of deformations, IEEE TPAMI, 1989.
I. Rocco et al., CNN architecture for geometric matching, in Proc. CVPR, 2017.
J. Y. Zhu et al., Image-to-image translation with conditional GANs (Pix2Pix), in Proc. CVPR, 2017.
P. Isola et al., PatchGAN discriminators for image-to-image translation, in Proc. CVPR, 2017.
T. Karras et al., High-resolution image synthesis with GANs, in Proc. NeurIPS, 2019.
T.-C. Wang et al., High-resolution image synthesis and semantic manipulation with GANs, in Proc. CVPR, 2018.
X. Liang et al., Clothing-agnostic peson representation for virtual try- on, in Proc. CVPR, 2019.
Z. Liu et al., DeepFashion: Powering robust clothes recognition and retrieval, in Proc. CVPR, 2016.
Y. Ge et al., FD-GAN: Pose-guided person image generation, in Proc.
ICCV, 2019.
L. Ma et al., Pose-guided person image generation, in Proc. NeurIPS, 2017.
K. Simonyan and A. Zisserman, Very deep convolutional networks, in Proc. ICLR, 2015.
O. Ronneberger et al., U-Net: Convolutional networks for biomedical image segmentation, in Proc. MICCAI, 2015.
Z. Wang et al., Image quality assessment: From error visibility to structural similarity (SSIM), IEEE TIP, 2004.
M. Heusel et al., GANs trained by a two time-scale update rule (FID), in Proc. NeurIPS, 2017.
R. Zhang et al., The unreasonable effectiveness of deep features as a perceptual metric (LPIPS), in Proc. CVPR, 2018.
D. Marelli et al., Designing an AI-based virtual try-on web application, Sensors, 2022.
M. Z. Nawaz et al., Technology-enabled engagement process of brand virtual try-on services, Journal of Product & Brand Management, 2025.
M. Dhatrak et al., AI-powered virtual try-on system, in Proc. IEEE C2I6, 2024.
Z. I. Fenanda et al., Effect of virtual try-on on purchase intention, JSTAS, 2024.
R. Mihaila, 3D virtual garment simulation and AI virtual try-on, Journal of Research in Gender Studies, 2023.
P. Goel et al., A review on virtual try-on, in Proc. IEEE ICCCNT, 2024.
M. I. G. De Almeida, Consumer acceptance of AI virtual try-on systems, Masters thesis, 2021.
T. Brown and K. Green, Machine learning applications in personalized fashion styling, Journal of Artificial Intelligence Research, vol. 78, pp. 104120, 2024.
Kusam Lata, Mayank Dave, Nishanth K N. Image-to-Image Translation Using Generative Adversarial Network, International Conference on Electronics Communication and Aerospace Technology [ICECA 2019], IEEE

Conference Record # 45616; IEEE Xplore ISBN: 978-1-7281-0167-5