🔒
Global Research Platform
Serving Researchers Since 2012

Ad Genie: A Multimodal Generative AI Framework for Automated Marketing Campaign Creation Using Product Images, Textual Prompts, and Web Intelligence

DOI : https://doi.org/10.5281/zenodo.20084728
Download Full-Text PDF Cite this Publication

Text Only Version

Ad Genie: A Multimodal Generative AI Framework for Automated Marketing Campaign Creation Using Product Images, Textual Prompts, and Web Intelligence

Ms. Gouthami

Project Guide, Assistant Professor

K. Saanvi

Author, Student (UG Scholar)

Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering

Keshav Memorial Institute of Technology Hyderabad, Telangana, India

M. Divya Bharathi

Co-Author, Student (UG Scholar)

Keshav Memorial Institute of Technology Hyderabad, Telangana, India

Arekanti Mercy

Co-Author, Student (UG Scholar)

Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering

Keshav Memorial Institute of Technology Hyderabad, Telangana, India

Chavan Supriya

Co-Author, Student (UG Scholar)

Keshav Memorial Institute of Technology Hyderabad, Telangana, India

Nenavath Sreelatha

Co-Author, Student (UG Scholar)

Dept. of Computer Science and Engineering Dept. of Computer Science and Engineering

Keshav Memorial Institute of Technology Hyderabad, Telangana, India

Abstract

Digital marketing requires product understanding, cus-tomer insight, competitive awareness, creative writing, and platform-specic communication. For small businesses, in-dependent sellers, student entrepreneurs, freelancers, and inuencers, producing effective campaigns is difcult be-cause it demands both creativity and continuous market re-search. Existing AI copywriting tools can generate pro-motional text, but many depend mainly on text prompts, produce generic outputs, and do not fully incorporate vi-sual product cues or real-time market context. This paper presents Ad Genie, a multimodal generative AI framework for automated marketing campaign creation. The proposed system accepts a product image and a campaign or product description as input, extracts visual and semantic features using vision-language models, generates search queries for market intelligence, retrieves trend and review-oriented in-formation from online sources, and produces structured campaign assets using a large language model. The gen-erated outputs include social media posts, a blog concept, a short promotional video script, target audience persona, market trends, sentiment summary, and structured interme-diate results. The prototype demonstrates how multimodal AI, natural language processing, computer vision, retrieval-augmented generation, and web intelligence can be inte-grated into a unied workow for context-aware marketing assistance. The work contributes a practical architecture for AI-driven campaign automation and identies future direc-tions such as multilingual generation, brand voice learning,

Keshav Memorial Institute of Technology Hyderabad, Telangana, India

automatic publishing, analytics, and AI-assisted video pro-duction.

Keywords: Multimodal AI, Generative AI, Digital Market-ing, Web Intelligence, Vision-Language Models, Retrieval-Augmented Generation, Campaign Automation, Customer Persona, Content Strategy.

  1. Introduction

    Digital marketing has become a central component of mod-ern commerce. Online sellers, social media creators, small businesses, and startups depend on product visibility across e-commerce marketplaces, short-video platforms, social networks, blogs, and search engines. A products success is inuenced not only by quality but also by how effec-tively it is presented to the intended audience. Product images, captions, customer reviews, hashtags, blog narra-tives, short videos, and inuencer-style messages collec-tively shape purchase decisions.

    Creating high-quality marketing content is a multidisci-plinary task. It requires product interpretation, knowledge of customer psychology, awareness of market trends, com-petitor analysis, platform-specic writing ability, and cre-ative storytelling. Large companies often rely on dedicated marketing teams, analytics tools, and creative agencies. In contrast, small sellers and independent creators frequently lack the resources and expertise needed to conduct system-atic market research and produce polished campaign ma-terial. Their promotional content may therefore become

    generic, inconsistent, or poorly aligned with customer ex-pectations.

    Recent progress in large language models (LLMs) has made it possible to generate uent marketing copy from nat-ural language prompts [11, 9, 10]. At the same time, vision-language models have improved the ability of AI systems to interpret images and connect visual information with text [3, 4, 5, 6]. Retrieval-augmented generation and web in-telligence techniques further allow AI systems to ground their outputs in external information such as reviews, trends, news, and competitor activity [7]. These developments cre-ate an opportunity to move beyond simple text generation and toward complete, context-aware campaign generation.

    This paper proposes Ad Genie, a multimodal AI agent for intelligent campaign creation. The system accepts two primary inputs: a product image and a textual campaign de-scription. It analyzes the product visually and semantically, retrieves relevant market information from online sources, extracts customer and trend insights, and generates a struc-tured campaign package. The current prototype demon-strates this workow through a web interface where users upload a product image, describe a campaign goal, and re-ceive organized outputs including social media posts, a blog concept, a video script, audience persona, and market anal-ysis.

    The central research question addressed in this paper is:

    How can multimodal product understanding and web intelligence be integrated with generative AI to automate the creation of context-aware digital marketing campaigns?

    The main contributions of this work are:

    1. A unied multimodal campaign generation pipeline that combines product image understanding, textual cam-paign intent, web intelligence, and generative AI.

    2. A modular system architecture consisting of input vali-dation, multimodal feature extraction, query generation, market insight extraction, audience strategy generation, and content rendering.

    3. A practical prototype that demonstrates campaign cre-ation from a product image and campaign topic through a user-friendly web interface.

    4. A structured output design that produces promotional copy, market trends, persona insights, blog concepts, video scripts, and machine-readable intermediate results.

    5. A research-oriented evaluation framework for compar-ing multimodal web-grounded generation with text-only LLM prompting and manual campaign drafting.

  2. Background and Related Work

    1. Generative AI in Digital Marketing

      Generative AI has rapidly entered the marketing domain be-cause LLMs can produce uent text, summarize informa-tion, rewrite content in different tones, and generate cre-ative ideas [2, 11, 12]. Marketing applications include ad copywriting, product descriptions, email campaigns, blog outlines, customer support responses, and social media cap-tions. LLMs are particularly useful for reducing the time required for brainstorming and rst-draft creation.

      Despite these advantages, prompt-based content genera-tion has limitations. If an LLM receives only a short product description, it may generate content that sounds polished but lacks product-specic detail, current market conext, or au-dience precision. The generated text may also hallucinate unsupported claims. For marketing use cases, unsupported claims can mislead customers or damage brand trust. There-fore, marketing generation systems benet from grounding mechanisms that connect generated content to product at-tributes and external evidence.

    2. Multimodal AI and Vision-Language Models

      Product marketing is naturally multimodal. A product im-age communicates color, shape, aesthetic style, material, usage context, and emotional tone. A product description communicates functional details, intended use cases, tech-nical specications, and brand messaging. A campaign gen-eration system that uses only text misses visual cues that are important for creating relevant content.

      Vision-language models such as CLIP, BLIP, LLaVA, and GPT-4o-style multimodal systems have shown that visual and textual information can be represented and reasoned about jointly [3, 4, 5, 12]. CLIP aligns images and text in a shared embedding space, BLIP supports image caption-ing and vision-language understanding, and LLaVA-style systems connect visual encoders with language models to enable visual question answering and multimodal reason-ing. These models make it possible to extract product-level attributes such as dominant colors, product category, style, use case, and visual mood.

      Prior work on image advertisements also shows that vi-sual information can improve the interpretation of advertis-ing symbolism and creative intent [15]. This supports the design choice of treating product images as rst-class inputs rather than optional decoration.

      In Ad Genie, multimodal analysis is used to convert a product image and text prompt into a richer product repre-sentation. This representation informs both market query generation and campaign content generation.

    3. Web Intelligence and Retrieval-Augmented Generation

      Marketing content must be sensitive to current trends. Cus-tomer preferences, competitor positioning, seasonal de-mands, and social media discussions change frequently. Static model knowledge alone is not sufcient for trend-aware marketing. Retrieval-augmented generation ad-dresses this issue by combining generative models with ex-ternal information retrieval [7].

      For marketing applications, web intelligence may include search results, product reviews, frequently asked questions, competitor listings, news articles, social media trends, and video review transcripts. This information can be analyzed for sentiment, pain points, keywords, and emerging cus-tomer needs. A web-grounded campaign system can then generate content that is more relevant than a purely prompt-based system.

    4. Sentiment Analysis, Personas, and Strat-egy

      Effective campaigns require an understanding of target au-diences. Customer personas summarize demographic, psy-chographic, behavioral, and motivational characteristics of potential buyers [13, 14]. Sentiment analysis identies whether customer discussions are positive, negative, or neu-tral [8]. Keyword extraction and trend mining identify the language customers use when searching for or discussing products.

      In Ad Genie, the system generates an audience persona using product semantics, market trends, and campaign in-tent. The persona includes likely age group, user inter-ests, needs, pain points, psychographics, and recommended channels. This makes the generated output more strategic than simple ad copy.

    5. Connection to SOMONITOR

      The base paper used for this project, SOMONITOR, presents a framework for marketing analytics using explain-able AI, CTR prediction, LLM-based content pillar extrac-tion, persona mining, communication theme mining, and data-driven story generation [1]. SOMONITOR demon-strates how LLMs can support marketing workows by pro-cessing large amounts of advertising content, identifying au-dience segments, and creating actionable content briefs.

      Ad Genie is conceptually related to SOMONITOR but differs in its primary objective. SOMONITOR focuses on monitoring and analyzing existing marketing content and campaign performance. Ad Genie focuses on generating a new campaign package from product-level input. In this sense, Ad Genie adapts the broader idea of AI-assisted marketing intelligence into a product-centered, multimodal campaign generation workow.

  3. State-of-the-Art Positioning

    The current state of the art in AI-assisted marketing is shaped by four converging research directions: large lan-guage models for natural language generation, vision-language models for multimodal product understanding, retrieval-augmented generation for grounding responses in external knowledge, and explainable marketing analytics for converting data into actionable strategy. Ad Genie is posi-tioned at the intersection of these directions.

    Table 1 summarizes how the proposed framework dif-fers from adjacent approaches. Text-only LLM tools are fast and useful for rst drafts, but they depend heavily on prompt quality and may miss product-specic visual sig-nals. Vision-language models can describe product ap-pearance, but they do not independently produce a com-plete marketing strategy. Retrieval-augmented generation grounds outputs in external sources, but it must be con-nected to domain-specic insight extraction to become use-ful for marketing. SOMONITOR and related explainable advertising systems focus on analyzing existing campaigns and competitor content. Ad Genie combines these streams into a product-level campaign generation workow.

  4. Research Gap

    Although AI marketing tools and LLM-based copywriting systems are increasingly available, several gaps remain:

    1. Many systems are text-only and do not use product im-ages for campaign creation.

    2. Many generated outputs are generic because they are not grounded in real-time market context.

    3. Most tools generate isolated pieces of content rather than a complete campaign strategy.

    4. Existing tools may not produce explicit audience per-sonas, pain points, market trends, and recommended channels.

    5. Small sellers require simple end-to-end workows rather than separate tools for research, analysis, writing, and formatting.

    6. Pure LLM systems may hallucinate claims when no ex-ternal grounding is provided.

    7. Marketing analytics frameworks often analyze existing campaign data but do not directly generate ready-to-use product campaign assets.

      Ad Genie addresses these gaps by integrating multi-modal product understanding, web intelligence, strategic in-sight extraction, and structured generative output in a single workow.

  5. Proposed System

    Ad Genie is designed as a modular AI framework for auto-mated campaign creation. The system receives multimodal input, processes it through specialized modules, and pro-duces both human-readable and machine-readable outputs.

    1. System Objectives

      The major objectives are:

      1. Automate the generation of digital marketing campaign assets.

      2. Combine product image understanding with textual product or campaign descriptions.

      3. Retrieve relevant market information from online sources.

      4. Extract trends, keywords, sentiment, competitor insights, and audience needs.

      5. Generate structured outputs such as social media posts, blog concepts, video scripts, and target personas.

      6. Provide a simple interface suitable for non-technical users.

    2. Target Users

      The system is intended for small e-commerce sellers, local businesses, student entrepreneurs, freelancers, digital mar-keters, inuencers, personal-brand builders, and marketing agencies seeking rapid campaign drafts.

    3. High-Level Workow

      Figure 1 shows the overall workow. The system begins with a user-provided product image and campaign descrip-tion, validates the inputs, extracts visual and semantic fea-

      Table 1: State-of-the-art positioning of Ad Genie against adjacent AI marketing approaches.

      Approach Primary Capability Limitation for Small-Seller Campaign Creation

      Ad Genie Extension

      Text-only LLM copywriting Generates uent captions,

      blogs, and ad copy from prompts

      Vision-language product analysis Extracts visual attributes

      and image captions from product photos

      Retrieval-augmented generation Grounds generated

      responses in external doc-uments or search results

      Explainable marketing analytics Analyzes existing cam-

      paigns, audiences, and competitor content

      Ad Genie full pipeline Integrates image, text, re-trieval, insight extraction, and generation

      Often generic; lacks im-age grounding and live market context

      Does not automatically generate full campaign strategy

      Requires domain-specic retrieval and summariza-tion design

      Primarily analytic; not de-signed as a product-to-campaign generator

      Requires broader bench-marking and production hardening

      Adds product image analysis, market retrieval, persona generation, and structured outputs

      Uses visual features as input to query genera-tion, audience strategy, and campaign generation Converts retrieved market signals into trends, senti-ment, keywords, and con-tent angles

      Adapts explainable mar-keting insight into an end-to-end campaign creation agent

      Provides a unied proto-type for context-aware, product-specic cam-paign drafting

      Product image and campaign description

      Input validation

      Multimodal product understanding

      Intelligent query generation

      Web intelligence retrieval

      Insight extraction

      Audience strategy generation

      Campaign content generation

      Output rendering and export

ported, the uploaded le is not corrupted, the text input is not empty, and the system has enough information to per-form analysis.

Figure 1: End-to-end workow of the proposed Ad Genie frame-work.

tures, retrieves web intelligence, generates insights, and pro-duces campaign-ready assets.

  1. Methodology

    1. Input Layer

      The input layer accepts a product image and a textual cam-paign or product description. The image may be uploaded in standard formats such as JPG, JPEG, PNG, or WEBP. The text description captures the users campaign goal or product context. For example, a user may enter: gifting a coffee mug to a friend. The system may also provide an optional manual image description eld when an image is not available or when the vision model is disabled.

      The input layer validates that the image format is sup-

    2. Multimodal Product Understanding

      The multimodal module analyzes both image and text. The vision component extracts product category, dominant col-ors, physical design, aesthetic style, mood, and usage con-text. The text component extracts campaign theme, sub-themes, product benets, target use case, brand tone, and constraints such as affordability, energy, elegance, sustain-ability, or gift suitability.

      The outputs from both components are fused into a com-pact structured representation, as shown in Listing 1.

      {

      "topic_semantics": { "main_theme": "gifting",

      "subthemes": ["friendship", "coffee", "mug"], "brand_tone": "modern"

      },

      "visual_aesthetic": { "colors": ["white", "gray"],

      "style_keywords": ["minimalist", "clean"], "mood": "invigorating"

      }

      }

      Listing 1: Example structured representation produced by the multimodal module.

    3. Intelligent Query Generation

      After product understanding, the system generates search queries for market intelligence. These queries are derived from the product category, campaign theme, customer use case, and visual attributes. For a coffee mug gift exam-ple, possible queries include personalized coffee mug gift trends, best gifts for coffee lovers, unique gifts for friends, coffee mug customer reviews, and trending personalized gift ideas.

      Query generation improves retrieval relevance because the system searches for market-specic context rather than relying on the users original prompt alone.

    4. Web Intelligence Retrieval

      The web intelligence module retrieves data from online sources such as search engines, review pages, news sources, social platforms, and product listings. Depending on imple-mentation and API availability, sources may include search engine results, e-commerce reviews, product FAQs, social media discussions, news articles, and YouTube review con-tent or transcripts.

      The retrieved data is cleaned before analysis. Cleaning may include removing HTML tags, duplicate results, irrele-vant links, advertisements, stopwords, and noisy text.

    5. Market Insight Extraction

      The insight extraction module transforms retrieved data into actionable marketing intelligence. It identies market trends, customer pain points, common product expectations, customer sentiment, competitor strengths and weaknesses, SEO keywords, hashtags, and recommended content angles. For the prototype coffee mug case, the system identied trends such as personalized gifts, experiential gifts, and sub-scription services. The market sentiment was marked as neutral, and the system recognized the gift as thoughtful but

      potentially not unique enough for some recipients.

    6. Audience Strategy Generation

      The audience strategy module generates a target persona and recommended channels. A persona may include a name, demographic prole, psychographic characteristics, needs and pain points, buying motivations, and recom-mended platforms. For the coffee mug example, the proto-type generated a persona named Coffee-Loving Friends, with demographics such as young adults aged 1835, ur-ban dwellers, coffee enthusiasts, and professionals. The rec-ommended channels included Instagram, TikTok, Pinterest, and Facebook Groups.

    7. Campaign Content Generation

      The nal generative module produces campaign-ready as-sets. The system generates three social media posts or tweets, a blog title and outline, a blog concept summary, a short video script with scenes and voiceover cues, a struc-tured campaign summary, and JSON output for debugging, export, or integration.

      The content generator uses the structured product repre-sentation and market insights as input. This approach re-duces generic output by grounding the LLM in product-specic and market-specic context.

  2. System Architecture

    The system follows a modular architecture. The major com-ponents are:

    Table 2: Module-wise technology mapping for the proposed sys-tem.

    Module Representative Technologies

    User interface Streamlit prototype; extensible to

    React-based frontend

    Vision analysis BLIP, LLaVA, CLIP-style vision-language models

    Text strategy LLaMA/GPT-style large language

    models

    Web intelligence DuckDuckGo, Serper API, Bing

    Search API, review/news sources Insight extraction Sentiment analysis, keyword exrac-

    tion, trend mining

    Output rendering Tabbed UI, JSON data view, tex-t/PDF export

    User Interface

    Input Validator

    Multimodal Engine

    Web Intelligence Module

    Insight Extractor

    Content Generator

    Output Renderer

Figure 2: Modular architecture of Ad Genie.

  1. Client/UI Layer: Provides elds for campaign topic, product image upload, optional manual image descrip-tion, model settings, and result display.

  2. Input Validator: Checks le type, input completeness, and basic constraints.

  3. Multimodal Engine: Uses vision-language models to extract visual and semantic information.

  4. Query Generator: Produces search queries based on the product representation.

  5. Web Intelligence Module: Retrieves market data from online sources.

  6. NLP Insight Extractor: Performs sentiment analy-sis, keyword extraction, trend detection, and competitor summarization.

  7. Content Generator: Uses LLMs to produce campaign assets.

  8. Output Renderer: Displays the generated insights in structured tabs and provides export options.

    1. Deployment View

The prototype is demonstrated as a local web applica-tion. The interface shown in the project screenshots runs at localhost:8501, which indicates a Streamlit-based prototype. The broader architecture can also be extended to a production environment using a separate frontend, back-end API layer, cloud GPU runtime, and external APIs.

A scalable deployment can consist of a client browser, web server or backend API, AI inference runtime, search and scraping APIs, external data sources, and output storage

or export layer.

  1. Prototype Implementation

    1. User Interface

      The prototype provides a dark-themed web interface titled Ad Genie: Where Marketing Meets Magic. The sidebar includes model settings and indicates the active technolo-gies: LLaMA-3-8B for text strategy, BLIP/LLaVA for im-age analysis, and DuckDuckGo for market trends.

      The main interface includes campaign topic input, prod-uct image upload, manual image description, generate but-ton, and result tabs for market analysis, audience strategy, creative content, and data view.

    2. Product Example Used in Prototype

      The demonstrated prototype uses the campaign topic gift-ing a coffee mug to a friend and a product image showing a white and black Lazy Panda coffee mug. The uploaded image contains a white mug with a black handle, panda il-lustration, and a small panda gure on the lid. The visual design suggests a cute, friendly, and gift-oriented product.

    3. Prototype Output

      The system generated visual palette attributes such as white and gray colors, minimalist and clean style, and an invigo-rating mood. It identied the core theme as gifting, with subthemes of friendship, coffee, and mug. It recognized a modern brand voice, market trends such as personalized gifts and experiential gifts, and neutral market sentiment. It also generated the persona Coffee-Loving Friends, rec-ommended channels such as Instagram, TikTok, Pinterest, and Facebook Groups, and produced social media posts, a blog concept, video script, and structured pipeline output.

    4. Example Generated Content

      The prototype generated post drafts such as:

      1. Fuel their friendship with a personalized coffee mug! #coffee #giftideas

      2. Want to make a lasting impression? Try gifting an expe-rience like a coffee-tasting tour! #experientialgifts #cof-fee

      3. Ready to upgrade your gifting game? Discover unique and thoughtful presents that reect your friends inter-ests! #giftinspo #coffee

      These examples show that the system does not only describe the mug but also expands the campaign toward broader gift-positioning strategies.

  2. Experimental Design and Evalua-tion Framework

    The current project bundle demonstrates a working proto-type and sample outputs. For a full research evaluation, this

    paper proposes a structured experimental design. The eval-uation should compare Ad Genie with baseline approaches across multiple product categories.

    1. Dataset Design

      A small benchmark dataset can be created using product im-ages and descriptions from different categories: personal-ized gift items, wireless earbuds, eco-friendly reusable bot-tles, fashion accessories, skincare products, kitchen appli-ances, home decor items, stationery products, tness acces-sories, and mobile phone accessories. Each sample should include a product image, product title, short product de-scription, intended campaign goal, and human-written ref-erence campaign if available.

    2. Baselines

      Ad Genie should be compared against four baselines: man-ual campaign drafting, text-only LLM prompting, image captioning followed by LLM generation, and the full Ad Genie pipeline. The full pipeline uses image, text, web in-telligence, insight extraction, and content generation.

    3. Evaluation Metrics

      The evaluation should include functional, performance, and content-quality metrics.

      1. Functional Metrics

        Functional metrics include image upload success rate, in-valid input detection, web retrieval success rate, output gen-eration success rate, and export success rate.

      2. Performance Metrics

        Performance metrics include image analysis time, web re-trieval time, content generation time, end-to-end response time, and failure recovery time.

      3. Content Quality Metrics

        Human evaluators can rate outputs on a 15 scale using product relevance, creativity, audience t, market aware-ness, platform suitability, clarity, persuasiveness, practical usefulness, factual safety, and overall campaign quality.

  3. Results and Discussion

    1. Prototype Observation

      The prototype successfully demonstrates the end-to-end concept of Ad Genie. In the coffee mug example, the sys-tem accepted a product image and campaign topic, analyzed visual and semantic attributes, identied market-oriented trends, generated a target persona, and produced structured creative content.

      The generated output shows three important capabilities. First, the system performs visual grounding by recogniz-ing color and style cues from the uploaded product image. Second, it performs semantic grounding by connecting the

      Table 3: Comparison of Ad Genie with baseline campaign generation approaches.

      Method

      Image

      Web Trends

      Persona

      Posts

      Blog

      Video Script

      Expected Strength

      Manual drafting

      Yes

      Yes

      Yes

      Yes

      Yes

      Yes

      High quality but time-

      consuming

      Text-only LLM

      No

      Limited

      Partial

      Yes

      Yes

      Yes

      Fast but often generic

      Image caption + LLM

      Partial

      No

      Partial

      Yes

      Yes

      Yes

      Better product grounding

      than text-only generation

      Ad Genie full pipeline

      Yes

      Yes

      Yes

      Yes

      Yes

      Yes

      Structured, context-aware,

      and product-specic cam-

      paign generation

      Table 4: Representative test cases for system validation.

      ID Module Expected Output

      MM-01 Input Valid image and text are ac-cepted

      MM-02 Input Corrupted image triggers an er-

      ror

      MM-03 Input Missing text triggers validation warning

      API-01 Retrieval Valid query returns market data API-02 Retrieval Timeout triggers retry or fall-

      back

      IN-01 Insights Sentiment and keywords are ex-

      10.4 Discussion

      Ad Genie should be understood as a campaign drafting as-sistant rather than a fully autonomous marketing manager. Its outputs can reduce ideation time and help users create a rst draft of campaign material. Human review remains necessary for brand accuracy, legal safety, factual correct-ness, and nal publishing decisions.

  4. Ethical, Legal, and Practical Con-siderations

    AI-generated marketing content must be handled carefully.

    CG-01 CG-02 CG-03 OP-01

    tracted

    Generation Three posts are generated Generation Blog concept is generated Generation Video script is generated Output Results are displayed in tabs

    The system should avoid unsupported product claims, dis-close AI assistance where appropriate, protect uploaded product images, respect API and platform terms, avoid bi-ased persona assumptions, require human approval before publishing, avoid copying protected marketing material, and

    campaign topic to themes such as gifting, friendship, cof-fee, and mug usage. Third, it performs strategic expansion by moving from a simple mug gift into broader trends such as personalized gifts and experiential gifts.

    This suggests that multimodal AI can improve campaign generation by linking product appearance, campaign intent, and market positioning.

      1. Strengths

        The main strengths of Ad Genie are its end-to-end workow, multimodal understanding, structured strategy generation, user accessibility, extensibility, and practical relevance. The system combines tasks that are usually separate: product inspection, trend research, audience thinking, and content drafting. Even when the generated content is not nal, it gives users a structured starting point.

      2. Limitations

    The current system has limitations. Web intelligence qual-ity depends on available search results and external APIs. LLM-generated content may still require human review before publication. Sentiment analysis may be inaccu-rate when retrieved data is noisy or limited. The proto-type screenshots demonstrate one product example; broader evaluation is required. API rate limits and model latency can affect response time. Some product categories may require domain-specic prompts or ne-tuning.

    prevent misleading or manipulative advertisements.

    These safeguards are especially important if Ad Genie is extended to automatic publishing or paid advertising work-ows.

  5. Future Enhancements

    Future versions of Ad Genie can include multilingual cam-paign generation, seller dashboard integration, automatic content publishing after user approval, AI video genera-tion, brand voice learning, analytics dashboards, A/B test-ing, CTR prediction, price intelligence, and a mobile ap-plication. These extensions would allow Ad Genie to evolve from a campaign drafting assistant into a broader AI-powered digital marketing platform.

  6. Conclusion

This paper presented Ad Genie, a multimodal generative AI framework for automated marketing campaign creation. The system integrates product image analysis, textual cam-paign understanding, web intelligence, market insight ex-traction, audience persona generation, and LLM-based con-tent creation. Unlike text-only copywriting tools, Ad Ge-nie uses both visual and semantic product information and grounds campaign generation in market-oriented insights.

The prototype demonstrates a practical workow in which a user uploads a product image, enters a campaign

topic, and receives structured marketing outputs including social media posts, a blog concept, a video script, persona information, and machine-readable intermediate data. The coffee mug case study illustrates how the system can trans-form a simple product and campaign goal into a more com-plete strategy involving gifting themes, audience segments, recommended platforms, and creative content.

Ad Genie is not intended to replace human marketers. Instead, it serves as an AI-assisted campaign drafting and research tool that can reduce manual effort, support small businesses, and provide structured creative direction. With further evaluation, stronger retrieval grounding, multilin-gual support, analytics integration, and human-in-the-loop safeguards, Ad Genie can evolve into a more complete AI-powered digital marketing assistant.

Acknowledgment

The authors thank the Department of Computer Science and Engineering, Keshav Memorial Institute of Technology, and the project guide Ms. Gouthami for their guidance and sup-port.

References

  1. A. Farseev, Q. Yang, M. Ongpin, I. Gossoudarev, Y.-Y. Chu-Farseeva, and S. Nikolenko, SOMONI-TOR: Combining Explainable AI and Large Language Models for Marketing Analytics, arXiv:2407.13117, 2024.

  2. A. Vaswani et al., Attention Is All You Need, in Advances in Neural Information Processing Systems, 2017.

  3. A. Radford et al., Learning Transferable Visual Mod-els From Natural Language Supervision, in Inter-national Conference on Machine Learning, 2021, arXiv:2103.00020.

  4. J. Li et al., BLIP: Bootstrapping Language-Image Pre-training for Unied Vision-Language Understand-ing and Generation, in International Conference on Machine Learning, 2022, arXiv:2201.12086.

  5. H. Liu et al., Visual Instruction Tuning, arXiv:2304.08485, 2023.

  6. L. Baraldi et al., The Revolution of Multimodal Large Language Models: A Survey, in Findings of the Association for Computational Linguistics, 2024, arXiv:2402.12451.

  7. P. Lewis et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, in Advances in Neural Information Processing Systems, 2020, arXiv:2005.11401.

  8. C. Hutto and E. Gilbert, VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text, in International AAAI Conference on Web and Social Media, 2014.

  9. S. Makridakis, F. Petropoulos, and Y. Kang, Large Language Models: Their Success and Impact, Fore-casting, vol. 5, no. 3, pp. 536549, 2023.

  10. S. Minaee et al., Large Language Models: A Survey,

    arXiv preprint, 2024.

  11. OpenAI, GPT-4 Technical Report, arXiv preprint, 2023.

  12. OpenAI, Hello GPT-4o, OpenAI technical an-nouncement, 2024. [Online]. Available: https:// openai.com/index/hello-gpt-4o/

  13. A. Malik, Persona Based Marketing Strategies: Cre-ation of Personas Through Data Analytics, Masters thesis, 2019.

  14. D. Pelleg and A. W. Moore, X-Means: Extending -Means with Efcient Estimation of the Number of Clusters, in International Conference on Machine Learning, 2000.

  15. A. Savchenko et al., Ad Lingua: Text Classication Improves Symbolism Prediction in Image Advertise-ments, in International Conference on Computational Linguistics, 2020.