AMAR: An Autonomous Multi-Agent Researcher for End to End Automated Scientific Literature Review and Draft Generation

Prof. Rajasekaran K; Prof. Lavanya Vijayan; Karthik Kashyap K S; Pratham C S; Kiran D

doi:10.5281/zenodo.20321086

Volume 15, Issue 05 (May 2026)

AMAR: An Autonomous Multi-Agent Researcher for End to End Automated Scientific Literature Review and Draft Generation

DOI : 10.5281/zenodo.20321086

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 71
Authors : Prof. Rajasekaran K, Prof. Lavanya Vijayan, Karthik Kashyap K S, Pratham C S, Kiran D
Paper ID : IJERTV15IS051100
Volume & Issue : Volume 15, Issue 05 , May – 2026
Published (First Online): 21-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

AMAR: An Autonomous Multi-Agent Researcher for End to End Automated Scientific Literature Review and Draft Generation

Prof. Rajasekaran K, Prof. Lavanya Vijayan, Karthik Kashyap K S, Pratham C S, Kiran D

Department of Artificial Intelligence and Data Science

S.E.A. College of Engineering and Technology Visvesvaraya Technological University, Belagavi, Bangalore, India

AbstractWe present AMAR (Autonomous Multi-Agent Researcher), an end-to-end web-based system that automates the complete academic research workflow from literature discovery through experiment execution to draft generation. AMAR orchestrates a pipeline of seven specialized AI agents: a Searcher Agent that retrieves papers from arXiv, OpenAlex, CrossRef, and IEEE Xplore APIs; a Summarizer Agent; a Critic Agent that identifies research gaps; a Developer Agent that synthesizes executable Python experiment code; an Experimenter Agent that executes generated code in an isolated Docker sandbox; a Verifier Agent that validates citations and results against academic databases; and a Writer Agent powered by Google Gemini 2.0 Flash that produces structured research drafts. The system is built on a React 18 + TypeScript frontend with Framer Motion animations, a Node.js/Express backend, and WebSocket-driven real-time agent status monitoring. A persistent session memory layer enables cross-session knowledge retention. AMAR reduces the typical manual research workflow from several hours to a fully automated pipeline, producing a verified draft complete with inline citations, experiment results, and an interactive knowledge graph of paper relationships. The system demonstrates that decomposing research tasks across specialized LLM-backed agents, combined with secure containerized execution, can significantly improve the reproducibility and rigor of AI-assisted research.

Index TermsMulti-agent systems, autonomous research, large language models, literature review automation, Docker sandboxing, knowledge graph, citation verification, Gemini API, React, Node.js

INTRODUCTION

The rapid expansion of academic literature across domains such as machine learning, natural language processing, and data science has created a significant bottleneck for researchers. Conducting a thorough literature review identifying relevant papers, synthesizing methods, recognizing open research questions, designing experiments, and composing a structured draft requires days or weeks of skilled effort. This bottleneck limits the pace of scientific progress and creates barriers for early-stage researchers who lack domain expertise.

Recent advances in large language models (LLMs) have demonstrated strong capabilities in text summarization, code generation, and question answering. However, most existing AI-assisted research tools address only isolated fragments of this workflow offering either paper recommendation, abstract summarization, or draft assistance in isolation rather than providing an integrated, end-to-end autonomous pipeline.

We introduce AMAR (Autonomous Multi-Agent Researcher), a production-ready system that automates the full research lifecycle. AMAR decomposes the research process into seven specialized agent roles coordinated by a central Orchestrator. Each agent focuses on a well-defined task: discovery, summarization, gap analysis, code generation, secure experiment execution, result verification, and draft writing. The complete pipeline runs without human intervention, delivering a verified research draft alongside reproducible experiment artifacts.

The key contributions of this paper are:
- Multi-Agent Orchestration: A seven-agent pipeline with a central Orchestrator that coordinates sequential task execution, provides real-time status updates via WebSocket, and handles error recovery.
- Secure Experiment Sandbox: A Docker-based execution environment that runs LLM-generated Python experiments in isolated containers with resource monitoring and log capture.
- Citation and Results Verification: An automated Verifier Agent that cross-checks generated citations and experiment results against academic databases, producing confidence-scored verification reports.
- Knowledge Graph Visualization: A D3.js force-directed graph rendering inter-paper relationships, enabling visual exploration of the discovered literature space.
- Session Memory Persistence: A cross-session knowledge retention layer that stores research context, enabling researchers to resume and build upon previous sessions.
RELATED WORK
1. AI-Assisted Literature Review
  
  Elicit [1] and Semantic Scholar [2] represent the current state-of-the-art in AI-assisted literature search, offering paper recommendations and abstract-level summarization. However, these systems do not generate experiment code, execute experiments, or produce structured drafts. Connected Papers [3] visualizes citation graphs but performs no synthesis. Consensus [4] provides LLM-backed question answering over scientific papers but lacks end-to-end automation. AMAR differentiates itself by covering the full
  
  pipeline from search to verified draft in a single, integrated system.
2. Multi-Agent LLM Systems
  
  AutoGPT [5] and BabyAGI [6] pioneered autonomous LLM agents capable of goal-directed task decomposition. MetaGPT [7] introduced role-playing agents for software engineering. ChatDev [8] demonstrated multi-agent collaboration for code generation. CAMEL [9] formalized role-playing communication between agents. AMAR adapts multi-agent principles to the academic research domain, introducing domain-specific agents (Critic, Verifier) not present in prior software-engineering-focused frameworks.
3. Automated Code Generation and Execution
  
  Code Interpreter in ChatGPT and the Code Llama family
  [10] have demonstrated LLM-based code generation and execution. Sandboxed code execution has been explored in educational and software testing contexts [11]. AMAR extends this paradigm to scientific experiment execution, using Docker containerization to enforce reproducibility and security critical requirements when running untrusted LLM-generated code.
4. Research Gap Analysis
Identifying research gaps traditionally requires expert domain knowledge. Recent work on scientific question generation [12] and critical analysis of literature [13] has begun to automate this process. AMARs Critic Agent provides structured gap identification as a first-class component of the research pipeline, enabling downstream experiment design to target identified open problems.

SYSTEM ARCHITECTURE

Real-time	WebSocket (Socket.IO)	Agent status streaming to frontend
Persistence	Firebase / SUPABASE	Session memory, project history

TABLE I: AMAR System Technology Stack

B. Agent Pipeline

The Orchestrator class maintains a registry of seven agents, each with an independent status state (idle, running, completed, error) and execution log. The pipeline executes sequentially; each agent receives the accumulated context produced by all prior agents, enabling rich inter-agent information sharing. The Orchestrator emits notifyUpdate events at ech state transition, which are relayed to the frontend via WebSocket for live visualization.

A. Overview

AMAR follows a layered architecture comprising three tiers: a React-based frontend, a Node.js/Express backend orchestrating the agent pipeline, and external AI and academic database services. Fig. 1 illustrates the complete system. The user submits a research topic via the web interface. The Orchestrator then dispatches agents sequentially, emitting real-time status events via WebSocket to the frontend dashboard. Final outputs a structured draft, experiment code, verification report, and knowledge graph

Layer	Technology	Role
Frontend	React 18, TypeScript, Vite	UI, real-time dashboard, knowledge graph
Styling/Anim.	Tailwind CSS, Framer Motion, shadcn/ui	Glassmorphism UI, agent animations
Backend	Node.js 18, Express 4	REST API, orchestration, session storage
AI Inference	Google Gemini 2.0 Flash	Summarization, gap analysis, code gen, writing
Paper APIs	arXiv, OpenAlex, CrossRef, IEEE Xplore	Literature discovery
Sandbox	Docker (dockerode SDK)	Isolated Python experiment execution

are rendered in the Results Lab interface.

Agent	Input	Output	Technology
1. Searcher	Research topic string	Paper list (title, summary, URL)	arXiv, OpenAlex, CrossRef, IEEE Xplore APIs
2. Summarizer	Paper list	Structured summaries per paper	Gemini 2.0 Flash
3. Critic	Paper summaries	Research gap list	Gemini 2.0 Flash
4. Developer	Topic + gaps	Python experiment code	Gemini 2.0 Flash
5. Experimenter	Experiment code	Execution logs + metrics JSON	Docker (Python 3.10 container)
6. Verifier	Citations + metrics	Verification report + scores	Academic DB cross-check + Gemini
7. Writer	All above	Structured research draft	Gemini 2.0 Flash

TABLE II: Agent Pipeline Inputs, Outputs and Technologies

C. Docker-Based Experiment Sandbox

A central innovation of AMAR is its ability to execute LLM-generated Python experiment code in a secure, reproducible environment. The Experimenter Agent leverages the dockerode Node.js SDK to programmatically manage Docker container lifecycles. For each experiment, the system performs the following steps:

Auto-generate a Dockerfile targeting a Python 3.10 base image with pinned dependencies from a requirements.txt produced by the Developer Agent.
Build and start an isolated container with strict resource limits (CPU shares, memory ceiling).
Execute the generated experiment script, capturing stdout/stderr logs in real time.
Parse structured metrics from the execution output (accuracy, loss, F1, etc.) into a JSON summary.
Clean up and remove the container and image after execution to prevent resource leakage.

This approach ensures that experiment results reported in the draft are traceable to actual code execution, significantly improving the reproducibility and scientific rigor of the generated research artifacts.

Criterion	Manual	Elicit	GPT-4o	AMAR (ours)
Literature retrieval	Manual search	Automated	None (hallucinated)	Automated (4 APIs)
Gap analysis	Expert judgment	None	Approximate	Automated (Critic Agent)
Experiment code	Manual coding	None	Generated, unexecuted	Generated + executed
Citation verification	Manual	None	None	Automated (Verifier)
Draft generation	Manual writing	None	Single-shot LLM	Multi-agent, verified
End-to-end time	612 hours	2040 min	510 min	< 2 min
Reproducibility	Variable	Low	Low	High (Docker)

agent state transitions are animated using Framer Motion, with sequential glowing pulse effects on the AgentIndicator component reflecting live pipeline progress. The WebSocket client (Socket.IO) automatically reconnects on connection loss, ensuring robust real-time monitoring.

D. Citation and Results Verification

The Verifier Agent performs automated quality assurance in two dimensions. First, citation verification: each reference generated by the Writer Agent is cross-checked against the arXiv and CrossRef APIs to confirm the papers existence, author list, and publication year. A per-citation confidence score is assigned (0.01.0). Second, results verification: experiment metrics produced by the Experimenter Agent are compared against reported baselines in the source papers to flag anomalous or inconsistent values. The verification report includes per-citation status (verified / unverified / mismatch), overall confidence, and human-readable notes providing the researcher with an audit trail before the draft is finalized.

E. Knowledge Graph

AMAR renders a D3.js force-directed graph visualizing the relationships between discovered papers. Each paper is represented as a node; edges encode citation relationships and topic similarity. Nodes are draggable and display full paper details on hover. An auto-layout button applies a force simulation to minimize edge crossings. This visualization enables researchers to rapidly identify central papers, isolated contributions, and thematic clusters within the literature space, supporting strategic decisions about research positioning.

IMPLEMENTATION DETAILS
1. Frontend Architecture
  
  The frontend is a single-page application (SPA) built with React 18 and TypeScript, bundled by Vite for fast hot-module replacement during development. Routing is handled by React Router, with four primary views: the Research Input page, the Research Results page, the Research Lab (Phase 4) page, and the Knowledge Graph page. State management relies on Reacts built-in useState and useEffect hooks supplemented by prop drilling and context for shared pipeline state.
  
  Visual design follows a glassmorphism aesthetic with a dark background, turquoise (#48D1CC) and royal blue (#002D72) accent palette, and frosted-glass panel effects implemented via Tailwind CSS backdrop-blur utilities. All
2. Backend Architecture
The Node.js backend exposes a REST API via Express 4 with the following priary endpoint groups: /api/research for standalone literature retrieval, /api/critic for gap analysis,

/api/orchestrator for the full autonomous pipeline, and

/api/save and /api/projects for session persistence. The Orchestrator class is instantiated per-session to isolate pipeline state. Each agent is implemented as an independent module (controller or orchestrator class) exposing a single async function, enabling easy unit testing and future agent replacement.

RESULTS AND EVALUATION

System Performance

We evaluated AMAR across 15 research topics spanning machine learning, natural language processing, computer vision, and bioinformatics. For each topic, we measured end-to-end pipeline latency, number of papers retrieved, research gaps identified, and draft quality (assessed via human evaluation on a 5-point Likert scale for coherence, coverage, and accuracy). Table III summarizes aggregate results.

Metric	Mean	Min	Max
Papers retrieved per topic	18.4	12	27
Research gaps identified	6.2	4	9
End-to-end pipeline latency (s)	48.3	31	74
Draft coherence (15)	4.1	3.5	4.8
Draft coverage (15)	3.9	3.2	4.6
Citation verification rate (%)	87.3	74	96
Experiment execution success (%)	91.7

TABLE III: AMAR System Performance Across 15 Evaluated Topics

Agent-Level Analysis

The Searcher Agent successfully retrieved papers for all 15 topics, with a mean of 18.4 papers per topic. Coverage was highest for well-indexed NLP topics (mean 24.1 papers) and lowest for niche bioinformatics topics (mean 13.2 papers), reflecting the coverage distribution of the underlying APIs. The arXiv API contributed 61% of retrieved papers, followed by OpenAlex (22%), CrossRef (12%), and IEEE Xplore (5%).

The Critic Agent produced between 4 and 9 gaps per topic. Human expert evaluation confirmed that 78% of identified gaps were genuine open research questions, with 14% being partially valid and 8% redundant or overly broad. The Developer Agent generated syntactically valid Python code in 94% of cases; the Experimenter Agent executed code successfully in 91.7% of those cases, with failures primarily attributed to missing library dependencies not anticipated in the generated requirements.txt.

The Verifier Agent achieved a citation verification rate of 87.3%, with unverified citations concentrated in arXiv preprints not yet indexed by CrossRef. Draft quality ratings (coherence 4.1/5, coverage 3.9/5) were competitive with the quality of manually written literature review sections for early-stage research exploration.
Comparison with Baseline Approaches

We compare AMAR against three baseline approaches:
1. Manual research workflow by a graduate-level researcher,
2. Elicit standalone literature search and summarization, and
3. ChatGPT-4o with a single literature review prompt (no tool use). Table IV presents the comparison.
  
  TABLE IV: Comparison of Research Workflow Approaches

DISCUSSION
AMAR is designed as a research acceleration tool, not a replacement for human scientific judgment. All generated drafts are explicitly labeled as AI-assisted and require researcher review before submission. The system does not fabricate experimental results; all reported metrics are traceable to Docker execution logs. Citation verification mitigates the hallucination risk that plagues single-model research assistants. Researchers using AMAR bear full responsibility for verifying the scientific validity of outputs before publication.
CONCLUSION

We presented AMAR, an Autonomous Multi-Agent Researcher that automates the complete academic research workflow from literature discovery through verified draft generation. AMAR orchestrates seven specialized AI agents across a React/Node.js web application, delivering literature retrieval from four academic APIs, structured research gap analysis, LLM-generated experiment code executed in Docker sandboxes, automated citation verification, and final draft production all within a sub-two-minute end-to-end pipeline.

Evaluation across 15 diverse research topics demonstrated an average citation verification rate of 87.3%, experiment execution success of 91.7%, and draft coherence ratings of 4.1/5 from human evaluators. AMAR outperforms single-model and single-task baselines on every criterion except raw speed for simple summarization tasks, and introduces Docker-based reproducibility and citation verification capabilities absent from all compared systems.
FUTURE WORK

Planned extensions to AMAR include:

Parallel Agent Execution: Enabling concurrent retrieval across multiple APIs and parallel summarization using a thread pool, targeting sub-30-second end-to-end latency.
Domain-Specific Fine-Tuning: Training specialized Writer and Critic agents on domain corpora (e.g., biomedical literature via PubMed) to improve draft quality and gap identification precision in specialized subfields.
Human-in-the-Loop Feedback: Introducing interactive correction steps where researchers can refine identified gaps, regenerate specific sections, or inject domain knowledge before the Writer Agent produces the final drft.
Expanded API Coverage: Integrating PubMed, Semantic Scholar, and ACL Anthology APIs to extend coverage to biomedical and computational linguistics literature.
Multi-Modal Outputs: Generating figures (e.g., experiment result plots, architecture diagrams) alongside the text draft, leveraging LLM-based code generation for matplotlib/seaborn visualizations executed in the Docker sandbox.
Collaborative Research Mode: Supporting multiple simultaneous users working on related topics, with a shared knowledge base and cross-session gap deduplication.

AMAR represents a step toward AI research laboratories capable of conducting complete research cycles autonomously, supporting researchers in rapidly mapping a new field, identifying high-value open problems, and producing reproducible experimental evidence while maintaining scientific rigor through verification and transparency mechanisms. Future work will address parallelization, domain specialization, and interactive human-in-the-loop refinement to further close the gap between automated and expert human research quality.

. ACKNOWLEDGMENT

The authors thank Asst. Prof.Rajasekaran, Department of AI&DS in S.E.A. College of Engineering and Technology, for his valuable guidance and support throughout this project. The authors also acknowledge the Department of AI & Data Science and Visvesvaraya Technological University for providing the necessary resources and infrastructure..

REFERENCES

A. Karpas et al., "Elicit: An AI Research Assistant for Systematic Literature Reviews," arXiv preprint arXiv:2110.06905, 2021.
W. Ammar et al., "Construction of the Literature Graph in Semantic Scholar," Proc. NAACL-HTL, 2018, pp. 8491.
A. Tarnavski et al., "Connected Papers: A Visual Tool for Academic Research," Online Tool, 2020. [Online]. Available: https://www.connectedpapers.com
E. Saunders, "Consensus: Crowd-Sourced Scientific Evidence Search," 2022. [Online]. Available: https://consensus.app
T. Significant Gravitas, "AutoGPT: An Autonomous GPT-4 Experiment," GitHub, 2023. [Online]. Available: https://github.com/Significant-Gravitas/AutoGPT
Y. Nakajima, "BabyAGI: AI-Powered Task Management System," GitHub, 2023.
Q. Hong et al., "MetaGPT: Meta Programming for Multi-Agent Collaborative Framework," arXiv preprint arXiv:2308.00352, 2023.
C. Qian et al., "Communicative Agents for Software Development," arXiv preprint arXiv:2307.07924, 2023.
G. Li et al., "CAMEL: Communicative Agents for Mind Exploration

of Large Scale Language Model Society," NeurIPS, 2023.
B. Rozière et al., "Code Llama: Open Foundation Models for Code," arXiv preprint arXiv:2308.12950, 2023.
M. Chen et al., "Evaluating Large Language Models Trained on Code," arXiv preprint arXiv:2107.03374, 2021.
T. Scialom et al., "Generating Scientific Questions from Academic Papers for Automated Research Gap Detection," Proc. ACL, 2021.
D. Wadden et al., "Fact or Fiction: Verifying Scientific Claims," Proc. EMNLP, 2020, pp. 75347550.

AMAR: An Autonomous Multi-Agent Researcher for End to End Automated Scientific Literature Review and Draft Generation

TABLE I: AMAR System Technology Stack

TABLE II: Agent Pipeline Inputs, Outputs and Technologies

TABLE III: AMAR System Performance Across 15 Evaluated Topics

TABLE IV: Comparison of Research Workflow Approaches