Global Publishing Platform
Serving Researchers Since 2012

Humming-Based Music Retrieval and Generation Approaches

DOI : https://doi.org/10.5281/zenodo.18910602
Download Full-Text PDF Cite this Publication

Text Only Version

 

Humming-Based Music Retrieval and Generation Approaches

Parthip TS, Vishnu Prakash, Nidhin N, Nevlin Jude Correya, Anaswara Dev S, Dr. Sabeena K, Chinchu M Pillai

College of Engineering Chengannur, Kerala, India

This work was supported by the College of Engineering Chengannur

ABSTRACT Query-by-humming (QbH) is a natural and intuitive approach within Music Information Retrieval (MIR), enabling users to retrieve or generate music using humming input. Unlike fingerprint- based recognition systems such as Shazam, QbH must process noisy, off-pitch, and rhythm- inconsistent queries. Over nearly three decades, research has progressed from early pitch contour and string-matching systems to modern deep learning embedding methods and generative pipelines. This survey reviews eight significant works: Ghias et al. (1995), Pham et al. (2024), Ranjan and Srivastava (2023), Tencent AI Lab (2023), Jin (2021), Amatov et al. (2023), Zhang (2024), and the Sri Lankan MIR review (2021). We analyze their methodologies, datasets, performance, contributions, and limitations. The review concludes that deep embedding-based retrieval provides the best accuracy, but hybrid systems combining noise-robust preprocessing, dataset scaling, and generative mashups offer the most promise for future applications such as Hummify.

INDEX TERMS Query by humming, music information retrieval, deep learning, Faiss, ArcFace, subse- quence matching, dataset creation, music generation.

  1. INTRODUCTION

    MUSIC is remembered by humans largely in terms of melody. Users often recall a tune without lyrics or ti-

    tle, making humming-based retrieval a natural query method. While Shazam and SoundHound achieve high accuracy for studio-quality recordings, they fail on humming due to instability in pitch, tempo, and background noise. Googles Hum to Search offers partial support but underperforms on regional or multilingual music.

    Research in QbH spans early contour-based systems (Ghias 1995) to embedding-based deep learning frameworks (Pham 2024), noise-robust feature extraction (Ranjan 2023), mashup generation pipelines (Tencent AI 2023), subsequence retrieval (Jin 2021), dataset expansion (Amatov 2023), and sentence-based retrieval (Zhang 2024). A 2021 Sri Lankan review further contextualizes MIR within machine learning and retrieval challenges.

    This survey analyzes these eight works, comparing methodologies, datasets, performance, contributions, and limitations to identify best practices and future directions for humming-based music retrieval and generation.

  2. BACKGROUND AND RELATED WORK

    Music Information Retrieval (MIR) has evolved significantly over the past three decades. Early research in the 1990s fo- cused on symbolic representations of music, especially MIDI sequences, which allowed simple melody search through pitch contour encoding. The pioneering work of Ghias et al. (1995) introduced the concept of Query-by-Humming (QbH), encoding pitch movements as Up, Down, or Same and using approximate string matching to search small MIDI databases. Although this proved the feasibility of melody- based retrieval, it was highly sensitive to pitch errors and could not scale.

    Subsequent approaches integrated Dynamic Time Warping (DTW) and chroma-based features, which provided better tolerance to pitch variations and timing inconsistencies. These systems improved robustness but remained compu- tationally expensive, particularly as databases grew larger.

    The deep learning era transformed QbH into an embed- ding problem, where humming queries and reference songs are mapped into a shared feature space. Works such as Pham et al. (2024) demonstrated the use of convolutional neural networks (CNNs), mel-spectrogram preprocessing, and metric learning losses like ArcFace to achieve high retrieval accuracy. At the same time, creative frameworks such as Tencent AI Labs Humming2Music expanded the QbH scope from retrieval to full music generation, showing that humming can serve as both a query and a creative input for AI-driven composition.

  3. METHODOLOGY AND DATASET ANALYSIS
    1. Early Pitch Contour and Pattern Matching

      Ghias et al. (1995) developed the first QbH system using pitch contour encoding (U=up, D=down, S=same). Queries were matched to MIDI databases with approximate string matching. The dataset contained 183 songs, with retrieval effective for sequences of 1012 notes. The main limitation was slow pitch tracking and poor scalability.

    2. Embedding-Based Deep Learning

      Pham et al. (2024) preprocess humming into mel- spectrograms and train CNNs (ResNet34, MobileNetV2, AlexNet) with ArcFace loss. Embeddings are compared

      using Faiss similarity search. The Hum2Song dataset from the Zalo AI Challenge (9k clips, 1k Vietnamese songs) was used, achieving 94.2% MRR@10.

    3. Noise-Robust Recognition

      Ranjan and Srivastava (2023) applied Total Variation Regu- larization (TVR) for denoising and trained a Fully Convolu- tional Network for melody classification. Evaluated on MIR- QBSH, the system improved robustness to noisy humming but was limited by small dataset size.

    4. Generative Music and Mashups

      Tencent AI Lab (2023) proposed Humming2Music, a five- stage pipeline: transcription, melody generation, chord pro- gression, accompaniment, and audio synthesis. Transform- ers and YOLOX-inspired models were used. Though not retrieval-oriented, this framework demonstrated mashup po- tential, where multiple melodies could be aligned into a coherent output.

    5. Subsequence Matching

      Jin (2021) introduced subsequence-based matching of melody contours for long or multi-sentence humming queries. Tested on symbolic melody datasets, it was efficient for large-scale retrieval but less accurate under noisy real- world inputs.

    6. Dataset Expansion

      Amatov et al. (2023) built the CHAD dataset (308+ hours of humming and covers) using semi-supervised alignment and trained ResNet embeddings with metric learning. This addressed dataset scarcity but stopped short of full retrieval pipelines.

    7. Sentence-Based Retrieval

      Zhang (2024) improved multi-sentence humming retrieval with BDTW (bidirectional DTW) and sentence segmenta- tion. Tested on MIR-QBSH and IOACAS datasets, it out- performed classical DTW on fragmented queries.

    8. Review of MIR Challenges

    The Sri Lankan review (2021) summarized challenges in machine learning-based MIR: semantic gaps, personalized retrieval, multilingual generalization, and the need for scal- able datasets.

  4. DATASETS FOR QUERY-BY-HUMMING

    The effectiveness of QbH systems depends largely on the availability of diverse and representative datasets. Over time, researchers have introduced datasets of increasing scale and complexity:

    • Ghias MIDI Dataset (1995): Consisted of 183 MIDI songs. Retrieval was only reliable for ideal humming sequences of 1012 notes. Limited diversity and unre- alistic conditions restricted its long-term impact.
    • MIR-QBSH Dataset: A widely used benchmark link-

      ing humming recordings with symbolic MIDI data. It enabled comparative studies for DTW-based and machine learning methods but remained relatively small in size.

    • Hum2Song (Zalo AI Callenge, 2021): Featured about 9,000 humming clips mapped to 1,000 Viet- namese songs. It introduced real-world noise, pitch drift, and tempo variation, making it a valuable bench- mark for deep learning retrieval systems.
    • CHAD Dataset (Amatov et al., 2023): Over 308 hours of humming and cover fragments, created with semi- supervised alignment. It addressed the problem of data scarcity and facilitated embedding generalization.
    • IOACAS Dataset (used in Zhang, 2024): A Chinese dataset that, along with MIR-QBSH, was applied to evaluate multi-sentence humming retrieval with Bidi- rectional DTW.
    • Proprietary and Synthetic Data (Tencent AI Lab, 2023): Used for music generation rather than retrieval. Synthetic augmentation allowed the system to generate mashups and full songs from humming inputs.

      These datasets reflect the fields progression from sym- bolic, small-scale collections to realistic, noisy, and large- scale corpora. They also highlight the persistent challenge of building multilingual and diverse QbH datasets for global applications.

  5. PERFORMANCE AND METRICS EVALUATION

    The performance of humming-based music retrieval sys- tems has varied widely across time, depending on the datasets, methods, and evaluation metrics used. A compara- tive overview of the reviewed works is presented below.

    Ghias et al. (1995) tested their pitch contourbased QbH system on a small dataset of 183 MIDI songs. They reported that most songs could be successfully retrieved within the first 12 notes, achieving close to 90% success when queries were clean and ideally hummed. However, the system was limited by slow pitch tracking and was not robust to noisy or off-pitch input.

    Pham et al. (2024) achieved the highest reported perfor- mance among the surveyed works. Using the Hum2Song dataset, which contained about 9,000 clips across 1,000 Viet- namese songs, they trained CNN backbones with ArcFace loss and retrieved songs using Faiss similarity search. Their ResNet34 model achieved 94.2% Mean Reciprocal Rank at 10 (MRR@10) and 83.7% Top-1 accuracy, significantly outperforming lighter architectures such as MobileNetV2 and AlexNet.

    Ranjan and Srivastava (2023) evaluated their noise- resilient recognition approach on the MIR-QBSH dataset. In- corporating Total Variation Regularization improved melody extraction and classification accuracy, reaching about 93% under noisy conditions. Their results confirmed that prepro- cessing plays a critical role in real-world QbH performance, though scalability was not tested due to the small dataset size.

    Tencent AI Lab (2023) approached humming-based in-

    put differently, focusing on generative music production rather than retrieval. Their system, Humming2Music, was trained on a mixture of proprietary and synthetic data. Performance was measured subjectively: users consistently rated the generated songs and mashups as musically coher- ent and satisfying. However, since the evaluation was not benchmark-driven, direct comparison with retrieval models is not possible.

    Jin (2021) demonstrated the efficiency of subsequence- based melody contour matching. Tested on symbolic melody databases, their system offered faster retrieval for long queries and subsequences. Nevertheless, recognition accu- racy was notably lower than deep learning approaches, especially in noisy humming scenarios.

    Amatov et al. (2023) primarily contributed through dataset creation. Their CHAD dataset contained more than 308 hours of humming and cover fragments, enabling better training of deep models. While they reported improvements in embedding generalization during benchmarking, they did not provide full retrieval accuracy metrics.

    Zhang (2024) evaluated their Bidirectional Dynamic Time Warping (BDTW) approach on MIR-QBSH and IOACAS datasets. Results showed that BDTW consistently outper- formed traditional DTW in handling multi-sentence hum- ming queries, particularly where songs were remembered in fragmented phrases.

    Finally, the Sri Lanka review (2021) did not perform experimental evaluations but identified gaps in performance across the MIR field, including issues with semantic gaps, personalization, and multilingual retrieval. These insights highlight the challenges that persist despite progress in algorithmic accuracy.

    In summary, the most accurate retrieval to date was reported by Pham et al. (2024), while Ranjan and Srivastava (2023) provided evidence of noise robustness, and Zhang (2024) enhanced handling of fragmented queries. Tencent AI Lab (2023) broadened the scope by emphasizing user satisfaction in generative mashups, marking a different but complementary performance dimension.

    The reviewed works collectively highlight the evolution of Query-by-Humming (QbH) and music information retrieval from symbolic pitch contour methods to deep learning em- beddings and even generative pipelines. Each study intro- duced distinct innovations, while also facing inherent lim- itations due to dataset constraints, computational demands, or lack of generalization. In this section, we summarize the innovations, contributions, and limitations before presenting a comparative table.

    1. Innovations

      Several key innovations emerged from the surveyed works:

      • Ghias et al. (1995): Introduced the very first QbH system based on pitch contour encoding (Up, Down, Same) and approximate string matching, demonstrating the feasibility of music search by humming.
      • Pham et al. (2024): Proposed deep embedding-based retrieval using mel-spectrograms, convolutional net- works, ArcFace loss, and Faiss similarity search. This

        was one of the first attempts to integrate state-of-the-art computer vision embedding strategies into MIR.

        • Ranjan & Srivastava (2023): Applied Total Variation Regularization (TVR) for robust melody extraction, tackling the problem of noise in humming signals.
        • Tencent AI Lab (2023): Expanded QbH beyond re- trieval into music generation. Their Humming2Music pipeline could produce complete songs, including chords and accompaniment, from a simple humming input.
        • Jin (2021): Developed subsequence-based melody matching, allowing retrieval from long or fragmented humming queries, a practical advancement for real- world scenarios.
        • Amatov et al. (2023): Built the CHAD dataset (308+ hours), one of the largest humming/cover collections. This dataset improved the training of deep learning models and addressed the scarcity of humming data.
        • Zhang (2024): Introduced Bidirectional Dynamic Time Warping (BDTW) for multi-sentence humming queries, showing improvements over traditional DTW methods.
        • Sri Lanka Review (2021): Synthesized challenges in machine learningbased MIR, identifying semantic gaps, personalization, and multilingual retrieval as key frontiers.
    2. Contributions

      Each work contributed uniquely to the QbH research space:

      • Early research (Ghias 1995) established QbH as a research problem.
      • Pham et al. (2024) demonstrated that deep embeddings could outperform traditional DTW and chroma-based methods, reaching 94.2% MRR@10.
      • Ranjan & Srivastava (2023) showed that noise-robust techniques improve classification accuracy for hum- ming inputs, even when datasets are small.
      • Tencent AI Lab (2023) contributed by bridging MIR and creative AI, showing how humming can inspire mashups and full compositions.
      • Jin (2021) provided efficient retrieval for subsequences, especially usefl when users recall only fragments of melodies.
      • Amatov et al. (2023) contributed a benchmark dataset (CHAD) to enable future research, addressing a long- standing bottleneck in MIR.
      • Zhang (2024) advanced multi-sentence retrieval, en- abling longer and natural humming queries to be pro- cessed effectively.
      • The Sri Lanka Review (2021) clarified the research landscape, pointing out unsolved challenges that guide future directions.
    3. Limitations

    Despite their progress, each work faced limitations:

      • Ghias (1995) lacked scalability and performed poorly with noisy queries.
      • Pham et al. (2024) required high computational re-

        sources and generalized poorly beyond Vietnamese datasets.

    • Ranjan & Srivastava (2023) were restricted to a small dataset, limiting real-world applicability.
    • Tencent AI Lab (2023) produced impressive generative results but required complex pipelines and did not focus on retrieval accuracy.
    • Jin (2021) struggled with noisy, real-world humming, where subsequence matching was less effective than deep learning embeddings.
    • Amatov et al. (2023) primarily contributed a dataset, but no complete retrieval or generation system was built.
    • Zhang (2024) improved DTW-based retrieval but relied heavily on accurate sentence segmentation, which can be error-prone.
    • The Sri Lanka Review (2021) was a conceptual sur- vey, not an implementation, and thus did not provide empirical benchmarks.
  6. FINAL DISCUSSION AND INSIGHTS

    This review demonstrates the historical evolution from early pitch contour methods to deep learning embeddings and generative mashup pipelines. Pham et al. (2024) currently achieve the highest retrieval accuracy, while Ranjan & Sri- vastava (2023) contribute noise robustness. Tencent AI Lab (2023) expands QbH into creative domains by generating full music and mashups. Jin (2021) and Zhang (2024) improved scalability for long queries and sentences. Amatov et al. (2023) address the scarcity of CHAD datasets, while the Sri Lankan review situates QbH within broader MIR challenges. Future QbH systems may integrate these strengths: embed- ding accuracy, noise- robust preprocessing, dataset scaling, subsequence retrieval, and generative mashups. Such hybrid frameworks align with the goals of Hummify, where users can identify songs by humming and automatically generate creative mashups.

  7. FUTURE DIRECTIONS

    While recent advances in deep embeddings and dataset cre- ation have improved QbH accuracy and robustness, several challenges and opportunities remain for future research:

    • Multilingual and Cross-Cultural Music Retrieval: Most datasets are language- or region-specific (e.g., Vietnamese Hum2Song). Expanding coverage to di- verse musical traditions would make QbH systems more inclusive.
    • Noise Robustness and Short Queries: Very short or off- key humming still leads to retrieval errors. Future systems must integrate advanced preprocessing, self- supervised denoising, and adaptive feature extraction.
    • On-Device Deployment: Deep models like ResNet34 require large computational resources. Model pruning, quantization, and mobile optimization will enable real- time humming retrieval on smartphones.
    • Generative Integration: Systems like Tencent AI Labs

      Humming2Music show the potential of combin- ing

      retrieval with music generation. Future frameworks may retrieve similar tracks and generate mashups or accompaniments, creating interactive creative tools.

    • Standardized Benchmarks: The lack of unified evalu- ation protocols limits fair comparison between methods. Establishing shared metrics (MRR, Top-k, and subjec- tive scores) across multiple datasets would advance the field.

    In summary, future QbH research lies at the intersection of robust retrieval, dataset expansion, and generative creativity. Hybrid frameworks that combine these dimensions will form the foundation for next-generation systems like Hummify, enabling users not only to identify songs by humming but also to generate personalized mashups and compositions.

  8. CONCLUSION

This paper surveyed eight works on humming-based mu- sic retrieval and generation. Each approach contributes uniquely: early pitch contour prototypes, embedding-based retrieval, noise-robust methods, generative mashups, sub- sequence matching, dataset expansion, and multi- sentence

retrieval. Comparative analysis shows that deep embeddings achieve state-of-the-art accuracy, but dataset bias, noise han- dling, and multilingual adaptation remain open challenges. Integrating generative pipelines into retrieval systems may lead to the next generation of intuitive and creative music search tools.

REFERENCES

  1. A. Ghias, J. Logan, D. Chamberlin, and B. Smith, Query by Hum- ming: Musical Information Retrieval in an Audio Database, Cornell University, 1995.
  2. Pham et al., An Approach to Hummed-Tune and Song Sequences

    Matching, Zalo AI Challenge, 2024.

  3. Ranjan and Srivastava, Incorporating Total Variation Regularization in QBH, 2023.
  4. Tencent AI Lab, Humming2Music: Being a Composer as Long as You Can Humming, 2023.
  5. Jin, Computer Music Query by Humming Considering Subsequence Matching, J. Phys.: Conf. Ser., 2021.
  6. Amatov et al., Semi-Supervised Deep Learning for QBH Dataset

    Collection (CHAD Dataset), 2023.

  7. Zhang, Multi-Sentence Query-by-Humming Using Bidirectional

    DTW, 2024.

  8. S. K. Perera et al., Music Information Retrieval Using Machine Learning Approaches: A Review, Informatica, 2021.