🏆
Global Publishing Platform
Serving Researchers Since 2012

Complaint Matching and Grouping Using NLP and Geo-Spatial Techniques: A Comparative Study

DOI : 10.17577/IJERTCONV14IS010001
Download Full-Text PDF Cite this Publication

Text Only Version

Complaint Matching and Grouping Using NLP and Geo-Spatial Techniques: A Comparative Study

Iyola Gloria Dsilva

Student , Master of Computer Applications , St.Joseph Engineering College

Rakshitha P

Assistant Professor , Master of Computer Applications , St.Joseph Engineering College

Mr. Hareesh B

Associate Professor, Department of Computer Applications,

St Joseph Engineering College Vamanjoor, Mangalore, Karnataka

Abstract – In civic complaint platforms, multiple users often report the same problem using different words or spellings. This results in fragmented information, making it difficult for administrators to efficiently manage and resolve issues. In this study, we compare six different modelsranging from traditional term-based methods to modern transformer architectures to automatically group similar complaints. We test the models under noisy conditions, including spelling errors and varied sentence structures, and evaluate them on multiple metrics such as precision, recall, F1 score, processing time, and resource usage. Our results show that combining sentence-level understanding with geographic location provides the most reliable grouping performance.

Keywords: Complaint grouping, Semantic similarity, NLP, Clustering, Geolocation, SBERT, TF-IDF, DBSCAN

  1. INTRODUCTION

    Modern urban systems increasingly rely on citizen participation to identify and address local problems. Platforms that allow users to submit complaints about potholes, water issues, or public safety have become valuable tools. However, the lack of uniformity in how users describe problems poses a significant challenge.

    Two complaints about the same issue might be written entirely differently, leading to duplicate entries or overlooked reports. In this study, we explore how Natural Language Processing (NLP) models can be used to automatically group

    these complaints by meaning, even when phrased differently or written with minor errors.

  2. RELATED WORK

Traditional approaches like TF-IDF [1] are based on word counts and often fail to capture deeper meanings. Word embedding methods such as Word2Vec and GloVe [2, 3] introduced semantic awareness, but struggled with rare words or spelling errors. FastText [4] addressed this by representing words as combinations of subwords, making it more robust to typos. spaCy [5] leverages pretrained word vectors along with syntactic parsing to generate stronger sentence representations. More recently, transformer-based models such as Sentence-BERT (SBERT) [6] and the Universal Sentence Encoder (USE) have demonstrated superior results in semantic similarity tasks. Additionally, geospatial filtering

  1. has been used to improve clustering by incorporating location as a contextual constraint.

    1. FEATURE SELECTION AND PROCESSING

      For meaningful clustering, we selected and processed the following key features:

      • Textual Content: The complaint title and description were combined and processed to capture semantic meaning. These were embedded using a range of NLP techniques including TF-IDF, Word2Vec (via GloVe), spaCys large vector model, and Sentence-BERT (SBERT).

      • Geolocation: Latitude and longitude were used to assess the spatial proximity of complaints. Geographical distances were calculated using the

        haversine formula and then converted into similarity scores.

      • Department & Ward IDs: These attributes were used as rule-based filters after clustering to ensure that grouped complaints originated from the same administrative unit.

The fusion of semantic, spatial, and administrative features allowed us to rigorously test each models ability to group similar complaints under real-world constraints.

  1. METHODOLOGY

    The complaint grouping process included several key steps:

    1. Text Embedding: Each complaint was converted into a numerical vector using one of the six models.

    2. Similarity Calculation: Cosine similarity was used to calculate pairwise distances between complaints.

    3. Clustering: DBSCAN, a density-based algorithm, was applied to form clusters of similar complaints.

    4. Rule-Based Filtering: We retained only those clusters where all complaints were from the same ward and department.

    5. Geo-Fusion (Optional): For one model, we fused semantic similarity with geographic distance to guide the clustering process.

    MODEL CATEGORIZATION:

    1. Traditional Model

      • TF-IDF: Relies on word frequency; ignores semantics and spelling errors.

    2. Static Word Embeddings

      • FastText: Captures subword information, offering resistance to typos.

      • spaCy: Provides dense word embeddings using pretrained GloVe vectors.

    3. Transformer-Based Models

      • SBERT: Fine-tuned on semantic similarity tasks, captures full sentence meaning.

      • SBERT + Geo Fusion: Combines SBERT embeddings with geographic coordinates.

      • USE: A general-purpose sentence encoder optimized for speed and moderate depth.

        EVALUATION METRICS:

        We used the following metrics for evaluation:

      • Precision:

        Correct clusters / All predicted clusters

      • Recall:

        Correct clusters / All actual duplicates

      • F1 Score:

        Harmonic mean of precision and recall

  2. RESULTS AND COMPARATIVE STUDY

    fig. 4.1

    fig.4.1 presents the comparative performance of all six models across key evaluation metrics. TF-IDF, while lightweight, struggled with semantic variance. FastText and spaCy achieved perfect clustering results, largely due to their robustness to typos and phrasal variations. SBERT + Geo stood out as the most balanced model, combining semantic understanding with spatial logic. USE offered high accuracy at faster inference times, making it a practical compromise.

    fig. 4.2

    fig.4.2 compares the models in terms of practical deployment factors: inference time, memory size, and tolerance to noisy inputs. While TF-IDF and spaCy are lightweight and fast, FastTexts subword modeling comes at the cost of size and

    processing time. SBERT models offer superior contextual understanding, but inference time and GPU dependency must be considered in real-time deployments. These insights are especially important when integrating such models into urban informatics platforms under resource constraints.

    Sample Grouped Output (SBERT + Geo + Rule):

    fig. 4.3

    As in fig.4.3 the SBERT + Geo + Rule Filter model accurately grouped complaints with varied phrasing referring to the same local issue. By combining semantic and spatial similarity, it avoided mismatches across wardsa limitation seen in other modelshighlighting its strength for location- sensitive applications.

  3. DISCUSSION

    Each model demonstrated unique strengths and limitations across semantic accuracy, noise tolerance, and spatial awareness.

      • TF-IDF performed poorly in real-world scenarios. Although it achieved full recall, it lacked precision due to its reliance on surface-level word overlap. It often grouped complaints with similar words but unrelated meaings.

      • FastText and spaCy achieved perfect grouping. Their strength lies in their robustness to typos and minor phrasing changes. FastText leverages subword vectors, while spaCy benefits from dense pretrained embeddings. However, both are limited in deeper contextual understanding and may fail in more abstract complaint cases.

      • SBERT, a transformer-based model, captured sentence meaning effectively. It produced high- precision clusters but sometimes grouped complaints from unrelated wards due to the lack of spatial filtering. This affected its recall in structured civic settings.

      • SBERT + Geo stood out by integrating geographic proximity with sentence-level semantics. It avoided both over-grouping and spatial mismatches, producing consistently high scores across all metrics. This combination offers a scalable solution for civic platforms where both meaning and location matter.

      • USE offered a strong middle ground, with good semantic understanding and fast inference. While not perfect in recall, it was efficient and well-suited for lightweight deployments.

        Overall, the analysis shows that semantic understanding alone is insufficient for civic complaints. Integrating location-based filtering, as in SBERT + Geo, provides the most reliable and context-aware grouping strategy.

  4. CONCLUSION & FUTURE WORK

In this study, we explored and compared six different NLP models for the task of grouping civic complaints, ranging from traditional TF-IDF to modern transformer-based approaches. Our findings highlight that while simple models like TF-IDF are fast and lightweight, they fail to capture semantic and contextual nuances needed for accurate clustering. FastText and spaCy showed better robustness to spelling variations but lacked deeper contextual reasoning.

SBERT, particularly when fused with geographic similarity and rule-based filtering, demonstrated the best overall performance. It was able to accurately group semantically similar complaints while maintaining spatial relevance, making it well-suited for real-world civic platforms where location context matters.

Despite strong results, there is still room for improvement and future development.

Future Work:

    • Real-time Integration: Incorporate complaint grouping into live civic portals using scheduled background tasks (e.g., Laravel cron jobs).

    • Admin Feedback Loop: Enable manual corrections or merges from administrators to continuously refine the model.

    • Visual Interface: Design a map-based visualization that shows clustered complaints as interactive nodes for easy exploration by city officials.

This work demonstrates the potential of combining semantic and spatial intelligence in NLP systems for urban governance and opens the door to building smarter, citizen-driven platforms.

REFERENCES

  1. G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing & Management, 1988.

  2. T. Mikolov et al., "Efficient Estimation of Word Representations in Vector Space," arXiv:1301.3781, 2013.

  3. J. Pennington et al., "GloVe: Global Vectors for Word Representation," EMNLP, 2014.

  4. P. Bojanowski et al., "Enriching Word Vectors with Subword Information," TACL, 2017.

  5. Explosion.ai, "spaCy 3.0 Documentation", 2023.

  6. N. Reimers, I. Gurevych, "Sentence-BERT,"

    EMNLP, 2019.

  7. R. Sinnott et al., "Use of Geospatial Data for Efficient Clustering," IEEE SmartCity, 2017.