Global Peer-Reviewed Platform
Serving Researchers Since 2012

A Robust ETL-Based Framework for Healthcare Data Integration and Patient Record Deduplication Using SQL Server and SSIS

DOI : https://doi.org/10.5281/zenodo.18848564
Download Full-Text PDF Cite this Publication

Text Only Version

 

A Robust ETL-Based Framework for Healthcare Data Integration and Patient Record Deduplication Using SQL Server and SSIS

Surendra Reddy Alavala

Independent Researcher Tallahassee, FL, USA

AbstractThe healthcare industry increasingly requires ad- vanced concepts to capture accurate patient data. Although crucial, duplicate patient records remain a persistent challenge, leading to clinical errors, billing discrepancies, and reporting in- consistencies. Given the current challenges facing the healthcare industry, we propose a new state-of-the-art that utilizes Microsoft SQL Server to store data and SQL Server Integration Services

EHR Systems

Laboratories

Billing Systems Files / Other

(SSIS) to perform data extraction, transformation, and loading (ETL) from various sources to a centralized repository. This approach will not only store the data, but also implement a mandatory ltration process prior to loading to prevent data duplication. By combining these leading industry technologies that support data standardization, deterministic and probabilistic matching techniques, and governance rules, healthcare organiza- tions can signicantly improve data quality and patient identity integrity.

Index Terms – Data quality, healthcare data, patient identity, deduplication, Microsoft SQL Server, ETL, SQL Server Integra- tion Services (SSIS).

  1. Introduction

    In the modern world, healthcare organizations produce vast amounts of data, all of which is interlinked with patient demographics. Patient information plays a vital role in the current healthcare system in providing safe, effective and timely care, especially in critical times. However, even with the latest electronic health record (EHR) systems such as Epic and Cerner, healthcare systems still encounter duplicate records within their system. These include patient names, dates of birth, addresses, phone numbers, emails, and other information.

    These duplicate demographic data more often arise from variations in data entry forms or setups, multiple registration points, the lack of pre-patient nder tools, and the absence of a universal patient identier. Common sources of duplica- tion include laboratory systems, billing departments, external providers, and information collected by telephone.

    These duplicate demographic data are not only inconvenient for parties involved, such as providers, laboratories, and pa- tients, but also cause signicant delays in treatment, additional billing costs, and clinical errors. Thus, ensuring data quality and stability, along with implementing modern Microsoft SQL Server and ETL technologies, has become a critical priority in creating a reliable framework for healthcare organizations.

    ETL / SSIS Process

    SQL Server Database

    Fig. 1. System Architecture: Data Integration from Multiple Sources (Colored Nodes).

  2. Digital Transformation in Healthcare and

    Current Challenges in Data Deduplication

    The healthcare industry is undergoing rapid digital trans- formation by adopting advanced data management and data analytics technologies [1]. This transformation involves the digitalization of medical records and healthcare processes to improve patient care and clinical efciency, and reduce operational costs for healthcare organizations.

    The implementation of Electronic Health Records (EHRs) in healthcare gave organizations the opportunity to generate, manage, and manipulate large amounts of structured and unstructured data. However, migration from legacy paper- based records often leads to data duplication, particularly when multiple sources and formats are involved [2].

    The most common point for data duplication is inconsis- tencies during data entry. A single patient may appear as a different individual if a letter is misspelled or an extra space is added to the demographic data. Sometimes address and phone number formats can also differ between departments, leading to duplicate records and making patient identication difcult. Without data standardization practices in organizations, the elds such as date, names, and addresses can vary widely. This lack of uniformity reduces the effectiveness of automated matching. Many healthcare systems lack a universal patient identication and instead rely on partial identiers such as dates of birth or names. This might be useful in many cases,

    but when it comes to data accuracy, these may not be sufcient to ensure accurate matching of patients [3].

    In most cases, patient data are collected on multiple plat- forms. Some in paper format, others through applications, over phone calls, and emails. This data is spread across different platforms and passes through multiple teams before it reaches the storage location. Without proper coordination between teams, errors originating on one platform will go to a series of platforms and ultimately reach central databases.

    This issue may seem minor in small organizations, where manual checks are feasible, but in larger organizations, this could pose real risks, such as medication errors, delayed treatments, billing errors, and inaccurate reporting [4]. These problems can affect both patients and organizations over time. This paper explores the role of Microsoft SQL Server and the Extract, Transform, and Load (ETL) process in building a robust centralized healthcare data system. These technologies not only help eliminate duplicate data, but also help with data security and improve database performance by eliminating un- necessary data [5]. ETL tools such as SQL Server Integration services (SSIS) facilitate the integration of data from various sources including laboratory, pharmacy, insurance systems, and formats such as les, databases, and applications into a

    unied central data warehouse.

  3. Understanding the connection between

    Extract

     

    SQL Server

     

    Microsoft SQL Server Databases and SQL Server Integration Services

  4. What it means for healthcare?

    In todays healthcare ecosystem, there is a growing need to join and aggregate data from structurally different sources to create a unied dataset for analysis where SQL Server and SQL Server Integration Services (SSIS) play a signicant role in this process [6]. SQL Server is responsible for storing and managing data, while SSIS is used to move and transform data between systems.

    This integration not only enhances workow efciency and reduces costs but also improves accuracy through con- sistent processes and helps address patient data challenges while maintaining regulatory compliance. Together, these tools foster a reliable, scalable, and efcient data ecosystem that streamlines data collection, eases administrative burden, and ultimately supports better patient care. SQL Server is respon- sible for storing and managing data, while SSIS is used to move and transform data between systems. Together, they help organizations maintain accurate, organized, and accessible information

  5. Proposed approach using SQL Server and ETL

    The proposed deduplication approach combines both Mi- crosoft SQL Server and ETL processes to ensure clean, accurate, and unied data. The overall architecture involves gathering data from different sources and loading it into stag- ing tables, which includes data cleansing and standardization.

    Microsoft SQL Server is a relational database management system (RDBMS) designed to act as a high-security vault for

    Transform (Cleansing)

    Deduplication Load into

    structured inormation. In the healthcare sector, where data accuracy can quite literally be a matter of life or death, SQL Server provides the necessary guardrails through schemas and tables.

    By enforcing strict data integrity rulessuch as unique constraints and relational mappingthe system ensures that patient records remain consistent across different departments. It doesnt just store data; it protects it. With built-in features like Role-Based Access Control (RBAC), organizations can ensure that only authorized personnel view sensitive demo- graphics. Furthermore, its robust disaster recovery protocols mean that even in the event of a system failure, patient data remains recoverable, supporting the always-on requirement of modern clinical environments.

    SQL Server Integration Services (SSIS) is a data integration and workow tool used to extract, transform, and load data from various sources to different destinations. This tool is used to collect data, transform and clean it based on business rules, and load it into SQL Server or other targets. SSIS helps reduce manual data handling, improve data quality, auto- mate repetitive processes, and manage large amounts of data with robust error-handling techniques. Comparative analyses show that SSIS remains powerful and cost-effective within Microsoft-centric ecosystems, though it faces challenges in hybrid and multi-cloud environments compared to some com- petitors [7], [52].

    Fig. 2. ETL Pipeline Process with Colored Steps.

    Once the data is loaded into staging tables, the data match- ing rules are applied to identify duplicates and generate a single Golden Record for each patient record. This process also facilitates auditing and simplies tracking.

    To identify the duplicates, we can use deterministic match- ing.This method relies on exact matches of the social security number, full name, and date of birth by using the lookup transformation in the SQL Server Integration Services (SSIS) [8], [16] . While highly precise, this approach may miss records due to spelling errors, abbreviations, or extra spaces. To address these issues, probabilistic and fuzzy matching are applied by using built-in features such as Pre-Sorting the data, Aggregate Transformation, Script Component, Fuzzy Lookup, and Fuzzy Grouping [9], [10].

    If records meet a certain similarity threshold, they are marked as duplicates. Once duplicates are identied, the organization-level rules determine which record to retain, often based on the most recent data or the rst entry sourced by clinical data [11]. In this scenario, SQL Server ranking and windowed functions can be used to determine which records are new and which are old. Window functions, which perform calculations on a set of table rows related to the current row, are particularly valuable for cohort analysis, time-series analysis, and patient trend tracking.

    For instance, functions such as ROW NUMBER(), RANK(), and DENSE RANK() can be applied using PAR- TITION BY and ORDER BY clauses, can rank duplicate records. Healthcare analysts can use these rankings to average a patients lab results over a period, study rates of improvement or decline in health, or review hospital readmission rates [13]. Handling large amounts of healthcare data requires complex queries , which provide insights into patient care, operational costs, and other critical areas. Advanced techniques such as joins, subqueries, window functions, and CTEs enable analysts to perform intelligent analyses of complex healthcare datasets, ultimately improving patient care and organizational efciency

    . Applying SQLs aggregation and window functions is also much easier if they are used correctly, for example when calcu- lating patient averages or tracking health trends over time [14]. To further improve efciency, an incremental deduplication process can be implemented and automated, ensuring that only new or changed records are processed instead of rechecking the entire dataset. For this purpose, Change Data Capture (CDC) techniques in SQL Server can be used to provide real- time updates.

    Patient Records

    To ensure a successful implementation, organizations must address several key areas:

    VII. Implementation Challenges and Strategic Mitigation

    The deployment of a centralized data framework using Microsoft SQL Server and SSIS is not without its hurdles. To ensure a successful implementation, organizations must address several key areas:

    • Bridging the Technical Expertise Gap: A signicant barrier in healthcare data management is the shortage of specialized SQL Server architects and SSIS develop- ers. Organizations can mitigate this by investing in the long-term retention of in-house Database Administrators (DBAs) or by establishing a partnership with a managed service vendor [70]. Such partners can initiate the ETL architecture and manage complex data ows on behalf of the organization.
    • Resolving Logic Ambiguity via Data Discovery: Im- plementation often stalls when business processes and data lineages are poorly understood. The use of data proling and process discovery tools within the SQL Server ecosystem provides a remedy for this. By map- ping data dependencies prior to building SSIS packages, organizations ensure that the automation of the Golden Record is based on a transparent framework [10].
    • Standardization of ETL Architectures: A common challenge in SSIS implementation is the lack of a uni-

      Deterministic Matching

      Golden Record / Unied Patient Record

      Probabilistic / Fuzzy Matching

      versal approach to package design. Different vendors may approach the same deduplication logic using var- ied methodologies, creating complications during system migration [5]. Modern metadata-driven ETL patterns are increasingly used to improve interoperability and simplify future transitions.

      • Strategic Platform Identication: Identifying the most impactful opportunities for data automation requires a

        Fig. 3. Deterministic vs Probabilistic Matching with Colored Nodes.

  6. CHALLENGES OF INTEGRATION

    Applying advanced SQL mechanisms to healthcare data analysis comes with several challenges, including: the com- plexity of data analysis, scalability issues, and the handling of sensitive information. One of the biggest obstacles is the ability to integrate data from multiple, often unrelated sources and analyze it effectively [12]. When applying this approach to healthcare, it is important to think beyond the technical setup. Healthcare organizations contain millions of records; the performance of the system needs to be carefully optimized. The data should be organized with indexing, partitioned into multiple tables, unnecessary information should be minimized, and parallel ETL processes should be used to reduce database load and save time. Because healthcare data is highly sensitive, the process of merging information must be well-dened to

    ensure no critical records are lost.

    The deployment of a centralized data framework using Microsoft SQL Server and SSIS is not without its hurdles.

    deep understanding of infrastructure. Collaborative part- nerships can assist healthcare organizations in selecting the SQL Server conguration (on-premise, cloud, or hybrid) that best aligns with their operational scale [41].

    • Tooling and End-to-End Execution: The absence of comprehensive development and monitoring tools can halt the adoption of advanced ETL processes. Pro- ciency in SQL Server Data Tools (SSDT) allows teams to maintain control over the data lifecycle, ensuring that deduplication and integration processes are executed seamlessly from ingestion to reporting [9].

Data access should be strictly role-based, and sensitive information must be masked and encrypted during processing and transfer. Detailed audit logs should also be maitained for future tracking. In this case, it is important to understand the best practices that will enable the effective use of SQL to analyze healthcare data properly and in the right manner [15]. Testing also plays a crucial role in this process; the testers should evaluate the process using precise metrics and include manual test scenarios where appropriate. To ensure

the safety of the data, pilot testing should be conducted in non-production environments to conrm that the approach is reliable before it is deployed in production.

  1. R. Kimball and M. Ross, The Data Warehouse Toolkit: The Denitive Guide to Dimensional Modeling, 3rd ed. Hoboken, NJ, USA: Wiley, 2013.
  2. H. Hammad, M. Barhoush, and B. H. Abed-Alguni, A semantic-based approach for managing healthcare big data: A survey, J. Healthc. Eng.,

    Role-based Access

    SQL Server Database

    Encryption at Rest

    / In Transit

    Audit Logs

    vol. 2020, Art. no. 8882211, 2020.

  3. M. A. Khan, The art of ETL: A comprehensive guide to SSIS and data quality, Int. J. Sci. Eng. Technol., vol. 14, no. 1, 2025.
  4. I. P. Fellegi and A. B. Sunter, A theory for record linkage, J. Am. Stat. Assoc., vol. 64, no. 328, pp. 11831210, 1969.
  5. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, Duplicate record detection: A survey, IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 116, 2007.

    Fig. 4. Data Security and Governance Framework for SQL Server.

    TABLE I

    Implementation Challenges and Strategic Mitigation

    Challenge Mitigation Strategy

  6. E. Rahm and H. H. Do, Data cleaning: Problems and current ap- proaches, IEEE Data Eng. Bull., vol. 23, no. 4, pp. 313, 2000.
  7. T. Herzog, F. Scheuren, and W. E. Winkler, Data Quality and Record Linkage Techniques. New York, NY, USA: Springer, 2007.
  8. S. N. Turhan and O¨ . Pinarer, Query performance evaluation over health data, in Proc. 11th Int. Conf. e-Health (EH 2019), Porto, Portugal, Jul. 2019, pp. 101108, doi:10.33965/ep019 201910L013.
  9. R. Ahmad, M. Z. Hussain, M. Z. Hasan, and M. A. Afaq, Data models, semantics, query languages, UCP J. Eng. Inf. Technol., vol. 1, no. 1,

    Data Quality Issues

    Integration Complexity

    Technical Skill Gaps Security and Compliance Unclear data lineages and business processes Standardization, lack of universal ETL package design

    Sensitive data exposure during processing Deployment risks in pro- duction environments

    ETL validation and cleansing proce-

    dures Standardized interfaces and ETL frame- works

    Training in SQL and BI tools Encryption, auditing, access control Data proling and process discovery tools

    Metadata-driven ETL patterns and SSDT standards

    Role-based access, masking, and en- cryption

    Rigorous pilot testing in non- production environments

    pp. 819, 2023.

  10. R. Avula, Healthcare data pipeline architectures for EHR integration, clinical trials management, and real-time patient monitoring, Quart. J. Emerg. Technol. Innov., vol. 8, no. 3, pp. 119131, 2023.
  11. S. Dash, S. K. Shakyawar, M. Sharma, and S. Kaushik, Big data in healthcare: Management, analysis and future prospects, J. Big Data, vol. 6, Art. no. 1, 2019.
  12. W. E. Winkler, The state of record linkage and current research problems, U.S. Census Bureau, Washington, DC, USA, Tech. Rep. RR99/04, 1999.
  13. M. A. Herna´ndez and S. J. Stolfo, The merge/purge problem for large databases, in Proc. ACM SIGMOD Int. Conf. Manage. Data, 1995, pp. 127138.
  14. P. Christen, A survey of indexing techniques for scalable record linkage and deduplication, IEEE Trans. Knowl. Data Eng., vol. 24, no. 9, pp. 15371555, 2012.

    VIII. Conclusion

    SQL Server and SQL Server Integration Services provide signicant value in todays healthcare data landscape. To- gether, these tools address the critical challenge of demo- graphic deduplication, which is a major concern for healthcare organizations. By leveraging these technologies, organizations can reduce costs, automate routine tasks, and maintain greater consistency across workows.

    Furthermore, supporting initiatives such as data standardiza- tion, deterministic and probabilistic matching, implementation of a master patient index, and enforcement of data governance can substantially enhance data quality and strengthen patient identity integrity. Overall, the integration of these solutions empowers healthcare organizations to manage their data more effectively and ultimately improve operational efciency.

    References

    1. A. Boonstra and M. Broekhuis, Barriers to the acceptance of electronic medical records by physicians from systematic review to taxonomy and interventions, BMC Health Serv. Res., vol. 10, Art. no. 231, 2010.
    2. S. J. Grannis, J. P. Overhage, and C. J. McDonald, Analysis of identier performance using a deterministic linkage algorithm, J. Am. Med. Inform. Assoc., vol. 9, no. 3, pp. 219227, 2002.
    3. K. L. Riplinger, W. A. Marella, and T. A. Payne, Association between patient matching problems and duplicate records in electronic health records, Perspect. Health Inf. Manag., vol. 14, Art. no. 1, 2017.
    4. B. Dixon, J. Vreeman, and S. J. Grannis, The impact of data quality and standardization on patient matching, J. Biomed. Inform., vol. 52,

    pp. 6573, 2014.

  15. M. Stonebraker et al., C-Store: A column-oriented DBMS, in Proc. 31st Int. Conf. Very Large Data Bases (VLDB), 2005, pp. 553564.
  16. J. Lechtenbo¨rger and G. Vossen, On the computation of relational view complements, ACM Trans. Database Syst., vol. 28, no. 2, pp. 175208, 2003.
  17. A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood, DBMSs on a modern processor: Where does time go? in Proc. 25th Int. Conf. Very Large Data Bases (VLDB), 1999, pp. 266277.
  18. World Health Organization, Digital health and data-driven health- care, 2021. [Online]. Available: https://www.who.int/publications/i/ item/9789240020924 [Accessed: Mar. 2, 2026].
  19. B. Ozaydin et al., Healthcare research data infrastructure solutions, J. Med. Internet Res., vol. 22, Art. no. e19086, 2020.
  20. A. Johnson et al., SQL for health data analysis improves transparency and validity, Inf. Med. Unlocked, vol. 30, Art. no. 100931, 2022.
  21. O. K. Atobatele et al., SQL-driven dashboards for healthcare decision- making, Int. Sci. Refereed Res. J., vol. 3, no. 1, pp. 114, 2022.
  22. Florida Department of Health, Public health data and surveillance systems, Tallahassee, FL, USA, 2023.
  23. J. Desmond, R. Wartmann, C. W. Lau, S. Thomas, P. M. Middleton, and J. A. Ginige, OMOP ETL Framework for Semi-Structured Health Data, arXiv preprint arXiv:2511.09017, 2025. [Online]. Available: https://arxiv.org/abs/2511.09017.
  24. R. Chandra, S. Agarwal, N. Singh, and S. Tiwari, A review of ontology- driven big data analytics in healthcare: Challenges, tools, and appli- cations, arXiv preprint arXiv:2510.05738, 2025. [Online]. Available: https://arxiv.org/abs/2510.05738.
  25. E. A. Sauleau, J.-P. Paumier, and A. Buemi, Medical record linkage in health information systems by approximate string matching and clustering, BMC Med. Infrm. Decis. Mak., vol. 5, Art. no. 32, 2005.
  26. B. H. Just, D. Marc, M. Munns, and R. Sandefer, Why patient matching is a challenge: Research on master patient index (MPI) data discrepancies in key identifying elds, Perspect. Health Inf. Manag., vol. 13, Art. no. 1, 2016.
  27. W. Nelson, J. Smith, and K. Thompson, Optimizing patient record linkage in a master patient index using machine learning: Algorithm development and validation, JMIR Form. Res., vol. 7, Art. no. e44550, 2023.
  28. G. Hagger-Johnson, T. Harron, and A. Goldstein, Probabilistic linkage to enhance deterministic algorithms and reduce data linkage errors in hospital administrative data, J. Innov. Health Inform., vol. 24, no. 1,

    pp. 112121, 2017.

  29. B. P. Hejblum, M. Doupe, and G. Lix, Probabilistic record linkage of de-identied research datasets with discrepancies using diagnosis codes, Sci. Data, vol. 6, Art. no. 190002, 2019.
  30. K. Y. Cheng, S. Pazmino, and B. Schreiweis, ETL processes for integrating healthcare data tools and architecture patterns, Stud. Health Technol. Inform., vol. 299, pp. 151156, 2022.
  31. E. Henke, P. Preuss, and J. Bauer, An extract-transform-load process design for the incremental loading of German real-world data based on FHIR and OMOP CDM, JMIR Med. Inform., vol. 11, Art. no. e41551, 2023.
  32. H. Albu, M. Mendelson, and C. Zhou, Challenges and recommenda- tions for EHR data extraction and preparation for dynamic prediction modeling in hospitalized patients, J. Med. Internet Res., vol. 27, Art. no. e55123, 2025, doi: 10.2196/55123.
  33. S. Priou, S. Valentin, and L. Roche, Where have my patients gone?: A simulation study on real-world data processing in clinical data warehouses, J. Biomed. Inform., vol. 140, Art. no. 104323, 2024.
  34. C. Daniel, L. Bouzille´, and F. Burgun, Initializing a hospital-wide data quality program: The AP-HP experience, Int. J. Med. Inform., vol. 129,

    pp. 4348, 2019.

  35. R. Grannis, S. Overhage, and K. McDonald, Enhancing patient match- ing in support of national and regional health information exchange, AHRQ Final Report, 2025.
  36. V. Wheatley, Quality impact of the master patient index, J. Health Inf. Manag., vol. 18, no. 2, pp. 2836, 2022.
  37. HealthIT.gov, Patient matching algorithm technical overview, U.S. Dept. of Health & Human Serv., 2019. [Online]. Available: https:

    //www.healthit.gov/topic/scientic-initiatives/patient-matching

  38. S. Lim, H. Wong, R. Philip, A. V. Vegt, K.-K. R. Choo, J. D. Pole, and C. Sullivan, Streamlining electronic medical record data extraction and validation in digital hospitals: A systematic review, Learn. Health Syst., vol. 9, no. 2, Art. no. e10411, 2025, doi: 10.1002/lrp.70024.
  39. H. Alami et al., Clinical data integration challenges in healthcare caused by contemporary software design, Digit. Health, vol. 11, 2025.
  40. S. J. Grannis, S. Overhage, and C. J. McDonald, Evaluating the effect of data standardization on record linkage quality, J. Am. Med. Inform. Assoc., vol. 20, no. 2, pp. 249254, 2013.
  41. P. D. Ohno-Machado et al., i2b2: Informatics for integrating biology and the bedside, J. Am. Med. Inform. Assoc., vol. 19, no. 2, pp. 4146, 2012.
  42. K. Denecke and M. Deng, Clinical natural language processing in data integration environments, Health Informatics J., vol. 24, no. 4, pp. 414 432, 2018.
  43. R. D. Boyce, G. Horvitz, and S. Shwe, Automated discovery of drug adverse events from clinical narratives, J. Biomed. Inform., vol. 69, pp. 127133, 2017.
  44. M. Hardin et al., Data harmonization in multi-institution research networks, Am. J. Epidemiol., vol. 188, no. 7, pp. 13091319, 2019.
  45. S. T. V. Setty and S. Shin, Challenges and opportunities in standardizing

    healthcare data, Health Syst., vol. 6, no. 3, pp. 185198, 2017.

  46. J. B. Chapman, Transaction processing and scalability in healthcare databases, Health Database Manage., vol. 37, no. 1, pp. 2330, 2023.
  47. K. Sharma, P. Gupta, and S. Malhotra, Data governance frameworks for health information systems: A systematic review, Int. J. Med. Inform., vol. 145, Art. no. 104312, 2021.
  48. S. Kumar and S. Singu, Effective data integration solutions for health- care: A comparative study of Informatica and SSIS, Int. J. Comput. Eng. Technol., vol. 15, no. 5, pp. 187194, 2024.
  49. R. Varhol et al., Using general practice data for chronic disease prevalence: The impact of record linkage on estimation accuracy, BMC Med. Inform. Decis. Mak., vol. 25, Art. no. 407, 2025.
  50. T. Mutemaringa et al., Record linkage for routinely collected health data in an African health information exchange, Int. J. Popul. Data Sci., vol. 8, Art. no. 1771, 2023.
  51. T. Batra et al., Unifying and linking data sources in medical and public health research, Glob. Epidemiol., vol. 2, 2024.
  52. A. K. Gupta et al., A framework for a consistent and reproducible evaluation of patient matching algorithms, JAMIA Open, vol. 5, no. 2, 2022.
  53. J. Laidler, A. Imaz Blanco, and D. Balasubramanian, Probabilistic link- age pipeline improving linkage quality and explainability in healthcare, Int. J. Popul. Data Sci., vol. 10, no. 4, 2025.
  54. R. W. Aldridge et al., Accuracy of probabilistic linkage using the en- hanced matching system for public health and epidemiological studies, PLoS ONE, vol. 10, no. 8, Art. no. e0136179, 2015.
  55. U. Tachinardi, Privacy-preserving record linkage across disparate healthcare datasets, Learn. Health Syst., vol. 8, no. 1, 2024.
  56. P. Christen and K. Goiser, Quality and complexity measures for data linkage and deduplication, Quality Training and Research Center, Australian National University, Tech. Rep. TR-CS-08-01, 2008.
  57. S. Asher et al., An introduction to probabilistic record linkage with a focus on linkage processing, Int. J. Environ. Res. Public Health, vol. 17, no. 18, Art. no. 6711, 2020.
  58. Centers for Disease Control and Prevention (CDC), IIS patient-level de-duplication best practices report, Atlanta, GA, USA, 2025.
  59. S. Dusetzina et al., An Overview of Record Linkage Methods Linking Data for Health Services Research. Rockville, MD, USA: Agency for Healthcare Research and Quality, 2014.
  60. S. Jo et al., Cross-enterprise document sharing (XDS) and patient identity management standards, Health Informatics Standards, vol. 4, no. 2, 2023.
  61. M. G. Arellano and G. I. Weber, Issues in identication and linkage of patient records across an integrated delivery system, J. Healthc. Inf. Manag., vol. 12, no. 3, pp. 4352, 1998.
  62. X. Wang, Multiple valued logic approach for matching patient records using fuzzy logic, J. Biomed. Inform., vol. 45, no. 6, 2012.
  63. D. Avoundjian et al., Comparing methods for record linkage for public health action, JMIR Public Health Surveill., vol. 6, no. 3, Art. no. e17110, 2020.
  64. p>C. Sagili, Data integration in healthcare: Bridging gaps for improved patient outcomes, Int. J. Comput. Eng. Technol., vol. 15, no. 6, pp. 616630, 2024.
  65. J. George and M. K. Jeyakumar, A comparative analysis of data inte- gration and business intelligence tools with an emphasis on healthcare data, Int. J. Eng. Trends Technol., vol. 68, no. 9, pp. 59, 2020.
  66. P. Vassiliadis, A survey of ExtractTransformLoad technology, Int.

    J. Data Warehous. Min. (IJDWM), vol. 5, no. 3, pp. 127, 2009.

  67. R. Kimball and J. Caserta, The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Hoboken, NJ, USA: John Wiley & Sons, 2013.