top of page

AI based De-identification System for Protecting Patient Privacy in Indian Healthcare Institutions.

Priyanka Kalia

16 Jul 2024

Miimansa AI's new research reveals groundbreaking innovations in de-identifying Indian clinical discharge summaries using Large Language Models (LLMs). This study addresses the urgent need for robust data de-identification methods amid rapid digitization in Indian healthcare institutions. By generating synthetic clinical reports with LLMs, the research significantly improves de-identification performance, ensuring patient privacy and data utility.

Using a small set of de-identified patient discharge summaries provided by an Indian healthcare institution, the paper reports the nominal performance of de-identification algorithms based on language models trained on publicly available non-Indian datasets, pointing towards a lack of cross-institutional generalization. Similarly, experimentation with off-the-shelf de-identification systems reveals potential risks associated with this approach. 
“Medical advancements increasingly rely on safe and ethical patient data repurposing. In India, patient data is accruing fast, yet guidelines and systems for data de-identification have lagged behind. Large Language Models represent a powerful and relatively recent advancement in AI. My group wanted to study whether these models could be used to build robust de-identification systems. Given that clinical data is difficult to access, we used LLMs to augment the available clinical data with synthetic patient data for training the de-identification algorithm. We were excited to collaborate with Miimansa and hope that this work enables democratized access to patient data, accelerates medical research, and eventually improves treatment outcomes.” Dr. Ashutosh Modi, Assistant Professor (CSE) at IIT Kanpur, Principal Investigator Exploration Lab, and a collaborator on the study.
To overcome data scarcity, the study explores generating synthetic clinical reports (using publicly available and Indian summaries) by performing in-context learning over Large Language Models (LLMs). The experiments demonstrate the use of generated patient reports as an effective data augmentation strategy for creating high-performing de-identification systems with good generalization capabilities.
The study highlights the existence of several protected health information (PHI) elements in the dataset that are unique to the medical language, social and cultural practices in India, which makes it difficult to detect them with the help of systems designed for non-Indian patient texts. The study’s results suggest that the LLM-based data augmentation and training approach taken by the researchers is a potential solution to rapidly developing a robust de-identification system for Indian patient data. Ethical data access pathways will accelerate medical research by ensuring patient privacy and improving the quality of care through advanced, data-driven insights and personalized treatment recommendations.
Dr Pawan Kumar, Principal Scientist (NLP), Miimansa AI, says, “The study by Miimansa addresses the significant challenge of safely and ethically accessing patient data for secondary research purposes in India. Our idea is to use domain-adapted Large Language Models (LLMs) to develop robust data de-identification systems that can capture locale-specific identifiable elements and can be trained with a seed data set. The collaboration with IITK has facilitated academic synergy and technological advancement. The long-term impact includes deploying these models as high-performance, locally validated systems that support diverse medical research projects leveraging Indian healthcare data.” 

The findings indicate that using synthetic data generated by LLMs can significantly improve the performance of de-identification systems, highlighting the potential of synthetic data to address privacy and data availability challenges in biomedical research. The research sets the foundation for the development of robust de-identification solutions tailored to the Indian context, ensuring the safe and effective use of digital health data.

About Miimansa AI


Miimansa AI is a health tech startup at the forefront of AI and machine learning applications in life sciences and healthcare. Specializing in clinical data management and biomedical research, Miimansa AI develops generative AI solutions for mining large medical repositories and automating clinical workflows. To support its aims of advancing evidence-based care, Miimansa is engaging with healthcare institutions and practitioners in the country on the problem of synthesizing real-world medical evidence using AI techniques. The company is led by former faculty and alumni from IIT Kanpur and Stanford University.


bottom of page