Objective. Differential diagnosis is the process that formulates a precise and accurate diagnosis on a patient. However, in most of the cases it is performed at first admission in a primary-care unit and it mainly relies on physician’s experience. Delays in the diagnosis formulation and mistakes may lead to serious complications. The aim of my PhD project is to improve differential diagnosis by developing a system able to automatically extract both structured and unstructured data to enable the use of Machine Learning (ML) methods on Real World Data (RWD), i.e., data collected during daily clinical practice outside traditional interventional controlled clinical trials. Approach. I used standard-based novel architectures to extract structured data from the hospital Laboratory Information System (LIS) and transfer them into a SQL Server database using the already existing architecture of the Ligurian Infectious Diseases Network (LIDN). Then I used NLP-based methods to build the most appropriate numerical representations for textual data, manually extracted and anonymized from Electronic Medical Records. To test the efficacy of the proposed pipeline, I used both structured and unstructured data coming from two different medical scenarios as input of the developed ML-pipeline to support diagnosis process. Main results. The first main result has been building an intelligent system able to extract a dataset of 285 features based on structured RWD of a selected group of patients with a diagnosis of candidemia/bacteremia. As I obtained each feature performing a re-elaboration of stored data, the outcome of the rules-based system needed to be validated. Specifically, clinicians manually validated results of 381 patients randomly selected from the cohort and attesting that each of the selected features presented an error < 1%. The second main result has been developing an NLP and ML-based pipeline able to transform free texts into the most appropriate numerical representation, using Bag of Words or Word Embedding techniques, to enable text classification or information extraction tasks. The first use case aimed at localizing the Epileptogenic Zone in drug-resistant epilepsy patients using the textual data of the semiological descriptions of seizures. I proved that all the numerical representations built by the pipeline accurately (F1-score up to 0.78 on blind set) localized the seizure onset zone. The second use case aimed at extracting information related to the possible presence of Central Venous Catheter (CVC) implanted at the diagnosis of candidemia to build a more complete picture of the patient. To do that I used the clinical notes written by medical staff in a limited time span around the diagnosis. The developed pipeline reached mean values of F1-score up to 0.92 in determining if a patient had CVC implanted and up to 0.84 in determining if CVC was removed, both results are obtained on a blind test set. The third main result derives from the features selection applied to the complete dataset, composed by structured and unstructured data related to the use case candidemia/bacteremia, involved in a majority voting process. My results confirm that CVC feature has a great impact (selected 100% of times, mean coefficient value in LASSO matrix is 0.12) on the outcome infection of invasive candidiasis. Significance. The developed NLP and ML-based pipeline accurately identifies EZ location and the presence of CVC from text alone. The main advantage is that it does not contain any specific information about the medical discipline, so it can be easily used in other scenarios, and it is based on Italian text. In general, the complete architecture exploits the paradigm of data reuse to support differential diagnosis, so in the future an always growing amount of data will be available.

Medical Data Management to enable the use of Machine Learning-based systems

MORA, SARA
2023-06-09

Abstract

Objective. Differential diagnosis is the process that formulates a precise and accurate diagnosis on a patient. However, in most of the cases it is performed at first admission in a primary-care unit and it mainly relies on physician’s experience. Delays in the diagnosis formulation and mistakes may lead to serious complications. The aim of my PhD project is to improve differential diagnosis by developing a system able to automatically extract both structured and unstructured data to enable the use of Machine Learning (ML) methods on Real World Data (RWD), i.e., data collected during daily clinical practice outside traditional interventional controlled clinical trials. Approach. I used standard-based novel architectures to extract structured data from the hospital Laboratory Information System (LIS) and transfer them into a SQL Server database using the already existing architecture of the Ligurian Infectious Diseases Network (LIDN). Then I used NLP-based methods to build the most appropriate numerical representations for textual data, manually extracted and anonymized from Electronic Medical Records. To test the efficacy of the proposed pipeline, I used both structured and unstructured data coming from two different medical scenarios as input of the developed ML-pipeline to support diagnosis process. Main results. The first main result has been building an intelligent system able to extract a dataset of 285 features based on structured RWD of a selected group of patients with a diagnosis of candidemia/bacteremia. As I obtained each feature performing a re-elaboration of stored data, the outcome of the rules-based system needed to be validated. Specifically, clinicians manually validated results of 381 patients randomly selected from the cohort and attesting that each of the selected features presented an error < 1%. The second main result has been developing an NLP and ML-based pipeline able to transform free texts into the most appropriate numerical representation, using Bag of Words or Word Embedding techniques, to enable text classification or information extraction tasks. The first use case aimed at localizing the Epileptogenic Zone in drug-resistant epilepsy patients using the textual data of the semiological descriptions of seizures. I proved that all the numerical representations built by the pipeline accurately (F1-score up to 0.78 on blind set) localized the seizure onset zone. The second use case aimed at extracting information related to the possible presence of Central Venous Catheter (CVC) implanted at the diagnosis of candidemia to build a more complete picture of the patient. To do that I used the clinical notes written by medical staff in a limited time span around the diagnosis. The developed pipeline reached mean values of F1-score up to 0.92 in determining if a patient had CVC implanted and up to 0.84 in determining if CVC was removed, both results are obtained on a blind test set. The third main result derives from the features selection applied to the complete dataset, composed by structured and unstructured data related to the use case candidemia/bacteremia, involved in a majority voting process. My results confirm that CVC feature has a great impact (selected 100% of times, mean coefficient value in LASSO matrix is 0.12) on the outcome infection of invasive candidiasis. Significance. The developed NLP and ML-based pipeline accurately identifies EZ location and the presence of CVC from text alone. The main advantage is that it does not contain any specific information about the medical discipline, so it can be easily used in other scenarios, and it is based on Italian text. In general, the complete architecture exploits the paradigm of data reuse to support differential diagnosis, so in the future an always growing amount of data will be available.
9-giu-2023
File in questo prodotto:
File Dimensione Formato  
phdunige_3891823.pdf

embargo fino al 09/06/2024

Tipologia: Tesi di dottorato
Dimensione 11.14 MB
Formato Adobe PDF
11.14 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1120760
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact