MS Thesis : BDA-Lab

Discovering Homogeneous Patient Treatment Clusters in Pediatric ICU Mortality Data

This thesis involves the use of cluster analysis techniques to discover the reasons behind a patient's survival and death in pediatric ICU. The data is acquired from a tertiary care hospital (Agha Khan University Hospital) located in Karachi, Pakistan, which has a need to detect, quantify and generalize patterns of patient response in pediatric ICU. The aim is to develop a framework to facilitate treatment of future patients with respect to efficiency and effectiveness. The hospital generates gigabytes of pediatric ICU data spread over multiple sources; this data is integrated, cleaned and prepared for analysis over a period of six months. A weighted feature selection approach is adopted to cater for medical preferences, and multiple approaches for class imbalance are investigated. First, intrinsic and extrinsic measures are evaluated over state-of-the-art algorithms, and the best candidate is used for cluster analysis. A novel technique of analyzing the cluster results separately for survival and death is adopted (based on centroid and cluster membership values), and the final results are converted to different dashboards to be shown to clinicians.

Cluster Analysis of Mortality and Survival Patterns in Diabetes Mellitus Patients

Diabetes is a prevalent health condition which is rising rapidly in lower-income and middle-income countries than in higher-income countries. In 2021, diabetes was one of the leading causes of death with an estimated 6.7 million deaths directly caused by it according to International Diabetes Federation (IDF) factsheet. Global prevalence has more than doubled since 1980, rising from 4.7% to 8.5% in adult population. Globally, there were 537 million diabetic patients in 2021 and these numbers are predicted to reach 784 million by 2045. This thesis invoves the use of cluster analysis techniques to analyze treatment patterns with respect to survival and mortality in patients suffering from Diabetes Mellitus. The data is acquired from a tertiary care hospital (Agha Khan University Hospital) located in Karachi, Pakistan. It involves typical Medical Record big data including laboratory tests and pharamceutical data. This work develops a framework to make future treatment of admitted diabetic patients more effective and efficient, and also standardizes results from a regional perspective.

A Framework of Mortality Prediction for Pediatric ICU Patients in Pakistan

According to UNICEF, in 2021, the mortality rate for children aged up to 24 years was 2.4%, which is equivalent to approximately 24 deaths per 1000 children. This highlights the urgent need for a pediatric intensive care unit (PICU) to reduce mortality rates. Children generally have a weaker, underdeveloped immune system than adults, making them more susceptible to infections, leading to critical conditions in less time than adult ICU patients. According to WHO, the current life expectancy is 73 years. As adults approach this age earlier, children have more years of life, and saving them gives them more years of life. Adults are closer to the end of their life expectancy. Moreover, children have a higher chance of survival after treatment than adults because their immune systems can develop to overcome critical conditions. Furthermore, PICU setup equipment is expensive, and there are fewer PICU facilities than ICU facilities, adding more load on these facilities and increasing the necessity of the best utilization of these facilities. These facts make PICU mortality a critical issue, and its prediction is essential. In this thesis, our primary objective is to construct a mortality prediction framework for patients in PICU, following extensive experimentation on various types of Electronic Health Record (EHR) data. The proposed framework, named PEDICTOR, is a unique and novel design specifically tailored to PICU patients in a tertiary care hospital (Agha Khan University Hospital) in Karachi, Pakistan. Another distinctive feature of PEDICTOR is that it is trained on specific age-based groups of pediatric patients and will provide predictions based on the group in which the patient lies. We developed a novel feature selection approach that incorporates expert domain knowledge and sets it apart from other methods. Our initial results demonstrates that PEDICTOR outputs effective results, enabling doctors to save lives and hospitals to manage their resources efficiently.

Predicting Paediatrics Patient Hospital Length of Stay: A Comparative Analysis of Bayesian Inference and Machine Learning Approaches

At the time of admission, predicting the Length of Stay (LOS) for the hospitalized patient could greatly help in efficient hospital resource utilization. Accurate LoS estimates beforehand are valuable for all stakeholders, including patients, doctors, and hospital administrators. As larger LoS associated with the severity of the illness, in advance LoS estimates could allow early interventions to avoid complications of disease. It also enables hospital management in more efficient utilization of human resources & facilities, resulting in increased patient flow & minimizing nonvalue added care time in hospitals and helping patients with cost estimation. However, making accurate estimations of LoS could be an arduous task. In this study, we developed a non-disease-specific predictive model using machine learning techniques and Bayesian methods for predicting the hospital length of stay based on static inputs, that is, measures that are available at the time of admission. Although many traditional methods which use statistical regression techniques have been used to predict the length of stay in hospitalized patients, but powerful machine learning techniques have not yet been explored much. Applying machine learning (ML) methods that handle multiple diverse inputs could strengthen predictive abilities and improve results. We compare and discuss the performance of various commonly used supervised machine learning algorithms with Bayesian predictive models, which have never been used in literature for predicting the length of stay admitted patients. The models are trained and validated on a dataset from Aga Khan University Hospital pediatric patients admitted to the hospital from 2015 to 2019.

A Novel Feature Selection Method for Predicting Mortality in Cardiac Patients

The aim of this study is to generate a universal predictive framework for mortality prediction of cardiac patients based on pre-operative variables for a tertiary care hospital (Tabba Heart Institute) in Karachi, Pakistan. The research also aims to find the factors that lead to operative mortality in cardiac patients. Once these factors are identified, they would serve as base line variables for any such predictive model. To generate a universal comprehensive framework, a novel feature selection method will be introduced which will incorporate professional bias into the rankings. This will ensure that the process of feature selection filters the predictors that are diagnostically correct. The use of Machine Learning (ML) along with Bayesian Inference Methods and Firth Logistic will be employed to check their robustness on imbalanced classification task.

Evaluating Bayesian Inference with Markov Chain Monte-Carlo Simulation Method for Length of Stay Prediction

Length of stay (LoS) prediction is deemed important for a medical institution's operational and logistical efficiency. Sound estimates of a patient’s stay increase clinical preparedness and reduce aberrations. Various statistical methods and techniques are used to quantify and predict the LoS of a patient based on pre-operative clinical features. However, the applications of Bayesian predictive models in predicting LoS of cardiac patients remains limited as compared to traditional machine learning (ML) techniques. This study applies Bayesian inference methods (especially hierarchical Bayesian regression) for LoS prediction for patients undergoing coronary artery bypass grafting at a tertiary care hospital (Tabba Heart Institute) located in Karachi, Pakistan, and evaluates the results with those obtained from various ML models. The study devises a comparative framework to analyze the results of Bayesian and ML models when the target variable has high variability as well as high skewness. The dataset consists of 5,636 records of cardiac patients with 68 pre-operative variables including LoS. Of these, 44 features (apart from the target variable) are selected via permutation feature importance method and used to build Bayesian regression (simple and hierarchical) models along with other traditional ML models. Appropriate priors and likelihood are chosen for Bayesian models to estimate the posterior distribution via No U-Turn Sampler (NUTS). Finally, the results of Bayesian and ML models are compared and evaluated. LoS estimates from the Bayesian regression models (simple and hierarchical) resulted in the root mean squared error (RMSE) of 3.22 and 1.49 respectively. While the average RMSE of ML models remained at 3.52. LoS prediction can better be achieved through Bayesian inference methods (especially Hierarchical Bayesian Regression) as compared with other ML methods, given the fact that the target variable is highly skewed and has high variability. Furthermore, Bayesian models offer greater interpretability of the estimated parameters for a better casual analysis.

Enhancing Accuracy in Pediatric Inpatient Hospital Cost Estimation: A Machine Learning-Based Approach

Accurately estimating inpatient billing costs during admission is important for financial planning in healthcare. Traditional methods have limitations in capturing true cost; hence, data-driven approaches are needed to improve hospital cost estimation in complex and dynamic environments. The main objective of this study is to predict a deviation between the initial hospital bill estimate and the actual bill charged at the time of discharge for a tertiary care hospital (Agha Khan University Hospital) in Karachi, Pakistan. This study is also focused on identifying the major factors contributing towards the cost of hospital stay. This study utilized dataset of approximately 22,000 pediatric patients (under 18 years of age). The main features of the dataset included medical conditions, hospital administration details, and socio-demographic information. The methodology utilizes named entity recognition techniques to extract structured data from unstructured textual data. Subsequently, a variety of machine learning classification models are trained and tested to predict deviations in hospital bill estimates. The boosting ensemble and artificial neural network classifier models performed best in predicting the deviations in the billing cost, with best accuracy, AUC and F1-scores of 80%, 77% and 77% respectively. The analysis of the important features revealed that age, length of stay, financial status of patients as main features to predict deviation in hospital bill estimates. The results obtained from our study demonstrate that leveraging machine learning techniques provides a reliable and efficient means of improving the performance of hospital billing estimations. These findings have significant implications for healthcare practitioners, enabling them to make more informed decisions and allocate resources effectively.

National Identity Card Detection and Data Extraction using OCR and Image Processing

Currently, text recognition and object recognition in the field of computer vision have been gaining significant attention. Text recognition has various applications, including the recognition of printed documents, newspapers, and handwritten text. To this end, numerous open-source platforms, libraries, and software such as Blink ID and Tesseract OCR have been developed. This research aims to investigate the OCR scanning and originality detection of Pakistani National Identity Cards (NICs). Although several systems exist for detecting ID cards, they cannot determine whether a card is an original or a photocopy. Our motivation for this research is to create a system that can extract information from Identity cards, particularly Pakistani Identity cards, and check their authenticity. To accomplish this, we have utilized image processing techniques and Convolutional Neural Networks (CNNs) to identify the name, father's name, and CNIC number. Moreover, to detect the originality of the card, we have used the OpenCV template matching algorithm to observe and identify the overlapping area of the face with a circular template, while YOLO has been employed for face detection.

Solving the Electric Vehicle Routing Problem through Reinforcement Learning

Electric Vehicles (EVs) will soon become the new norm for transportation, however, even with the latest technology available charging a vehicle would take more than 30 minutes. Through reinforcement learning, we will try to address this problem by taking into account, both, distance travelled and wait times in a charging station as our cost, which we will try to minimize. The first question is to use a road network to find the most efficient model, for which, the maximum cumulative discounted reward is generated, where distance and time are to minimize. Secondly, we see how travelling times can be reduced by in a congested network, where the decision to recharge an EV is to be taken along the route and the algorithm needs to take into account the traffic at a Charging Station. By taking into account the three major variables, which are, distance to be travelled, State of Charge, and EVs currently at a Charging Station, we find the most efficient route using Q-learning.