A reference machine learning model for prediction of cholera epidemics based-on seasonal weather changes linkages in Tanzania
The Cholera epidemic remains a public threat throughout history, affecting vulnerable populations living with unreliable water and sub-standard sanitary conditions. Studies have observed that the occurrence of cholera has also, strong linkage with seasonal weather patterns. Over the past decades, there have been great achievements in developing cholera epidemic models which have focused on using mathematical techniques. However, most existing prediction systems have some challenges such as lack of flexibility, not user friendly, in-effective and also, lack integration of essential weather variables. In addition, the use of advanced technology such as machine learning (ML) have not been explicitly deployed in modeling cholera epidemics in developing countries including Tanzania; due to the challenges that come with its datasets such as missing-information, data-inconsistency, imbalance-class and other uncertainties. The aim of this work was to overcome and complement the existing challenges of cholera epidemic models by taking the advantages of ML techniques. Hence, by developing an ML model that is capable of predicting cholera epidemic outbreaks based-on seasonal weather changes linkages in Tanzania. Secondary datasets from Tanzania Meteorological Agency (TMA), the Ministry of Health and Social Welfare, and Dar es Salaam Water and Sewerage Authority (DAWASCO) were used. Then, Adaptive Synthetic Sampling Approach (ADASYN) and Principal Component Analysis (PCA) were applied to restore sampling balance and dimensions of the dataset. In order to determine which ML algorithms were best able to predict (yes/no) whether cholera epidemic would occur given the weather variables, ten classification algorithms were evaluated using F1-score, sensitivity and balancedaccuracy metrics. The Friedman-test was then used to determine whether the performance of the models was statistically significant. Results showed that Random Forest, Bagging, and ExtraTree classifiers had the best performance, with 74%, 74.1% and 71.9% accuracy respectively. The ensemble method of model fine-tuning was then applied in order to obtain one model from the three, and an overall accuracy of 78.5% was achieved. Lastly, a model evaluation process was performed on the selected final model. The model validation process involved four processes: The first evaluation process re-ran the final model using the same dataset but without the weather variables; which resulted into confirming that the model with weather variables to have higher performance compared to the model without the weather variable. The second evaluation process re-ran the model-development procedure using datasets from Tanga and Songwe regions in order to illustrate on how the adaptive reference model can be referenced by other researchers. The third and fourth model evaluation involved mixed-design approach of quantitative and qualitative methods using focus group discussions and interviewer-administered questionnaires with 500 and 20 stakeholders (including; medical officers, epidemiological analysts, nurses, environmental experts, ICT experts and cholera patients) respectively. The results of the third evaluation process proved that 90% of the responses agreed that, the developed model is robust and appropriate to work in least developing countries towards effective prediction of cholera epidemics. Whereas, the results of the fourth evaluation process proved also that cholera ML model is better in terms of their usability, expandability and computational complexity compared to the cholera statistical models. Overall, the study improved our understanding of the significant roles of ML strategies in health-care data. However, the study could not be treated as a time series problem due to data collection bias such as data-inconsistency in terms of time. The study recommends a review of health-care systems in order to facilitate quality data collection and further deployment of ML techniques in the health sector in Tanzania.
The following license files are associated with this item: