Skip to main content

Application of machine learning models in predicting length of stay among healthcare workers in underserved communities in South Africa



Human resource planning in healthcare can employ machine learning to effectively predict length of stay of recruited health workers who are stationed in rural areas. While prior studies have identified a number of demographic factors related to general health practitioners’ decision to stay in public health practice, recruitment agencies have no validated methods to predict how long these health workers will commit to their placement. We aim to use machine learning methods to predict health professional’s length of practice in the rural public healthcare sector based on their demographic information.


Recruitment and retention data from Africa Health Placements was used to develop machine-learning models to predict health workers’ length of practice. A cross-validation technique was used to validate the models, and to evaluate which model performs better, based on their respective aggregated error rates of prediction. Length of stay was categorized into four groups for classification (less than 1 year, less than 2 years, less than 3 years, and more than 3 years). R, a statistical computing language, was used to train three machine learning models and apply 10-fold cross validation techniques in order to attain evaluative statistics.


The three models attain almost identical results, with negligible difference in accuracy. The “best”-performing model (Multinomial logistic classifier) achieved a 47.34% [SD 1.63] classification accuracy while the decision tree model achieved an almost comparable 45.82% [SD 1.69]. The three models achieved an average AUC of approximately 0.66 suggesting sufficient predictive signal at the four categorical variables selected.


Machine-learning models give us a demonstrably effective tool to predict the recruited health workers’ length of practice. These models can be adapted in future studies to incorporate other information beside demographic details such as information about placement location and income. Beyond the scope of predicting length of practice, this modelling technique will also allow strategic planning and optimization of public healthcare recruitment.

Peer Review reports


The lack of health workforce is a global crisis which numerous countries have proposed and implemented intervention plans [1, 2]. However, there is limited data regarding the impact of these interventions and their sustainability over a long period of time. Research shows that the loss of healthcare workers in African countries (such as South Africa and Ghana) cripples the pre-existing delicate health system [3, 4]. Hence, the retention of health workers is essential for the healthcare system performance. These studies also point out that the recruitment of health workers should not only focus on nurses and physicians, but also on community health workers (CHWs) to help the primary healthcare systems boost the coverage and address the basic health needs of societies [4].

Specifically, healthcare systems in sub-Saharan Africa (SSA) face a serious human resource crisis, with recent estimates pointing to a shortfall of more than half a million nurses and midwives needed to meet the Millennium Development Goals of improving the health and wellbeing of the SSA population by 2015 [5]. One of the reasons for this phenomenon is due to human capital flight (“brain drain”) in the health profession, especially in the public sector [1, 6]. Migration of health workers from low- and middle-income countries (LMICs) to high-income countries is a controversial aspect of globalization, having attracted considerable attention in health policy discourse at both the technical and political levels [1, 7,8,9]. The migration of skilled healthcare workforce translates into a direct loss of considerable resources to the public sector of LMICs, as direct benefits only accrue to countries, which have not invested in educating young professionals. To make matters worse, in many sub-Saharan countries such as Sierra Leone and South Africa, there are limited alternatives for the population to seek healthcare services from the private sector or next health facility due to inaccessible distance or cost factor [10].

To maintain a functional health system, most countries have altered their retirement age in order to extend the working life of their staffs. Furthermore, Botswana and South Africa have recruited from other countries within and outside the continent [7]. Despite various local and international frameworks, the effectiveness of these interventions is yet to be seen [7, 8]. Another challenge lies in the monitoring and evaluation of these frameworks. Recent cross-sectional reviews of currently available healthcare workforce database show that in most cases, the systems are fragmented, unreliable, and cannot be integrated at both national and international levels, and that in order for policy-makers to make data-driven decisions, better database management systems still need to be developed [1, 2, 8].

A high turnover rate in the health workforce is another concern as it is costly and detrimental to organizational performance and quality of care. Healthcare organizations with high attrition rate not only face issues with the quality, consistency and stability of services provided to people in need, but also issues regarding the working conditions of the remaining staffs such as increased workloads, disrupted team cohesion and decreased morale [11, 12].

Some studies have focused on the influence of individual and organizational factors on an employee’s intention to leave [13]. A World Health Organization (WHO) study of four African countries shows that the major reasons behind health worker migration are better salary, safer environment, living conditions, lack of facilities, lack of promotion, and heavy workloads [8]. Other studies conclude that better compensation package with good work-life balance is the primary reason to migrate [6, 14, 15]. On the other hand, one of the obstacles to migration is language barrier, which lies at the basis of patient care [16, 17]. Patients express their distress by describing their symptoms and pain and report changes in health status to professionals. Nurses or doctors need the current and technical language fluency to communicate under stress and duress with one another, members of the teams, and patient families [6].

Another healthcare policy concern is the misdistribution of healthcare workforce between urban and rural areas. It prevents equitable access to health services, contributes to increased health-care costs and underutilization of health professional skills in urban areas, and remains a barrier to universal health coverage [6].

Overall, the human capital flight of local health professionals, the high turnover rate, and the shortage of workers in the public sector of South Africa thus demands further investment in attracting and retaining foreign healthcare staffs that stay for an extended period of time. The WHO has also issued global recommendations to improve the rural recruitment and retention of the health workforce [18]. This is pivotal to the delivery of healthcare in rural and remote areas of South Africa. A study has shown that 84% of South African population uses public healthcare, served by only 30% of the trained and certified doctors [19]. Generally, sub-Saharan Africa faces severe lack of healthcare workers, with only 3% of the world’s total medical staff while facing 24% of the global burden of disease [8]. The arrival of foreign medical workforce and their placement in the public health sector reduces the two-front misdistribution of physicians, alleviates the lack of human resources in public rural facilities, and improves access to healthcare to people in rural areas [8].

To date, greater efforts have focused on recruitment, with significantly less attention to workforce retention. As aforementioned, a challenge to improve health access in rural areas is to maintain high retention rate of the medical workforce. Currently, there are few empirical studies regarding the factors that influence the length of practice [14, 17]. Previous attempts to identify these factors mainly focus on worker satisfaction at medical facilities and retention strategy of staffing agencies [17]. There are some recent research into the correlation between employee demographic information and the success of retention effort in public health facilities [14].

This paper aims to develop a predicting tool for the length of practice of foreign healthcare workers, given their demographic information. Machine learning methods are well-suited for this challenge. Rather than traditionally considering the effect of demographic variables on the length of practice one after another, machine learning method examines all potential predictors simultaneously in an unbiased manner, and identifies pattern of information that are useful to make prediction.


Study design

A quantitative retrospective cohort study was conducted using secondary data, collected from the Africa Health Placements (AHP).

Study setting

South Africa Health, healthcare worker population in underserved communities and distribution and retention levels. AHP recruits foreign and locally qualified health professionals to be placed in underserved communities in South Africa. Underserved areas like rural areas often face challenges in recruiting and retaining health workers, government has responded with programmes like compulsory community service and rural allowance to address this challenge.

Data acquisition

Longitudinal individual health worker records are maintained at AHP. These health workers included professionals from South Africa and the rest of the world seeking employment in underserved facilities in South Africa. Data was collected using two methods (i) customized online portal completed by healthcare workers (HCW) and (ii) interviews by recruitment officers through email, Skype, and telephonic conversations. Data were captured onto a database and customer management system called Docwize. The online portal is available at the AHP website as a contact form. Once registered, the HCW receives login details to complete their application on Docwize. This system allows them to input personal and professional information, upload certificates, which would then be verified with the respective regulatory authorities, and be informed about the next steps until they secured a job offer. The HCW have an option of completing the application online or supplying the details to the recruitment officers who then update the system. It takes an average of 18 months to complete the recruitment process, 75% of the HCW were discouraged by the regulatory delays resulting in incomplete data. The length of stay was continuously monitored during their employment contract. Emails and telephonic contact are used to establish their last date of employment at a particular facility.

Statistical analysis

Dataset description and manipulation

We took a complete cases approach, using only data from successfully recruited health workers without missing observations. The Africa Health Placements dataset contains 62 variables and 13 698 entries, in which there were 2079 successfully recruited practitioners. Among these 2079 professionals, some chose not to provide personal information such as marital status or gender. After data cleaning, there were 1838 entries with completed fields to meet the requirements of this study.

The variables that are used to develop our machine learning models are chosen based on their availability in the AHP data system. They are nationality, profession, relationship, and gender. Since there are a lot of missing values in our age variable dataset, a complete case approach with age could have further reduced the dataset to merely 914 entries and undermine the ability of the model to learn from existing data. Hence, we excluded it from the final analysis. Notably, all of our four predictors are categorical variables. A challenge with having categorical variables in machine learning is that to fully represent each variable, we have to use a large number of dummy variables to represent each level within the variable. For example, since our data had records from 145 countries, we needed 144 dummy variables to represent all existing countries. This method would result in a very sparse dataset and usually not useful in predictive modelling. Hence, we transcribed each variable as follows:

  • Nationality: categorical data of 145 different countries. Instead of recording nationality as it is, the nationality variable is transcribed based on World Bank’s classification of countries into 4 categories: low income, lower middle income, upper middle income, and high income.

  • Professions: categorical data of 22 different registered professions, recorded into 3 different categories: doctor, nurse, and other

  • Gender: categorical data of 2 levels: male and female

  • Relationship status: categorical data of 3 levels: married, single, or other.

Machine learning model development

With a large recruitment and retention dataset from AHP, we built three machine learning predictive models using relevant demographic data. We evaluated the models’ performance by doing 10-fold cross-validation. The aim was to choose a model that performs significantly better in predicting length of practice.

As shown on Table 1, three different machine learning classification models (multinomial logistic regression, decision tree, and Naive Bayes Classification) were used to train the dataset. The issue was approached as a classification, rather than a regression problem, as we aimed to classify a successful recruit into one of the four mutually exclusive groups (less than 1 year, less than 2 years, less than 3 years, and more than 3 years). The use of a regression method is not optimal in this case, due to (i) the lack of quantitative numerical variables in our demographic information, (ii) the wide range of value of the dependent variables (length of practice measured in days), and (iii) the non-continuous nature of the dependent variables. A regression method would require a much larger dataset to arrive at a model of relatively acceptable fit. With our current available dataset, the experimental fit is approximately 18% with high internal sum of squares. Moreover, in strategic workforce planning, a precise prediction of the length of practice in days (or months) is generally not expected. A prediction of whether a specific healthcare worker will stay for 1 year, 2 years, or longer is usually acceptable for most intents and purposes.

Table 1 Machine learning results


To decide which of the three models perform best, we have to see their ability to generalize and predict new, unseen data. A challenge to our research was the lack of test data which we could have used for model evaluation. Conventionally splitting our existing data into a 80/20 ratio—80% of the data for training and 20% for testing—was an option, but not optimal as we wanted to use all data available for training.

We examined our three models with a technique called 10-fold cross-validation. Ten-fold cross-validation works as follows: we randomly partition the original dataset into 10 disjoint subsets, use nine of those subsets in the training process, make predictions about the remaining subset, and record the misclassification error. To avoid opportune data splits, we average misclassification error across the 10 folds. A comparison between the average misclassification errors of the three machine learning models allowed us to decide which model performs best on unseen data.


Three machine learning models were trained, and a 10-fold cross validation technique was used to attain evaluative statistics. The three models attain almost identical results, with negligible difference in accuracy. The “best”-performing model (multinomial logistic classifier) achieves a 47.34% [SD 1.63] while the decision tree model achieves an almost comparable 45.82% [SD 1.69] (Table 1).

Multiclass area under the curve (AUC) was computed by building multiple receiver operating characteristic (ROC) curves (one class versus another) and taking the average, as defined by Hand and Till [20]. The three models achieve an average AUC of 0.66 (multinomial logistic at 0.6652, decision tree 0.6635, Naive Bayes 0.6602), suggesting sufficient predictive signal at the four selected categorical variables.

Overall, the three models had significant accuracy in classifying the length of stay of healthcare workers (p value < 2.2e−16) (Table 1). Additionally, Kappa statistics was also computed, in order to measure how much better each of the classifiers is performing over the performance of a classifier that simply guesses at random according to the frequency of each class [21]. The Cohen’s Kappa statistics of the multinomial logistics, decision tree, and Naive Bayes are 0.2658, 0.2649, and 0.2521 respectively, suggesting a fair (but not substantial) agreement between prediction and response adjusted by the amount of agreement expected by chance.

All three models perform reasonably well at identifying those who are likely to stay for less than 1 year (Table 2). The sensitivity of this class was greater than 75% for all three models, showing that they correctly identify more than ¾ of those who are likely to stay less than 1 year. Specificity of this class is not particularly high (all lower than 65%), so all three models do not do as well in identifying those who are staying for more than 1 year. However, with a negative positive rate as high as 84% across the three techniques, it means that when the model negatively classifies a person out of those who stay for less than 1 year, such classification is likely to be correct.

Table 2 Predictions of length of stay across the three models

In contrast, all three models perform poorly at identifying those who are staying between 2 and 3 years (Table 2). With sensitivity at as low as 0% (decision tree) and specificity up to 100%, the three models must have learned to negatively assign a majority (all in decision tree case) out of this class. This is likely the result of imbalanced data sample with too little sample data of this class (Fig. 1).

Fig. 1
figure 1

Number of subjects categorized by (from left to right, up to down) length of practice, professions, relationships, and countries

Comprehensive data analysis

In general, more males (997, 54%) than females (861, 46%) were recruited (Table 3). Males stay on average 187.78 days more than females do. South Africa has supplied the greatest number of health workers (381, 41%), followed by the United Kingdom (361, 39%), Nigeria (106, 11%), and Netherlands (86, 9%) (Table 3). Doctors (1538, 83%) were the most recruited health workers and then nurses (107, 6%) and other professionals (193, 10%). With regard to relationship status, single healthcare workers constituted 61% of the recruited, 31% were married, and 8% were cohabiting (Table 3, Figs. 1, 2, and 3).

Table 3 Length of stay by gender, nationality, profession, and relationship status
Fig. 2
figure 2

Length of stay as function of relationship, colour by gender and grid by income group

Fig. 3
figure 3

Decision tree on income, gender and profession

Figure 4 shows two world heat maps that represent (a) the number of successful recruits from each country and (b) the average length of practice among those in these countries. The two maps point to an observation: AHP as a health placement organization is not very successful in recruiting from some countries, e.g. Russia, but once we do, the recruits tend to stay for an extended period of time. However, the sample size casts some doubts on this observation. Some countries have very high average length of stay, simply because we have a very small sample size of them.

Fig. 4
figure 4

Map showing world distribution of a number of candidates sourced from each country and b average length of practice by these candidates from each respective country


This research shows that a majority of foreign qualified healthcare workers (1497 out of 1838, 81%) stay at their placement facilities for less than 3 years. While a constant rate of foreign recruitment per year can “fill the gap” in paper, the low average length of practice signifies a hidden cost of recruiting, relocating, and training of new healthcare professionals. Effective workforce planning from government or non-profit organizations, thus, requires a tool to predict the length of practice of incoming health professionals.

The three models attain significantly above chance results, with the average AUC of approximately 0.66 (multinomial logistic at 0.6652, decision tree at 0.6635, Naive Bayes at 0.6602), suggesting sufficient predictive signal at the four categorical variables selected. This is an indication that applying and retraining machine learning models with available datasets, Human Resource for Health decision makers can effectively source healthcare workers who are most likely to stay the longest in underserved communities.

Machine learning must be applied together with other qualitative methods like exit interviews so as to give an in-depth understanding of the healthcare worker perceptions and experiences that relate to their length of stay. A mixed method would have generated a better understanding of why certain gender, countries, age, and experience tend to stay longer than others.

Limitations of the study

Incomplete fields in the data were another issue as many candidates were excluded from the study due to missing information. We could not obtain age as one of the predictors, although we recognized that it could potentially influence health worker long-term plan to stay. Our issue with incomplete data relates directly to the ineffective database system issue that is common among the public sector in South Africa [1, 2, 8]. Although in the short run, installing and enabling a more effective database system imposes a cost challenge to healthcare non-profits and public sector, such system is likely to make tremendous impacts as the machine learning models can be further improved by learning from a larger, high-quality dataset. In the meantime, there is a potential for the public sectors and NGOs to collaborate and involve in data sharing that could empower the training process of machine learning algorithms.


Machine learning models give us an effective tool to predict the recruited health workers’ length of practice. These models can be adapted beyond the scope of demographic information (i.e. information about placement location, income), allowing strategic planning and optimization of public healthcare recruitment.



Area under the curve


Healthcare workers


Low- and middle-income countries


Non-governmental organization


Receiver operating characteristic


Sub-Saharan Africa


World Health Organization


  1. Bangdiwala IS, Fonn S, Okoye O, Tollman S. Workforce resources for health in developing countries. Public Heal Rev. 2010;32(1):296–318.

    Article  Google Scholar 

  2. Viscomi M, Larkins S, Sen Gupta T. Recruitment and retention of general practitioners in rural Canada and Australia: a review of the literature. Can J Rural Med. 2013;18(1):13–24.

    PubMed  Google Scholar 

  3. Tshuma N, Mosikare O, Alaba OA, Muloongo K, Nyasulu PS. Acceptability of community-based adherence clubs among health facility staff in South Africa: a qualitative study. Patient Prefer Adherence. 2017;11:1523–31.

  4. Agyepong IA, Anafi P, Asiamah E, et al. Health worker (internal customer) satisfaction and motivation in the public sector in Ghana. Hum Resour Heal. 2012;11(247).

  5. Delobelle P, Rawlinson JL, Ntuli S, Malatsi I, Decock R, Depoorter AM. Job satisfaction and turnover intent of primary healthcare nurses in rural South Africa: a questionnaire survey. 2010:371–83.

  6. Habte D, Dussault G, Dovlo D. Challenges confronting the health workforce in sub-Saharan Africa. World Hosp Heal Serv. 2004;40(2):23–6.

    Google Scholar 

  7. Dovlo D. The brain drain and retention of health professionals in Africa. In: A case study Prep a Reg Train Conf Improv Tert Educ sub-Saharan Africa Things that Work; 2003. p. 23–5.

    Google Scholar 

  8. Hatcher AM, Onah M, Kornik S, Peacocke J, Reid S. Placement, support, and retention of health professionals: national, cross-sectional findings from medical and dental community service officers in South Africa. Hum Resour Health. 2014;12:14.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Cometto G, Tulenko K, Muula AS, Krech R. Health workforce brain drain: from denouncing the challenge to solving the problem. PLoS Med. 2013;10(9):10–2.

    Article  Google Scholar 

  10. Mills A, Brugha R, Hanson K, McPake B. What can be done about the private health sector in low-income countries? Bull World Health Organ. 2002;80:325–30.

    PubMed  PubMed Central  Google Scholar 

  11. Kok MC, Dieleman M, Taegtmeyer M, et al. Which intervention design factors influence performance of community health workers in low- and middle-income countries? A systematic review. Health Policy Plan. 2014;30(9):1207–27.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Rosenthal EL, Brownstein JN, Rush CH, et al. Community health workers: part of the solution. Health Aff (Millwood). 2010;29(7):1338–42.

    Article  Google Scholar 

  13. Steinmetz S, De Vries DH, Tijdens KG. Should I stay or should I go? The impact of working time and wages on retention in the health workforce; 2014. p. 1–12.

    Google Scholar 

  14. Ali Mohammed M, De Moraes A. Factors affecting employees’ job satisfaction in public hospitals: implications for recruitment and retention. J Gen Manag. 2009;34(4):51–66.

    Article  Google Scholar 

  15. Labonté R, Sanders D, Mathole T, et al. Health worker migration from South Africa: causes, consequences and policy responses. Hum Resour Health. 2015;13(1):92.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Sieleunou I. Health worker migration and universal health care in sub-Saharan Africa. Pan Afr Med J. 2011;10:55.

    PubMed  PubMed Central  Google Scholar 

  17. George G, Gow J, Bachoo S. Understanding the factors influencing health-worker employment decisions in South Africa. Hum Resour Health. 2013;11(1):15.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Buchan J, Couper ID, Tangcharoensathien V, et al. Early implementation of WHO recommendations for the retention of health workers in remote and rural areas. Bull World Health Organ. 2013;91(11):834–40.

    Article  PubMed  PubMed Central  Google Scholar 

  19. NDoH. National Health Insurance; 2017.

    Google Scholar 

  20. Hand DJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45:171–86.

    Article  Google Scholar 

  21. Landis JR, Koch GG. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics. 1977;33(2):363.

    Article  CAS  PubMed  Google Scholar 

Download references


The authors would like to thank the African Health Placement for providing the dataset used in the study.


This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Availability of data and materials

The dataset supporting the conclusions of this manuscript is available with the corresponding author and will be made available in an anonymized version on reasonable request.

Author information

Authors and Affiliations



All authors contributed toward conceptualization, data analysis, drafting, and critically revising the paper and agree to be accountable for all aspects of the work. All authors also read and approved the final manuscript.

Corresponding author

Correspondence to Sangiwe Moyo.

Ethics declarations

Ethics approval and consent to participate

Permission to conduct the study was obtained from Africa Health Placements. The researchers followed the highest standards to protect confidentiality and anonymity of subject data. All identifying information of individual subjects such as name, address and date of birth were removed from the dataset prior to the study.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Moyo, S., Doan, T.N., Yun, J.A. et al. Application of machine learning models in predicting length of stay among healthcare workers in underserved communities in South Africa. Hum Resour Health 16, 68 (2018).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: