Dynamic prediction of cardiovascular disease using improved LSTM

Purpose – Previous dynamic prediction models rarely handle multi-period data with different intervals, and the large-scale patient hospital records are not effectively used to improve the prediction performance. This paper aims to focus on the prediction of cardiovascular disease using the improved long short-term memory (LSTM)model. Design/methodology/approach – A newmodel based on the traditional LSTMwas proposed to predict cardiovascular disease. The irregular time interval is smoothed to obtain the time parameter vector, and it is used as the input of the forgetting gate of LSTM to overcome the prediction obstacle caused by the irregular time interval. Findings – The experimental results show that the dynamic prediction model proposed in this paper obtained a significant better classification performance compared with the traditional LSTMmodel. Originality/value – In this paper, the authors improved the LSTM by smoothing the irregular time between different medical stages of the patient to obtain the temporal feature vector.


Introduction
Cardiovascular disease (CVD) is a class of diseases that involve the heart or blood vessels (Mendis et al., 2011).CVD is the chronic disease that poses the greatest threat to people, and now it has become one of the leading causes of death around the world (Gibbons et al., 1999).According to data released by the World Health Organization in May 2017, approximately 17.7 million people died of CVD in 2015, accounting for 31 per cent of the global death.Therefore, medical professionals and researchers have carried out extensive studies on the treatments and interventions for CVD (Kagashe et al., 2017;Yan et al., 2016;Zhu et al., 2017).
Unlike acute diseases, there are many stages during the development and evolution of chronic diseases, and the characteristics of chronic diseases vary across stages (Asaria et al., 2007).To take early interventions and maintain the healthiness of patients with CVD, various predictive models were developed to identify high-risk groups or predict the development of disease.Most previous disease prediction models were based on case-cohort study to investigate the relationship between potential high risk factors and morbidity and mortality (Ganna and Ingelsson, 2015).It is found that body mass index (Rost et al., 2018), waist-hip ratio (Zwakenberg et al., 2018) and sitting time and sitting posture (Howell et al., 2017) have high correlations with the morbidity of CVD.However, due to the high cost of case-cohort study, the training data of these models are insufficient, and the prediction performance needs further improvement.
In recent years, with the development of information technology and the wide application of information systems in medical industry, hospital information system (HIS) has accumulated large-scale and multi-dimensional data including patient demographics, disease symptoms and diagnosis and biochemical indicators (Ahmadi et al., 2017).HIS is an ideal data source to support risk assessment and the development of prediction models of CVD with machine learning algorithms (Goldschmidt, 2005).Long short-term memory (LSTM) is a recurrent neural network (RNN) that is suitable for processing and predicting important events with relatively long intervals and delays in time series.However, in the context of medical industry, the time interval between multiple hospitalizations of patients is different, and the traditional LSTM cannot effectively learn the important characteristics of patient's medical condition, which limits the practical application of LSTM in medical problems.
In this paper, we improved the LSTM by smoothing the irregular time between different medical stages of the patient to obtain the temporal feature vector.The temporal feature vector is used as the input of the forgetting threshold, which can effectively deal with the irregular time interval between the multi-period data and improve the predictive performance of the model.

Literature review
Machine learning algorithms have been used in various fields including disease prediction.Using logistic regression, Zhou et al. constructed a risk score model for type 2 diabetes in middle-aged male populations in rural China (Zhou et al., 2017).Lin et al. considered the problem of co-occurring diseases and constructed a Bayesian multi-task learning model for chronic diseases and their corresponding complications (Lin et al., 2017).To improve the accuracy of prediction, Long et al. proposed a hybrid heart disease prediction method, which combines rough set theory, clustering algorithm, genetic algorithm, naive Bayes and support vector machines, and the model showed obvious advantages over baseline models (Long et al., 2015).Based on electronic medical record data, Ye et al. used the xgboost algorithm to predict the risk of hypertension of patients (Ye et al., 2018).
However, existing models usually take a single period of sample data as input, and ignoring the time-series characteristics of clinical medical data, especially for chronic diseases.Therefore, many studies began to consider the inclusion of time series features in the model of chronic disease static prediction to construct a dynamic prediction model of chronic diseases.Marini et al. used the dynamic Bayesian model to simulate the long-term disease state of type 1 diabetes.The model can dynamically simulate the development of type 1 diabetes and predict future status (Marini et al., 2015).Bueno et al. proposed to use a dynamic Bayesian network to model the patient's data over multiple periods to study the potential physiological changes that may occur after the patient received the drug treatment (Bueno et al., 2016).Jackson et al. constructed a three-stage hidden Markov model (HMM) to characterize and predict chronic rejection after six months of lung transplantation (Jackson and Sharples, 2002).Forkan, et al. combined the HMM with neural network algorithms to learn and construct the probability of future disease in chronic diseases for elderly people living alone (Forkan and Khalil, 2017).However, both the dynamic Bayesian model and the HMM assume that the time interval between successive observations is fixed, and the Prediction of cardiovascular disease computational complexity increases rapidly as the number of variables increases, which limits the ability to learn complex data.RNN is a kind of neural network used to process sequential data (Graves et al., 2013b).The network memorizes the previous information and applies it to the calculation of the current output, that is, the nodes between the hidden layers of each segment also establish a connection.In addition, the input of the hidden layer at time step t includes the input of the input layer at time step t and the output of the hidden layer at time step t-1 (Graves, 2013a).However, in the process of learning long-term data, RNN may have the problem of gradient disappearing.In light of this, an improved version of RNN named LSTM was proposed to solve the problem of gradient disappearance in the long-dependent learning process by introducing structures of forgetting gates (Graves, 1997).
However, the existing LSTM model also assumes a fixed time interval between different time slices, which limits its practical application in medical problems.In view of the above problems, this paper improves the internal structure of the LSTM unit through parameterizes the time interval between time slices, thus obtaining important information of the influence of time interval on the development of disease.

Model framework
This paper investigates how to predict the diagnosis result at time step t (Y T ), in the case of a given patient's records from time step 1 to time step t (X 1 , X 2 , . .., X T ).Among them, the number of records per patient and the time interval between samples X t-1 , X t , and X tþ1 could be different.
To use LSTM to process sequence data with irregular time intervals, we first adapt the threshold structure of the LSTM unit to learn the temporal characteristics associated with CVD evolution at different time intervals.After that, we propose to use the target repeat prediction method for the output of hidden layer at each time step, which can simplify the model training process with different lengths of time series.Finally, for the output layer of the model, the Sigmoid function is introduced as the activation function of the multi-tag output, so that the patient's multiple diagnostic tags are predicted as output.The overall structure of the model is shown in Figure 1.

Introduction to long short-term memory
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate.The cell remembers values over arbitrary time intervals, and the three gates regulate the flow of information into and out of the cell.
Figure 2 shows the structure of a traditional LSTM cell and illustrates the operations of the gates.There are three gates (input, forget and output) in the basic cell of LSTM, and each gate has a sigmoid activation function and a point-wise multiplication operation.The basic cell of the LSTM is defined as the following equations: where f t denotes the output of forget gate to the network at time step t, where s is the logistic sigmoid function.i t and o t denote the output of input gate and output gate,

Improved long short-term memory
In the medical situation, patients with chronic diseases will go to the hospital because of the development of the disease, such as deterioration or recurrence.However, different patients may have different time intervals between hospitalizations due to their physical condition, condition, etc., and the difference may range from less than 1 month to several years.The lack of time interval brings certain difficulties and challenges to the study of clinical time series data.
To solve the problem of irregular time interval, we propose to smooth the time interval to obtain the time parameter vector and use it as the input of LSTM forget gate.The improved LSTM cell is shown in Figure 3.We will introduce the forward propagation process of the LSTM network.
The first step in the forward propagation of the LSTM network is the calculation of the forgotten threshold.This threshold determines which of the input information will be forgotten and will not affect future time step.In detail, the time interval between the time step t-1 and the time step t is smoothed to obtain a three-dimensional vector, and the time vector is used as an input parameter of the forget gate, as shown in equation (1).

Prediction of cardiovascular disease
In equation ( 4), P f p D tÀ1:t represents a vector after the smoothing of the time interval between time slices, and the smoothing formula is shown in equation ( 5): In equation ( 5), D t-1:t represents the time interval, in units of days.Because patients rarely rehospitalize in the same month, so we choose two months as the denominator, then half a year and one year, making the vector p D tÀ1:t within a reasonable range.P f is a connection weight parameter corresponding to the time interval vector, which needs to be optimized for training to handle the memory effect generated by the irregular time interval.
The second step of forward propagation determines what information is saved in the cell state.First, you need to generate a temporary state and then update the old cell state.The formula is shown in equations ( 6) and ( 7).
where W C and b C are the connection weight and offset of the temporary state.Ct is a temporary state containing new candidate values.C t-1 is the status information of the previous time step.C t is the state of the time step t after the update.The third step of forward propagation determines the final network output, as shown in equation ( 8).
where h t is the current hidden state, and h t and C t will be used as input for the next time step.

Target repeat prediction
When constructing a traditional LSTM network model, generally, only the output prediction of the last time step is given, and the error of the entire network is calculated to update the network weight.However, when a sample has or is truncated into a short time series, the prediction performance could be worsened.To solve the above problem, this paper adopts the target repeat prediction method.For the output of the hidden layer of each time step, the prediction probability of the diagnosis is calculated by the Sigmoid activation function, and the prediction loss of each time step is obtained by combining the real classification label.Finally, we use the weighted summation of the prediction loss of all time slices and the prediction loss of the last time slice as a loss function of the entire model to update the parameters of the entire model.
For a single time step, the loss function is calculated as follows: In equation ( 9), ŷ represents the disease classification probability vector calculated by the Sigmoid function in a single time slice, and ŷi represents the output probability of the corresponding i-th disease diagnosis.y indicates the actual class label of the current sample, and y i indicates the classification label of the i-th disease diagnosis, taking 0 or 1. C represents the dimension of the classification label vector.The overall loss function for the entire model is shown in equation ( 10): In equation ( 10), y (t) is the real classification label of time slice t, and ŷ t ð Þ represents the corresponding classification label prediction probability vector.a is the hyperparameter of the model, which is used to measure the sum of the predicted losses of all time slices and the weight of the last time slice prediction loss for the overall loss of the model.

Multi-label classification
In actual medical scenes, doctors make a disease diagnosis based on patient's laboratory indicators.Patients may have multiple diseases at the same time, such as coronary heart disease and type 2 diabetes.Thus, we define the disease diagnosis task as a multi-label classification task.This paper proposes a prediction model for multi-label classification, while the traditional model normally handles the single classification problem.
In the selection of the classification label, in addition to CVD, diseases that may cause CVD and diseases that may be caused by CVD are also included (Jonnagaddala et al., 2015), such as hyperlipidemia, diabetes and so on, which can be divided into eight categories (c 1 , c 2 , . .., c 8 ).The diagnosis output for each sample is represented as eight-dimensional vectors with Boolean values.The i-th dimensional of the vector is 1 if the diagnosis belongs to c i and 0 otherwise.
Compared to logistic and Softmax function, all elements in the output probability vector of Sigmoid function are not equal to 1, which is more suitable for multi-label classification problems.Therefore, we use Sigmoid as the activation function of our model.In existing multi-label classification studies, the classification result is the k-value element with the highest numerical value in the output probability vector, and the value of k is determined according to the actual problem (Tsoumakas et al., 2007).In this paper, the average number of labels for all samples is about 3, and k is set to 3. Revision (ICD-10) coding, and the test and inspection items use the system-defined code, which can be uniquely identified.

Data preprocess
To make the data meet the requirements and specifications, we only keep the records of patients whose "patient ID" and "hospital ID" is non-empty, the number of hospitalizations is more than twice, the "discharge method" is "normal", "age" is 18 or older, "admission time" and "discharge time" is valid, and the diagnostic records include cardiovascular or related diseases.
After preprocessing, we obtained 12,545 hospital records generated by 3,805 patients collecting from 15 March 1999 to 7 July 2010and calculated the length of time series for each sample.As shown in Figure 4, the length of samples is mostly concentrated from 2 to 5. Therefore, in the subsequent model training process, we set the maximum length of all samples to be 5.Samples with length less than 5 are complemented by 0, and samples with length longer than 5 are truncated.
The data set consists of 2,176 males and 1,629 females.The age distribution is shown in Table III.The missing values of continuous variables are filled with mean values, and the missing values of discrete variables are filled with the majority.
To meet the input requirements of LSTM, we encode classification features using one-hot encoding.On the test project, the mean, maximum and minimum values of the sequence data are extracted to achieve feature extraction and dimensionality reduction.
Different variables have different value range and units, and the value range and unit dimensions have great impact on the weight learning process of the model.Generally, in classification and clustering algorithms, the z-score algorithm is usually used for normalization, which can achieve better results.Therefore, this paper uses the z-score standardization method to preprocess the input data.

Performance evaluation metrics
This paper focuses on the classification of disease diagnosis, and the classification performance of the diagnosis of different diseases.Therefore, the Precision micro , Recall micro  Prediction of cardiovascular disease and F1 micro are selected as evaluation metrics.These three indicators are adapted from the corresponding single label classification model, and the calculation formulas are as follows: In addition, the AUC indicator indicates the area under the ROC curve and is often used to evaluate classifier performance.Therefore, in the multi-classification problem of this paper, we use micro AUC as one of the model evaluation indicators.

Experimental result
LSTM learn the characteristics of data set from training set and predict the classification labels of new samples.The hyper parameters of LSTM model needs to be set.The proposed improved LSTM model is defined as T-LSTM-TR.We train and tune the parameters of our model using 10-fold cross-validation method.
The hyper parameters need to be adjusted and optimized during training process, including the number of hidden layer neurons H, the end time slice loss function weight a and the dropout parameter.The model is trained by setting different parameter sets separately, and then the test results are compared.Finally, the optimal parameters of the T-LSTM-TR model is set as H = 120, a = 0.5 and Dropout = 0.4.This paper selects the traditional LSTM model as the benchmark model for performance comparison.As shown in Table IV, the performance of T-LSTM-TR model proposed in this paper is similar to that of the LSTM model in terms of precision, while the performance of T-LSTM-TR is significantly superior compared to that of the traditional LSTM model in terms of other indicators.The results show that the classification performance of our model is effectively improved by adapting the departmental structure of traditional LSTM unit.As shown in Figure 5, we can more clearly compare the performance of T-LSTM-TR and LSTM through the ROC curve.
For the hidden layer feature processing of all time slices, the average pooling process is an alternative method, and the output prediction result can be obtained using the Sigmoid function.To validate the effectiveness of the proposed target repeat prediction method, we

Conclusion
Based on the traditional LSTM, this paper proposed a new model by improving the internal forgetting gate input.First, the irregular time interval is smoothed to obtain the time parameter vector, and then it is used as the input of the forgetting gate to overcome the prediction obstacle caused by the irregular time interval.The experimental results show that the dynamic prediction model proposed in this paper has a significant improvement in classification performance compared with the traditional LSTM model, which verifies the effectiveness of the proposed model.There are still some limitations in this paper for future studies.First, this paper assumes that the diagnostic labels of the samples are independent to each other, which in fact there are varying degrees of correlation between many diseases.Second, due to the limits of data size, although the model of this paper has a significant improvement over the existing models in the performance evaluation indicators, the model still need further improvement to meet the requirements of practical applications.
respectively.x t and h t-1 are the input and the previous hidden state, respectively.W f , W i , W o , b f , b i and b o are weight matrices which are learned.
Figure 2. Traditional LSTM Figure 3. Improved LSTM cell Figure 4. Sample sequence length histogram compared the performance of average pooling process and target repeat prediction method.The average pooled model is defined as T-LSTM-MP, and the comparison results are shown in TableV.As shown in TableV, the T-LSTM-TR model obtained higher results compared to the T-LSTM-MP model in all the four indicators, indicating that the target repeated prediction method is significantly better than the average pooling method.As shown in Figure5, we can more clearly compare the performance of T-LSTM -TR and T-LSTM-MP through the ROC curve.
Figure 5. ROC curve of T-LSTM-TR and LSTM Data descriptionIn the study, we used the data collected from the HIS of a hospital.The data set contained age, sex, 23 test indicators and nine disease diagnosis labels.The specific test indicators we used are shown in Table I.The disease diagnosis labels are shown in Table II.All information of patients that are recorded during hospitalization was identified by the patient ID and hospital ID.The disease diagnosis uses International Classification of Diseases 10th

Table I .
Test indicators

Table III .
Sample age distribution