Independent component analysis: An introduction

Independent component analysis (ICA) is a widely-used blind source separation technique. ICA has been applied to many applications. ICA is usually utilized as a black box, without understanding its internal details. Therefore, in this paper, the basics of ICA are provided to show how it works to serve as a comprehensive source for researchers who are interested in this field. This paper starts by introducing the definition and underlying principles of ICA. Additionally, different numerical examples in a step-by-step approach are demonstrated to explain the preprocessing steps of ICA and the mixing and unmixing processes in ICA. Moreover, different ICA algorithms, challenges, and applications are presented.


Introduction
Measurements cannot be isolated from a noise which has a great impact on measured signals. For example, the recorded sound of a person in a street has sounds of footsteps, pedestrians, etc. Hence, it is difficult to record a clean measurement; this is due to (1) source signals always are corrupted with a noise, and (2) the other independent signals (e.g. car sounds) which are generated from different sources [31]. Thus, the measurements can be defined as a combination of many independent sources. The topic of separating these mixed signals is called blind source separation (BSS).The term blind indicates that the source signals can be separated even if little information is known about the source signals.
One of the most widely-used examples of BSS is to separate voice signals of people speaking at the same time, this is called cocktail party problem [31]. The independent component analysis (ICA) technique is one of the most well-known algorithms which are used for solving this problem [23]. The goal of this problem is to detect or extract the sound with a single object even though different sounds in the environment are superimposed on one another [31]. Figure 1 shows an example of the cocktail party problem. In this example, two voice signals are recorded from two different individuals, i.e., two independent source signals.

Independent component analysis
Moreover, two sensors, i.e., microphones, are used for recording two signals, and the outputs from these sensors are two mixtures. The goal is to extract original signals 1 from mixtures of signals. This problem can be solved using independent component analysis (ICA) technique [23]. ICA was first introduced in the 80s by J. H erault, C. Jutten and B. Ans, and the authors proposed an iterative real-time algorithm [15]. However, in that paper, there is no theoretical explanation was presented and the proposed algorithm was not applicable in a number of cases. However, the ICA technique remained mostly unknown till 1994, where the name of ICA appeared and introduced as a new concept [9]. The aim of ICA is to extract useful information or source signals from data (a set of measured mixture signals). These data can be in the form of images, stock markets, or sounds. Hence, ICA was used for extracting source signals in many applications such as medical signals [7,34], biological assays [3], and audio signals [2]. ICA is also considered as a dimensionality reduction algorithm when ICA can delete or retain a single source. This is also called filtering operation, where some signals can be filtered or removed [31].
ICA is considered as an extension of the principal component analysis (PCA) technique [9,33]. However, PCA optimizes the covariance matrix of the data which represents secondorder statistics, while ICA optimizes higher-order statistics such as kurtosis. Hence, PCA finds uncorrelated components while ICA finds independent components [21,33]. As a consequence, PCA can extract independent sources when the higher-order correlations of mixture data are small or insignificant [21].
ICA has many algorithms such as FastICA [18], projection pursuit [21], and Infomax [21]. The main goal of these algorithms is to extract independent components by (1) maximizing the non-Gaussianity, (2) minimizing the mutual information, or (3) using maximum likelihood (ML) estimation method [20]. However, ICA suffers from a number of problems such as overcomplete ICA and under-complete ICA.
Many studies treating the ICA technique as a black box without understanding the internal details. In this paper, in a step-by-step approach, the basic definitions of ICA, and how to use ICA for extracting independent signals are introduced. This paper is divided into eight sections. In Section 2, an overview of the definition of the main idea of ICA and its background are introduced. This section begins by explaining with illustrative numerical examples how signals are mixed to form mixture signals, and then the unmixing process is presented. Section 3 introduces with visualized steps and numerical examples two preprocessing steps of ICA, which greatly help for extracting source signals. Section 4 presents principles of how ICA extracts independent signals using different approaches such as maximizing the likelihood, maximizing the non-Gaussianity, or minimizing the mutual information. This section explains mathematically the steps of each approach. Different ICA algorithms are highlighted in Section 5. Section 6 lists some applications that use ICA for recovering independent sources from a set of sensed signals that result from a mixing set of Figure 1. Example of the cocktail party problem. Two source signals (e.g. sound signals) are generated from two individuals and then recorded by two sensors, e.g., microphones. Two microphones mixed the two source signals linearly. The goal of this problem is to recover the original signals from the mixed signals. ACI source signals. In Section 7, the most common problems of ICA are explained. Finally, concluding remarks will be given in Section 8.

ICA background 2.1 Mixing signals
Each signal varies over time and a signal is represented as follows, s i ¼ fs i1 ; s i2 ; . . . ; s iN g, where N is the number of time steps and s ij represents the amplitude of the signal s i at the jth time. 2 Given two independent source signals 3 s 1 ¼ fs 11 ; s 12 ; . . . ; s 1N g and s 2 ¼ fs 21 ; s 22 ; . . . ; s 2N g (see Figure 2). Both signals can be represented as follows: where S ∈ R p3N represents the space that is defined by source signals and p indicates the number of source signals. 4 The source signals (s 1 and s 2 ) can be mixed as follows, where a and b are the mixing coefficients and x 1 is the first mixture signal. Thus, the mixture x 1 is the weighted sum of the two source signals (s 1 and s 2 ). Similarly, another mixture ðx 2 Þ can be measured by changing the distance between the source signals and the sensing device, e.g. microphone, and it is calculated as follows, where c and d are mixing coefficients. The two mixing coefficients a and b are different than the coefficients c and d because the two sensing devices which are used for sensing these signals are in different locations, so that each sensor measures a different mixture of source signals. As a consequence, each source signal has a different impact on output signals. The two mixtures can be represented as follows: where X ∈ R n3N is the space that is defined by the mixture signals and n is the number of mixtures. Therefore, simply, the mixing coefficients (a; b; c, and d) are utilized for transforming linearly source signals from S space to mixed signals in X space as follows, S → X : X ¼ AS, where A ∈ R n3p is the mixing coefficients matrix (see Figure 2) and it is defined as:

Independent component analysis
2.1.1 Illustrative example. The goal of this example is to show the properties of source and mixture signals. Given two source signals s 1 ¼ sinðaÞ and s 2 ¼ r − 0:5, where a is in the range of [1,30] with time step 0.05 and r indicates a random number in the range of [0,1]. Figure 3 shows source signals, histograms, and scatter diagram of both signals. As shown, the two source signals are independent and their histograms are not Gaussian. The scatter diagram in Figure 3(e) shows how the two source signals are independent, where each point represents the amplitude of both source signals. Figure 4 shows the mixture signals with their histograms and scatter diagram. As shown, the histograms of both mixture signals are approximately Gaussian, and the mixtures are not independent. Moreover, the mixture signals are more complex than the source signals. From this example, it is remarked that the mixed signals have the following properties: 1. Independence: if the source signals are independent (as in Figure 3(a and b)), their mixture signals are not (see Figure 4(a and b)). This is because the source signals are shared between both mixtures.  ., Gaussian or normal. This property can be used for searching for non-Gaussian signals within mixture signals to extract source or independent signals. In other words, the source signals must be non-Gaussian, and this assumption is a fundamental restriction in ICA. Hence, the ICA model cannot estimate Gaussian independent components.
3. Complexity: It is clear from the previous example that mixed signals are more complex than source signals.
From these properties we can conclude that if the extracted signals from mixture signals are independent, have non-Gaussian histograms, or have low complexity than mixture signals; then these signals represent source signals. 2.1.2 Numerical example: Mixing signals. The goal of this example 5 is to explain how source signals are mixed to form mixture signals. Figure 5 shows two source signals s 1 and s 2 which form the space S. The two axes of the S space (s 1 and s 2 ) represent the x-axis and y-axis, respectively. Additionally, the vector with coordinates ð 1 0Þ T lie on the axis s 1 in S and hence simply, the symbol s 1 refers to this vector and similarly, s 2 refers to the vector with the following coordinates ð 0 1Þ T . During the mixing process, the matrix A transforms s 1 and s 2 in the S space to s  Independent component analysis In our example, assume that the mixing matrix is as follows, Given two source signals are as follows, s 1 ¼ ð 1 2 1 2Þ and s 2 ¼ ð 1 1 2 2Þ. These two signals can be represented by four points which are plotted in the S space in black color (see Figure 5). The coordinates of these points are as follows: The new axes in the X space (s 0 1 and s 0 2 ) are plotted in solid red and blue color, respectively (see Figure 5) and and they can be calculated as follows: The four points are transformed in the X space; these points are plotted in a red color in Figure 5; and the values of these new points are (a) 2 ) in the mixture space X. The two source signals can be represented by four points (in black color) in the S space. These points are also transformed using the mixing matrix A into different four points (in red color) in the X space. Additionally, the vectors w 1 and w 2 are used to extract the source signal s 1 and s 2 , and they are plotted in dotted red and blue lines, respectively. w 1 and w 2 are orthogonal on s 0 2 and s 0 1 , respectively.
Assumed the second source s 2 is silent/OFF; hence, the sensors record only the signal that is generated from s 1 (see Figure 6(a)). The mixed signals are laid along s 0 1 ¼ ð a cÞ T and the distribution of the projected samples onto s 0 1 are depicted in Figure 6(a). Similarly, Figure 6(b) shows the projection onto s 0 2 ¼ ð b dÞ T when the first source is silent; this projection represents the mixed data. It is worth mentioning that the new axes s 0 1 and s 0 2 need not to be orthogonal on the s 1 and s 2 , respectively. Figure 5 is the combination of Figure 6(a) and (b) when both source signals are played together and the sensors measure the two signals simultaneously.
A related point to consider is that the number of red points in Figure 6(a) which represent the projected points onto s 0 1 is three while the number of original points was four. This can be interpreted mathematically by calculating the coordinates of the projected points onto s 0 1 . For example, the projection of the first point ð 1 1Þ T is calculated as follows, Similarly, the projection of the second, third, and fourth points are 3; 3, and 4, respectively. Therefore, the second and third samples were projected onto the same position onto s 0 1 . This is the reason why the number of projected points is three.

Unmixing signals
In this section, the unmixing process for extracting source signals will be presented. Given a mixing matrix A, independent components can be estimated by inverting the linear system as in Eq. (2), but we know neither S nor A; hence, the problem is considerably more difficult. Assume that the matrix (A) is known; hence, source signals can be extracted. For simplicity, we assume that the number of sources and mixture signals are the same and hence the unmixing matrix is a square matrix.
Given two mixture signals x 1 and x 2 . The aim is to extract source signals, and this can be achieved by searching for unmixing coefficients as follows: Independent component analysis where α; β; γ, and δ represent unmixing coefficients, which are used for transforming the mixture signals into a set of independent signals as follow, is the unmixing coefficients matrix as shown in Figure 7. Simply we can say that the first source signal, y 1 , can be extracted from the mixtures (x 1 and x 2 ) using two unmixing coefficients (α and β). This pair of unmixing coefficients defines a point with coordinates ðα; βÞ, where w 1 ¼ ð α β Þ T is a weight vector (see Eq. (11)). Similarly, y 2 can be extracted using the two unmixing coefficients γ and δ which define the weight vector w 2 ¼ ð γ δÞ T (see Eq. (11)) W ¼ ð w 1 w 2 Þ T is the unmixing matrix and it represents the inverse of A. The unmixing process can be achieved by rotating the rows of W. This rotation will continue till each row in W (w 1 or w 2 ) finds the orientation which is orthogonal on other transformed signals. For example, in our example, w 1 is orthogonal on s 0 2 (see Figure 5). The source signals are then extracted by projecting mixture signals onto that orientation.
In practice, changing the length or orientation of weight vectors has a great influence on the extracted signals (Y). This is the reason why the extracted signals may be not identical to original source signals. The consequences of changing the length or orientation of the weight vectors are as follows: Length: The length of the weight vector w 1 is jw 1 j ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi , and assume that the length of w 1 is changed by a factor λ as follows, . The extracted signal or the best approximation of s 1 is denoted by y 1 ¼ w T 1 X and it is estimated as in Eq. (12). Hence, the extracted signal is a scaled version of the source signal and the length of the weight vector affects only the amplitude of the extracted signal.
Orientation: As mentioned before, the source signals s 1 and s 2 in the S space are transformed to s 0 1 and s 0 2 (see Eqs. (4) and (5)  An illustrative example of the process of extracting signals. Two source signals (y 1 and y 2 ) are extracted from two mixture signals (x 1 and x 2 ) using the unmixing matrix W. ACI for any orthogonal vectors is zero as follows, , and the inner product of w 1 and s 0 1 is as follows, w T 1 s where θ is the angle between w 1 and s 0 1 as shown in Figure 5, and k is a constant. The value of k depends on the length of w 1 and s 0 1 and the angle θ. The extracted signal will be as follows, The extracted signal (ks 1 ) is a scaled version from the source signal (s 1 ), and ks 1 is extracted from X by taking the inner product of all mixture signals with w 1 which is orthogonal to s 0 2 . Thus, it is difficult to recover the amplitude of source signals. Figure 8 displays the mixing and unmixing steps of ICA. As shown, the first mixture signal x 1 is observed using only the first row in A matrix, where the first element in x 1 is calculated as follows, x 11 ¼ fa 11 s 11 þ a 12 s 21 þ . . . þ a 1p s p1 g. Moreover, the number of mixture signals and the number of source signals are not always the same. This is because, the number of mixture signals depends on the number of sensors. Additionally, the dimension of W is not agree with X; hence, W is transposed, and the first element in the first extracted signal (y 1 ) is estimated as follows,  ð 2 − 1 Þ T ¼ 0, and similarly, the vector w 2 is orthogonal to s 0 1 (see Figure 5). Moreover, the source signal s 1 is extracted as follows, , and similarly, s 2 is extracted as follows, Figure 8. Block diagram of the ICA mixing and unmixing steps. a ij is the mixing coefficient for the ith mixture signal and jth source signal, and w ij is the unmixing coefficient for the ith extracted signal and jth mixture signal.

Independent component analysis
. Hence, the original source signals are extracted perfectly. This is because k ≈ 1 and hence according to Eq. (12) the extracted signal is identical to the source signal. As mentioned before, the value of k is calculated as follows, k ¼ jw 1 j s 0 1 cosθ, and the value of , and the value of s The angle between s 0 1 and the s 1 axes is 458 because s 0 1 ¼ ð 1 1Þ T ; and similarly, the angle between w 1 and s 1 is Figure 5 top left corner). Therefore, θ ≈ 638 − 458 ≈ 188, and hence k ¼ ffiffi 5 9 q ffiffi ffi 2 p cos188 ≈ 1. Hence, changing the orientation of w 1 leads to a different extracted signal.

Ambiguities of ICA
ICA has some ambiguities such as: The order of independent components: In ICA, the weight vector ðw i Þ is initialized randomly and then rotated to find one independent component. During the rotation, the value of w i is updated iteratively. Thus, w i extracts source signals but not in a specific order.
The sign of independent components: Changing the sign of independent components has not any influence on the ICA model. In other words, we can multiply the weight vectors in W by −1 without affecting the extracted signal. In our example, in Section 2.2.1, the value of w 1 was 1 3 2 3 . Multiplying w 1 by −1, i.e., w 1 ¼ - has no influence because w 1 still in the same direction with the same magnitude and hence the value of k will not be changed, and the extracted signal s 1 will be with the same values but with a different sign, i.e., s 1 ¼ w T 1 X ¼ ð − 1 0Þ T . As a result, the matrix W in n-dimensional space has 2n local maxima, i.e., two local maxima for each independent component, corresponding to s i and −s i [21]. This problem is insignificant in many applications [16,19].

ICA: Preprocessing phase
This section explains the preprocessing steps of the ICA technique. This phase has two main steps: centering and whitening.

The centering step
The goal of this step is to center the data by subtracting the mean from all signals. Given n mixture signals ðXÞ, the mean is μ and the centering step can be calculated as follows: . .
where D is the mixture signals after the centering step as in Figure 9A) and μ ∈ R 13N is the mean of all mixture signals. The mean vector can be added back to independent components after applying ICA.

The whitening data step
This step aims to whiten the data which means transforming signals into uncorrelated signals and then rescale each signal to be with a unit variance. This step includes two main steps as follows.
1. Decorrelation: The goal of this step is to decorrelate signals; in other words, make each signal uncorrelated with each other. Two random variables are considered uncorrelated if their covariance is zero. In ICA, the PCA technique can be used for decorrelating signals. In PCA, eigenvectors which form the new PCA space are calculated.In PCA, first, the covariance matrix is calculated. The covariance matrix of any two variables ðx i x j Þ is defined as With many variables, the covariance matrix is calculated as follows, Σ ¼ E½DD T , where D is the centered data (see Figure 9B)). The covariance matrix is solved by calculating the eigenvalues ðλÞ and eigenvectors ðVÞ as follows, VΣ ¼ λV, where the eigenvectors represent the principal components which represent the directions of the PCA space and the eigenvalues are The eigenvector which has the maximum eigenvalue is the first principal component ðPC 1 Þ and it has the maximum variance [33]. For decorrelating mixture signals, they are projected onto the calculated PCA space as follows, U ¼ VD.

2.
Scaling: the goal here is to scale each decorrelated signal to be with a unit variance. Hence, each vector in U has a unit length and is then rescaled to be with a unit variance as follows, where Z is the whitened or sphered data and λ n g. After the scaling step, the data becomes rotationally symmetric like a sphere; therefore, the whitening step is also called sphering [32].

Numerical example
Given eight mixture signals X ¼ fx 1 ; x 2 ; . . . ; x 8 g, each mixture signal is represented by one row in X as in Eq. (14). 6 The mean (μ) was then calculated and its value was μ ¼ 2:63 3:63 .
In the centering step, the data are centered by subtracting the mean from each signal and the value of D will be as follows: From Eq. (16) it can be remarked that the two eigenvectors are orthogonal as shown in Figure 10, i.e., v T 1 v 2 ¼ ½0:45 − 0:9 − 0:90 − 0:45 T ¼ 0, where v 1 and v 2 represent the first and second eigenvectors, respectively. Moreover, the value of the second eigenvalue ðλ 2 Þ was more than the first one ðλ 1 Þ, and λ 2 represents 4:54 0:28þ4:54 ≈ 94:19% of the total eigenvalues; thus, v 2 and v 1 represent the first and second principal components of the PCA space, respectively, and v 2 points to the direction of the maximum variance (see Figure 10).
The two signals are decorrelated by projecting the centered data onto the PCA space as follows, U ¼ VD.

ACI
The matrix U is already centered; thus, the covariance matrix for U is given by From Eq. (18) it is remarked that the two mixture signals are decorrelated by projecting them onto the PCA space. Thus, the covariance matrix is diagonal and the off-diagonal elements which represent the covariance between two mixture signals are zeros. Figure 10 displays the contour of the two mixtures is ellipsoid centered at the mean. The projection of mixture signals onto the PCA space rotates the ellipse so that the principal components are aligned with the x 1 and x 2 axes. After the decorrelation step, the signals are then rescaled to be with a unit variance (see Figure 10). The whitening can be calculated as follows, Z ¼ λ − 1 2 VD, and the values of the mixture signals after the scaling step are The covariance matrix for the whitened data is 7 This means that the covariance matrix of the whitened data is the identity matrix (see Eq. (20)) which means that the data are decorrelated and have unit variance. Figure 11 displays the scatter plot for two mixtures, where each mixture signal is represented by 500-time steps. As shown in Figure 11(a), the scatter of the original mixtures forms an ACI ellipse centered at the origin. Projecting the mixture signals onto the PCA space rotates the principal components to be aligned with the x 1 and x 2 axes and hence the ellipse is also rotated as shown in Figure 11(b). After the whitening step, the contour of the mixture signals forms a circle. This is because the signals have unit variance.

Principles of ICA estimation
In ICA, the goal is to find the unmixing matrix ðWÞ and then projecting the whitened data onto that matrix for extracting independent signals. This matrix can be estimated using three main approaches of independence, which result in slightly different unmixing matrices. The first is based on the non-Gaussianity. This can be measured by some measures such as negentropy and kurtosis, and the goal of this approach is to find independent components which maximize the non-Gaussianity [25,30]. In the second approach, the ICA goal can be obtained by minimizing the mutual information [22,14]. Independent components can be also estimated by using maximum likelihood (ML) estimation [28]. All approaches simply search for a rotation or unmixing matrix W. Projecting the whitened data onto that rotation matrix extracts independent signals. The preprocessing steps are calculated from the data, but the rotation matrix is approximated numerically through an optimization procedure. Searching for the optimal solution is difficult due to the local minima exists in the objective function. In this section, different approaches are introduced for extracting independent components.

Measures of non-Gaussianity
Searching for independent components can be achieved by maximizing the non-Gaussianity of extracted signals [23]. Two measures are used for measuring the non-Gaussianity, namely, Kurtosis and negative entropy. 4.1.1 Kurtosis. Kurtosis can be used as a measure of non-Gaussianity, and the extracted signal can be obtained by finding the unmixing vector which maximizes the kurtosis of the extracted signal [4]. In other words, the source signals can be extracted by finding the orientation of the weight vectors which maximize the kurtosis.
Kurtosis is simple to calculate; however, it is sensitive for outliers. Thus, it is not robust enough for measuring the non-Gaussianity [21]. The Kurtosis (K) of any probability density function (pdf) is defined as follow, where the normalized kurtosis ð b KÞ is the ratio between the fourth and second central moments, and it is given by For whitened data ðZÞ, E½Z 2 ¼ 1 because Z with a unit variance. Therefore, the kurtosis will be As reported in [20], the fourth moment for Gaussian signals is 3ðE½Z 2 Þ 2 and hence b KðxÞ As a consequence, Gaussian pdfs have zero kurtosis.

Independent component analysis
Kurtosis has an additivity property as follows: and for any scalar parameter α, where α is a scalar.
These properties can be used for interpreting one of the ambiguities of ICA that are mentioned in Section 2.3, which is the sign of independent components. Given two source signals s 1 and s 2 , and the matrix Using the kurtosis properties in Eqs. (24) and (25), we have Assume that s 1 ; s 2 , and Y have a unit variance. This implies that E½Y 2 ¼ q 2 1 E½s 1 þ q 2 2 E½s 2 ¼ q 2 1 þ q 2 2 ¼ 1. Geometrically, this means that Q is constrained to a unit circle in the two-dimensional space. The aim of ICA is to maximize the kurtosis ðKðYÞ ¼ q 4 1 K ðs 1 Þ þ q 4 2 K ðs 2 ÞÞ on the unit circle. The optimal solutions, i.e., maxima, are the points when one of Q is zero and the other is nonzero; this is due to the unit circle constraint, and the nonzero element must be 1 or À1 [11]. These optimal solutions are the ones which are used to extract ±s i . Generally, Q ¼ A T W ¼ I means that each vector in the matrix Q extracts only one source signal.
The ICs can be obtained by finding the ICs which maximizes kurtosis of extracted signals Y ¼ W T Z. The kurtosis of Y is then calculated as in Eq. (23), where the term ðE½y 2 i Þ 2 in Eq. (22) is equal one because W and Z have a unit length. W has a unit length because it is scaled to be with a unit length, and Z is the whitened data, so, it has a unit length. Thus, the kurtosis can be expressed as: The gradient of the kurtosis of Y is given by, vKðW T ZÞ vW ¼ cE½ZðW T ZÞ 3 , where c is a constant, which we set to unity for convenience. The weight vector is updated in each iteration as follows, w new ¼ w old þ ηE½Zðw T old ZÞ 3 , where η is the step size for the gradient ascent. Since we are optimizing the kurtosis on the unit circle kwk ¼ 1, the gradient method must be complemented by projecting w onto the unit circle after every step. This can be done by normalizing the weight vectors w new through dividing it by its norm as follows, 4.1.2 Negative entropy. Negative entropy is termed negentropy, and it is defined as follows, J ðyÞ ¼ H ðy Gaussian Þ − H ðyÞ, where H ðy Gaussian Þ is the entropy of a Gaussian random variable whose covariance matrix is equal to the covariance matrix of y. The entropy of a random variable Q which has N possible outcomes is where p q ðq t Þ is the probability of the event q t ; t ¼ 1; 2; . . . ; N.

ACI
The negentropy is zero when all variables are Gaussian, i.e., H ðy Gaussian Þ ¼ H ðyÞ. Negentropy is always nonnegative because the entropy of Gaussian variable is the maximum among all other random variables with the same variance. Moreover, it is invariant for invertible linear transformation and it is scale-invariant [21]. However, calculating the entropy from a finite data is computationally difficult. Hence, different approximations have been introduced for calculating the negentropy [21]. For example, where y is assumed to be with zero mean. This approximation suffers from the sensitivity of kurtosis; therefore, Hyvarinen proposed another approximation based on the maximum entropy principle as follows [23]: where k i are some positive constants, v indicates a Gaussian variable with zero mean and unit variance, G i represent some quadratic functions [23,20]. The function G has different choices such as where 1 ≤ a 1 ≤ 2. These two functions are widely used, and these approximations give a very good compromise between the kurtosis and negentropy properties which are the two classical non-Gaussianity measures.

Minimization of mutual information
Minimizing mutual information between independent components is one of the well-known approaches for ICA estimation. In ICA, maximizing the entropy of Y ¼ W T X can be achieved by spreading out the points in Y as much as possible. Signals b Y can be obtained by transforming Y by g as follows, b Y ¼ gðYÞ, where g is assumed to be the cumulative density function cdf of source signals. Hence, b Y have a uniform joint distribution. The pdf of the linear transformation which represents the pdf for source signals (p S ). This can be substituted in Eq. (29) and the entropy will be In Eq. (33), increasing the matching between the extracted and source signals, the ratio p Y ðYÞ p S ðYÞ will be one. As a consequence, the p b Y ð b YÞ ¼ p Y ðYÞ p S ðYÞ becomes uniform which maximizes the entropy of p b Y ð b YÞ. Moreover, the term −1 N P N t¼1 log p X ðX t Þ represents the entropy of X; hence, Eq. (33) is given by The first term E P i¼1 log p i ðw T i XÞ ¼ − P i¼1 H ðw T i XÞ; therefore, the likelihood and mutual information are approximately equal, and they differ only by a sign and an additive constant. It is worth mentioning that maximum likelihood estimation will give wrong results if the information of ICs are not correct; but, with the non-Gaussianity approach, we need not for any prior information [23].

ICA algorithms
In this section, different ICA algorithms are introduced.

Projection pursuit
Projection pursuit (PP) is a statistical technique for finding possible projections of multidimensional data [13]. In the basic one-dimensional projection pursuit, the aim is to find the directions where the projections of the data onto these directions have distributions which are deviated from Gaussian distribution, and this exactly is the same goal of ICA [13]. Hence, ICA is considered as a variant of projection pursuit.
In PP, one source signal is extracted from each projection, which is different than ICA algorithms that extract p signals simultaneously from n mixtures. Simply, in PP, after finding the first projection which maximizes the non-Gaussianity, the same process is repeated to find Independent component analysis new projections for extracting next source signal(s) from the reduced set of mixture signals, and this sequential process is called deflation [17]. Given n mixture signals which represent the axes of the n-dimensional space ðXÞ. The nth source signal can be extracted using the vector w n which is orthogonal to the other n − 1 axes. These mixture signals in the n-dimensional space are projected onto the ðn − 1Þ-dimensional space which has n − 1 transformed axes. For example, assume n ¼ 3, and the third source signal can be extracted by finding w 3 which is orthogonal to the plane that is defined by the other two transformed axes s 0 1 and s 0 2 ; this plane is denoted by p 0 1;2 . Hence, the data points in three-dimensional space are projected onto the plane p 0 1;2 which is a two-dimensional space. This process is continued until all source signals are extracted [20,32].
Given three source signals each source signal has 10000 time-steps as shown in Figure 12. These signals represent sound signals. These sound signals were collected from Matlab, where the first signal is called Chrip, the second signal is called gong, and the third is called train. Figure 12 (d, e, and f) shows the histogram for each signal. As shown, the histograms are non-Gaussian. These three signals were mixed, and the mixing matrix was as follows: (44) Figure 13 shows the mixed signals and the histogram for these mixture signals. As shown in the figure, the mixture signals follow all the properties that were mentioned in Section 2.1.1, where (1) source signals are more independent than mixture signals, (2) the histograms of mixture signals in Figure 13 are much more Gaussian than the histogram of source signals in Figure 12 mixtures signals (see Figure 13 are more complex than source signals (see Figure 12)).
In the projection pursuit algorithm, mixture signals are first whitened, and then the values of the first weight vector ðw 1 Þ are initialized randomly. The value of w 1 is listed in Table 1. This weight vector is then normalized, and it will be used for extracting one source signal ðy 1 Þ. The kurtosis for the extracted signal is then calculated and the weight vector is updated to maximize the kurtosis iteratively. Table 1 shows the kurtosis of the extracted signal during some iterations of the projection pursuit algorithm. It is remarked that the kurtosis increases during the iterations as shown in Figure 14(a). Moreover, in this example, the correlation between the extracted signal ðy 1 Þ and all source signals (s 1 ; s 2 , and s 3 ) were calculated. This may help to understand that how the extracted signal is correlated with one source signal and not correlated with the other signals. From the table, it can be remarked that the correlation between y 1 and source signals are changed iteratively, and the correlation between y 1 and s 1 was 1 at the end of iterations. Figure 15 shows the histogram of the extracted signal during the iteration. As shown in Figure 15(a), the extracted signal is Gaussian; hence, its kurtosis value which represents the measure of non-Gaussianity in the projection pursuit algorithm is small (0.18). The kurtosis value of the extracted signal increased to 0.21, 3.92, and 4.06 after the 10th, 100th, and 1000th iterations, respectively. This reflects that the non-Gaussianity of y 1 increased during the iterations of the projection pursuit algorithm. Additionally, Figure 14(b) shows the angle between the optimal vector and the gradient vector ðαÞ. As shown, the value of the angle is dramatically decreased and it reached zero which means that both the optimal and gradient vectors have the same direction.

FastICA
FastICA algorithm extracts independent components by maximizing the non-Gaussianity by maximizing the negentropy for the extracted signals using a fixed-point iteration scheme [18].

ACI
FastICA has a cubic or at least quadratic convergence speed and hence it is much faster than Gradient-based algorithms that have linear convergence. Additionally, FastICA has no learning rate or other adjustable parameters which makes it easy to use. FastICA can be used for extracting one IC, this is called one-unit, where FastICA finds the weight vector ðwÞ that extracts one independent component. The values of w are updated by a learning rule that searches for a direction which maximizes the non-Gaussianity.
The derivative of the function G in Eq. (31) is denoted by g, and the derivatives for G 1 and G 2 in Eq. (32) are: g 1 ðyÞ ¼ tanhða 1 uÞ and g 2 ðyÞ ¼ u exp where 1 ≤ a 1 ≤ 2 is a constant, and often a 1 ¼ 1. In FastICA, the convergence means that the dot-product between the current and old weight vectors is almost equal to one and hence the values of the new and old weight vectors are in the same direction. The maxima of the approximation of the negentropy of w T X is calculated at a certain optima of E½Gðw T XÞ, where E½ðw T XÞ 2 ¼ kw 2 k ¼ 1. The optimal solution is obtained where, E½Xgðw T XÞ − βw ¼ 0, and this equation can be solved using Newton's method. 9 Let FðwÞ ¼ E½Xgðw T XÞ − βw; hence, the Jacobian matrix is given by, JFðwÞ ¼ vF vw ¼ E½XX T g 0 ðw T XÞ − βI. Since the data are whitened; thus, ½XX T g 0 ðw T XÞ ≈ E½XX T E½g 0 ðw T XÞ0 E½XX T g 0 ðw T XÞ ¼ E½g 0 ðw T XÞI and hence the Jacobian matrix becomes diagonal, which is easily inverted. The value of w can be updated according to Newton's method as follows: Eq. (46) can be further simplified by multiplying both sides by β − E½g 0 ðw T XÞ as follows: Several units of FastICA can be used for extracting several independent components, the output w T i X is decorrelated iteratively with the other outputs which were calculated in the previous iterations ðw T 1 X; w T 2 X; . . . ; w T i−1 XÞ. This decorrelation step prevents different vectors from converging to the same optima. Deflation orthogonalization method is similar to the projection pursuit, where the independent components are estimated one by one. For each iteration, the projections of the previously estimated weight vectors ðw p w j Þw j are subtracted from w p , where j ¼ 1; 2; . . . ; p − 1, and then w p is normalized as in Eq. (48). In this method, estimation errors in the first vectors are cumulated over the next ones by orthogonalization. Symmetric orthogonalization method can be used when a symmetric correlation, i.e., no vectors are privileged over others, is required [18]. Hence, the vectors w i can be estimated in  Table 1. Results of the projection pursuit algorithm in terms of the correlation between the extracted signal ðy 1 Þ and source signals, values of the weight vector ðw 1 Þ, kurtosis of y 1 , and the angle between the optimal vector and the gradient vector ðαÞ during the iterations of the projection pursuit algorithm.