Spline functions for Arabic morphological disambiguation

We have developed in this paper a morphological disambiguation hybrid system for the Arabic language that identifies the stem, lemma and root of a given sentence words. Following an out-of-context analysis performed by the morphological analyser Alkhalil Morpho Sys, the system first identifies all the potential tags of each word of the sentence. Then, a disambiguation phase is carried out to choose for each word the right solution among those obtained during the first phase. This problem has been solved by equating the disambiguation issue with a surface optimization problem of spline functions. Tests have shown the interest of this approach and the superiority of its performances compared to those of the state of the art.


Introduction
One of the main challenges facing Natural Language Processing (NLP) is the presence of ambiguity in a more or less important set of words depending on the language.A word is ambiguous if its analysis provides more than one solution.Thus, a word is morphologically ambiguous if it can accept several segmentations in (proclitic þ stem þ enclitic) or different morphological tags (for example, several stems or lemmas or roots).Recall that the root of a word is an abstract unit representing its origin.The lemma constitutes the minimal lexical unit having a meaning and obtained by derivation of a root according to a scheme.Finally, the stem is an inflected form of a lemma [1].Similarly, a word is syntactically ambiguous if it can have several syntactic functions (for example, to be subject or object).Finally, a word is semantically ambiguous if it has several meanings [2].
Ambiguity is very present in the Arabic language because of its agglutinating and derivational characteristics [3].Moreover, the absence of diacritical marks in the vast majority of Arabic texts greatly amplifies ambiguity [4].For example, the non-diacritized word " ‫ﻓ‬ ‫ﺮ‬ ‫ﻣ‬ ‫ﺖ‬ " /frmt/ 1 may be the verb " ‫ﻓ‬ َ ‫ﺮ‬ َ ‫ﻣ‬ َ ‫ﺖ‬ ْ " /faramato/ that has two meanings depending on its context: (and she threw) whose root is " ‫ﺭ‬ ‫ﻡ‬ ‫ﻱ‬ " /r m y/, or (she cut) whose root is " ‫ﻑ‬ ‫ﺭ‬ ‫ﻡ‬ " /frm/.Likewise, this word can be the verb " ‫ﻓ‬ ُ ‫ﺮ‬ ِ ‫ﻣ‬ َ ‫ﺖ‬ ْ " /furimato/ (it was cut off) whose root is " ‫ﻑ‬ ‫ﺭ‬ ‫ﻡ‬ " /f r m/.Habash [5] argues that on average a non-diacritized word can have 12 different morphological analyses.Similarly, in the statistical study conducted in [4] on a corpus of more than 82 million non-diacritized words, the authors showed that a word analysed out of context has on average 2.63 roots, 5.11 lemmas, 4.79 stems and 3.86 POS tags.
Morphological analysis is an essential step in the vast majority of NLP applications (text classification, machine translation, indexing, opinion mining, etc.).Indeed, these applications use canonical forms of the words to be analysed (stem, lemma, or root) to better represent them, and thereafter improve the performance of information retrieval systems [6,7].
Most of the disambiguation systems developed for the Arabic language proceed in two phases.In the first, the system performs a morphological analysis of the words taken out of their contexts.Thus, the system provides all possible morphological analyses of each word.Then, in the second phase, the system proceeds to a disambiguation phase, which consists in choosing the most appropriate solution among those proposed in the first phase.Methods used in the disambiguation phase can be probabilistic (HMM, Maximum entropy, N-grams, SVM) or neural.
In this contribution, we propose a new method of disambiguation of Arabic texts.We start with a morphological analysis out of context of the words of the sentence.For this, we use the second version of the morphological analyser Alkhalil Morpho Sys [8].Then, we propose in the disambiguation phase a new approach based on spline functions [9].This approach consists of equating the disambiguation issue with a surface optimization problem of spline functions.
The use of splines in the disambiguation phase has several advantages over statistical methods.Indeed, the simplicity of their expressions makes their implementation very easy.In addition, the performances of the spline disambiguation systems in terms of accuracy and speed are better than those based on HMM or SVM.Finally, the use of splines exempts us from resorting to the smoothing methods widely solicited in the training phases of the statistical approaches to circumvent the problem of absence of some words or transitions in the training corpus.
The paper is organized as follows.We give in the second section a state of art on the Arabic morphological disambiguation.Then, we briefly present in the third section the morphological analyser Alkhalil Morpho Sys, the Nemlar corpus used in the training and testing phases, and we recall the definition and some properties of the spline functions.We give in the fourth section a description of our system, and we reserve the following section for an evaluation of its performances.We end the paper with a conclusion and some perspectives.

Related work
We distinguish two classes of disambiguation systems: systems that are limited to the study of a single morphological tag, and those which analyse several morphological tags.The approaches developed in these disambiguation systems are either rule-based, or entirely statistical, or hybrid using both linguistic rules and statistical processing.
Given the interest of roots in many NLP applications, many researchers have been interested in the development of a root extractor (heavy stemmer).The work of [10] is one of the first works in this field.Their system is rule-based and consists of first eliminating the clitics of the word and then deducing the root.Yousef et al., [11] adopted a statistical approach to develop a heavy stemmer.They used transducers and rational kernels to model the word patterns of the Arabic language, and thereby extract a single root for each analysed word.Similarly, Boudlal et al., [12] developed a hybrid method for root extraction.The first step of ACI this system consists of an out of context morphological analysis of the words, and the second step is a disambiguation phase based on the HMM.
The research on stems (light stemming) has also interested many researchers and several light stemmers have been previously developed.Indeed, Larkey et al., [13] developed an evolved version of light stemmers previously developed by the same authors [14].They used a set of rules to eliminate clitics attached to words.They then proved the effectiveness of this light stemmer by testing it in the field of information retrieval.Ababneh et al., [15] have also developed a rule-based light stemmer.They have thus exploited the rules to solve some ambiguity problems.
The extraction of lemmas for the Arabic language has aroused interest only in recent years.El-shishtawy et al., [16] proposed a system based on linguistic rules and resources.The system begins by searching for the stem and the pattern of the word.Then, it uses the pattern to identify the POS tag, and exploits this information to extract the word lemma.Similarly, [17] presented a hybrid method for the extraction of lemmas.It consists in using a learning classifier that can predict the lemma pattern of a given stem, and then retrieve the lemma using this pattern and some rules.
Other works have recently focused on the analysis of several tags instead of just one.MADAMIRA is a hybrid system of morphological disambiguation widely used by researchers [18].After out of context morphological analysis, the system uses SVM and language models in the disambiguation phase.The tags obtained following the analysis of a word by this system are its vocalized form, its lemma, its stem and its POS tag.CamelParser is an Arabic syntactic dependency analysis system aligned with contextually disambiguated morphological features [19].Based on MADAMIRA, it improves its results and generates syntactically enriched syntactic dependencies.Similarly, Bounhas et al., [20] developed an approach combining linguistic rules and statistical classification to morphological disambiguation of non-vocalized Arabic texts.They first performed unsupervised training from unlabelled vocalized Arabic corpora.Then, to deal with imperfect data, they compared a possibilistic approach to a data transformation-based approach.Experiments have shown that on classical texts, the possibilistic approach applied to 14 morphological features gives better results than the one based on transformation.The system Bel-Arabi developed in [21] is a grammar analyser based on basic morphology analysis and some Arabic grammar rules.This system provides the stem, the POS tag and the base phrase chunking.Farasa is a morphosyntactic disambiguation system developed by QCRI Arabic Language Technologies [22].It is an open-source tool and consists of a segmentation module, a POS tagger, an Arabic text Diacritizer and a Dependency Parser.The segmentation module is based on SVMranking model [23], while the POS tagger and Dependency Parser are based on the randomized greedy algorithm [24].

Alkhalil Morpho Sys 2
AlKhalil Morpho Sys 2 is an open source morphosyntactic analyser developed by the Computer Research Laboratory of Mohammed First University, Morocco [8].It analyses both non vocalized Arabic words and partially or completely vocalized words.The analysis is done out of context and the results for a given word are: the possible vocalized forms of the word, for each possible vocalized form, the system provides the segmentation of the word in proclitic þ stem þ enclitic, its POS tag, its syntactic state (case for nouns and mood for verbs), its lemma and its stem accompanied by their patterns.

Spline functions for disambiguation
We used this analyser in the first phase of our system and we are just interested in stem, lemma and root tags.

Training and testing corpora
The Nemlar project (Network for Euro-Mediterranean Language Resources) started in 2003 in the framework of the the MED-Unco program supported by the European Union [25], which brought together 14 partners from various countries.It was aimed at developing the resources of the Arabic language.Nemlar corpus is a set of Arabic-language texts originally annotated by the Egyptian company RDI on behalf of the Nemlar consortium that holds the rights.It was collected from 13 different domains spread across 489 files that contain about 500,000 words.
This corpus has been recently corrected and enriched by other tags.Its latest version is open and can be downloaded at the following address: http://oujda-nlp-team.net/en/corpora/nemlar-corpus-en/.The tags provided for a given word are: its vocalized form, its stem, the clitics attached to the stem, its lemma, its grammatical category, its scheme.
To build our systems and evaluate them, we segmented this corpus into ten parts of the same size.The segmentation was performed randomly at the sentence level.Nine parts (about 90% of the corpus) will constitute the training set E A , which will be used in the training phases of our systems.The test set E T , consisting of the remaining 10% of the Nemlar corpus and containing around 50,000 words, will be used in the testing phases.

Spline functions
A spline function f is a piecewise polynomial function on an [a, b] interval [9].Thus, f is associated with a subdivision ðx i Þ ð1≤i≤mÞ of [a, b] so that the restriction of f to each [x i , x iþ1 ] interval is a Polynomial (ðx i Þ ð1≤i≤mÞ are also called the knots of the spline).
The spline f is of class C r and degree n if it is of class C r over the whole [a, b] interval, and In general, spline functions are used to approximate a function or interpolate data.In the following, we will recall the classical theorem of interpolation theory.

Theorem 1 Given two dots c and d of the [a, b] interval and a set of data ðf
and verifying the following Hermite interpolation conditions: where h ðiÞ denotes the derivative of order i of the function h.

Description of the proposed approach
In order to improve their performances, many Arabic NLP applications do not seek to extract information directly from text words.Indeed, these applications begin by replacing the words by one of their canonical forms (stem or lemma or root), and then proceed to the analysis of these latter.Thus, we are interested in this work by the extraction process of these three tags: stem, lemma and root.The proposed approach is composed of two modules.The first one is reserved for an out of context morphological analysis of the sentence words.Thus, the system provides for each analysed word the list of its possible solutions.Then, the system uses in the second module the spline functions to eliminate the ambiguity by choosing for each word the right stem among these solutions (see Figure 1)

Morphological analysis
After a pre-processing step that consists of segmenting the text into sentences and then the sentences into words, the system performs for each sentence an out of context morphological analysis of its words.The morphological analyser Alkhalil Morpho Sys 2 was used for this task.Thus, we obtain the different possible analyses of each word (see example in Figure 2).

Disambiguation phase by spline functions
The disambiguation phase will be carried out tag by tag according to the same process.We will present in the following our stem disambiguation approach; the other tags (lemma and root) will be treated in the same way.
Given a sentence S consisting of the words w 1 ; w 2 ; . . .; w k , the morphological analysis of the first phase gives us the possible stems of each word of the sentence (see Figure 3).
, where s 1 j 1 ∈ S i ¼ fs i 1 ; . . .; s i n i g the set of the potential stems of the word w i .

Spline functions for disambiguation
To identify this optimal path, we will associate to each potential path s Given that all paths are composed of the same number of stems, and since it is the position of the stem in the path that is important, we have assimilated the stems to their positions in the path to construct the associated spline.Thus, the knots of the spline ψ s 1

ACI
In order to evaluate the impact of using the context in the disambiguation phase, we will experiment with three families of splines.The first one concerns linear splines built solely from information on the stems of words and without the use of information on their contexts.The quadratic splines of the second family exploit, in addition to information used in the construction of linear splines, the left context (in the order of the Arabic script) of the analysed words.Finally, the construction of the cubic splines of the third family is based on the exploitation of the right context of the analysed words in addition to the information above.
4.2.1 Disambiguation by linear splines.The linear spline associated with a potential path ðs 1 ; . . .; s k Þðs i ∈ S i Þ is obtained by choosing as Hermite interpolation data at each knot i, 1 ≤ i ≤ k, the weight in the Arabic language of the stem s i .Thus, the spline ψ L ðs 1 ;...;s k Þ associated with this path satisfies the following Hermite interpolation conditions: where p i is the weight in the Arabic language of the stem s i .
The following result is a consequence of Theorem 1 above.
Theorem 2 Given a sequence of values ðp i Þ ð1≤i≤kÞ , there is a single linear continuous spline The expression of ψ L interval is given by: and It is easy to verify that if the Hermite data ðp i Þ ð1≤i≤kÞ are positive, then the spline ψ L ðs 1 ;...;s k Þ is positive.
We note that this spline is built solely from information on the stems and does not exploit the context of the associated words.
4.2.2Disambiguation by quadratic splines.The quadratic spline relative to a potential path ðs 1 ; . . .; s k Þ is obtained by choosing as the value at the knot i ∈ f1; . . .; kg the weight in the Arabic language of the stem s i and as right derivative the weight in the Arabic language of the transition from the stem s i to the stem s iþ1 .Thus, by applying Theorem 1 above, we have the following result.
Theorem 3 Given two series of values ðp i Þ ð1≤i≤kÞ and ðt i Þ ð1≤i<kÞ , there exists a single continuous quadratic spline ψ Q ðs 1 ;...;s k Þ (of degree 2 and of class C 0 in [1, k]) satisfying in each [i, iþ1] interval, 1 ≤ i < k, the following interpolation conditions: where D r gðxÞ is the right derivative of the function g at the pointx.

Spline functions for disambiguation
The expression of the spline ψ Q As for the linear spline, it is easy to verify that if the Hermite data ðp i Þ ð1≤i≤kÞ and ðt i Þ ð1≤i<kÞ are positive, then the spline is positive.Furthermore, 4.2.3Disambiguation by cubic splines.In addition to the values at knots i and i þ 1 and the right derivative at knot i necessary for the construction of the quadratic spline, the cubic spline also interpolates the left derivative at knot i þ 1.
Theorem 4 Given a potential path ðs 1 ; . . .; s k Þ and three value sequences , there is a single continuous cubic spline ψ C ðs 1 ;...;s k Þ (of degree 3 and of class C 0 ) satisfying in each [i, iþ1] interval, 1 ≤ i < k, the following interpolation conditions: (10) where D l gðxÞ is the left derivative of the function g at the point x.
The expression of the spline ψ C ðs 1 ;...;s k Þ in the [i, i þ 1] interval is given by: and Unlike the linear and quadratic splines, the cubic spline does not maintain the positivity of the data.Indeed, even if the Hermite data ðp i Þ ð1≤i≤kÞ ,ðt i Þ ð1≤i<kÞ and ðT i Þ ð1<i≤kÞ are all positive, the spline ψ C ðs 1 ;...;s k Þ is not necessarily positive.For example, if we take on the [1,2] interval the following data p 1 ¼ 0:4; p 2 ¼ 0:01; t 1 ¼ 0:6 and T 2 ¼ 0:8, then the associated cubic spline is negative in the [1.75, 1.95] interval.

Estimation of model parameters
To estimate the different Hermite data defining the models (i.e. the three sequences ðp i Þ ð1≤i≤kÞ ,ðt i Þ ð1≤i<kÞ and ðT i Þ ð1<i≤kÞ , we use a labeled training corpus C, and we propose several choices based on the maximum likelihood [26].ACI 4.3.1 Estimation of stem weights.The weight p i of a stem s i associated with a word w i can be estimated by one of the following two formulas: is the set of the potential stems of the word w i proposed by the morphological analyser. (P1) is none other than the estimate of the probability that the word w i is associated with the stem s i in the corpus C.However, (P2) represents the frequency of the stem s i in the set of potential stems fs i 1 ; . . .; s i n i g of the word w i .4.3.2Estimation of right derivatives.The right derivative at knot i is equal to the weight t i of the transition between the stems s i and s iþ1 associated respectively with the words w i and w iþ1 .To estimate it, we propose one of the following three formulas: where Occðs i ; s iþ1 Þ 5 number of occurrences in the training corpus C of the stem s i followed by the stem s iþ1 , Occððw i ; s i Þ; ðw iþ1 ; s iþ1 ÞÞ 5 number of occurrences in the training corpus C of the word w i associated with the stem s i followed by the word w iþ1 associated with the stem s iþ1 .
Remark: the formula (Tr1) calculates an estimate of the weight of the transition t i just from information on the stems s i and s iþ1 , and without taking into account the associated words w i and w iþ1 .While the estimate of the formula (Tr2) partially takes into account the words w i and w iþ1 since the normalization (the denominator value) is computed from information of the potential stems of these two words.Finally, the last formula (Tr3) estimates the transition between two stems of two consecutive words by limiting themselves just to the frequencies of appearance in the training corpus C of these words accompanied by their stems.

Estimation of left derivatives.
To estimate the left derivative T iþ1 at knot ði þ 1Þ, we considered a convex combination between the weight of the transition t i from the stem s i to the stem s iþ1 and that of the transition t iþ1 from the stem s iþ1 to the stem s iþ2 .So, with α is a parameter to choose between 0 and 1.
If we choose α ¼ 0, then T iþ1 in each knot ði þ 1Þ is equal to the right derivative t iþ1 , so the spline ψ C ðs 1 ;...;s k Þ will be of class C 1 on the [1, k] interval.
On the other hand, if α ¼ 1 then the left derivative T iþ1 of the spline ψ C ðs 1 ;...;s k Þ at knot ði þ 1Þ is equal to the right derivative t i at knot i.In this case ðs 1 ;...;s k Þ ðxÞdx, and this implies that the optimal path corresponding to the cubic spline is identical to that corresponding to the linear spline.
For more readability, we consider the following example  4 the potential stems of each word obtained by the morphological analysis of the first step.
We then used the relation ( 8) to calculate the expressions of the two quadratic splines associated with the two following possible paths " We have used the equations (P1) and (Tr2) to estimate the Hermite data.The surfaces of these two splines are given by the relation ( 9) and equal to 1.92 and 3.41 respectively.We present in Figure 5 the graphs of these two splines.

Viterbi algorithm
We recall that the disambiguation phase consists of looking for the optimal path ðs * 1 ; . . .; s * k Þ of stems associated with the words ðw 1 ; . . .; w k Þ among the ðn 1 3 . . . 3 n k Þ possible paths ðs 1 j 1 ; . . .; s k j k Þ, where s i j i ∈ S i ¼ fs 1 1 ; . . .; s i n i g the set of the potential stems of the word w i (see Figure 3).This path checks the following optimization equation: where ψ s 1 is the spline associated with the path of stems ðs 1 j 1 ; . . .; s k j k Þ.To optimize the search time of this path, we have developed an algorithm inspired by that of Viterbi [27].
In what follows, we will work with the following notations: Given two potential stems s i u and ) associated with the two words w i and w iþ1 , we denote by f u;v;i the polynomial that interpolates the data of the stems s i u and s iþ1 v (f u;v;i corresponds to one of the polynomials given by the equations ( 5) or ( 8) or ( 11)) and For a partial path of potential stems ðs 1 j 1 ; . . .; s i−1 j i−1 ; s i u Þ leading to the stem s i u of the word w i , 1 ≤ i ≤ k and 1 ≤ u ≤ n i , we also denote by ψ s 1 j 1 ;...;s i−1 j i−1 ; s i u the spline associated with this path.Let Λði; s i u Þ be the maximum surface on all the splines associated with the partial paths leading to the stem s i u : ), and on the right, that of the spline associated with the stem path ).

Spline functions for disambiguation
It's easy to verify that This formula will allow us to calculate the function $\Lambda$ by induction.
Similarly, if we denote by γði; s i u Þ the index of the stem associated with the word w i−1 in the optimal path leading to the stem s i u : and by Γði; s i u Þ the optimal path leading to the stem s i u , (i.e. the one with the maximum associated spline surface): then Γði; s i u Þ ∈ f1; . . .; n 1 g 3 . . . 3 f1; . . .; n i−1 g and satisfies the following recurrence relation: The relations ( 16), ( 17) and ( 19) will allow us to identify the optimal path according to the following algorithm: Step 1 (initialization): Calculate for each 1 ≤ u ≤ n 2 : Step 2 (induction): For each 3 ≤ i ≤ k and 1 ≤ u ≤ n i , calculate Λði; s i u Þ; γði; s i u Þ and Γði; s i u Þ using the following induction relations:

Results and discussion
We will first evaluate the impact of spline degree and Hermite data choices on the accuracy of the disambiguation system.This will allow us to determine how much the performances of the system are improved by the use of the context (disambiguation by the quadratic and cubic splines).Then we will make a comparison between the disambiguation system using splines and those based on HMM and SVM.This comparison is performed under identical training and testing conditions for the three systems.We end by giving the accuracy of the disambiguation systems relating to the three tags stem, lemma and root.For all these evaluations, we used the Nemlar corpus introduced in section 3.2.We have first extracted in a random way at the level of the sentences about 90% of the corpus, noted E A , which we used in the training phases of all the experiments realized in the two first subsections of this section.The training phase consists of estimating the Hermite data needed for calculating the splines used in the disambiguation phase.These estimates are based on equations (P1), (P2), (Tr1), (Tr2), (Tr3) and (Cu).
The set E T , consisting of the remaining 10% of the Nemlar corpus and containing approximately 50,000 words, was used in the test phases of our systems.
The used measure of performance is the accuracy, which corresponds to the percentage of the words of the set E T correctly labeled.It is defined by: accuracy ¼ number of words of E T correctly labeled size of E T

Impact of the spline choice
As presented in section 4.2, we can use three types of splines (linear, quadratic or cubic).Similarly, we have several choices for estimating spline parameters (section 4.3).
In the following tables, we present the results for the different possible choices.5.1.1Evaluation of disambiguation by linear splines.For linear splines, we have two choices for estimating the weights of the stems expressed above by the formulas (P1) and (P2).
By testing these choices on the test set E T , we obtained the results presented in Table 1.We find that the accuracy relative to the estimation choice of the stem weights (P1) is better than that obtained with the choice (P2).This can be explained in part by the nature of (P1), which estimates the probability that a stem is associated with a word, while (P2) estimates the weight of a stem in the set of potential stems of a given word.
5.1.2Evaluation of disambiguation by quadratic splines.In addition to the two estimation choices of the stem weights, we have for the right derivatives of the stems three estimation Spline functions for disambiguation choices given above by the formulas (Tr1), (Tr2) and (Tr3).The evaluation results of these different choices on the test set E T are presented in Table 2.
Let us first note that the quadratic splines with the choice (P1) for estimating the stem weights are more performing than linear splines.Moreover, the results obtained with the choice (P1) are, as for linear splines, better than those relating to the choice (P2).Similarly, the (Tr2) estimate of the right derivatives provides the best performances.The relatively weak results obtained with the choices (Tr1) and (Tr3) can be explained by the demanding nature of the estimate (Tr3) that counts only the transitions of the words accompanied by their respective stems, and of the fact that the estimate (Tr1) is based solely on information on stems without taking into account their associated words.
5.1.3Evaluation of disambiguation by cubic splines.To build cubic splines, we will need, in addition to the estimation of the weights and the right derivatives of the stems, to estimate their left derivatives from the above formula (Cu).Since left derivative estimates depend on a parameter α ∈ ½0; 1, we performed tests on the test set E T for a few values of α equal to the five nodes of a uniform subdivision of the interval [0,1].We present in Table 3 the evaluation results of these tests.
These results confirm the conclusions obtained with linear and quadratic splines.Indeed, for every choice of $\alpha$, the system accuracy corresponding to (P1) stem weight estimate is better than that corresponding to the (P2) estimate.Moreover, the (Tr2) estimate of the right derivatives provides the best performances.Finally, it is clear that the system performance decreases with the value of α.This implies that the more the estimate (Cu) of the left derivatives favours the right context (i.e., small α and so more weight at t iþ1 than at t i in (Cu) formula), the better the results.Thus, the best results are obtained for the single cubic spline of class C 1 corresponding to the choice α ¼ 0.

ACI
By comparing the results obtained by each spline according to the different choices of estimation of its parameters, we conclude that the use of the quadratic splines with (P1) stem weight estimate and (Tr2) estimate of the right derivatives provides the best performances.

Comparison of the spline-based model with other models
To evaluate the impact of spline use in the disambiguation phase, we will compare the performances of the spline-based model with two other models successively using HMM and SVM in the disambiguation phase.
In this part, we will limit ourselves to the spline-based model that has performed best, namely the model based on the quadratic spline built from the (P1) stem weight estimate and the (Tr2) right derivative estimate.We kept the morphological phase for the three models, and only modified the disambiguation phase by first using the HMM, then the SVM (for more details on these models see for example [28,29]).
The training phase of each of these three models (Spline, HMM and SVM) was carried out from the same corpus E A , and they were tested on the same test corpus E T .
We calculated both the accuracy of each model and the speed, which is equal to the number of analysed words per second.The obtained results are shown in Table 4.
It is clear that the model based on the quadratic splines realizes the best performances.Indeed, the accuracy of this model exceeds 94% whereas those of the models based on the HMM and the SVM reach only 92.35% and 88.43% respectively.In addition, this model is the fastest given that it analyses on average 290 words per second against only 254 for the HMMbased model and 210 for the one based on SVM.Finally, the smoothing methods used in the training phases of statistical approaches to deal with the problems of the absence of certain words or transitions in the training corpus are not essential for spline-based methods.Indeed, in HMM-based methods, the search for the solution consists of identifying the path with the highest probability.And since the probability of a path is expressed as a product of transition and emission probabilities, the absence of one of these transitions in the training corpus implies that the probability of the path will be estimated by zero.However, in the spline-based method, the surface of the spline associated with a path is not estimated by zero even though some transitions are absent in the training corpus, because the surface is expressed as a sum and not as a product of transition and emission probabilities.
To check if the choice of learning and test sets impacts system performance, we used the 10-folds cross validation test for spline and HMM based models.We did not use this test for the SVM-based model due to its performance limitations.We present in Table 5 the accuracy of each fold.We note that for each fold the spline based model surpasses the HMM based model, and on the other hand the 10 tests achieve performances close to average accuracy.
It remains to be noted that we did not consider it necessary to make comparisons with other disambiguation systems such as Madamira or Farasa since in [30] the authors compared Madamira with their lemmatization system based on HMMs, which is equivalent to the one used in Table 4, and they showed the superiority of the performances of their system.

Evaluation of the disambiguation system for the three tags
We will evaluate in this section our model on the three tags root, lemma and stem.The spline considered is the quadratic spline constructed from the (P1) stem weight estimate and the

Spline functions for disambiguation
(Tr2) right derivative estimate.We used the 10-folds cross validation test for spline and HMM based models.Table 6 presents the test results obtained for each tag analysed alone.Thus, lines 2, 3 and 4 present the average accuracy of all folds relating to each of the three tags stem, lemma and root respectively.The line 'All correct' displays the average percentage of words of the test corpus for which the three tags provided by the system coincide with those given by the annotators.Finally, the last line 'All wrong' displays the average percentage of tested words for which each tag provided by the system differs from that proposed by the annotators.
We observe that the performances of the system are very interesting.The difference in results for the three tags can be explained in part by the differences between the numbers of occurrences of each tag in the training corpus.Indeed, the occurrences of the roots are greater than those of the corresponding lemmas, which in turn are greater than those of the associated stems.As these occurrences are used in the training phase of the system (estimation of the Hermite data), the estimates of the weights of the roots are more precise than those of the lemmas, and the latter are more precise than those of stems.Moreover, by comparing the accuracies of the three tags with that of the last two lines of Table 6, we note that for almost all the tested words, the system tends to provide either three correct tags for each word or three wrong tags.Finally, the spline-based model performs better for the three tags than the HMM-based model.

Conclusion
In this paper we have presented an approach to identify the root, lemma and stem of words in Arabic language sentences.This approach is achieved in two steps.The analyser Alkhalil morpho sys is used in the first phase to identify for each word analysed out of context all its  ACI possible solutions (roots, lemmas and stems).Then, a disambiguation phase is carried out to identify the correct tag among these solutions.This phase is based on the use of spline functions.
The results obtained are very encouraging and we plan to improve them by using larger training corpus and using smoothing techniques to circumvent the problem of no transitions between word tags in the training corpus.We also intend to take advantage of the additional information provided by the analyser Alkhalil to better filter the transitions between the successive word tags.Finally, we envisage to develop a spline-based system that analyses the three tags together.A comparison between this system and the three presented in this paper will be made.
We are currently working on the exploitation of this approach to develop a POS Tag and a diacritisation system for the Arabic language.We have developed a demo for these three systems (light stemmer, lemmatizer and heavy stemmer) which can be consulted from the following address: http://demo.oujda-nlp-team.net/AlKhalil-Analyser/.
Once we finish developing the POS Tag and the diacritisation system, we intend to integrate them into the demo and then make the five systems open source.

s 1 j 1 ;
. . .; s k j kare the integers between 1 and k.

Figure 3 .
Figure 3. Possible stems of the sentence words.
P2) where Occðw i ; s i Þ 5 number of occurrences in the training corpus C of the word w i associated with the stem s i , Occðw i Þ 5 number of occurrences in the training corpus C of the word w i , Occðs i Þ 5 number of occurrences in the training corpus C of the stem s i ,

Figure 4 .
Figure 4. Results of morphological analysis of sentence words.
j 1 ; . . .; s k AlmElm xSA}S Al$Er AlmEASr/ (The teacher explained the characteristics of contemporary poetry).We present in Figure

Table 4 .
Comparison between the three models.

Table 5 .
Results of the 10-folds cross validation test.

Table 6 .
Accuracies relating to the three tags.